-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bits GPTQ quantization output #35460
Comments
Thanks for the report @joshuaongg21 ! This shouldn't be the case. @Qubitium, does this also happen with gptqmodel ? |
@joshuaongg21 Please try to replicate this error (requantize) using gptqmodel This should not happen. Our previously tests show 8 bit gptq quantization is almost effortless to get high quality result. Usually when I see total corruption of output, my first instinct is to double check the tokenizer. @SunMarc I will add 2-8 bit unit tests in gptqmodel to check for regressions just in case. |
@SunMarc @joshuaongg21 We have added 2-8 bit validation on GPTQModel using tinyllama and evaluating the ARC-Challenge score post-quantization. https://github.com/ModelCloud/GPTQModel/pull/995/files What we see is the 8bit has highest score following by 4 bit which is healthy and normal. No regression in 8bit quants. We do however see a 3-bit regression vs 2-bit which we need to investigate further since no-one that I know of have actually produced a 3-bit gptq quant that is widely used. At this moment, we are not sure why 3bit is worse than 2bits. Need to be investigated. But 4bit, 8bits looks good, and 2bits looks normal. 3bit is currently outlier. Acc scores for model: TORCH_QLINEAR_QUANTIZED_MODEL_ARC_CHALLENGE_EXPECTS = {
2: {'acc,none': 0.22610921501706485, 'acc_norm,none': 0.2909556313993174},
3: {'acc,none': 0.21245733788395904, 'acc_norm,none': 0.24744027303754265},
4: {'acc,none': 0.2738907849829352, 'acc_norm,none': 0.3122866894197952},
8: {'acc,none': 0.2841296928327645, 'acc_norm,none': 0.302901023890785},
} |
Thanks for the details @Qubitium ! |
System Info
Hi, I noticed that with 8-bit quantization using GPTQConfig, the model inference generally produces really bad results, with outputs that often don't make sense. Could this be an engineering issue with GPTQ, or is this typical behavior for GPTQ when using 8-bit quantization? Thank you in advance!
Who can help?
@SunMarc @ArthurZucker
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8
meta-llama/Llama-3.1-8B-Instruct
Expected behavior
the !!!!!!!!!!!!!!!!!.....
as the output. The generated output does not make sense at all.The text was updated successfully, but these errors were encountered: