Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8bits GPTQ quantization output #35460

Open
1 of 4 tasks
joshuaongg21 opened this issue Dec 30, 2024 · 4 comments
Open
1 of 4 tasks

8bits GPTQ quantization output #35460

joshuaongg21 opened this issue Dec 30, 2024 · 4 comments
Labels

Comments

@joshuaongg21
Copy link

joshuaongg21 commented Dec 30, 2024

System Info

Hi, I noticed that with 8-bit quantization using GPTQConfig, the model inference generally produces really bad results, with outputs that often don't make sense. Could this be an engineering issue with GPTQ, or is this typical behavior for GPTQ when using 8-bit quantization? Thank you in advance!

Who can help?

@SunMarc @ArthurZucker
No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below) - NQOpen

Reproduction

  1. By using Qwen2.5 Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8
  2. By using GPTQConfig on meta-llama/Llama-3.1-8B-Instruct
quantization = GPTQConfig(
    bits = 8,
    group_size = 128,
    dataset = "c4",
    desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model)

model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quantization, device_map="auto")

Expected behavior

  1. model tends to generate unrelated outputs, such as the !!!!!!!!!!!!!!!!!..... as the output. The generated output does not make sense at all.
@joshuaongg21 joshuaongg21 changed the title 8bits quantization 8bits GPTQ quantization output Dec 30, 2024
@SunMarc
Copy link
Member

SunMarc commented Jan 2, 2025

Thanks for the report @joshuaongg21 ! This shouldn't be the case. @Qubitium, does this also happen with gptqmodel ?

@Qubitium
Copy link
Contributor

Qubitium commented Jan 2, 2025

@joshuaongg21 Please try to replicate this error (requantize) using gptqmodel

This should not happen. Our previously tests show 8 bit gptq quantization is almost effortless to get high quality result.

Usually when I see total corruption of output, my first instinct is to double check the tokenizer.

@SunMarc I will add 2-8 bit unit tests in gptqmodel to check for regressions just in case.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 3, 2025

@SunMarc @joshuaongg21 We have added 2-8 bit validation on GPTQModel using tinyllama and evaluating the ARC-Challenge score post-quantization.

https://github.com/ModelCloud/GPTQModel/pull/995/files

What we see is the 8bit has highest score following by 4 bit which is healthy and normal. No regression in 8bit quants. We do however see a 3-bit regression vs 2-bit which we need to investigate further since no-one that I know of have actually produced a 3-bit gptq quant that is widely used. At this moment, we are not sure why 3bit is worse than 2bits. Need to be investigated. But 4bit, 8bits looks good, and 2bits looks normal. 3bit is currently outlier.

Acc scores for model: ModelCloud/TinyLlama-1.1B-Chat-v1.0

    TORCH_QLINEAR_QUANTIZED_MODEL_ARC_CHALLENGE_EXPECTS = {
        2: {'acc,none': 0.22610921501706485, 'acc_norm,none': 0.2909556313993174},
        3: {'acc,none': 0.21245733788395904, 'acc_norm,none': 0.24744027303754265},
        4: {'acc,none': 0.2738907849829352, 'acc_norm,none': 0.3122866894197952},
        8: {'acc,none': 0.2841296928327645, 'acc_norm,none': 0.302901023890785},
    }

@SunMarc
Copy link
Member

SunMarc commented Jan 3, 2025

Thanks for the details @Qubitium !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants