Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: Model inference Qwen2.5-32B-Instruct-GPTQ-Int4 appears as garbled text !!!!!!!!!!!!!!!!!! #945

Open
4 tasks done
zhanaali opened this issue Sep 23, 2024 · 18 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@zhanaali
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-32B-Instruct-GPTQ-Int4

What is the scenario where the problem happened?

Using vllm reasoning Qwen2.5-32B-Instruct-GPTQ-Int4 appears with garbled text !!!!!!!!!!!!!!!!!!

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

python==3.10
gpu: A100 80GB * 2
CUDA Version: 12.4
Driver Version: 550.54.15
PyTorch: 2.3.0+cu121
pip list

anaconda-anon-usage 0.4.4
archspec 0.2.3
boltons 23.0.0
Brotli 1.0.9
certifi 2024.7.4
cffi 1.16.0
charset-normalizer 3.3.2
conda 24.7.1
conda-content-trust 0.2.0
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
cryptography 42.0.5
distro 1.9.0
frozendict 2.4.2
idna 3.7
jsonpatch 1.33
jsonpointer 2.1
libmambapy 1.5.8
menuinst 2.1.2
packaging 24.1
pip 24.2
platformdirs 3.10.0
pluggy 1.0.0
pycosat 0.6.6
pycparser 2.21
PySocks 1.7.1
requests 2.32.3
ruamel.yaml 0.17.21
setuptools 72.1.0
tqdm 4.66.4
truststore 0.8.0
urllib3 2.2.2
wheel 0.43.0
zstandard 0.22.0

Description

Steps to reproduce

This happens to Qwen2.5-32B-Instruct-GPTQ-Int4
The badcase can be reproduced with the following steps:

  1. ...
  2. ...

The following example input & output can be used:

{
   "content": "你好",
   "role": "user"
}

Expected results

{"model":"Qwen2-7B-Instruct","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","function_call":null},"finish_reason":"stop"}],"created":1727075660}

Attempts to fix

Switching to Qwen2.5-72B-Instruct-GPTQ-Int4 model, the output is normal.

Anything else helpful for investigation

I find that this problem also happens to Qwen1.5-32B-Instruct-GPTQ-Int4

@zhanaali
Copy link
Author

Inference file
openai_api_32b.txt

@zhanaali
Copy link
Author

Can the same script reasoning Qwen2.5-32B-Instruct-GPTQ-Int8 model normally output。 Is it a problem with the reasoning parameters?

@hzhwcmhf
Copy link
Member

Have you tried to upgrade the vllm and autogptq packages?

@zhanaali
Copy link
Author

It's still the same after the upgrade
image
@hzhwcmhf

@leavegee
Copy link

我也遇到了相同的问题。 用的模型是32B的GPTQ量化模型
Name: vllm
Version: 0.6.1.post2
推理命令
vllm serve qwen25-32b --quantization gptq --host 0.0.0.0 --port 8080
希望能得到解决方案。

@jklj077
Copy link
Collaborator

jklj077 commented Sep 26, 2024

Hi, could you try installing the latest vllm in a fresh environment?

conda create -n vllm python=3.11
conda activate vllm
pip install vllm

This should install

  • vllm 0.6.2
  • torch 2.4.0
  • cuda12.1 runtime

Tested with

  • 2 or 8 NVIDIA A10
  • NVIDIA Driver 535.183.06

The result appears normal:
image

image

@linzhengtian
Copy link

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4全部会出现这个问题,prompt大于60tonken后就恢复正常。

@jklj077
Copy link
Collaborator

jklj077 commented Sep 26, 2024

@linzhengtian Please provide steps to reproduce. I cannot reproduce with the settings above with vLLM. (The input sequence length is also about 30 tokens.)

@noanti
Copy link

noanti commented Sep 26, 2024

遇到了相同的问题
vllm==0.6.1.post2
卡是v100*2。
相同环境部署qwen2.5-72b-gptq-int4和qwen2.5-14b-gptq-int4都没有问题,只有32b不行,只会输出感叹号。

@jklj077
Copy link
Collaborator

jklj077 commented Sep 27, 2024

@noanti see this comment: #945 (comment)

@QwertyJack
Copy link

@noanti see this comment: #945 (comment)

Tested on V100, failed with infinite !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@QwertyJack
Copy link

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4全部会出现这个问题,prompt大于60tonken后就恢复正常。

The same here.

@featherace
Copy link

Inference file openai_api_32b.txt

Try to set quantization = "gptq_marlin" or quantization = None

@QwertyJack
Copy link

Try to set quantization = "gptq_marlin" or quantization = None

Unfortunately, V100 has SM70 so it does not support marlin.

@noanti
Copy link

noanti commented Oct 17, 2024

@noanti see this comment: #945 (comment)

试了vllm0.6.2和0.6.3,仍然有问题。按照前面说的,把prompt增加到50token以上就能正常输出了,很奇怪……

@shilei4260
Copy link

@noanti 卡是v100*4 16G。部署qwen2.5-72b-gptq-int4 会报显存错误,你怎么部署的哦,可以看下命令不。

Copy link

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

@wfnian
Copy link

wfnian commented Dec 24, 2024

我在vllm 0.6.4.post1,两张P100显卡,Qwen2.5-32B-Instruct-GPTQ-Int4上也出现了无限感叹号!!!!!!的问题,
然后在modelscope的评论区找到了目前对于我来说唯一有效的解决办法:在每次提问前,系统提示词后面新增一轮对话:

if len(messages) <= 1:
        messages.extend([
                {"role":"user",content:"你好"},
                {"role":"assistant",content:"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"}
            ])

这个解决方案虽然不优雅,但是对于我来说确实好用。当然我的系统提示词确实也超过50token了。

@github-actions github-actions bot removed the inactive label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

10 participants