Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: " 那明天呢?" 使用Qwen2.5-3B的Tokenizer编码后的token id为啥这么奇怪 #1149

Open
4 tasks done
dongyang-mt opened this issue Dec 26, 2024 · 2 comments

Comments

@dongyang-mt
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen2.5-3B-Instruct

What is the scenario where the problem happened?

transformers

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04
Python: Python 3.10
GPUs: 8 x NVIDIA A100
NVIDIA driver: 535 (from nvidia-smi)
CUDA compiler: 12.1 (from nvcc -V)
PyTorch: 2.2.1+cu121 (from python -c "import troch; print(torch.version)")

Description

Steps to reproduce

import torch
from transformers import AutoTokenizer

加载 tokenizer

model_name_or_path = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

例子文本

text = " 那明天呢?"

检查每个 token

tokens = tokenizer.tokenize(text)
print("Tokenized Text:", tokens)

转换成 token ID

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

解码回文本

decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)

Encoded Text: " 那明天呢?"
Tokenized Text: ['Ġé', 'Ĥ', '£', 'æĺİ天', 'åij¢', '?']
Token IDs: [18137, 224, 96, 104807, 101036, 30]
Decoded Text: 那明天呢?

image

@1006076811
Copy link

请问您这个token计算器是哪个工具

@dongyang-mt
Copy link
Author

请问您这个token计算器是哪个工具
通过tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
网页版token计算器链接:https://dashscope.console.aliyun.com/tokenizer
计算的结果是一致的:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants