Weird results for 70b models #2584

BeksultanSagyndyk · 2024-12-19T19:18:43Z

I have forked the repository and created custom tasks in the lm_eval -> tasks directory. All of them are MMLU-like tasks and are in the Kazakh language.

I am running evaluations using the following command:

lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct --batch_size 2 --num_fewshot 0 --apply_chat_template --tasks mmlu_translated_kk,kazakh_and_literature_unt_mc,kk_biology_unt_mc,kk_constitution_mc,kk_dastur_mc,kk_english_unt_mc,kk_geography_unt_mc,kk_history_of_kazakhstan_unt_mc,kk_human_society_rights_unt_mc,kk_world_history_unt_mc --output output --device cuda:0

The problem is that for the LLama 3 8B models, I get an average accuracy of around 0.3–0.4, which is correct. I have verified this by running each model manually with the generation part.

However, for the larger models like LLama instruct 3.3-70B, LLama 3.1-70B, gemma-2-27b-it and others, I keep getting an accuracy score of ~0.1 across all tasks.

What am I doing wrong? its

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird results for 70b models #2584

Weird results for 70b models #2584

BeksultanSagyndyk commented Dec 19, 2024 •

edited

Loading

Weird results for 70b models #2584

Weird results for 70b models #2584

Comments

BeksultanSagyndyk commented Dec 19, 2024 • edited Loading

BeksultanSagyndyk commented Dec 19, 2024 •

edited

Loading