Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

Open
luoxuan-cs opened this issue Dec 31, 2024 · 1 comment
Labels
asking questions For asking for clarification / support on library usage.

Comments

@luoxuan-cs
Copy link

luoxuan-cs commented Dec 31, 2024

Description

I've observed a substantial difference between the officially reported ARC-Challenge accuracy and my local evaluation results using lm-evaluation-harness.

Official Results

Reported 25-shot arc_challenge accuracy on HuggingFace repo (meta-llama/Llama-3.2-3B): 69.1%

Local Evaluation Results

Base Model Test

My command line:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B,trust_remote_code=True \ --tasks arc_challenge \ --batch_size 8 \ --num_fewshot 25

Results:

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 25 acc 0.4667 ± 0.0146
none 25 acc_norm 0.5068 ± 0.0146

Instruct Model

My hypothesis is that, the Llama model requires a chat template. Therefore, I tested the meta-llama/Llama-3.2-3B-Instruct with the official chat template:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,trust_remote_code=True,dtype="bfloat16" \ --tasks arc_challenge \ --num_fewshot 25 \ --apply_chat_template \ --fewshot_as_multiturn \ --batch_size 8

However, this produces similar results:

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 25 acc 0.4983 0.0146
arc_challenge 1 none 25 acc_norm 0.5341 0.0146

I'm wondering about what caused this inconsistency.

@baberabb
Copy link
Contributor

baberabb commented Jan 2, 2025

Hi! They use the MMLU format for this task (<question><A. answer_choice>,...><Answer: ><"A">) rather than the standard cloze format (<question><Answer: ><"answer_choice">).

You can run it on #2556 (llama_arc_challenge)

@baberabb baberabb added the asking questions For asking for clarification / support on library usage. label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking questions For asking for clarification / support on library usage.
Projects
None yet
Development

No branches or pull requests

2 participants