Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

luoxuan-cs · 2024-12-31T12:35:53Z

Description

I've observed a substantial difference between the officially reported ARC-Challenge accuracy and my local evaluation results using lm-evaluation-harness.

Official Results

Reported 25-shot arc_challenge accuracy on HuggingFace repo (meta-llama/Llama-3.2-3B): 69.1%

Local Evaluation Results

Base Model Test

My command line:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B,trust_remote_code=True \ --tasks arc_challenge \ --batch_size 8 \ --num_fewshot 25

Results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	25	acc	↑	0.4667	±	0.0146
		none	25	acc_norm	↑	0.5068	±	0.0146

Instruct Model

My hypothesis is that, the Llama model requires a chat template. Therefore, I tested the meta-llama/Llama-3.2-3B-Instruct with the official chat template:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,trust_remote_code=True,dtype="bfloat16" \ --tasks arc_challenge \ --num_fewshot 25 \ --apply_chat_template \ --fewshot_as_multiturn \ --batch_size 8

However, this produces similar results:

Tasks	Version	Filter	n-shot	Metric	Value	Stderr
arc_challenge	1	none	25	acc	0.4983	0.0146
arc_challenge	1	none	25	acc_norm	0.5341	0.0146

I'm wondering about what caused this inconsistency.

The text was updated successfully, but these errors were encountered:

baberabb · 2025-01-02T11:00:41Z

Hi! They use the MMLU format for this task (<question><A. answer_choice>,...><Answer: ><"A">) rather than the standard cloze format (<question><Answer: ><"answer_choice">).

You can run it on #2556 (llama_arc_challenge)

baberabb added the asking questions For asking for clarification / support on library usage. label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

luoxuan-cs commented Dec 31, 2024 •

edited

Loading

baberabb commented Jan 2, 2025 •

edited

Loading

Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

Significant Discrepancy in ARC-Challenge Accuracy: Llama-3.2-3B Official vs lm-evaluation-harness #2605

Comments

luoxuan-cs commented Dec 31, 2024 • edited Loading

Description

Official Results

Local Evaluation Results

Base Model Test

Instruct Model

baberabb commented Jan 2, 2025 • edited Loading

luoxuan-cs commented Dec 31, 2024 •

edited

Loading

baberabb commented Jan 2, 2025 •

edited

Loading