Skip to content

v0.4.2

Compare
Choose a tag to compare
@haileyschoelkopf haileyschoelkopf released this 18 Mar 13:07
· 291 commits to main since this release
4600d6b

lm-eval v0.4.2 Release Notes

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions

  • Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
  • Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
  • New Tasks
  • Re-introduction of TemplateLM base class for lower-code new LM class implementations by @anjor
  • Run the library with metrics/scoring stage skipped via --predict_only by @baberabb
  • Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:

TaskManager API

previously, users had to call lm_eval.tasks.initialize_tasks() to register the library's default tasks, or lm_eval.tasks.include_path() to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks() 
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

 
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

get_task_dict() now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large stderr scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.4.2