Skip to content

v0.4.4

Compare
Choose a tag to compare
@haileyschoelkopf haileyschoelkopf released this 05 Sep 15:13
· 97 commits to main since this release
543617f

lm-eval v0.4.4 Release Notes

New Additions

  • This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using --tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here.

  • API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.

    • The url can be specified by passing the base_url to --model_args, for example, base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with the num_concurrent argument; tokenization is controlled with tokenized_requests.
    • Other arguments (such as top_p, top_k, etc.) can be passed to the API using --gen_kwargs as usual.
    • Note: Instruct-tuned models, not just base models, can be used with local-completions using --apply_chat_template (either with or without tokenized_requests).
      • They can also be used with local-chat-completions (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
    • example with OpenAI completions API (using vllm serve):
      • lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
    • example with chat API:
      • lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
    • We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under local-completions!
  • We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.

  • A combination of data-parallel and model-parallel (using HF's device_map functionality for "naive" pipeline parallel) inference using --model hf is now supported, thank you to @NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.

New tasks as of v0.4.4 include:

Backwards Incompatibilities

tags versus groups, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:

# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...

These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!

To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):

# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...

Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...
 ### New! Needed to turn on aggregation ###
 aggregate_metric_list:
  - metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
  - weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.
  

Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)

What's Changed

New Contributors

Full Changelog: v0.4.3...v0.4.4