-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for generative answering of multiple_choice tasks #2601
base: main
Are you sure you want to change the base?
Conversation
For reference, some tinyBenchmark results with claude:
|
doc_system_instruction += " " | ||
if multiple_choice_generate == "abcd": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".
doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer." | ||
else: | ||
doc_system_instruction += "Please answer with the letter of the correct last answer." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about non-english tasks that are already inside this repo?
As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.
This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.