Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added caseHOLD task #2570

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions lm_eval/tasks/casehold/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# CaseHOLD

### Paper

Title: `When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset`

Abstract: `https://arxiv.org/abs/2104.08671`

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). The citing context from the judicial decision serves as the prompt for the question. The answer choices are holding statements derived from citations following text in a legal decision. There are five answer choices for each citing text. The correct answer is the holding statement that corresponds to the citing text. The four incorrect answers are other holding statements.


Homepage: `https://github.com/reglab/casehold`

### Citation

```
@inproceedings{zhengguha2021,
title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset},
author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
year={2021},
eprint={2104.08671},
archivePrefix={arXiv},
primaryClass={cs.CL},
booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
publisher={Association for Computing Machinery}
}
```

### Groups, Tags, and Tasks

#### Groups

None.

#### Tags

None.

#### Tasks

* `casehold`: Complete the following excerpt from a US court opinion. Multiple choice with 5 options.

### Checklist

For adding novel benchmarks/datasets to the library:
* [X] Is the task an existing benchmark in the literature?
* [X] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
- Note: I am running with some problems while reproducing the results. I will update the checklist after solving the problem.

If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
28 changes: 28 additions & 0 deletions lm_eval/tasks/casehold/casehold.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
tag:
- multiple_choice
task: casehold
dataset_path: casehold/casehold
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{prompt}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
- metric: f1
aggregation: !function utils.micro_f1_score
higher_is_better: true
name: micro_f1
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
39 changes: 39 additions & 0 deletions lm_eval/tasks/casehold/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import datasets
from evaluate import load

choices_keys = [
"holding_0",
"holding_1",
"holding_2",
"holding_3",
"holding_4",
]

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
prompt = f"Complete the following excerpt from a US court opinion:\n{doc['citing_prompt']}\nAnswer:"
choices = [doc[key] for key in choices_keys]
out_doc = {
"prompt": prompt,
"choices": choices,
"gold": int(doc["label"]),
}
return out_doc
return dataset.map(_process_doc)


def micro_f1_score(items):
f1_metric = load("f1")
golds, preds = list(zip(*items))
f1_score = f1_metric.compute(references=golds, predictions=preds, average="micro")[
"f1"
]
return f1_score

def macro_f1_score(items):
f1_metric = load("f1")
golds, preds = list(zip(*items))
f1_score = f1_metric.compute(references=golds, predictions=preds, average="macro")[
"f1"
]
return f1_score