Home

Awesome

Evaluation for Code Language Models

For perplexity evaluation on GPT-Neo or other models on Hugging Face on our test dataset, install this repository:

pip install -e .

To evaluate models on Hugging Face, including GPT-Neo, CodeParrot, specify pretrained=, and run the following:

python main.py --model gpt2 --model_args pretrained=EleutherAI/gpt-neo-2.7B \
      --device cuda:0 --batch_size 1 \ 
      --tasks code_python,code_c++,code_c#,code_c,code_php,code_go,code_scala,code_java,code_javascript,code_typescript,code_ruby,code_rust

Language Model Evaluation Harness

codecov

Overview

This project provides a unified framework to test autoregressive language models (GPT-2, GPT-3, GPTNeo, etc) on a large number of different evaluation tasks.

Features:

Install

pip install lm-eval

Basic Usage

To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.

python main.py \
	--model gpt2 \
	--device cuda:0 \
	--tasks lambada,hellaswag

(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)

Additional arguments can be provided to the model constructor using the --model_args flag. Most importantly, the gpt2 model can be used to load an arbitrary HuggingFace model. For example, to run GPTNeo use the following:

python main.py \
	--model gpt2 \
	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
	--device cuda:0 \
	--tasks lambada,hellaswag

If you have access to the OpenAI API, you can also evaluate GPT-3:

export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \
	--model gpt3 \
	--model_args engine=davinci \
	--tasks lambada,hellaswag

To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through this script.

Cite as

@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}

Full Task List

Task NameTrainValTestVal/Test DocsMetrics
cola1043mcc
mnli9815acc
mnli_mismatched9832acc
mrpc408acc, f1
rte277acc
qnli5463acc
qqp40430acc, f1
sst872acc
wnli71acc
boolq3270acc
cb56acc, f1
copa100acc
multirc4848acc
record10000f1, em
wic638acc
wsc104acc
coqa500f1, em
drop9536em, f1
lambada5153ppl, acc
lambada_cloze5153ppl, acc
wikitext62word_perplexity, byte_perplexity, bits_per_byte
piqa1838acc, acc_norm
prost18736acc, acc_norm
pubmedqa1000acc
sciq1000acc, acc_norm
qa4mre_2011120acc, acc_norm
qa4mre_2012160acc, acc_norm
qa4mre_2013284acc, acc_norm
triviaqa11313acc
arc_easy2376acc, acc_norm
arc_challenge1172acc, acc_norm
logiqa651acc, acc_norm
hellaswag10042acc, acc_norm
openbookqa500acc, acc_norm
squad211873exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1
race1045acc
headqa2742acc, acc_norm
mathqa2985acc, acc_norm
webqs2032acc
wsc273273acc
winogrande1267acc
anli_r11000acc
anli_r21000acc
anli_r31200acc
ethics_cm3885acc
ethics_deontology3596acc, em
ethics_justice2704acc, em
ethics_utilitarianism_original4808acc
ethics_utilitarianism4808acc
ethics_virtue4975acc, em
math_algebra1187acc
math_counting_and_prob474acc
math_geometry479acc
math_intermediate_algebra903acc
math_num_theory540acc
math_prealgebra871acc
math_precalc546acc
arithmetic_2da2000acc
arithmetic_2ds2000acc
arithmetic_3da2000acc
arithmetic_3ds2000acc
arithmetic_4da2000acc
arithmetic_4ds2000acc
arithmetic_5da2000acc
arithmetic_5ds2000acc
arithmetic_2dm2000acc
arithmetic_1dc2000acc
hendrycksTest-abstract_algebra100acc, acc_norm
hendrycksTest-anatomy135acc, acc_norm
hendrycksTest-astronomy152acc, acc_norm
hendrycksTest-business_ethics100acc, acc_norm
hendrycksTest-clinical_knowledge265acc, acc_norm
hendrycksTest-college_biology144acc, acc_norm
hendrycksTest-college_chemistry100acc, acc_norm
hendrycksTest-college_computer_science100acc, acc_norm
hendrycksTest-college_mathematics100acc, acc_norm
hendrycksTest-college_medicine173acc, acc_norm
hendrycksTest-college_physics102acc, acc_norm
hendrycksTest-computer_security100acc, acc_norm
hendrycksTest-conceptual_physics235acc, acc_norm
hendrycksTest-econometrics114acc, acc_norm
hendrycksTest-electrical_engineering145acc, acc_norm
hendrycksTest-elementary_mathematics378acc, acc_norm
hendrycksTest-formal_logic126acc, acc_norm
hendrycksTest-global_facts100acc, acc_norm
hendrycksTest-high_school_biology310acc, acc_norm
hendrycksTest-high_school_chemistry203acc, acc_norm
hendrycksTest-high_school_computer_science100acc, acc_norm
hendrycksTest-high_school_european_history165acc, acc_norm
hendrycksTest-high_school_geography198acc, acc_norm
hendrycksTest-high_school_government_and_politics193acc, acc_norm
hendrycksTest-high_school_macroeconomics390acc, acc_norm
hendrycksTest-high_school_mathematics270acc, acc_norm
hendrycksTest-high_school_microeconomics238acc, acc_norm
hendrycksTest-high_school_physics151acc, acc_norm
hendrycksTest-high_school_psychology545acc, acc_norm
hendrycksTest-high_school_statistics216acc, acc_norm
hendrycksTest-high_school_us_history204acc, acc_norm
hendrycksTest-high_school_world_history237acc, acc_norm
hendrycksTest-human_aging223acc, acc_norm
hendrycksTest-human_sexuality131acc, acc_norm
hendrycksTest-international_law121acc, acc_norm
hendrycksTest-jurisprudence108acc, acc_norm
hendrycksTest-logical_fallacies163acc, acc_norm
hendrycksTest-machine_learning112acc, acc_norm
hendrycksTest-management103acc, acc_norm
hendrycksTest-marketing234acc, acc_norm
hendrycksTest-medical_genetics100acc, acc_norm
hendrycksTest-miscellaneous783acc, acc_norm
hendrycksTest-moral_disputes346acc, acc_norm
hendrycksTest-moral_scenarios895acc, acc_norm
hendrycksTest-nutrition306acc, acc_norm
hendrycksTest-philosophy311acc, acc_norm
hendrycksTest-prehistory324acc, acc_norm
hendrycksTest-professional_accounting282acc, acc_norm
hendrycksTest-professional_law1534acc, acc_norm
hendrycksTest-professional_medicine272acc, acc_norm
hendrycksTest-professional_psychology612acc, acc_norm
hendrycksTest-public_relations110acc, acc_norm
hendrycksTest-security_studies245acc, acc_norm
hendrycksTest-sociology201acc, acc_norm
hendrycksTest-us_foreign_policy100acc, acc_norm
hendrycksTest-virology166acc, acc_norm
hendrycksTest-world_religions171acc, acc_norm
wmt14-en-fr3003bleu, chrf, ter
wmt14-fr-en3003bleu, chrf, ter
wmt16-en-ro1999bleu, chrf, ter
wmt16-ro-en1999bleu, chrf, ter
wmt16-de-en2999bleu, chrf, ter
wmt16-en-de2999bleu, chrf, ter
wmt20-cs-en664bleu, chrf, ter
wmt20-de-en785bleu, chrf, ter
wmt20-de-fr1619bleu, chrf, ter
wmt20-en-cs1418bleu, chrf, ter
wmt20-en-de1418bleu, chrf, ter
wmt20-en-iu2971bleu, chrf, ter
wmt20-en-ja1000bleu, chrf, ter
wmt20-en-km2320bleu, chrf, ter
wmt20-en-pl1000bleu, chrf, ter
wmt20-en-ps2719bleu, chrf, ter
wmt20-en-ru2002bleu, chrf, ter
wmt20-en-ta1000bleu, chrf, ter
wmt20-en-zh1418bleu, chrf, ter
wmt20-fr-de1619bleu, chrf, ter
wmt20-iu-en2971bleu, chrf, ter
wmt20-ja-en993bleu, chrf, ter
wmt20-km-en2320bleu, chrf, ter
wmt20-pl-en1001bleu, chrf, ter
wmt20-ps-en2719bleu, chrf, ter
wmt20-ru-en991bleu, chrf, ter
wmt20-ta-en997bleu, chrf, ter
wmt20-zh-en2000bleu, chrf, ter
iwslt17-en-ar1460bleu, chrf, ter
iwslt17-ar-en1460bleu, chrf, ter
anagrams110000acc
anagrams210000acc
cycle_letters10000acc
random_insertion10000acc
reversed_words10000acc
pile_arxiv2407word_perplexity, byte_perplexity, bits_per_byte
pile_books3269word_perplexity, byte_perplexity, bits_per_byte
pile_bookcorpus228word_perplexity, byte_perplexity, bits_per_byte
pile_dm-mathematics1922word_perplexity, byte_perplexity, bits_per_byte
pile_enron1010word_perplexity, byte_perplexity, bits_per_byte
pile_europarl157word_perplexity, byte_perplexity, bits_per_byte
pile_freelaw5101word_perplexity, byte_perplexity, bits_per_byte
pile_github18195word_perplexity, byte_perplexity, bits_per_byte
pile_gutenberg80word_perplexity, byte_perplexity, bits_per_byte
pile_hackernews1632word_perplexity, byte_perplexity, bits_per_byte
pile_nih-exporter1884word_perplexity, byte_perplexity, bits_per_byte
pile_opensubtitles642word_perplexity, byte_perplexity, bits_per_byte
pile_openwebtext232925word_perplexity, byte_perplexity, bits_per_byte
pile_philpapers68word_perplexity, byte_perplexity, bits_per_byte
pile_pile-cc52790word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-abstracts29895word_perplexity, byte_perplexity, bits_per_byte
pile_pubmed-central5911word_perplexity, byte_perplexity, bits_per_byte
pile_stackexchange30378word_perplexity, byte_perplexity, bits_per_byte
pile_uspto11415word_perplexity, byte_perplexity, bits_per_byte
pile_ubuntu-irc22word_perplexity, byte_perplexity, bits_per_byte
pile_wikipedia17511word_perplexity, byte_perplexity, bits_per_byte
pile_youtubesubtitles342word_perplexity, byte_perplexity, bits_per_byte

Usage

Evaluate a task

Additional arguments can be provided to the model constructor using the --model_args flag. Most importantly, the gpt2 model can be used to load an arbitrary HuggingFace model as follows:

python main.py \
	--model gpt2 \
	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
	--device cuda:0 \
	--tasks lambada,hellaswag \
	--num_fewshot 2

To inspect what the LM inputs look like, you can run the following command:

python write_out.py \
	--tasks all_tasks \
	--provide_description \
	--num_fewshot 5 \
	--num_examples 10 \
	--output_base_path /path/to/output/folder

This will write out one text file for each task.

Code Structure

There are two major components of the library:

  1. LMs (language models), e.g. GPT-2, GPT-3
  2. Tasks, e.g. MNLI, RTE, SQuAD (coming soon)

Both LMs (lm_eval.models) and Tasks (lm_eval.tasks) are kept in a registry data structure, for easy CLI instantiation.

If you want to extend either models or tasks, simply add a new LM or Task subclass, and decorate with the registry decorator.

The GPT-3 Evaluations Project tracks our progress implementing new tasks. Right now, we are focused on getting all the datasets loaded so that we can dedupe against the training data. Implementing the actual evaluations is nice but not necessary at the current moment.

Task Versioning

To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.

When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.

Description

1. LM Evaluation

Given an LM, we want to evaluate it on a wide range of NLU tasks. We should at least cover the set of tasks in the GPT-3 paper, and any other tasks/benchmarks that are relevant. We will follow the GPT-3 format of a) zero-shot, b) one-shot, c) few-shot evaluation.

To do this, we need 3 components:

The data downloader should download data for the relevant tasks.

The task formatter formats the task input data into an LM-usable format.

The task evaluator scores a task.

2. Removing val/test data from LM training set

With the data downloader in place, we simply need to (1) expose the val/test examples, and (2) remove them from the training set.

3. Adding task training data to LM training set

This part is the easiest. I guess we just write out some text files containing the training data? We can let the usual LM preprocessing pipeline handle it from there.