Awesome

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Data Collection: L-Eval (preview on 🤗 HuggingFace Datasets • check our 📃 paper ) is a comprehensive Long Context Language Models (LCLMs) evaluation suite with 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k～200k tokens). L-Eval has 2 groups: closed-ended tasks and open-ended tasks. The closed-ended group primarily tests the reasoning and understanding ability regarding a longer context, and the open-ended group consists of more summarization tasks that require aggregation of long document information (download the data).

Long Context LLMs Evaluation: Closed-ended tasks typically do not present issues with evaluation fairness. However, in real-world long-context tasks, open-ended tasks tend to be more common. We have found that n-gram metrics such as ROUGE and F1 cannot accurately reflect the abilities of LCLMs. As such, L-Eval does not solely rely on metrics used in previous text generation benchmarks. Instead, L-Eval primarily utilizes Length-Instruction-Enhanced (LIE) evaluation, and LLM judges (battling with Turbo-16k or Llama2). Please refer to open-ended tasks evaluation).

We hope L-Eval could help researchers and developers track the progress of long-context language models (LCLMs) and understand the strengths/shortcomings of different methods. We will also keep up with the latest releases of instruction-following LCLMs.

Other features of this repo:

🧭️ Handle CUDA OOM with memory-efficient inference
🖇️ Build a retrieval-based baseline with Langchain
✏️ Flask web client for editing local jsonl files
🔖 View the Leaderboard
📨 How to submit your results
Previous long sequence datasets used in L-Eval

Long context abilities of LLMs on closed/open-ended tasks:

🔥 Updates of L-Eval

[2024-4-25] We add the results for Llama3 8b/70b.

Model	TOEFL	QuALITY	Coursera	SFiction	GSM	CodeU
Llama3-8b-Instruct	82.89	64.85	53.77	69.53	79.00	2.22
Llama3-70b-Instruct	84.75	80.19	75.87	72.65	90.00	6.67
GPT4-32k (2023)	84.38	82.17	75.58	74.99	96.00	25.55

[2023-10-7] Final version of our paper can be found here.
[2023-8-30] We have annotated two new closed-ended tasks: (i) A scientific fiction dataset to test the loyalty to input and (ii) a code understanding dataset. 📢 L-Eval has been supported by OpenCompass. You can test L-Eval together with other benchmarks for foundation models here.

Folders

The repository is structured as follows:

├── Baselines/ # scripts to generate the prediction files with baseline models
├── Baselines-light/ # scripts to generate the prediction files with 24G gpus
├── Evaluation/ # evaluation scripts
├── LEval-data/ # test samples
│   ├── Closed-ended-tasks/ # exact match tasks (like multiple-choice)
│   │   ├── test_file.jsonl 
│   │   └── ...
│   ├── Open-ended-tasks/ # generation tasks
│   │   ├── test_file.jsonl
│   │   └── ...
├── Predictions/ # output of models
│   ├── exam_eval/turbo-16k-0613
│   │              ├── <task_name>.pred.jsonl
│   │              └── ... 
│   ├── llm_gpt4_eval  
│   │             ├──<model_name>.pred.jsonl
│   ├── ngram_eval  
│   │             ├──model_name
│   │                     └──task_name.pred.jsonl
│   ├── ...
└── Tools/ # useful scripts

Quick use

Step 1. Download the data

It is easy to load the 20 test data in one line with huggingface datasets, and we give the example scripts:

from datasets import load_dataset, disable_caching

datasets = ["coursera", "gsm100", "quality", "topic_retrieval_longchat", "tpo", "codeU", "sci_fi" ,"financial_qa", "gov_report_summ", "legal_contract_qa", "meeting_summ", "multidoc_qa", "narrative_qa", "natural_question", "news_summ", "paper_assistant", "patent_summ", "review_summ", "scientific_qa", "tv_show_summ"]
# The corresponding NAMEs in the paper
# "coursera", "GSM(16-shot)", "QuALITY", "TopicRet", "TOFEL", "codeU", "SFiction", "LongFQA", "GovReport", "CUAD", "QMSum", "MultiDoc2Dial", "NarrativeQA", "NQ", "Multi-news", "Openreview", "BigPatent", "SPACE", "Qasper", "SummScreen"]

for testset in datasets:
    # disable_caching()  uncomment this if you cannot download codeU and sci_fi 
    data = load_dataset('L4NLP/LEval', testset, split='test')
    # evaluate your model

You can also directly clone this repo:

git clone https://github.com/OpenLMLab/LEval.git

The test data is in LEval-data.

Each long document has multiple queries and corresponding responses. The format of each sample is as follows:

{
    "instructions": ["What is the main goal of data science?\nA. Analyze and predict future trends\nB. Generate massive amounts of data\nC. Answer questions using data\nD. Increase the use of technology", "..."], // a list of instructions (questions need LLMs to answer)
    "outputs": ["C","A", "..."], // the ground truth or reference of corresponding instructions
    "input": "A very long document", // LLMs need to respond to instructions based on this long document.
    "source": "domain the document belongs to", // meeting, narrative_qa, etc.
    "evaluation": "Metrics used for evaluation" // e.g., exam, human, LLM, ROUGE, F1, etc.
}

Step 2. Generate your prediction results (Closed-ended tasks)

Examples of closed-ended tasks

Multiple Choice Question (single correct option). Example predicted answer: A, BCD
Math Word Problems. Example predicted answer: 3

We test all the baselines with a single 80G A800 GPU. If you encounter the OOM problem, please refer to multiple GPUs inference. To generate the output files, you need to add a new file to Baseline folder and then replace the model name with your own model. An example of testing gpt3.5-turbo-16k on closed-ended tasks:

python Baselines/turbo16k-test.py  --metric exam_eval (for closed-ended group)  --task_name quality [Optional, if you only want to test one task]

The script will save the prediction results to a local file. You need to press enter to confirm the path. Details about open-ended tasks can be found in the next section.

Step 3. Evaluate the prediction file

Given the prediction file generated in Step 2, please run the following command to calculate the metric:

python Evaluation/auto_eval.py --pred_file Predictions/exam_eval/turbo-16k-0613/quality.pred.jsonl

Evaluating LCLMs on open-ended tasks

In this part, we mainly introduce how to evaluate LCLMs on open-ended tasks.

Examples of open-ended tasks

Summarization. Example predicted answer: This paper proposes a new method for ...
Abstractive question answering. Example predicted answer: The main goal of data science is to answer questions using data.

Generate prediction results on open-ended tasks:

CMD: python Baselines/turbo16k-test.py --metric ngram_eval (for open-ended group)  --task_name narrative_qa [Optional, if you only want to test one task]

Generate prediction results on the 96-question subset (GPT-4 evaluation subset):

CMD: python Baselines/turbo16k-test.py --metric llm_gpt4_eval

Generate prediction results on the 85-question subset (human evaluation subset):

CMD: python Baselines/turbo16k-test.py --metric human_eval

Generate prediction results on the 2 subsets (181 questions) :

CMD: python Baselines/turbo16k-test.py --metric llm_turbo_eval

Automatic Metrics

we use the following automatic metrics to evaluate the performance of generation tasks:

GPT-4/3.5 Evaluation. We suggest using GPT-4 as a judge and battling with turbo-16k-0613. We report the win-rate in our paper. Turbo-16k serves as a strong baseline, and you could also opt for Llama2-4k to directly demonstrate the extent of your improvements.

python Evaluation/llm_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl --judge_model gpt-4 (or gpt-3.5-turbo) --battle_with Predictions/ngram_eval/turbo-16k-0613 (or llama2-13b-chat)/narrative_qa.pred.jsonl

Please add the following judgment prompt in Long context settings:

Additional details or information that are not mentioned in the reference answer cannot be considered as advantages and do not let them sway your judgment.

N-gram Match Evaluation (biased), traditional automatic metrics like F1, ROUGE, is very cheap and efficient to calculate. However, they are biased towards the length of the predicted answer.

python Evaluation/auto_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl

❗ Length-Instruction-Enhanced Evaluation

For open-ended tasks, models are informed of the ground truth length via a length instruction,e.g, We need a 20 words summary where 20 is the length of reference answer to reduce the length bias in automatic metrics. The figure below shows the improvement in Kendall-Tau correlation with human assessment brought by length-instruction-enhanced evaluation.

Human evaluation

we provide a very easy-to-use flask web app running on localhost 127.0.0.1:5000. You need to copy your prediction file <model_name>.pred.jsonl (samples with evaluation: human) to the Predictions/human_eval folder and then run:

python Evaluation/web_human_eval.py  --mode begin (or continue)

where --mode denotes whether you are starting a new evaluation or continuing your previous annotation. Feel free to close the browser and set --mode continue to continue from your last annotation. Once running the script, you have to provide the annotator name and your annotation results will be saved to Predictions/human_eval/annotation_from_<name>.jsonl. See the running screenshot here. We have provided the prediction files from 5 popular models as baselines for human evaluation. if you want to add outputs from other baselines, you can also move the corresponding prediction file to the Predictions/human_eval folder.

Statistics of the data:

How to Submit

The leaderboard contains 5 parts: Exact Match, GPT-4 evaluator, GPT-3.5 Evaluator, F1, ROUGE,

To submit your results on our leaderboard, you can send an email to levalbenchmark@gmail.com.

Your submission should include 4 things:

Metadata: Model name, number of parameters, and links to your paper/blog/GitHub/demo.
Output files: Please submit 1 folder named with your model (e.g., Predictions/turbo-16k-0613 ) for ngram matching evaluation and a jsonl file, e.g., Predictions/LLM_Eval/claude100k.pred.jsonl(The file naming format is model_name.pred.jsonl) for LLM evaluation, as described in Evaluation scripts section.
Results: Please submit the results produced by our evaluation scripts. Results should contain all keys in the leaderboard.
Judgements from turbo3.5 and gpt4 (The output file produced by llm_eval.py)

We will randomly verify some results with the submitted output files.

Explanation of keys in the leaderboard

Keys in Exact Match
- Avg: averaging over 4 datasets performance score.
- Max-Ctx: the maximum context length of your model.
- Tokens: the number of input tokens in experiments.
- Ret.: whether using retrieval.
- PE: whether doing prompt engineering (e.g., modifying the original prompt to improve the performance, providing in-context examples).
- IDD: whether using in-domain data (e.g. data from qmsum, narrative_qa training set) into further finetuning. Please don't hack this evaluation set. But considering most of the sources are open, if your dataset potentially contains some in-domain data, you don't need to remove them. In that case, please set this value to 'yes'. If the construction of the IFT data is not transparent, you can leave it blank.
Keys in F1_and ROUGE
- F1 avg: the average over each dataset’s overall F1 score on QA-style tasks
- ROUGE avg: the average over each dataset’s overall ROUGE-L score on Summarization-style tasks.
- Length: the average length of the generated outputs.
Keys in GPT-4/3.5 Evaluator
- n_wins: number of wins including results of swapping the position of two answers.
- n_draws number of draws including results of swapping the position of two answers.
- win % vs turbo16k The win rate of your model in the battle with turbo-16k-0613
- Length: the average length of the generated outputs.

Memory-efficient inference and multiple GPUs inference

Using Flash Attention during inference 🚀

Please first try Flash Attention if you have a 80G GPU. Based on our experiments, it works well when the sequence length is less than 32k (Flash-attn v2). if you still encounter OOM, please refer to the next section. If you are using LLaMA, we support FlashAttention in inference which can save your gpu memory, please add the param --flash. The code is similar for other models. Flash attention for Chatglm is implemented with torch2.0. Please ensure that you have successfully installed it.

If you encounter installation issues, it's likely due to the CUDA and Torch versions mismatch. Here is my running env:

python>=3.8
torch==1.13.1+cu117
CUDA Driver Version: 525.105.17   CUDA Toolkit: 11.7
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
[if flashAttn-v1] git checkout tags/v1.0.0 
python setup.py install

python Baselines/longchat-test.py --task_path LEval-data/Open-ended-tasks/narrative.jsonl --max_length 16k --gpu 0 --metric ngram_eval --flash

Memory-efficient inference with LightLLM 🚂

Using lightLLM can make the inference procedure on a single or multiple 24G GPUs by optimizing the storage of KV cache but sacrificing inference speed.

Installation

Download L-Eval and the data.
Install LightLLM according to the official instructions.

Examples of running L-Eval with LightLLM

You must first download the model you would like to evaluate. LightLLM does not support automatic downloads yet.

Code for running L-Eval with LightLLM is located in the Baselines-light directory.

The following command evaluates vicuna-7b-v1.5-16k on 4 RTX 3090 GPUs.

python Baselines-lightllm/vicuna-test.py --metric exam_eval --max_length 16k --model_path /.../.../vicuna-7b-v1.5-16k/ --lightllm_extra_args "--tp 4 --max_total_token_num 130000 --trust_remote_code --max_req_total_len 16384 --max_req_input_len 15900"

You don't actually need 4 GPUs to run this example. But performance will improve with more GPUs.

--lightllm_extra_args are extra arguments passed to LightLLM server. View the LightLLM documentation for more information on how to set these arguments. model_dir is automatically passed and do not need to be specified again.

The script assumes LightLLM server is listening on port 8000.

Known Issues

LightLLM server process might not properly terminate after the evaluation script stops. If you don't have other Python processes running, you can run killall -HUP python to terminate LightLLM server.

Other Tools

Using Langchain to build retrieval-based baselines

You can use the script turbo4k-retrieve-test.py in Baselines to enhance a regular LLM with a sparser or dense retriever. An example is as follows:

python Baselines/turbo4k-retrieve-test.py --metric exam_eval (or ngram_eval, human_eval, llm_turbo_eval, llm_gpt4_eval) --retriever BM25 (or AdaEmbedding)

The retrieval-based method is implemented with langchain. If you want to use BM25 retriever, please first install Elasticsearch. If you want to try ada embedding (cheap but effective), please fill your api-key.

A flask-based annotation website for jsonl files

We have also released a very easy-to-use annotation website for L-Eval and make sure you have installed flask. Firstly, you have to preprocess your files into a jsonl format which should contains 3 keys input:str, instructions:list and, outputs:list (see the examples in LEval-data folder). To annotate new instruction-output pairs, please run the script to view and annotate the local jsonl file: Start running the website on 127.0.0.1:5000 by:

python Tools/web_annotate_jsonl.py --path LEval-data/Generation/meeting_summ.jsonl --mode begin --new_pairs_num 2

where --new_pairs_num means the number of new QA pairs you want to add and --mode (begin or continue) means whether you want to continue from previous annotation results. The input file denoted by --path should be a jsonl file like the examples in LEval-data folder. In this case, we annotate two new QA pairs based on the long input. After clicking submit, the results will be saved to the disk.

Example of our annotation website

Example of the human evaluation website

<div align="center"> <img src="figs/human_eval.png" border="0" width=660px/> </div> You can score the outputs from different models via the website. After completing the annotation, the result page is like: <div align="center"> <img src="figs/res_page.png" border="0"/> </div>

Acknowledgement

<a name="ack"></a> This work is done by Fudan University and The University of Hong Kong. Primary contributors: Chenxin An, Shansan Gong, Ming Zhong, Xingjian zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu.

We would like to express our gratitude towards Siyu Ren, Zhiyong Wu, Qinyuan Cheng, Bo Wang, and Yukang Chen for their valuable suggestions and insights!

We sincerely appreciate the assistance provided by the following works for L-Eval:

We download the videos to form the long documents from Coursera website
we extract 100 math problems from GSM8k and we use 8 long examples from {chain-of-thought-hub
topic retrieval data is collected from LongChat
QuALITY is from their official github
TOEFL Practice Online data comes from TOEFL-QA Other open-sourced datasets are collected from: gov_report, cuad, qmsum, Multidoc2dial narrativeQA, Natural Questions, review advisor, multi-news bigpatent, SPACE, Qasper, SummScreen

Please kindly cite the original papers when using L-Eval. Thanks again for their effort!!

We are very pleased to answer any questions about L-Eval: cxan20@fudan.edu.cn

Citation

@misc{an2023leval,
      title={L-Eval: Instituting Standardized Evaluation for Long Context Language Models}, 
      author={Chenxin An and Shansan Gong and Ming Zhong and Mukai Li and Jun Zhang and Lingpeng Kong and Xipeng Qiu},
      year={2023},
      eprint={2307.11088},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}