Home

Awesome

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

CLongEval is a Chinese benchmark for evaluating long-context LLMs, which is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.

<div align="center"> <img src="assets/schema.png" border="0" width=400px/> </div>

Dataset Statistics

The image below presents statistics for each task. We stratify the benchmark into small, medium, and large subsets. The small set primarily includes test data with lengths ranging from 1K to 16K tokens, the medium set mainly encompasses lengths from 16K to 50K tokens, and the large set primarily extends from 50K to 100K tokens.

<div align="center"> <img src="assets/statistics.png" border="0" width=600px/> </div>

Benchmark Results

The tables below display model scores across three subsets using automated evaluation metrics. Evaluations for GLM-4-128K were conducted up to the cut-off date of February 21, 2024, whereas those for other models were completed by February 15, 2024.

Small Set

LStQALCvMemLStSumStNLabStTDetKpRetTblQry
Zh-LLAMA2-7B-64K29.3441.4010.290.5902.867.50
Zh-Alpaca2-7B-64K35.5229.3414.294.970.096.399.75
Qwen-7B-32K31.9447.7111.204.31011.186.64
ChatGLM3-6B-32K49.3653.4016.370.460.9133.6722.60
InternLM2-7B-32K49.5558.3417.2916.462.2721.8720.75
InternLM2-20B-32K53.8257.4117.0011.160.9134.9717.25
GLM-4-128K52.7446.7420.2987.9317.4081.4773.25
Mooshot-v1-32K60.2151.7621.5689.0125.3686.7466.50
GPT-4-Turbo-128K66.1963.4221.9679.7038.3584.2482.35

Medium Set

LStQALCvMemLStSumStNLabStTDetKpRetTblQry
Zh-LLAMA2-7B-64K16.9026.307.74001.21N/A
Zh-Alpaca2-7B-64K18.4122.458.56000.93N/A
InternLM2-7B-200K29.5932.078.13001.454.50
InternLM2-20B-200K25.1336.8413.99001.646.25
Moonshot-v1-128K51.2038.2918.8186.3011.3378.6466.50
GPT-4-Turbo-128K52.6354.1817.3837.409.3222.3452.76

Large Set

LStQALCvMemLStSumStNLabStTDetKpRetTblQry
InternLM2-7B-200K19.0318.162.36000.892.67
InternLM2-20B-200K15.6228.398.31000.510.67
Moonshot-v1-128K41.5232.5916.3878.484.3351.5052.00

Reproducing Main Results

Downloading Data

We have uploaded CLongEval to Hugging Face. The files can be downloaded from this link and manually put into the data directory.

Inference

We use the lmdeploy framework for InternLM2 series inference, huggingface's native methods for Qwen-7B-32K inference, and the vllm framework for other open-source model inference. Please modify the model path in config/model2path.json before performing inference to ensure proper loading of the models from the local path. Our code is adapted and modified based on LongBench.

Take InternLM2-7B as an example, For single-GPU inference, use the following command:

python inference.py --model_name internlm2-7b-200k --size small.jsonl --dataset_name long_story_qa --gpu_memory_utilization 0.8 --tensor_parallel_size 1 --gpus "0"

For multi-GPU inference, use the following command:

python inference.py --model_name internlm2-7b-200k --size small.jsonl --dataset_name long_story_qa --gpu_memory_utilization 0.8 --tensor_parallel_size 2 --gpus "0,1"

You can check inference_example.sh that shows the complete commands to run the Inference for InternLM2-7B and obtain all results. After the above command is complete, the inference results will be saved in inference_results/internlm2-7b-200k/. Additionally, we also provide the inference results of GPT-4-Turbo and Moonshot-v1 in inference_results/.

Evaluation

Use the following command to obtain the model's performance on a specific dataset:

python eval.py --model_name internlm2-7b-200k --datasets long_story_qa

The evaluated scores will be saved into the eval_results/intenlm2-7b-200k/.

Reproducing Visualization

Citation

If you find CLongEval useful in your research, please consider citing:

@misc{qiu2024clongeval,
      title={CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models}, 
      author={Zexuan Qiu and Jingjing Li and Shijue Huang and Wanjun Zhong and Irwin King},
      year={2024},
      eprint={2403.03514},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement