Home

Awesome

CLEX: Continuous Length Extrapolation for Large Language Models

This repo provides the official implementation of our paper "CLEX: Continuous Length Extrapolation for Large Language Models"

<div style='display:flex; gap: 0.25rem; '> <!-- <a href='https://huggingface.co/DAMO-NLP-SG'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a> --> <a href='https://huggingface.co/spaces/DAMO-NLP-SG/CLEX-Chat'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a> <a href='https://huggingface.co/papers/2310.16450'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-blue'></a> </div>

News

Features and Highlights of CLEX

CLEX_diagram

If you have any questions, feel free to contact us. (Emails: guanzzh.chen@gmail.com, lixin4ever@gmail.com)

Model Zoo

<div align="center">
Model NameModel TypeStarting PointTrain DataTrain LengthMAX Test LengthHF Repo
CLEX-LLaMA-2-7B-16KbaseLLaMA-2-7BRedpajama-Book16K64Klink
CLEX-LLaMA-2-7B-Chat-16KchatCLEX-7B-16KUltraChat16K64Klink
CLEX-LLaMA-2-7B-64KbaseLLaMA-2-7BRedpajama-Book64k256Klink
CLEX-Phi-2-32KbasePhi-2-2.7BLongCorpus-2.5B32k128Klink
CLEX-Mixtral-8x7B-32KbaseMixtral-8x7B-v0.1LongCorpus-2.5B32k>128Klink
CLEX-Mixtral-8x7B-Chat-32kchatCLEX-Mixtral-8x7B-32KUltrachat 200k32k>128Klink
</div>

Supported LLMs

Usage

Environment Setup

conda create -yn clex python=3.9
conda activate clex

git clone https://github.com/DAMO-NLP-SG/CLEX.git
cd CLEX
pip install -r requirements.txt
# install flash-attn separately
pip install flash-attn --no-build-isolation

Code Snippet for Minimal Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DAMO-NLP-SG/CLEX-7B-16K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
  "DAMO-NLP-SG/CLEX-7B-16K",
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  use_flash_attention_2=True
)
inputs = tokenizer("What is CLEX?", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

Inference with Command Line Interface

We replicate the command line interface of FastChat here. You can use the command below to enable the streaming chatting upon CLEX. The CLEX-7B-Chat-4K supports the input sequence lengths up to 16k.

python3 serve/cli.py --model-path DAMO-NLP-SG/CLEX-7B-Chat-4K --num-gpu 1

You can also try our web GUI demo here.

LongCorpus-2.5B

We collect a 2.5B training dataset from various domains for long-context continual pre-training. The composition of this dataset is as follows (partially inspired by Long-Data-Collection):

DomainProportionSource
Book40%Redpajama-Book
Arxiv20%Redpajama-Arxiv
General20%Redpajama
Code10%LCC-Python
QA5%Natural Questions
Summarization5%BookSum

We have also curated a test dataset comprising 250 million tokens, mirroring the same composition. The selection criteria ensured that the average n-gram similarity (for n=2, 3, 4) with the training set is below 10%. This threshold effectively excludes all QA and Summarization data, resulting in a test corpus where the distribution of tokens across Book, Arxiv, General, and Code categories follows a ratio of 4:2:2:1, respectively.

Training

To train the long-context LLM with CLEX, run the script scripts/train_lm.sh as follows:

./scripts/train_lm.sh

For training the chat model, run the script scripts/train_chat.sh instead.

Note that we use an on-the-fly tokenization, which supports any desired training length without pre-tokenizing. So if you use a learning rate scheduler (e.g., cosine), you may need to specify the arg max_steps in the training arguments (You can estimate it depending on training data size).

Customization

We now support LLaMA-2, Phi-2, and Mixtral-8x7B. If you want to customize your LLM equipped with RoPE, please follow three steps:

  1. Init the CLEX layer and acquire the packed cos and sin embeddings of CLEX-scaled RoPE.
  2. Pass the cos and sin embeddings to the attention layer.
  3. Move the update of past_key_value before applying the RoPE. This ensures all keys would be rotated by the same cos and sin embeddings.

Evaluation

Language Modelling

Here are the evaluation PPLs of the base models trained with CLEX. We apply training and evaluation on a subset of 2B tokens from the RedPajama-Book corpus, where the training and test sets are split by 99:1.

ModelsTrain LengthEval.(4k)Eval.(8k)Eval.(16k)Eval.(32k)Eval.(64k)
LLaMA-2-7B4k6.0420.54>100>1000>1000
CodeLLaMA-7B16k7.67.47.3315.1252.02
Naive FT16k5.985.935.9118.31> 100
PI16k5.95.715.726.058.75
Yarn (s=16)16k6.55.715.735.998.51
Yarn (s=32)16k6.615.945.966.086.22
CL-Scaling16k24.995.865.8710.5641.09
ALIBI4k6.346.396.416.56.51
RandomPos4k5.88>100>1000>1000>1000
CLEX-LLaMA-2-7B-4K4k5.865.75.8714.5330.51
CLEX-LLaMA-2-7B-16K16k5.885.685.525.555.64
CLEX-LLaMA-2-13B-4k4k5.435.315.346.4012.15
Train LengthEval.(32k)Eval.(64k)Eval.(128k)Eval.(256k)
CLEX-LLaMA-2-7B64k5.995.896.045.98

The CLEX-Phi-2-2.7B and CLEX-Mixtral-8x7B are trained on LongCorpus-2.5B, where the eval results on test set are listed below.

Train LengthEval.(32k)Eval.(64k)Eval.(128k)Eval.(256k)
Phi-2-2.7B2k>100>100>100>100
CLEX-Phi-2-2.7B32k5.115.176.55-
Mixtral-8x7B32k2.783.445.8814.20
CLEX-Mixtral-8x7B32k2.562.532.573.78

LongBench

We evaluate the chat models trained with CLEX on the LongBench, where the average length of most tasks ranges from 5k to 16k. Except for those marked with † are evaluated by ourselves, the baseline results are retrieved from the leaderboard of LongBench. ** denotes the method that needs to truncate the input sequence to the train length.

ModelTrain LengthAvg.Single-Document QAMulti-Document QASummarizationFew-shot LearningSythetic TaskCode Completion
GPT-3.5-Turbo-16K-44.6645.136.2323.957.585154.15
CodeLLaMA-7B<sup>†</sup>16k33.4232.1921.4920.0657.738.9260.11
Vicuna-v1.5-7B16k30.5431.7518.823.2556.835.3347.25
LongChat-v1.5-7B32k31.5928.7820.3322.4550.813.0354.15
XGen-7B<sup>**</sup>8k24.9622.1518.0219.0547.234.738.6
InternLM-7B<sup>**</sup>8k22.6421.4517.915.241.553.336.45
Llama2-7B-chat<sup>**</sup>4k26.7621.6518.218.5349.954.1348.1
Baichuan-13B<sup>†</sup> (ALiBi)4k13.4918.366.799.9311.721.8532.28
ALiBi-7B-4K<sup>†</sup>4k9.937.235.987.45.690.6732.61
CLEX-7B-Chat-4K4k32.7229.3820.0823.2556.029.6757.94

InfiniteBench

We also evaluate CLEX-Mixtral-8x7B-Chat-32k on InfiniteBench, which is a 128k-length benchmark covering various tasks. We compare our CLEX-Mixtral-8x7B-Chat-32k with GPT-4, Claude, KimiChat, and vanilla Mixtral-8x7B.

Task NameGPT-4YaRN-Mistral-7BKimi-ChatClaude 2CLEX-Mixtral-8x7B-Chat-32kMixtral-8x7B-Instruct-v0.1
Retrieve.PassKey100%92.71%98.14%97.80%99.72%96.78%
Retrieve.Number100%56.61%95.42%98.14%76.10%76.61%
Retrieve.KV89.00%< 5%53.60%65.40%<5%<5%
En.Sum14.73%9.09%17.93%14.45%15.48%14.3%
En.QA22.22%9.55%16.52%11.97%15.52%16.81%
En.MC67.25%27.95%72.49%62.88%58.96%56.77%
En.Dia8.50%7.50%11.50%46.50%9%<5%
Code.Debug39.59%< 5%18.02%< 5%21.32%<5%
Code.Run23.25%< 5%< 5%< 5%< 5%<5%
Math.Calc< 5%< 5%< 5%< 5%< 5%<5%
Math.Find60.00%17.14%12.57%32.29%28%26.57%

Key points:

Acknowledgement

We would like to express our gratitude to the following open-sourcing efforts our CLEX benefits from:

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{damonlpsg2023clex,
  author = {Chen, Guanzheng and Li, Xin and Meng, Zaiqiao and Liang, Shangsong and Bing, Lidong},
  title = {CLEX: Continuous Length Extrapolation for Large Language Models},
  year = 2023,
  journal = {arXiv preprint arXiv:2310.16450},
  url = {https://arxiv.org/abs/2310.16450}
}