Home

Awesome

πŸ“ RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

ModelsClaimed LengthEffective Length4K8K16K32K64K128KAvg.wAvg. (inc)wAvg. (dec)
Llama2 (7B)4K85.6
Jamba-1.5-large* (94B/398B)256k>128k<ins>96.7</ins><ins>96.6</ins><ins>96.4</ins><ins>96.0</ins><ins>95.4</ins><ins>95.1</ins>96.095.7 (1st)96.3 (1st)
Gemini-1.5-pro1M>128K<ins>96.7</ins><ins>95.8</ins><ins>96.0</ins><ins>95.9</ins><ins>95.9</ins><ins>94.4</ins>95.895.5 (2nd)96.1 (2nd)
Jamba-1.5-mini (12B/52B)256K>128K<ins>95.6</ins><ins>95.6</ins><ins>94.8</ins><ins>94.6</ins><ins>92.8</ins><ins>90.0</ins>93.993.1 (3rd)94.8 (3rd)
GPT-4-1106-preview128K64K<ins>96.6</ins><ins>96.3</ins><ins>95.2</ins><ins>93.2</ins><ins>87.0</ins>81.291.689.0 (4th)94.1 (4th)
Llama3.1 (70B)128K64K<ins>96.5</ins><ins>95.8</ins><ins>95.4</ins><ins>94.8</ins><ins>88.4</ins>66.689.685.5 (9th)93.7 (5th)
Command-R-plus-0824 (104B)128K32K<ins>96.0</ins><ins>95.1</ins><ins>94.0</ins><ins>92.4</ins>85.464.687.983.4 (12th)92.4 (6th)
Qwen2 (72B)128K32K<ins>96.9</ins><ins>96.1</ins><ins>94.9</ins><ins>94.1</ins>79.853.785.979.6 (16th)92.3 (7th)
Command-R-plus (104B)128K32K<ins>95.6</ins><ins>95.2</ins><ins>94.2</ins><ins>92.0</ins>84.363.187.482.7 (13th)92.1 (8th)
Command-R-0824 (32B)128K64K<ins>94.7</ins><ins>93.7</ins><ins>93.1</ins><ins>90.8</ins><ins>86.6</ins>74.788.986.0 (7th)91.9 (9th)
GLM4 (9B)1M64K<ins>94.7</ins><ins>92.8</ins><ins>92.1</ins><ins>89.9</ins><ins>86.7</ins>83.189.988.0 (5th)91.7 (10th)
Llama3.1 (8B)128K32K<ins>95.5</ins><ins>93.8</ins><ins>91.6</ins><ins>87.4</ins>84.777.088.385.4 (10th)91.3 (11th)
Command-R (35B)128K32K<ins>93.8</ins><ins>93.3</ins><ins>92.4</ins><ins>89.5</ins>84.976.088.385.5 (8th)91.1 (12th)
MegaBeam-Mistral (7B)512K32K<ins>93.8</ins><ins>92.5</ins><ins>92.0</ins><ins>89.2</ins>83.783.789.187.3 (6th)91.0 (13th)
Mistral-Large (123B)128K32K<ins>96.2</ins><ins>96.1</ins><ins>95.1</ins><ins>93.0</ins>78.823.780.570.6 (22nd)90.4 (14th)
GradientAI/Llama3 (70B)1M16K<ins>95.1</ins><ins>94.4</ins><ins>90.8</ins>85.480.972.186.582.6 (14th)90.3 (15th)
Mixtral-8x22B (39B/141B)64K32K<ins>95.6</ins><ins>94.9</ins><ins>93.4</ins><ins>90.9</ins>84.731.781.973.5 (20th)90.3 (16th)
Yi (34B)200K32K<ins>93.3</ins><ins>92.2</ins><ins>91.3</ins><ins>87.5</ins>83.277.387.584.8 (11th)90.1 (17th)
Phi3-mini (3.8B)128K32K<ins>92.2</ins><ins>91.5</ins><ins>90.7</ins><ins>87.5</ins>80.666.784.880.9 (15th)88.7 (18th)
Phi3-medium (14B)128K32K<ins>93.3</ins><ins>93.2</ins><ins>91.1</ins><ins>86.8</ins>78.646.181.574.8 (19th)88.3 (19th)
Mixtral-8x7B (12.9B/46.7B)32K32K<ins>94.9</ins><ins>92.1</ins><ins>92.5</ins><ins>85.9</ins>72.444.580.472.8 (21st)87.9 (20th)
GradientAI/Llama3 (8B)1M16K<ins>92.8</ins><ins>90.3</ins><ins>85.7</ins>79.976.369.582.478.5 (17th)86.3 (21st)
FILM-7B* (7B)32K32K<ins>92.8</ins><ins>88.2</ins><ins>88.1</ins><ins>86.9</ins>70.127.175.566.4 (24th)84.7 (22nd)
InternLM2.5 (7B)1M4K<ins>88.1</ins>85.584.582.775.568.980.977.8 (18th)83.9 (23rd)
Mistral (7B)32K16K<ins>93.6</ins><ins>91.2</ins><ins>87.2</ins>75.449.013.868.455.6 (26th)81.2 (24th)
Mistral-Nemo128K16K<ins>87.8</ins><ins>87.2</ins><ins>87.7</ins>69.046.819.066.254.7 (27th)77.8 (25th)
GLM3 (6B)128K4K<ins>87.8</ins>83.478.669.956.042.069.662.0 (25th)77.2 (26th)
LWM (7B)1M<4K82.378.473.769.168.165.072.869.9 (23rd)75.7 (27th)
DBRX (36B/132B)32K8K<ins>95.1</ins><ins>93.8</ins>83.663.12.40.056.338.0 (28th)74.7 (28th)
Qwen1.5 (72B)32K8K<ins>94.9</ins><ins>93.8</ins>78.067.80.00.055.737.5 (29th)74.0 (29th)
Together (7B)32K4K<ins>88.2</ins>81.169.463.00.00.050.333.8 (30th)66.7 (30th)
LongChat (7B)32K<4K84.779.970.859.30.00.049.133.1 (31th)65.2 (31th)
LongAlpaca (13B)32K<4K60.657.056.643.60.00.036.324.7 (32nd)47.9 (32nd)

πŸ’‘ Requirements

cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.2.0 .

πŸ” Evaluate long-context LMs

1. Download data

cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh

2. Download model

3. Run evaluation pipeline

GPUS="" # number of GPUs
ROOT_DIR="" # the path that stores generated task samples and model predictions. 
MODEL_DIR="" # the path that contains individual model folders from Huggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
case $MODEL_NAME in
    YOUR_HF_MODEL_NAME)
        MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="" # hf or vllm
        ;;
    YOUR_TRTLLM_ENGINE_NAME)
        MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="trtllm"
        ;;
    YOUR_OPENAI_MODEL_NAME)
        MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="openai"
        TOKENIZER_PATH="cl100k_base"
        TOKENIZER_TYPE="openai"
        OPENAI_API_KEY="" # your OpenAI API key
        ;;
    YOUR_GEMINI_MODEL_NAME)
        MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="gemini"
        TOKENIZER_PATH=$MODEL_PATH
        TOKENIZER_TYPE="gemini"
        GEMINI_API_KEY="" # your Gemini API key
        ;;
bash run.sh YOUR_MODEL_NAME synthetic

🧠 (Optional) Customize task complexity

The tasks to be evaluated on are stored in scripts/config_tasks.sh. Configuration of each task is defined in scripts/synthetic.yaml. The complexity of each task can be configured by changing the arguments which we describe in detail below.

CategoryTask nameConfigurations
Retrievalniahtype_haystack: repeat/essay/needle<br># repeat: repeated noise sentences<br># essay: Paul Graham Essays<br># needle: distracted needles<br><br>type_needle_k: words/numbers/uuids<br>type_needle_v: words/numbers/uuids<br># words: adjective-noun<br># numbers: 7 digits<br># uuids: 32 digits<br><br>num_needle_k: int >= 1<br># add multiple needles in haystack <br>num_needle_v: int >= 1<br> # retrieve multiple values from a single key<br>num_needle_q: int >= 1<br> # retrieve multiple values from multiple keys
Multi-hop<br>Tracingvariable_trackingnum_chains: int >= 1<br># number of variable name-binding chains<br>num_hops: int >= 1<br># number of times binding variable names in each chain
Aggregationcommon_words_extractionfreq_cw: int >= 1<br># frequency of common words<br>freq_ucw: int >= 1<br># frequency of uncommon words<br>num_cw: int >= 1 <br># number of common words
Aggregationfreq_words_extractionalpha: float > 1.0<br># parameter of the distribution to draw synthetic words. Reducing alpha to increase the difficulty of this task. Note that increasing the number of words to return also increases the difficulty of this task, we use 3 in our evaluations as models show worse performance at short context size when more words need to be returned.
Question<br>Answeringqadataset: squad or hotpotqa<br># the short-context qa dataset we use

πŸš€ (Optional) Contribute a new synthetic task

1. Create a python script for data preparation

2. Add task template

3. Add evaluation metric

4. Add required configurations

πŸ› οΈ Limitations

While tasks in RULER are designed to be configurable, we only evaluate the above models with 13 task configurations. These tasks were selected because most models can achieve good (some almost perfect) performance at short context size (<= 4K), which leaves ample room to observe degradation as we extend the input length. We did not include more complexed tasks in RULER that models show worse performance at short context size. We also did not stress test every model with more difficult task configurations. Although RULER covers four task categories extending previous evaluation protocol and provides a clean test bed for sanity-checking LMs with known upper bound performance, it is by no means comprehensive enough and it cannot replace the more preferred realistic tasks. We welcome people to contribute new tasks and/or new task categories to help evaluate long-context capabilities.

πŸ“ Citation

@article{hsieh2024ruler,
  title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
  author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
  year={2024},
  journal={arXiv preprint arXiv:2404.06654},
}

Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.