Awesome

SysBench

Code for SysBench: Can Large Language Models Follow System Messages?

Introduction

In this section, we introduce the usage of our attached codes, including:

SysBench's dataset.
Running customized model on SysBench
Reproducing figures/tables using provided data.
Reproducing figures/tables from scratch.

The Dataset

datas/system_benchmark_eval_datas.json is the dataset file, a JSON array containing 500 dialogues with system messages. For each entry, the meanings of its important JSON fields are listed in the table below.

Field Name	Meaning
`system_id`	The ID of the system message (or dialogue)
`system_prompt`	The content of the system message.
`messages`	A JSON array containing the roles and contents of system message and the whole 5-turn conversation. The contents of the role "assistant" is the ground truth.
`prompt_infos`	Containing 5 JSON entries corresponding to each user instruction. For each entry, the field `alignment` denotes its alignment with the system message, and the field `criteria` is the checklists with constraint types labeled.

Evaluating Customized Model

This section presents the steps to evaluate a customized model on SysBench.

Software Dependencies

Only Python (>= 3.10) and the openai (>= 1.0) package are required for the code base itself.

Implement Interface

We provide a template model file for easily adding a new model. Suppose the customized model's name is myModel, copy the template by:

cd models
cp template.py myModel.py
cd ..

Then, rename the class name to myModel and implement its __call__ method. It receives a list called messages, where each element is a Python dictionary containing role and content keys, representing the whole historical dialog contents for generating the next model outputs. This method should return model response contents in string format.

Some of the existing code in this directory can be used for reference. For example, gpt4o.py is for the OpenAI style API, glm_9b_client.py is for the vLLM server, while qwen2_7b.py is for offline inference.

Prepare GPT-4o as Verifier

GPT-4o is used as the model-based verifier, please fill in the OpenAI Key and base URL in gpt4o.py to configure GPT-4o inference. Run the following command to test its usability:

python models/gpt4o.py

Run Evaluation

Run the following command for evaluation:

python -m eval_system_bench \
    --infer_model_name myModel \
    --output_dir output \
    --max_threads 20

Note: It is highly recommended to use online inference and keep your __call__ method re-entrant. Setting max_threads to 1 is required in the absence of such a guarantee.

Calculate Metrics

After finishing the evaluation step, the detailed model and verifier outputs are both automatically stored in the output/myModel directory by default. To calculate the metric scores, run:

python -m eval_output \
    --infer_model_name myModel \
    --output_dir output

This command will output the metric scores with detailed information.

Reproducing Results from Provided Data

Since all API keys are removed from our provided data due to privacy and anonymity requests, reproducing all results in the paper from scratch is more complicated, and we place the instructions in the next subsection. In this section, we elaborate on the steps to reproduce results with our provided raw data, which is much easier to follow.

Software Dependencies

Python (>= 3.10), matplotlib (>= 3.9), pandas (>= 2.2), and openpyxl(>= 3.1) are required.

Plot Figures

Run the following commands to plot figures and generate tables (in LaTeX code). We recommend installing the missing fonts for better display:

mkdir figures # create the output directory

# Plot each figure
python plot/fig3_stat.py
python plot/fig4_radar.py
python plot/fig5_hgt_histo.py
python plot/fig6_atscore.py

# Generate each table in LaTeX code
python plot/tab1_category.py
python plot/tab2_overall.py
python plot/tab3_align.py
python plot/tab4_turn.py
python plot/tab6_csr_full.py
python plot/tab7_align_full.py

These commands will parse the raw data in output/ and generate figures and tables presented in the paper.

Expected Results

All results should be strictly consistent with those presented in the paper.

Reproducing Results from Scratch

To reproduce from scratch, obtaining the API keys (for all closed models) or preparing the checkpoints (for all open models) are required. Here lists the detailed steps.

Hardware Dependencies

GPU instances are required when running open-sourced models. For the largest Qwen-72B model, we use 4× NVIDIA H100 80GB GPUs.

Software Dependencies

transformers (>= 4.44.0) and vLLM (>= 0.5.0).

Configure Models

Please modify all the model files listed in ./models directories. For models with public API, please fill in your public keys and the base URLs. For open-sourced models running inference locally (i.e., Qwen family, Llama family, and GLM-4 9B), we recommend deploying a vLLM server for online serving, please check glm_9b_client.py for reference and modify others.

We also provide a sample script to start the vLLM server, at servers/run_vllm_serve.sh

Backup Our Data (Optional)

The output/ directory will be overwritten later.

mv output output-backup && mkdir output

Exp. 1: Evaluate Models

For each model, run the following command for evaluation, please set max_threads to 1 for those without re-entrant guarantee.

python -m eval_system_bench \
    --infer_model_name <model_name> \
    --output_dir output \
    --max_threads 20

Then, the detailed evaluation results are available in the directory output/<model_name>.

Exp. 2: Ground-truth History

We also replace the historical model response with the ground truth:

OUTDIR=output/with_gt_history_output
python -m eval_system_bench_with_gt \
    --infer_model_name <model_name> \
    --output_dir $OUTDIR \
    --max_threads 20

To reproduce Figure in the paper, following models should be run with the command above: qwen2_72b, claude35_opus, ernie4 and llama3_8b. All results will be stored in output/with_gt_history_output standby.

Exp. 3: Attention Score

To explore the distribution of attention scores, please first specify the Huggingface checkpoint paths of glm4-9b, llama31-8b, and qwen-72b models in Line 20-22 of attenscore/main.py.

Then, change the working directory to ./attenscore and run our provided script by commands:

cd attenscore
bash run_all.sh

You can change the value of the --id flag if you want to explore another system message not presented in Figure. And set --id to -1 will run all 500 system messages on the current model, but very time-consuming. All results will be stored in output/attenscore for plotting the figure later.

Reproduce Figures and Tables

Finally, when all experimental data are ready in the output/, follow the instructions for reproduction. Note that there are more available command flags for the attention score figure, run the following command for model details:

python plot/fig6_atscore -h

Expected Results

Even though there exists unavoidable randomness and fluctuation, especially for closed models, all the figures and tables should statistically match the patterns shown in the paper.

Citation

@article{qin2024sysbench,
  title={SysBench: Can Large Language Models Follow System Messages?},
  author={Qin, Yanzhao and Zhang, Tao and Shen, Yanjun and Luo, Wenjing and Sun, Haoze and Zhang, Yan and Qiao, Yujing and Chen, Weipeng and Zhou, Zenan and Zhang, Wentao and others},
  journal={arXiv preprint arXiv:2408.10943},
  year={2024}
}