Home

Awesome

πŸ”₯FLAMES: Benchmarking Value Alignment of LLMs in Chinese

πŸŽ‰ [2024-05-22] News: Flames is now available on OpenCompass!
πŸŽ‰ [2024-03-13] News: We have been accepted to the NAACL 2024 Main Conference!

Flames is a highly adversarial benchmark in Chinese for LLM's value alignment evaluation developed by Shanghai AI Lab and Fudan NLP Group. We offer:

For detailed information, please refer to our paper: FLAMES: Benchmarking Value Alignment of LLMs in Chinese

πŸ” Table of Contents

πŸ† Leaderboard

πŸ“Š Dataset

πŸ’― Scorer

©️ Citation

πŸ† Leaderboard

Below are the evaluation results of the Harmless rate / Harmless score for representative LLMs. Blod indicates the best.

ModelOverallFairnessSafetyMoralityLegalityData protection
ChatGPT46.91%45.38% / 79.845.45% / 74.142.79% / 76.845.65% / 63.855.26% / 70.2
GPT-440.01%41.37% / 78.227.51% / 67.750.75% / 80.630.43% / 53.650.0% / 66.7
Claude63.77%53.41% / 83.428.44% / 65.577.11% / 91.571.74% / 81.288.16% / 92.1
Minimax23.66%24.5% / 69.918.41% / 59.627.86% / 70.530.43% / 53.617.11% / 44.7
Ernie Bot45.96%42.97% / 78.832.17% / 69.247.76% / 78.160.87% / 73.946.05% / 64.0
InternLM-20B58.56%52.61% / 83.551.05% / 79.254.23% / 81.471.74% / 81.263.16% / 75.4
MOSS-16B36.18%33.33% / 74.633.33% / 70.631.34% / 71.050.0% / 66.732.89% / 55.3
Qwen-14B41.97%30.92% / 72.236.83% / 74.754.23% / 82.332.61% / 55.155.26% / 70.2
Baichuan2-13B43.16%38.55% / 76.453.85% / 81.744.78% / 77.939.13% / 59.439.47% / 59.6
BELLE-13B24.76%22.09% / 68.415.38% / 57.820.9% / 66.539.13% / 59.426.32% / 50.9
InternLM-7B53.93%44.58% / 78.035.9% / 69.151.24% / 80.376.09% / 84.161.84% / 74.6
Qwen-7B36.45%36.14% / 77.231.93% / 69.240.3% / 76.130.43% / 53.643.42% / 62.3
Baichuan2-7B46.17%42.17% / 79.456.41% / 81.639.3% / 76.052.17% / 68.140.79% / 60.5
ChatGLM-6B33.1%26.91% / 72.315.38% / 60.440.3% / 75.650.0% / 66.732.89% / 55.3
ChatGLM2-6B33.86%31.73% / 74.222.61% / 64.343.28% / 75.828.26% / 52.243.42% / 62.3
ChatGLM3-6B36.32%37.75% / 77.832.63% / 70.044.78% / 77.128.26% / 52.238.16% / 58.8
ChatYuan-770M41.07%28.11% / 72.354.78% / 79.130.35% / 71.050.0% / 66.742.11% / 61.4

Last update: Dec. 11th 2023

πŸ“Š Dataset

Why πŸ”₯Flames?

Dataset# Prompts% Successful attackHuman annotationSpecified scorer
Safety-prompts100k1.63%βœ•βœ•
CValues2,1003.1%βœ•βœ“
Flames (ours)2,25153.09%βœ“βœ“

Statistics

The statistics of released Flames-1k-Chinese is shown below:

AttributePrompts
Fairness249
Safety429
Morality201
Legality46
Data protection75
Overall1,000

Examples

Below are examples of prompt-response-label from 5 dimensions (i.e. Fairness, Safety, Morality, Legality, and Data protection).

example

Usage

We currently release Flames-1k-Chinese which includes 1,000 highly adversarial prompts.

πŸ’― Scorer

The Flames-scorer is now available at huggingface.

The environment can be set up as:

$ pip install -r requirements.txt

And you can use infer.py to evaluate your model:

python infer.py --data_path YOUR_DATA_FILE.jsonl

Please note that:

  1. Ensure each entry in YOUR_DATA_FILE.jsonl includes the fields: "dimension", "prompt", and "response".
  2. The predicted score will be stored in the "predicted" field, and the output will be saved in the same directory as YOUR_DATA_FILE.jsonl.
  3. The accuracy of the Flames-scorer on out-of-distribution prompts (i.e., prompts not included in the Flames-prompts) has not been evaluated. Consequently, its predictions for such data may not be reliable.

©️ Citation

If you think this dataset is helpful, please cite the paper.

@misc{huang2023flames,
      title={Flames: Benchmarking Value Alignment of Chinese Large Language Models}, 
      author={Kexin Huang and Xiangyang Liu and Qianyu Guo and Tianxiang Sun and Jiawei Sun and Yaru Wang and Zeyang Zhou and Yixu Wang and Yan Teng and Xipeng Qiu and Yingchun Wang and Dahua Lin},
      year={2023},
      eprint={2311.06899},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
<!--<h2>License</h2>-->