Home

Awesome

<p> <!-- Link to tutorials badge using shields.io --> <!-- Follow on twitter badge using shields.io --> <a href="https://inklab.usc.edu/CommonGen/"> <img src="https://img.shields.io/badge/Website-💻-green"> </a> <a href="https://arxiv.org/abs/1911.03705"> <img src="https://img.shields.io/badge/Paper-📝-orange"> </a> <a href="https://huggingface.co/datasets/allenai/commongen_lite"> <img src="https://img.shields.io/badge/Dataset-🤗-blue"> </a> </p>

CommonGen-Eval

Evaluating LLMs with CommonGen using CommonGen-lite dataset (400 examples + 900 human references). We use GPT-4 to evaluate the constrained text generation ability of LLMs. Please see more in our paper.

Leaderboard

modellencoverposwin_tieoverall
human (upper bound)12.8499.0098.11100.0097.13
human (lower bound)12.8499.0098.1150.0048.57
gpt-4-061314.1397.4491.7850.4445.11
gpt-4-1106-preview14.9096.3390.1150.7844.08
gpt-3.5-turbo12.7692.1183.0049.7838.06
Yi-34b-chat13.4580.1175.1139.4423.73
Pallas-0.514.8386.6779.5632.2222.22
vicuna-13b-v1.515.0285.8979.5627.4418.75
tulu-2-dpo-70b17.8988.7880.1123.0016.36
Mixtral-8x7B-Instruct-v0.120.1584.1173.3317.8911.03
Llama-2-7b-chat-hf16.0688.5676.4415.4410.45
zephyr-7b-beta15.7682.4472.7816.8910.13
Yi-6b-chat13.3271.6763.5622.1110.07

Link: https://inklab.usc.edu/CommonGen/leaderboard.html

Installation

pip install -r requirements.txt
python -m spacy download en_core_web_lg

Run model inference

Example:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python vllm_infer.py \
    --data_name "commongen" \
    --model_name 01-ai/Yi-34b-chat --tensor_parallel_size 4  --dtype bfloat16 \
    --output_folder "model_outputs/" \
    --top_p 1 --temperature 0 --batch_size 8 --max_tokens 128
<details> <summary>Instruction Prompt Template (2-shots prompting)</summary>
# Instruction

Given several concepts (i.e., nouns or verbs), write a short and simple sentence that contains *all* the required words.
The sentence should describe a common scene in daily life, and the concepts should be used in a natural way.

# Examples

## Example 1
- Concepts: "dog(noun), frisbee(noun), catch(verb), throw(verb)"
- Sentence: The dog catches the frisbee when the boy throws it into the air.

## Example 2
- Concepts: "apple(noun), place(verb), tree(noun), pick(verb)"
- Sentence: A girl picks some apples from a tree and places them into her basket.

# Your Task 

- Concepts: "{$concept_list}"
- Sentence: 
</details>

Run GPT-4 based evaluation

To make your model on the leaderboard, please create an issue or PR to submit the inference script. I'll run the following evaluation script and update the leaderboard. You will not need to run the evaluation script yourself (the script needs special access to a HF dataset.)

Scripts: see scripts/all_gpt_eval.sh and evaluate.py for knowing more details.

Example:

models=("zephyr-7b-beta" "tulu-2-dpo-70b" "vicuna-13b-v1.5")
for model in "${models[@]}"
do 
    python evaluate.py --mode "compare" \
        --model_output_file "model_outputs/${model}.json" \
        --eval_output_file "eval_outputs/${model}.eval_result.gpt-4-1106-preview.json" \
        --model gpt-4-1106-preview &
done
<details> <summary>Evaluation Prompt Template (Pairwise Comparison)</summary>
# Data

Given several concepts (i.e., nouns or verbs), we ask models to write a short and simple sentence that contains *all* the required words. 
The sentence should describe a common scene in daily life, and the concepts should be used in a natural way.

Concepts: "{$concept_list}"

Model A: "{$candidate_A}"

Model B: "{$candidate_B}"

# Your Task

Your task is to choose a better sentence from the two candidates. Decide which model's sentence is better in terms of the naturalness and commonness of the scenes they describe. 

## Rules: 
- A better sentence should describe a common scene in daily life, and all concepts should be used in a natural way.
- You should prefer sentences that use all given concepts with correct part-of-speech tags. 
- A simpler and shorter sentence is preferred if it describes the same scene as the other sentence.
- If you think both sentences are equally good or bad, please choose "tie".

Now, please output your choice ("A" or "B" or "tie").

Your choice: 
</details>

Case studies

Here are some examples of the generated sentences from the models.

<details> <summary> Example 1 </summary> </details> <details> <summary> Example 2 </summary> </details> <details> <summary> Example 3 </summary> </details> <details> <summary> Example 4 </summary> </details>

Links

Citation

@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    pages = "1823--1840", 
}