Home

Awesome

<p align="center"> <img src="assets/taco_logo.png" width="200" height="200"> </p>

TACO(Topics in Algorithmic COde generation dataset)

<p align="center"> 🤗 <a href="https://huggingface.co/datasets/BAAI/TACO">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://data.baai.ac.cn/details/BAAI-TACO"><img src="assets/baai.png" width="18"/> BAAI DataHub</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://arxiv.org/abs/2312.14852">Paper</a> </p> <br>

TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field. The dataset consists of programming competition problems that are more difficult and closer to real programming scenarios. It emphasizes improving or evaluating the model's understanding and reasoning abilities in practical application scenarios, rather than just implementing predefined function functionalities.

News and Updates

Download and Use

🤗 <a href="https://huggingface.co/datasets/BAAI/TACO">Hugging Face</a>

First, install the datasets package.

pip install -U datasets

Then, load the dataset with the following program.

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', token=YOUR_HF_TOKEN)

<img src="assets/baai.png" width="18"/><a href="https://data.baai.ac.cn/details/BAAI-TACO">BAAI DataHub</a> First, download the dataset and unzip it into a folder named "BAAI-TACO." Then, load the dataset with the following program.

from datasets import load_from_disk
taco = load_from_disk(PATH_TO_BAAI-TACO)

Statistics of TACO

Comparison DimensionTACOCodeContestAPPSHumanEval(/-X)MBP(/X)P
Problem Scale (train/dev/test)25443/-/100013328/117/1655000/-/5000-/-/164374/-/500
No Answers in Test Set043/1651235/500000
Duplicate QuestionsNo DuplicationNo DuplicationNo DuplicationDuplicates RemovedDuplicates Removed
Duplicate AnswersDuplicates RemovedNo DuplicationNo DuplicationDuplicates RemovedDuplicates Removed
Test Cases/Problems202.3203.720.997.773
Task TopicsYesYesNoNoNo
Algorithm TagsYesNoNoNoNo
Programming SkillsYesNoNoNoNo
Difficulty TagsYesYesYesNoNo

The Distribution of Algorithm Tags in TACO is

<center> <img src="assets/algo.png" width="600"/> </center>

The Distribution of Programming Skills in TACO is

<center> <img src="assets/skill.png" width="600"/> </center>

Evaluation with TACO

First, you should initialize model, tokenizer as well as the difficulties or skills to use TACO.

# Initialize model and tokenizer
model_name = 'codellama/CodeLlama-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda:0"
model = model.to(device)


# Initialize evaluation dataset 
difficulties = ['ALL']
# difficulties = ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] 
# skills = ['ALL']
# skills = ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"]

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', split='test', difficulties=difficulties)
# taco = load_dataset('BAAI/TACO', split='test', skills=skills)

Then, run generations with code models.

# setting up times of run
n_samples = 200
temperature = 0.2
top_p = 0.95 
output = []
for idx, sample in enumerate(taco):
    prompt = sample['question']
    results = {"task_id": idx, "prompt": prompt}
    generations = []
    for i in range(n_samples):
        seed = i
        generation = predict(device, model, tokenizer, prompt, seed, top_p, temperature, max_length=2048)
        clean_code = truncate_after_eof_strings(generation)
        generations.append(clean_code)
    results["output"] = generations
    output.append(results)

generation.py gives a complete example of generate TACO result samples with CodeLlama, which outputs a JSON format file generation.json.

[
    {
        "task_id": 0,
        "prompt": "The city park of IT City contains n east to ...",
        "output": [
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            ...
        ]
    },
    {
        "task_id": "1",
        "prompt": "Zookeeper is buying a carton of fruit to feed ...",
        "output": [
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            ...
        ]
    },
    ...
]

Finally, execute the generated codes and compute metrics. compute_metric.py gives a complete example of code execution and pass@k computation with generation.json from last step.

The result file taco_metrics.json is like

{
    "pass@1": 0.0932,
    "pass@10": 0.1515,
    "pass@100": 0.1999,
    "detail" : {
        "pass@1": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@10": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@100": {
            "0": ...,
            "1": ...,
            ...
        },
    }
}

Finetuning with TACO

First, you should tokenize the training set of TACO. We provide a python script pretokenizing.py and an example shell script pretokenize.sh to help you. This step would output a pretokenized training data in cache_dir with the name of dataset_name. Below is an example to tokenize with CodeLlama-7b.

python pretokenizing.py \
    --tokenizer_dir codellama/CodeLlama-7b-hf \
    --cache_dir . \
    --dataset_name codellama_tokenized 

Then, finetune with the pretokenized training data. We provide a python script train.py and an example shell script finetune.sh to help you. This step would output the checkpoints in output_dir. Below is an example to finetuning CodeLlama-7b.

torchrun --nproc_per_node=8 --nnodes=1 train.py \
    --model_name_or_path codellama/CodeLlama-7b-hf \
    --data_path codellama_tokenized \
    --bf16 True \
    --output_dir codellama_ft \
    --num_train_epochs 2 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --logging_steps 1 \
    --resume_from_checkpoint True \
    --gradient_checkpointing True \
    --deepspeed ds_configs/deepspeed_z2_config_bf16.json

Evaluation Results

We conducted experiments using the TACO test set and training set on GPT-4 and a code generation model trained on a large amount of code data. The results show:

Citation

If you use the models, data, or code from this project, please cite the original paper:

@article{li2023taco,
  title={TACO: Topics in Algorithmic COde generation dataset},
  author={Rongao Li and Jie Fu and Bo-Wen Zhang and Tao Huang and Zhihong Sun and Chen Lyu and Guang Liu and Zhi Jin and Ge Li},
  journal={arXiv preprint arXiv:2312.14852},
  year={2023}
}

License

The TACO dataset that is authored by BAAI, Shandong Normal University and Peking University is released under an Apache 2.0 License. However, the data also includes content licensed under other permissive licenses such as MIT License, or web-crawled data which is used under the terms of the CC BY 4.0 license(Creative Commons Attribution 4.0 International license).

We gratefully acknowledge the contributions of the following: