Home

Awesome

DevEval

DevEval is a code generation benchmark collected through a rigorous pipeline. DevEval contains 1,825 testing samples, collected from 115 real-world code repositories and covering 10 programming topics. Compared to existing benchmarks, DevEval aligns to real-world repositories in multiple dimensions, e.g., code and dependency distributions. More details about DevEval can be found in our paper link.

Metadata

The metadata of DevEval is stored in data.tar.gz. Users can uncompress the file to get the metadata - data.jsonl.

tar -xzvf data.tar.gz

Each line of metadata is a json object, which contains the following fields:

Setup

Before running the evaluation, researchers need to download the repositories, and dependency data.

Repositories

The original repositories can be downloaded from link. Users need to uncompressed the repositories and put them in the root directory (e.g., DevEval/Source_Code).

The project contexts are stored in Source_Code. Source_Code contains 10 subfolders, each of which corresponds to a programming topic, e.g., Text Processing. Each topic contains multiple repositories. For each sample in metadata, we can find its repository based on the key project_path. Please do not modify the file structure of the repositories. Otherwise, the evaluation script can not work properly.

Dependency Data

The dependency data of repositories is used to evaluate the recall of reference dependencies and is available in link. Users need to uncompressed the dependency data and put them in the root directory (e.g., DevEval/Dependency_Data). Please do not modify the file name of the dependency data. Otherwise, the evaluation script can not load the cached dependency data.

Evaluation

Environment Setup

Create a virtual conda environment and install the required packages.

conda create --name DevEval --file environment.txt
conda activate DevEval
pip install -r requirement.txt
# replace the path with your own path
echo "export NLTK_DATA=/home/user/DevEval/nltk_data" >> ~/.bashrc
source ~/.bashrc

Users can execute python pass_k.py to check whether the environment is normal. This command executes test cases on ground-truths and stores failed samples in failed_samples.jsonl.

Completion Format

User need to convert models' predictions into a jsonl file. Each line of the jsonl file is a json object storing a completion for a requirement. An example of a json object is shown below.

{
    "namespace": "benedict.utils.type_util.is_bool",
    "completion": "    config_options = {}\n    # Code to retrieve options from the \"twtxt\" section of the config file\n    # If the section does not exist, return an empty dictionary\n    return config_options\n"
}

We provide processed completion files of a few LLMs in Experiments.

Notes

  1. The pipeline of our evaluation is described as follows. For each sample, we first locat its completion_file and body_position. Then, we replace the code at body_position with the generated code. Finally, we run test cases to check the generated code. After that, we recover the original code at body_position for the next sample.

  2. Because a repository often contains multiple samples in DevEval, researchers should avoid simultaneously running multiple evaluation scripts within the same repository. Otherwise, the evaluation results may be incorrect.

Pass@k (Functional Correctness)

Users can run run_pass_k.sh to compute the Pass@k. The script is shown below.

ROOT=/home/user/DevEval
TASK=without_context
Model=gpt-4-1106-preview_greedy

python $ROOT/check_source_code.py $ROOT/Source_Code

python pass_k.py \
    --output_file $ROOT/Experiments/$TASK/$Model/completion.jsonl \
    --log_file $ROOT/Experiments/$TASK/$Model/test_output.jsonl \
    --source_code_root $ROOT/Source_Code \
    --data_file $ROOT/data.jsonl \
    --n 1 \
    --k 1

The arguments are explained as follows.

Recall@k (Recall of Reference Dependency)

Users can run parser/run_recall_k.sh to compute the Recall@k. The script is shown below.

cd parser
ROOT=/home/user/DevEval
TASK=without_context
Model=gpt-4-1106-preview_greedy

python $ROOT/check_source_code.py $ROOT/Source_Code

python recall_k.py \
    --output_file $ROOT/Experiments/$TASK/$Model/completion.jsonl \
    --log_file $ROOT/Experiments_2024/$TASK/$Model/dependency_results.jsonl \
    --source_code_root $ROOT/Source_Code \
    --dependency_data_root $ROOT/Dependency_Data \
    --data_file $ROOT/data.jsonl \
    --k 1 

The arguments are explained as follows.

Repository-level Code Generation

Experimental Settings

Prompts

We release all prompts used in the above three settings. The prompts are stored in Experiments/prompt. LM_prompt_elements.jsonl contains the prompt elements for language models, and other prompt files are used for instruction-tuned models.

Runing LLMs

We resue the code implementations in EvoCodeBench to run LLMs. Users can refer to the link.

Model's Completion

We release models' predictions on DevEval. Users can find the completions in the Experiments/model_prediction folder. For example,

Leaderboard

Leaderboard

Citation

If you have any questions or suggestions, please email us at lijia@stu.pku.edu.cn.

If you find this repository useful, please cite our paper:

@article{DevEval,
  title={DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories},
  author={Li, Jia and Li, Ge and Zhao, Yunfei and Li, Yongmin and Liu, Huanyu and Zhu, Hao and Wang, Lecheng and Liu, Kaibo and Fang, Zheng and Wang, Lanshen and others},
  journal={arXiv preprint arXiv:2405.19856},
  year={2024}
}