Home

Awesome

EvoCodeBench

EvoCodeBench is an evolutionary code generation benchmark aligned with real-world code repositories. Details of EvoCodeBench can be found in our paper "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories" Paper.

News

[Mar 29, 2024] We release EvoCodeBench and its first version - EvoCodeBench-2403.

Outline

Released Versions

VersionRelease DateCovered PeriodNumber of SamplesNumber of RepositoriesLink
EvoCodeBench-2403Mar 27, 2024Dec 2023 - Feb 202427525HuggingFace

Where Covered Period means the time period of the repositories are created. You can download different versions of EvoCodeBench from the above links.

Metadata

Example

Each sample in EvoCodeBench contains the following fields:

Repositories and Dependency Data

The original repositories and dependency data of EvoCodeBench can be downloaded from the link in the Released Versions section. Researchers need to uncompressed the original repositories and put them in the root directory (e.g., EvoCodeBench/Source_Code and EvoCodeBench/Dependency_Data).

Evaluation

Before Evaluation

  1. The pipeline of our evaluation is described as follows. For each requirement, we first locate its completion_file and body_position. Then, we replace the code at body_position with the generated code. Finally, we run test functions and store execution results. After that, we recover the original code at body_position for the next requirement.

  2. Because a project often contains multiple samples in EvoCodeBench, researchers should avoid simultaneously running multiple evaluation scripts within the same repository. Otherwise, the evaluation results may be incorrect.

  3. If the Pass@k and Recall@k values are abnormal, the problem may be caused by the following reasons. (1) The environment is not set up correctly. (2) The generated code is not in the correct json format.

Environment Setup

We strongly recommend researchers to conduct evaluations in Ubuntu and use conda to create a virtual environment.

cd EvoCodeBench
conda create --name EvoCodeBench --python=3.10
conda activate EvoCodeBench
pip install numpy
pip install func-timeout
pip install tqdm
pip install textwrap
pip install psutil
pip install tiktoken

Then, researchers can build the execution environment by running the following command. Reasearchers need to modify the Root variable in the envir_setup.sh script.

bash setup_env.sh

It will cost a few hours to build the execution environment.

Output Format

Before evaluating, we need to convert models' outputs into a jsonl file. Each line of the jsonl file is a json object storing a completion for a requirement. An example of a json object is shown below.

{
    "namespace": "benedict.utils.type_util.is_bool",
    "completion": "    config_options = {}\n    # Code to retrieve options from the \"twtxt\" section of the config file\n    # If the section does not exist, return an empty dictionary\n    return config_options\n"
}

In our experiments, we employ the following metrics to evaluate the generated code. The definitions of these metrics can be found in our paper.

Pass@k (Functional Correctness)

Researchers can directly run the script run_pass_k.sh to evaluate the generated code. The script is shown below.

# replace the following paths with your own
ROOT=/home/user/EvoCodeBench

# recover the repositories
python check_source_code.py $ROOT/Source_Code \

python pass_k.py \
    --output_file /path/to/completion.jsonl \
    --log_file /path/to/store/execution_results.jsonl \
    --source_code_root $ROOT/Source_Code \
    --data_file $ROOT/data.jsonl \
    --n 1 \
    --k 1 \ 

The arguments are explained as follows.

Recall@k (Recall of Reference Dependency)

cd parser

Researchers can directly run the script run_recall_k.sh to evaluate the generated code. The script is shown below.

# replace the following paths with your own paths
ROOT=/home/user/EvoCodeBench

# recover the repositories
python ../check_source_code.py $ROOT/Source_Code \

python recall_k.py \
    --output_file /path/to/completion.jsonl \
    --log_file /path/to/store/dependency_results.jsonl \
    --k 1 \
    --source_code_root $ROOT/Source_Code \
    --dependency_data_root $ROOT/Dependency_Data \
    --data_file $ROOT/data.jsonl \

The arguments are explained as follows.

Repository-level Code Generation

Experimental Settings

prompt/prompt_elements.jsonl stores the contexts used in the above three settings. Researchers can use these contexts to reproduce our experimental results.

Code Generation

First, we produce prompts for different settings. Before running the following command, researchers need to install the tiktoken library.

python make_prompt.py \
    --setting baseline \ # or local_completion or local_infilling
    --output_file /path/to/store/prompt.jsonl \
    --context_window 16384 \ # the maximum length of the context
    --max_tokens 500 \ # the maximum length of the generated code

Then, we invoke OpenAI's API to generate code based on the prompts. Researchers need to install the openai library.

python gpt_generation.py \
    --prompt_file /path/to/prompt.jsonl \
    --output_dir /path/to/store/completion.jsonl \
    --model gpt-3.5 \ # or gpt-4
    --moda greedy \ # or sampling
    --api_key_file /path/to/api_key.txt \ # the api key of OpenAI

Before running the following command, researchers need to install the vllm library.

python LM_inference.py \
    --setting baseline \ # or local_completion or local_infilling
    --output_dir /path/to/store/completion.jsonl \
    --model deepseek-7b \ # or other models
    --moda greedy \ # or sampling

After generating code, researchers need to extract the function body from models' outputs. Researchers can run the following command to extract the function body.

cd prompt
python process_completion.py \
    --model_type lm \ # or gpt
    --completion_file /path/to/raw completion.jsonl \
    --output_file /path/to/completion.jsonl \
    --data_file /path/to/data.jsonl \

The outputed completion.jsonl file can be used in the evaluation process.

Model's Completion

We release models' completions and their evaluation results on EvoCodeBench-2403. Researchers can find the completions in the model_completion folder. For example,

Leaderboard

We evaluate 10 popular LLMs on EvoCodeBench-2403, and the results are shown in the following Table.

Leaderboard

Citation

If you have any questions or suggestions, please email us at lijia@stu.pku.edu.cn.