Awesome

<p align="center"> <h1 align="center">DOMAINEVAL: An Auto-Constructed Benchmark for

Multi-Domain Code Generation</h1>

Benchmark Description

DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference code and tests) covering six domains (i.e., Computation, Basic, Network, Cryptography, Visualization, System).

Environment Setup

cd DomainEval/setup

env_name="your env name"
conda create -n "$env_name" python=3.9 -y
conda activate "$env_name"
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements_py39.txt

Benchmark Construction

Domain Repository Collection

Move the code repository to the path directory of the corresponding domain {src_data}/{domain}.

Test-Method Matching & Selection

domain="your domain"
version="your version"
srcdata_dir="{src_data}"

cd DomainEval
mkdir "log_${version}"
nohup python -u sandbox.py \
--domain "$domain" \
--srcdata_dir "$srcdata_dir" \
--output_dir "bench_${version}" \
> "log_${version}/result_sandbox_${domain}.txt" &
python -u codefilter.py \
--bench_dir "bench_${version}" \
> "log_${version}/result_codefilter.txt"

Instruction Generation

version="your version"
nohup python -u datagenerate.py \
--eval_dir "domaineval_${version}" \
> log_${version}/result_datagenerate.txt &

Dataset

The final data is in domaineval_{your version}. The data is in the format of json, each line is a json object, the format is:

{
    "method_name":,
    "full_method_name":,
    "method_path":,
    "method_code":,
    "test_code_list":[
        {"test_code":, "code_start":, "test_path":},
        {"test_code":, "code_start":, "test_path":}
    ],
    "instruction":,
    "method_code_mask":,
}

Evaluation

First, you need include the path and name of your model in self.model_path_dict within modeleval.py and add your model api in get_message within utils/utils_chat.py and self.model_name_list_api within modeleval.py.

model_name="your model name or std"

# set the k in pass@k, it can only be 1 or 5 currently
k_pass=1 # or k_pass=5

# set the version of the dataset
version="your version"
eval_dir="domaineval_${version}"

# model inference
nohup python -u modeleval.py \
-m "$model_name" \
-b "$eval_dir" \
-k "$k_pass" \
> "result_modeleval_${model_name}_pass\@${k_pass}.txt" &

# result execution and analysis
nohup python -u resultexec.py \
-m "$model_name" \
-v "$eval_dir" \
-k "$k_pass" \
> result_exec.txt &
resultexec_pid=$!
echo $resultexec_pid
wait $resultexec_pid
mkdir -p "analyseresult/pass@${k_pass}"
python resultanalyse.py \
-m "$model_name" \
-v "$eval_dir" \
-k "$k_pass" \
> "analyseresult/pass@${k_pass}/result_analyse_${model_name}.txt"

Tips: To evaluate LLMs using the domaineval_20240711 dataset, first set model_name="std", k_pass=1, and version="20240711", then run the commands in Evaluation to verify the environment. With a correctly installed environment, the accuracy of std should be 100%, with the only possible failure being a timed out error. You can also use our setup/Dockerfile to build the execution docker, but be aware that two data points might time out.

Submission

Now you have the results of your model on the dataset.

DomainEval/modelresult/${eval_dir}/${model_name}/pass_${k_pass}: Completed code generated by your LLM.
DomainEval/executeresult/${eval_dir}/${model_name}/pass_${k_pass}: Execution results of the generated code.
DomainEval/analyseresult/pass@${k_pass}/result_analyse_${model_name}.txt: Analysis results of the generated code.

The next step is to submit a pull request for the project:

Fork the repository into your own GitHub account.
Clone the repository to your local.
Checkout a new branch from main.
Make the results directories above (i.e. ./modelresult/${eval_dir}/${model_name}, ./executeresult/${eval_dir}/${model_name}, ./analyseresult/pass@${k_pass}/result_analyse_${model_name}.txt).
Submit the Pull Request.
The maintainers will review your Pull Request soon.

Once your pull request is accepted, we will update the Leaderboard with your results.

Tips: You can also try Codabench, which we provide, to evaluate model inference results. Currently, we only support calculating the pass@1 results for a single model with a sampling count of N=1. Please do not submit results from multiple models simultaneously.