Awesome

AiXcoder NL2Code Evaluation Benchmark (aix-bench)

Paper available: https://arxiv.org/abs/2206.13179

Introduction

This is a method-level benchmark for evaluating code generating (synthesis) models, which take natural language as input and code as output, and is primarily used to evaluate the ability of code-generating models. AiXcoder NL2Code Evaluation Benchmark is divided into two datasets:

Automated Test Dataset: Each sample in this part of the dataset contains a functionally independent and well-described natural language function description, the Java function signature of the function, and a set of Java unit tests that verify the correctness of this function.

The main use of this dataset is to automatically evaluate the correctness of the code generated by the model.
NL Task Description Dataset: Each sample in this part of the data set contains a relatively independent functional description. This part of the data is closer to the real method description in the code, and contains some functional descriptions whose details are not very clear.

The code generated by the model requires human evaluation. Please refer to the detailed introduction for the evaluation criteria described later.

Datasets	Automated Test Dataset	NL Task Description Dataset
Test Set Size	175	161

Currently, these two datasets only contain Java codes, and the natural language description part contains English and Chinese languages. If you only care about code correctness, you can just use the automated test dataset.

License

The code in this project uses the MIT open source license.

The data in this project is licensed under the Computational Use of Data Agreement (C-UDA).

Referencing

If you use code or data from this project, it is recommended that you reference it like this:

@misc{2206.13179,
  Author = {Yiyang Hao and Ge Li and Yongqiang Liu and Xiaowei Miao and He Zong and Siyuan Jiang and Yang Liu and He Wei},
  Title = {AixBench: A Code Generation Benchmark Dataset},
  Year = {2022},
  Eprint = {arXiv:2206.13179},
}

Automated Test Dataset

Data file path: src/main/resources/dataset_autotest_nl.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

Comments well describe a function that can be implemented.
The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
Comments are descriptions of the objective, rather than descriptions of the implementation process.

On this basis, we extracted the descriptions in the comments, and then made some supplements, so that:

The description contains specific information necessary to implement the function. For example: Returns whether or no the JDK version is high enough. There is no clear high enough standard. So we added it manually as Returns whether or no the JDK version is 1.7u40 and above..
The part of description irrelevant to the task is deleted. For example removed the second half of the original data max() that works on three integers. Like many of the other max() functions in this class.

Just like in real-world scenarios, natural language descriptions will contain certain grammatical errors or punctuation or inconsistencies in capitalization. We keep these because we think these perturbations test the model's anti-disturbance ability.

NL Task Description Dataset

Data file path: src/main/resources/dataset_manual_nl.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

Comments well describe a function that can be implemented.
The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
We allow a certain degree of ambiguity, such as in "Read the encoded image data from a JPEG image.", we do not specify how the read data should be handled. During evaluation, as long as the code generated by the model fully implements the functions described in the description, then a full score is awarded for correctness.

Evaluation standard

We manually evaluate the code generated by the model in three dimensions.

Correctness:

4 points: The specified function is fully realized.
3 points: The main function is realized. However, some details are missing, which does not affect the correctness of the overall logic. A little modification is need to meet all the requirements.
2 points: Only the core function is implemented. Most of the requirements are not reflected in the code. More modifications are required to meet the requirements.
1 point: The specified function is not implemented at all.

Code Quality:

3 points: The details are in place. No obviously better code in terms of performance exists. If possible, resources are released accordingly. No obvious code smell.
2 points: Some details are not in place. There is code smell of low severity.
1 point: There is significantly better solution in terms of performance. Or there is serious code smell.

Maintainability:

5 points: The method implementation is very standardized, the variable naming is semantically straightforward, the method is not unnecessarily bloated, the readability is good, the code is short, and the code blocks are clearly structured.
4 points: The method implementation is relatively standardized, the variable naming is basically semantically straightforward, and the readability is better.
3 points: The method implementation meets certain specifications, some variable names are meaningless, and defective code and deprecate methods are used.
2 points: The code is written in a confusing way, or does not follow a consistent specification, or there are many meaningless names in variable naming, or there are certain repetitions and redundant codes. Poor readability.
1 point: Very confusing, completely illogical, hard-to-read code.

Dataset

The dataset includes 175 hand-picked code examples that occur frequently in JAVA programming, and each example includes the following fields:

{
"task_id": 166,
"raw_nl": "通过反射为对象的对应字段注入值",
"signature": "public <T> T initByReflect(String name, String value, T t)"
}

The task_id is used to mark the serial number of the example, raw_nl represents the description in natural language, signature represents the signature of the function to be generated, and raw_nl and signature are used together as the input of the model.

Project structure

src/main/java/com/aixcode/autoTest/evaluation/
     Automated test classes for testing each example
src/main/java/com/aixcode/autoTest/generate/
     Function-level code to store model output, each example needs to manually create a class
src/main/java/com/aixcode/autoTest/Executor.java
     Automated test executor
src/main/java/com/aixcode/autoTest/predictionHelper.java
     Convert predicted methods into classes that can be tested by automation

How to use

1. Download the dataset

git clone https://github.com/aixcoder-plugin/nl2code-dataset.git

2. Build the project

open the project with IDEA, add the library to the classpath. and run project by executor the file of Executor.java

3. Get model predictions

For each test data, take raw_nl and signature as input, get the output of the model, the output is the only method of the class, the class name is prefix+task_id, and the prefix is user-defined. At the same time, this class needs to inherit the GenerateMethodBase class. For the following example, according to the prediction output of the model, user need to manually generate the following class, where the class name is Aixcoder166 (Aixcoder+166), and inherit the GenerateMethodBase class at the same time.

public class Aixcoder166 extends GenerateMethodBase {
/**
* 通过反射为对象的对应字段注入值
*/
public<T> T initByReflect(String name, Object value, T t) {
if (null == t) {
throw new NullPointerException("t can not be null");
}

        if (null == value) {
            return null;
        }

        Class<?> clazz = t.getClass();

        if (!clazz.isAssignableFrom(value.getClass())) {
            throw new IllegalArgumentException("value must be assignable to" + clazz);
        }

        try {
            Field field = clazz.getDeclaredField(name);
            field.setAccessible(true);
            field.set(t, value);
        } catch (NoSuchFieldException e) {
            throw new IllegalArgumentException("no such field:" + name);
        } catch (IllegalAccessException e) {
            throw new IllegalArgumentException("illegal access:" + name);
        }

        return t;
    }
}

The above process can be implemented in batches. Using the assembleFile method in the predictionHelper class, all classes can be generated in batches according to the prediction output of the model. Each class needs to manually import all the required dependency packages. Execute the following code:

public class predictionHelper {
    public static void main(String[] args) {
        assembleFile("src/main/resources/prediction.jsonl");
    }
}

4. Finally execute Executor

4.1 Test sample can be executed one by one at a time

class Executor{
    private static void evaluationOneExample(String basePackage,String prefix,String fileId){
        try {
            int[] result= evaluationGenerateMethod(fileId,basePackage,prefix);
            System.out.println(prefix+" result:"+result[0]+"/"+result[1]);
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

You can execute the example above like this:

class Executor{
    public static void main(String[] args) {
        try {
            String taskId = "166";
            String basePackage = "com.aixcode.autoTest.generate.aixcoder";
            String prefix = "Aixcoder";
            evaluationOneExample(taskId, basePackage, prefix);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

4.2 Executing all test samples at once

class Executor{
    //Executing all samples. This will iterate through all evaluation classes under src/main/java/com/aixcode/autoTest/evaluation
    public static double[] runAllTest(String basePackage, String prefix, int minFileId, int maxFileId) {
        try {
            List<String> fileNames = listFiles("src/main/java/com/aixcode/autoTest/evaluation");
            List<String> fileIds = fileNames.stream().map(fileName -> fileName.substring("Evaluation".length(), fileName.lastIndexOf("."))).collect(Collectors.toList());

            double copilot_score = 0;
            int CopilotExacttCount = 0;
            int totalCount = 0;
            for (String fileId : fileIds) {
                if (!(Integer.parseInt(fileId) >= minFileId && Integer.parseInt(fileId) <= maxFileId)) {
                    continue;
                }
                totalCount++;
                int[] result = evaluationGenerateMethod(fileId, basePackage, prefix);
                if (result != null && result.length == 2 && result[1] != 0) {
                    copilot_score += (double) result[0] / result[1];
                    if (result[0] == result[1]) {
                        CopilotExacttCount++;
                    }
                }
            }

            return new double[]{copilot_score, CopilotExacttCount, totalCount};
        } catch (Exception e) {
            e.printStackTrace();
        }
        return new double[]{0, 0, 0};
    }
}

To perform the above tasks, you can do the following:

class Executor {
    public static void main(String[] args) {
        try {
            double[] res=runAllTest("com.aixcode.autoTest.generate.aixcoderFirstHalf", "AixcoderAuto", 0, 103);
            System.out.println("result:"+res[0]+"/"+res[1]+"/"+res[2]);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Contributing

Fork the repository
Create Feat_xxx branch
Submit code
Create pull request