Awesome

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

JavaBench is a project-level Java benchmark that contains four projects at graduate-level difficulty. The difficulty and quality of JavaBench is validated and guaranteed by graduate students across four years. Please check our Leaderboard for the visualization of the evaluation results.

Updates

2024-06-08 Publish benchmark and leaderboard
2024-07-24 Add instructions for submitting results

Benchmark Dataset

The four Java projects in JavaBench are designed for undergraduate students throughout the four aca-demic years from 2019 to 2022. We then use students’ overall scores as evidence of difficulty levels.

Dataset

The benchmark dataset is accessible at ./datasets. We provide three types of datasets with difference context settings.

Maximum Context: The dataset contains the context information as much as possible (Limited by LLMs).
Minimum Context: The dataset contains the no context information.
Selective Context: The dataset contains the context information that only includes method signatures of dependencies extracted by jdeps.

Below is the structure of the dataset:

task_id: The ID of the completion task, composed of the assignment number and class name.
target: The file path of the task in the Java project.
code: The code snippet that needs to be completed with // TODO.
code_context: The context information that provides additional information for code completion.

Here is a skeleton of code: code

Generation Strategies

Strategy

Usage

Setup

Create python virtual environment and install all the requirements.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Inference

We provide inference.py to generate code snippets based on the trained models. The following is an example of generating code snippets using the WizardCoder-Python-34B-V1.0 model.

We recommend that sample numbers should be set to 5 or more for precise evaluation results.

python inference.py \
    --model-path WizardLM/WizardCoder-Python-34B-V1.0 \
    --num-sample 5 \
    --output output/result-PA19/WizardCoder-Python-34B-V1.0/samples.jsonl \
    --data datasets/selective-context/data-PA19.jsonl \
    --mode holistic

The inference functionality is implemented based on the command-line tool of FastChat. FastChat supports specifying models from Hugging Face or local models, as well as configuring underlying options such as device and GPU. For more details, please refer to https://github.com/lm-sys/FastChat.

Based on the parameters provided by FastChat, we have added additional parameters:

Argument	Type	Default	Description
`--data`	str	(required)	Path to the input dataset file
`--output`	str	(required)	Path to save the output
`--mode`	"holistic" \| "independent" \| "incremental"	"holistic"	Synthesis strategy: holistic, independent, or incremental
`--num-sample`	int	10	Number of samples that each class should generate
`--incremental-mode`	"seq" \| "rev" \| "rand"	"seq"	Mode for incremental synthesis strategy: sequential, reverse, or random
`--temperature`	float	0.2	Sampling temperature for generation
`--repetition_penalty`	float	1.0	Penalty for repetition in generation
`--max-new-tokens`	int	4096	Maximum number of new tokens to generate

Below are the instructions for the inference output format:

task_id: The ID of the completion task, composed of the assignment number and class name.
prompt: The combined prompt given to LLMs, including code and context.
target: The file path of the task in the Java project.
completion: The result of the inference.
mediate: For independent and incremental synthesis strategies, this field records the intermediate steps.

Evaluation

We provide evaluation.py to evaluate the generated code snippets based on the test cases.

There are two evaluation granularities.

Class-wise

Class-wise granularity uses a generated class to replace the canon-ical solution’s counterpart class at a time.

Usage: evaluation.py class-wise [OPTIONS] DATA

Options:
  --output TEXT  Output file for evalution  [required]
  --help         Show this message and exit.

For example:

python evaluation.py class-wise \
    --output output/result-PA21/gpt-3.5-turbo/single_class.json \
    output/result-PA21/gpt-3.5-turbo/samples.jsonl

Below are the instructions for the class-wise evalution output format:

task_id: The ID of the completion task, composed of the assignment number and class name.
compile_errors: The number of compilation errors, where 0 indicates successful compilation.
test_result: Two numbers representing the number of passing test cases and the total number of test cases for the project. This result is only available if the compilation is successful.
has_todo: Indicates whether the inference result contains // TODO, as LLMs may exhibit laziness.
can_replace: Indicates whether the inference result contains a complete class.

Test-wise

Test-wise granularity iterates all the test cases in the test suites and takes the average result. For each test case, we replace the classes relating to the test case while keeping other classes in the canonical solution unchanged.

Usage: evaluation.py test-wise [OPTIONS] DATA

Options:
  --output TEXT  Output file for evalution  [required]
  --test TEXT    Test configuration for evaluation  [required]
  --help         Show this message and exit.

For example:

python evaluation.py test-wise \
    --output output/result-PA19/gpt-3.5-turbo/result-full.json \
    --test dataset/testcase/test-PA19.jsonl \
    output/result-PA19/gpt-3.5-turbo/samples.jsonl

By default, the evalution.py will print all the gradle build log to stderr. You can redirect the log to a file by adding 2> log.txt at the end of the command.

Below are the instructions for the class-wise evalution output format:

test_id: The ID of the test suite, composed of the assignment number and test suite name.
compilable: Indicates whether the generated code is compilable.
n_pass: Two numbers representing the number of passing test cases and the total number of test cases for the project. This result is only available if the compilation is successful.
has_todo: Indicates whether the inference result contains // TODO, as LLMs may exhibit laziness.
can_replace: Indicates whether the inference result contains a complete class.

Submission

Now you have three files:

samples.jsonl: Completed code generated by LLMs.
single_class.json: Evaluation results of class-wise granularity.
result-full.json: Evaluation results of test-wise granularity.

If you're having trouble with the evalution step, you can just upload samples.jsonl and we'll evaluate it for you!

The next step is to submit a pull request for the project:

Fork the repository into your own GitHub account.
Clone the repository to your local.
Checkout a new branch from main.
Make a new directory under the output folder corresponding to the dataset(e.g. ./output/holistic-selective/result-PA19/gpt-3.5-turbo-1106) and copy all the files above.
Submit the Pull Request.
The maintainers will review your Pull Request soon.

Once your pull request is accepted, we will update the Leaderboard with your results.