Home

Awesome

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

JavaBench is a project-level Java benchmark that contains four projects at graduate-level difficulty. The difficulty and quality of JavaBench is validated and guaranteed by graduate students across four years. Please check our Leaderboard for the visualization of the evaluation results.

Updates

Benchmark Dataset

The four Java projects in JavaBench are designed for undergraduate students throughout the four aca-demic years from 2019 to 2022. We then use students’ overall scores as evidence of difficulty levels.

Dataset

The benchmark dataset is accessible at ./datasets. We provide three types of datasets with difference context settings.

Below is the structure of the dataset:

Here is a skeleton of code: code

Generation Strategies

Strategy

Usage

Setup

Create python virtual environment and install all the requirements.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Inference

We provide inference.py to generate code snippets based on the trained models. The following is an example of generating code snippets using the WizardCoder-Python-34B-V1.0 model.

We recommend that sample numbers should be set to 5 or more for precise evaluation results.

python inference.py \
    --model-path WizardLM/WizardCoder-Python-34B-V1.0 \
    --num-sample 5 \
    --output output/result-PA19/WizardCoder-Python-34B-V1.0/samples.jsonl \
    --data datasets/selective-context/data-PA19.jsonl \
    --mode holistic

The inference functionality is implemented based on the command-line tool of FastChat. FastChat supports specifying models from Hugging Face or local models, as well as configuring underlying options such as device and GPU. For more details, please refer to https://github.com/lm-sys/FastChat.

Based on the parameters provided by FastChat, we have added additional parameters:

ArgumentTypeDefaultDescription
--datastr(required)Path to the input dataset file
--outputstr(required)Path to save the output
--mode"holistic" | "independent" | "incremental""holistic"Synthesis strategy: holistic, independent, or incremental
--num-sampleint10Number of samples that each class should generate
--incremental-mode"seq" | "rev" | "rand""seq"Mode for incremental synthesis strategy: sequential, reverse, or random
--temperaturefloat0.2Sampling temperature for generation
--repetition_penaltyfloat1.0Penalty for repetition in generation
--max-new-tokensint4096Maximum number of new tokens to generate

Below are the instructions for the inference output format:

Evaluation

Evaluation

We provide evaluation.py to evaluate the generated code snippets based on the test cases.

There are two evaluation granularities.

Class-wise

Class-wise granularity uses a generated class to replace the canon-ical solution’s counterpart class at a time.

Usage: evaluation.py class-wise [OPTIONS] DATA

Options:
  --output TEXT  Output file for evalution  [required]
  --help         Show this message and exit.

For example:

python evaluation.py class-wise \
    --output output/result-PA21/gpt-3.5-turbo/single_class.json \
    output/result-PA21/gpt-3.5-turbo/samples.jsonl

Below are the instructions for the class-wise evalution output format:

Test-wise

Test-wise granularity iterates all the test cases in the test suites and takes the average result. For each test case, we replace the classes relating to the test case while keeping other classes in the canonical solution unchanged.

Usage: evaluation.py test-wise [OPTIONS] DATA

Options:
  --output TEXT  Output file for evalution  [required]
  --test TEXT    Test configuration for evaluation  [required]
  --help         Show this message and exit.

For example:

python evaluation.py test-wise \
    --output output/result-PA19/gpt-3.5-turbo/result-full.json \
    --test dataset/testcase/test-PA19.jsonl \
    output/result-PA19/gpt-3.5-turbo/samples.jsonl

By default, the evalution.py will print all the gradle build log to stderr. You can redirect the log to a file by adding 2> log.txt at the end of the command.

Below are the instructions for the class-wise evalution output format:

Submission

Now you have three files:

If you're having trouble with the evalution step, you can just upload samples.jsonl and we'll evaluate it for you!

The next step is to submit a pull request for the project:

  1. Fork the repository into your own GitHub account.
  2. Clone the repository to your local.
  3. Checkout a new branch from main.
  4. Make a new directory under the output folder corresponding to the dataset(e.g. ./output/holistic-selective/result-PA19/gpt-3.5-turbo-1106) and copy all the files above.
  5. Submit the Pull Request.
  6. The maintainers will review your Pull Request soon.

Once your pull request is accepted, we will update the Leaderboard with your results.

Contributors

Citation