Home

Awesome

BigCodeBench

<center> <img src="https://github.com/bigcode-bench/bigcode-bench.github.io/blob/main/asset/bigcodebench_banner.svg?raw=true" alt="BigCodeBench"> </center> <p align="center"> <a href="https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard"><img src="https://img.shields.io/badge/🤗&nbsp&nbsp%F0%9F%8F%86-leaderboard-%23ff8811"></a> <a href="https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06"><img src="https://img.shields.io/badge/🤗-collection-pink"></a> <a href="https://bigcode-bench.github.io/"><img src="https://img.shields.io/badge/%F0%9F%8F%86-website-8A2BE2"></a> <a href="https://arxiv.org/abs/2406.15877"><img src="https://img.shields.io/badge/arXiv-2406.15877-b31b1b.svg"></a> <a href="https://pypi.org/project/bigcodebench/"><img src="https://img.shields.io/pypi/v/bigcodebench?color=g"></a> <a href="https://pepy.tech/project/bigcodebench"><img src="https://static.pepy.tech/badge/bigcodebench"></a> <a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a> <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate" title="Docker-Eval"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate"></a> <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-generate" title="Docker-Gen"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate"></a> </p> <p align="center"> <a href="#-impact">💥 Impact</a> • <a href="#-news">📰 News</a> • <a href="#-quick-start">🔥 Quick Start</a> • <a href="#-remote-evaluation">🚀 Remote Evaluation</a> • <a href="#-llm-generated-code">💻 LLM-generated Code</a> • <a href="#-advanced-usage">🧑 Advanced Usage</a> • <a href="#-result-submission">📰 Result Submission</a> • <a href="#-citation">📜 Citation</a> </p>

💥 Impact

BigCodeBench has been used by many LLM teams including:

📰 News

<details><summary>More News <i>:: click to expand ::</i></summary> <div> </div> </details>

🌸 About

BigCodeBench

BigCodeBench is an easy-to-use benchmark for solving practical and challenging tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.

There are two splits in BigCodeBench:

Why BigCodeBench?

BigCodeBench focuses on task automation via code generation with diverse function calls and complex instructions, with:

🔥 Quick Start

To get started, please first set up the environment:

# By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench --upgrade

# You are suggested to use `flash-attn` for generating code samples.
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
<details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary> <div>
# Install to use bigcodebench.generate
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
</div> </details>

🚀 Remote Evaluation

We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.

[!Warning]

To ease the generation, we use batch inference by default. However, the batch inference results could vary from batch sizes to batch sizes and versions to versions, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set --bs to 1.

[!Note]

Remotely executing on BigCodeBench-Full typically takes 6-7 minutes, and on BigCodeBench-Hard typically takes 4-5 minutes.

bigcodebench.evaluate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --split [complete|instruct] \
  --subset [full|hard] \
  --backend [vllm|openai|anthropic|google|mistral|hf]

[!Note]

BigCodeBench uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --direct_completion to avoid being evaluated in a chat mode.

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY=<your_openai_api_key>

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY=<your_anthropic_api_key>

Access Mistral APIs from Mistral Console

export MISTRAL_API_KEY=<your_mistral_api_key>

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY=<your_google_api_key>

💻 LLM-generated Code

We share pre-generated code samples from LLMs we have evaluated:

🧑 Advanced Usage

Please refer to the ADVANCED USAGE for more details.

📰 Result Submission

Please email both the generated code samples and the execution results to terry.zhuo@monash.edu if you would like to contribute your model to the leaderboard. Note that the file names should be in the format of [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl and [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json. You can file an issue to remind us if we do not respond to your email within 3 days.

📜 Citation

@article{zhuo2024bigcodebench,
  title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
  author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
  journal={arXiv preprint arXiv:2406.15877},
  year={2024}
}

🙏 Acknowledgement