Home

Awesome

HumanEval.jl

Run tests

This project is a julia version of HumanEval. Our goal is to gain a better understanding of latest LLMs' performance with the Julia programming language.

modelevalplus *basic **
gpt-4-0125-preview0.7740.823
gpt-4-turbo0.7560.823
mistral-large-instruct-24070.7440.823
gpt-4o0.7380.817
claude-3-5-sonnet-202406200.720.823
gpt-4-1106-preview0.720.805
DeepSeek-Coder-V2-Instruct0.6950.774
DeepSeek-V2-Chat0.6890.756
Llama-3.1-405B-Instruct0.6280.744
claude-3-opus-202402290.610.689
Qwen2-72B-Instruct0.5980.665
Phind-CodeLlama-34B-v20.5910.659
gpt-3.5-turbo-01250.5910.652
mistral-large-latest0.5730.659
gpt-3.5-turbo-06130.5670.64
gpt-3.5-turbo-11060.5550.628
DeepSeek-Coder-33B-instruct0.5430.598
Magicoder-S-DS-6.7B0.5430.616
WizardCoder-33B-V1.10.5430.604
Qwen1.5-110B-Chat0.530.598
yi-large0.5240.652
deepseek-coder-6.7b-instruct0.4880.549
CodeLlama-70b-Instruct-hf0.4570.561
code-millenials-34b0.4390.5
Magicoder-S-CL-7B0.4020.463
CodeLlama-34b-Instruct-hf0.3110.366
Starling-LM-7B-alpha0.2990.354
Yi-34B-Chat0.2320.317
<sub> <strong>* evalplus:</strong> scores are calculated based on test cases from both <a href="https://github.com/openai/human-eval">HumanEval</a> and <a href="https://github.com/evalplus/evalplus">evalplus</a>.<br> <strong>** basic:</strong> scores are calculated based on test cases from <a href="https://github.com/openai/human-eval">HumanEval</a> only. <br> By default, all results are calculated by <code>pass@1</code> using greedy decoding. Models are deployed with <a href="https://github.com/vllm-project/vllm">vllm</a> which uses a predefined chat template stored in the tokenizer. Feel free to <a href="https://github.com/01-ai/HumanEval.jl/issues">create an issue</a> if you'd like to evaluate some other models. <br> </sub>

Getting Started

First, deploy the model you'd like to evaluate with a OpenAI compatible endpoint, like vLLM or Ollama. We'll need the OPENAI_API_KEY and OPENAI_BASE_URL in the next step.

To test models from Anthropic, you should set ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL instead.

Evaluate with docker

docker run -it --rm \
  -v /PATH/TO/SAVE/RESULTS/generations:/workspace/HumanEval.jl/generations \
  -e OPENAI_API_KEY=YOUR_SECRET \
  -e OPENAI_BASE_URL=http://localhost:8000/v1 \
  -e RETESTITEMS_NWORKERS=16 \
  -e RETESTITEMS_TESTITEM_TIMEOUT=15 \
  -e MODEL=gpt-3.5-turbo-0613 \
  ghcr.io/01-ai/humaneval.jl:latest

Evaluate with local development environment

  1. Make sure you have the latest Julia installed.
  2. Clone and enter the root of this project.
  3. Start the Julia REPL with the following command
OPENAI_API_KEY=debug OPENAI_BASE_URL=http://localhost:8000/v1 RETESTITEMS_NWORKERS=16 RETESTITEMS_TESTITEM_TIMEOUT=15 MODEL=gpt-3.5-turbo-0613 julia --project

The meaning of the environment variables are the same with above.

  1. Execute following commands in the Julia REPL.
julia> import Pkg; Pkg.instantiate();

julia> include("src/evaluation.jl")

julia> evaluate("YOUR_MODEL_NAME")

Once finished, the results will be displayed. You may find more details under the generations directory.

Related Work

Future Work

We're hiring! If you're interested in working on code LLM at 01.ai, please contact yi@01.ai.

FAQ

Acknowledgement