Home

Awesome

<h1 align="center"> <img src="assets/icon.png" width="32px" height="auto"> HumanEval-V Benchmark </h1> <p align="center"> <a href="https://arxiv.org/abs/2410.12381">📄 Paper</a> • <a href="https://humaneval-v.github.io">🏠 Home Page</a> • <a href="https://humaneval-v.github.io/#leaderboard">🏆 Leaderboard</a> • <a href="https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark">🤗 Dataset</a> • <a href="https://huggingface.co/spaces/HumanEval-V/HumanEval-V-Benchmark-Viewer">🤗 Dataset Viewer</a> </p>

Welcome to the official repository for the paper "HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks".

🌟News

👀Introduction

<h6 align="center">An example coding task from HumanEval-V. Each task involves completing a Python function<br> based on a single image, the function signature, and problem descriptions provided in the comment block.</h6> <p align="center"> <img src="./assets/introduction_example.png" style="width:75%; margin-left: auto; margin-right: auto;"> </p>

HumanEval-V is a novel and lightweight benchmark designed to evaluate the visual understanding and reasoning capabilities of Large Multimodal Models (LMMs) through coding tasks. The dataset comprises 108 entry-level Python programming challenges, adapted from platforms like CodeForces and Stack Overflow. Each task includes visual context that is indispensable to the problem, requiring models to perceive, reason, and generate Python code solutions accordingly.

Benchmark Components

Each coding task in HumanEval-V consists of three main components:

  1. Image: A single image containing the essential visual context necessary to solve the task.
  2. Function Signature: Includes the problem description, necessary imports, and the function signature that needs to be completed.
  3. Test Cases: Used to execute and validate the correctness of the generated code.

The LMM must generate the complete function body given the image and function signature. Below is the conversational prompt format used to prompt the LMMs. The {code_context} is replaced with the function signature.

**Instructions:**
You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.
Please complete the function based on the provided image and code context. Return the complete solution, including the function signature, in a single response, formatted within a Python code block.

**Code Context:**
```python
{code_context}
```

After the LMM generates a response, the solution is validated using the following process:

⚡Quick Start

1. Environment Setup

git clone https://github.com/HumanEval-V/HumanEval-V-Benchmark.git
cd HumanEval-V-Benchmark
conda create -n humanevalv python=3.10
conda activate humanevalv
pip install -r requirements.infer.txt  # For using our inference script
pip install -r requirements.txt        # For running only the evaluation with your own inference script

2. Run Model Inference

Option 1: Using Your Own Inference Script

If you have your own inference script, organize the model predictions in the following format (an example prediction file can be found in output/example_pred_sample_20.json):

[
  {
    "qid": "XXX",
    "predictions": [
      "XXX",
      "XXX"
    ]
  }
]

Option 2: Implementing Inference from Scratch

If you need to implement your own inference script, refer to the example script at models/example.py. You mainly need to implement the query method in the LMM class, which takes an image and textual prompt and generates predictions. We provide example implementations for OpenAI GPT-4o-mini and vllm-based InternVL2-4B in models/openai_model.py and models/vllm_model.py.

After implementing your model, run the following command:

python inference.py --model_name your_model_name --prediction_file output/your_lmm_sample_N.json --sample_num N --temperature T

This command will create a JSON file at output/your_lmm_sample_N.json for evaluation. Set model_name to the filename of your implemented LMM in the models directory. Set sample_num to 1 (temperature to 0) for calculating pass@1 or to 20 (temperature to 0.8) for pass@10.

An example command for running the GPT-4o-mini model is as follows:

# For generating a single sample for pass@1
python inference.py --model_name openai_model --prediction_file output/gpt_4o_mini_sample_1.json --sample_num 1 --temperature 0
# For generating 20 samples for pass@10
python inference.py --model_name openai_model --prediction_file output/gpt_4o_mini_sample_20.json --sample_num 20 --temperature 0.8

If you encounter interruptions during inference, you can simply rerun the inference script using the same prediction file path to resume generating predictions without overwriting the existing results.

3. Running Evaluation

After obtaining the model predictions in the specified format, run the evaluation as follows:

python evaluate.py --prediction_file output/your_lmm_sample_N.json

This command will execute the predictions using the test cases and calculate the pass@k score, saving the results to output/your_lmm_sample_N_executed.json. For subsequent evaluations without re-executing code solutions, you can use:

python evaluate.py --prediction_file output/your_lmm_sample_N.json --score_only

💘Citation

@article{zhang2024humanevalv,
  title={HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks}, 
  author={Zhang, Fengji and Wu, Linquan and Bai, Huiyu and Lin, Guancheng and Li, Xiao and Yu, Xiao and Wang, Yue and Chen, Bei and Keung, Jacky},
  journal={arXiv preprint arXiv:2410.12381},
  year={2024},
}