Awesome

SciCode

Homepage | Paper

This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"

🔔News

[2024-11-04]: Leaderboard is on! Check here. We have also added Claude Sonnet 3.5 (new) results.

[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.

[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.

[2024-08-22]: The SciCode benchmark has been successfully integrated into OpenCompass.

[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.

Introduction

SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only 7.7% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

Dataset Creation

SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

🏆 Leaderboard

Models	Main Problem Resolve Rate	<span style="color:grey">Subproblem</span>
🥇 OpenAI o1-preview	<div align="center">7.7</div>	<div align="center" style="color:grey">28.5</div>
🥈 Claude3.5-Sonnet	<div align="center">4.6</div>	<div align="center" style="color:grey">26.0</div>
🥉 Claude3.5-Sonnet (new)	<div align="center">4.6</div>	<div align="center" style="color:grey">25.3</div>
Deepseek-Coder-v2	<div align="center">3.1</div>	<div align="center" style="color:grey">21.2</div>
GPT-4o	<div align="center">1.5</div>	<div align="center" style="color:grey">25.0</div>
GPT-4-Turbo	<div align="center">1.5</div>	<div align="center" style="color:grey">22.9</div>
OpenAI o1-mini	<div align="center">1.5</div>	<div align="center" style="color:grey">22.2</div>
Gemini 1.5 Pro	<div align="center">1.5</div>	<div align="center" style="color:grey">21.9</div>
Claude3-Opus	<div align="center">1.5</div>	<div align="center" style="color:grey">21.5</div>
Llama-3.1-405B-Chat	<div align="center">1.5</div>	<div align="center" style="color:grey">19.8</div>
Claude3-Sonnet	<div align="center">1.5</div>	<div align="center" style="color:grey">17.0</div>
Qwen2-72B-Instruct	<div align="center">1.5</div>	<div align="center" style="color:grey">17.0</div>
Llama-3.1-70B-Chat	<div align="center">0.0</div>	<div align="center" style="color:grey">17.0</div>
Mixtral-8x22B-Instruct	<div align="center">0.0</div>	<div align="center" style="color:grey">16.3</div>
Llama-3-70B-Chat	<div align="center">0.0</div>	<div align="center" style="color:grey">14.6</div>

Instructions to evaluate a new model

Clone this repository git clone git@github.com:scicode-bench/SciCode.git
Install the scicode package with pip install -e .
Download the numeric test results and save them as ./eval/data/test_data.h5
Run eval/scripts/gencode_json.py to generate new model outputs (see the eval/scripts readme) for more information
Run eval/scripts/test_generated_code.py to evaluate the unittests

More information and FAQ

More information, including a FAQ section, is provided on our website. If you have trouble reaching the website, please find the markdown source in its github repository.

Contact

Minyang Tian: mtian8@illinois.edu
Eliu Huerta: elihu@anl.gov
Hao Peng: haopeng@illinois.edu

Citation

@misc{tian2024scicode,
    title={SciCode: A Research Coding Benchmark Curated by Scientists},
    author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
    year={2024},
    eprint={2407.13168},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}