Awesome

<div align="center"> <h1>👨‍💻 Awesome Code LLM</h1> <a href="https://awesome.re"> <img src="https://awesome.re/badge.svg" alt="Awesome"> </a> <a href="https://img.shields.io/badge/PRs-Welcome-red"> <img src="https://img.shields.io/badge/PRs-Welcome-red" alt="PRs Welcome"> </a> <a href="https://img.shields.io/github/last-commit/huybery/Awesome-Code-LLM?color=green"> <img src="https://img.shields.io/github/last-commit/huybery/Awesome-Code-LLM?color=green" alt="Last Commit"> </a> </div>

🔆 How to Contribute

Contributions are welcome! If you have any resources, tools, papers, or insights related to Code LLMs, feel free to submit a pull request. Let's work together to make this project better!

News

🔥🔥🔥 [2024-11-12] Qwen2.5-Coder series are released, offering six model sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B), with Qwen2.5-Coder-32B-Instruct now the most powerful open-source code model.
🔥🔥 [2024-11-08] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models is released.

🚀 Top Code LLMs

Sort by HumanEval Pass@1

Rank	Model	Params	HumanEval	MBPP	Source
1	o1-mini-2024-09-12	-	97.6	93.9	paper
2	o1-preview-2024-09-12	-	95.1	93.4	paper
3	Qwen2.5-Coder-32B-Instruct	32B	92.7	90.2	github
4	Claude-3.5-Sonnet-20241022	-	92.1	91.0	paper
5	GPT-4o-2024-08-06	-	92.1	86.8	paper
6	Qwen2.5-Coder-14B-Instruct	14B	89.6	86.2	github
7	Claude-3.5-Sonnet-20240620	-	89.0	87.6	paper
8	GPT-4o-mini-2024-07-18	-	87.8	86.0	paper
9	Qwen2.5-Coder-7B-Instruct	7B	88.4	83.5	github
10	DS-Coder-V2-Instruct	21/236B	85.4	89.4	github
11	Qwen2.5-Coder-3B-Instruct	3B	84.1	73.6	github
12	DS-Coder-V2-Lite-Instruct	2.4/16B	81.1	82.8	github
13	CodeQwen1.5-7B-Chat	7B	83.5	70.6	github
14	DeepSeek-Coder-33B-Instruct	33B	79.3	70.0	github
15	DeepSeek-Coder-6.7B-Instruct	6.7B	78.6	65.4	github
16	GPT-3.5-Turbo	-	76.2	70.8	github
17	CodeLlama-70B-Instruct	70B	72.0	77.8	paper
18	Qwen2.5-Coder-1.5B-Instruct	1.5B	70.7	69.2	github
19	StarCoder2-15B-Instruct-v0.1	15B	67.7	78.0	paper
20	Qwen2.5-Coder-0.5B-Instruct	0.5B	61.6	52.4	github
21	Pangu-Coder2	15B	61.6	-	paper
22	WizardCoder-15B	15B	57.3	51.8	paper
23	CodeQwen1.5-7B	7B	51.8	61.8	github
24	CodeLlama-34B-Instruct	34B	48.2	61.1	paper
25	Code-Davinci-002	-	47.0	-	paper

💡 Evaluation Toolkit:

bigcode-evaluation-harness: A framework for the evaluation of autoregressive code generation language models.
code-eval: A framework for the evaluation of autoregressive code generation language models on HumanEval.
SandboxFusion: A secure sandbox for running and judging code generated by LLMs.

🚀 Awesome Code LLMs Leaderboard

Leaderboard	Description
Evalperf Leaderboard	Evaluating LLMs for Efficient Code Generation.
Aider Code Editing Leaderboard	Measuring the LLM’s coding ability, and whether it can write new code that integrates into existing code.
BigCodeBench Leaderboard	BigCodeBench evaluates LLMs with practical and challenging programming tasks.
LiveCodeBench Leaderboard	Holistic and Contamination Free Evaluation of Large Language Models for Code.
Big Code Models Leaderboard	Compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E.
BIRD Leaderboard	BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
CanAiCode Leaderboard	CanAiCode Leaderboard
Coding LLMs Leaderboard	Coding LLMs Leaderboard
CRUXEval Leaderboard	CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!
EvalPlus Leaderboard	EvalPlus evaluates AI Coders with rigorous tests.
InfiBench Leaderboard	InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.
InterCode Leaderboard	InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue.
Program Synthesis Models Leaderboard	They created this leaderboard to help researchers easily identify the best open-source model with an intuitive leadership quadrant graph. They evaluate the performance of open-source code models to rank them based on their capabilities and market adoption.
Spider Leaderboard	Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.

📚 Awesome Code LLMs Papers

🌊 Awesome Code Pre-Training Papers

Title	Venue	Date	Code	Resources
<br> OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models <br>	`Preprint`	`2024.11`	Github	HF
<br> Qwen2.5-Coder Technical Report <br>	`Preprint`	`2024.09`	Github	HF
<br> DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence <br>	`Preprint`	`2024.06`	Github	HF
<br> StarCoder 2 and The Stack v2: The Next Generation <br>	`Preprint`	`2024.02`	Github	HF
<br> DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence <br>	`Preprint`	`2024.01`	Github	HF
<br> Code Llama: Open Foundation Models for Code <br>	`Preprint`	`2023.08`	Github	HF
Textbooks Are All You Need <br>	`Preprint`	`2023.06`	-	HF
<br> CodeT5+: Open Code Large Language Models for Code Understanding and Generation <br>	`Preprint`	`2023.05`	Github	HF
<br> StarCoder: may the source be with you! <br>	`Preprint`	`2023.05`	Github	HF
<br> CodeGen2: Lessons for Training LLMs on Programming and Natural Languages <br>	`ICLR23`	`2023.05`	Github	HF
<br> CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X <br>	`Preprint`	`2023.03`	Github	HF
SantaCoder: don't reach for the stars! <br>	`Preprint`	`2023.01`	-	HF
<br> CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis <br>	`ICLR'23`	`2022.03`	Github	HF
<br> Evaluating Large Language Models Trained on Code <br>	`Preprint`	`2021.07`	Github	-

🐳 Awesome Code Instruction-Tuning Papers

Title	Venue	Date	Code	Resources
<br> Magicoder: Source Code Is All You Need <br>	`ICML'24`	`2023.12`	Github	HF
<br> OctoPack: Instruction Tuning Code Large Language Models <br>	`ICLR'24`	`2023.08`	Github	HF
<br> WizardCoder: Empowering Code Large Language Models with Evol-Instruct <br>	`Preprint`	`2023.07`	Github	HF
<br> Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions <br>	`Preprint`	`2023.xx`	Github	HF

🐬 Awesome Code Alignment Papers

Title	Venue	Date	Code	Resources
ProSec: Fortifying Code LLMs with Proactive Security Alignment <br>	`Preprint`	`2024.11`	-	-
PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models <br>	`Preprint`	`2024.06`	-	-
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback <br>	`Preprint`	`2023.07`	-	-
<br> RLTF: Reinforcement Learning from Unit Test Feedback <br>	`Preprint`	`2023.07`	Github	-
<br> Execution-based Code Generation using Deep Reinforcement Learning <br>	`TMLR'23`	`2023.01`	Github	-
<br> CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning <br>	`NeurIPS'22`	`2022.07`	Github	-

🐋 Awesome Code Prompting Papers

Title	Venue	Date	Code	Resources
<br> From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging <br>	`Preprint`	`2024.10`	Github	-
<br> Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs <br>	`AAAI'25`	`2024.06`	Github	-
<br> Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step <br>	`ACL'24`	`2024.02`	Github	-
SelfEvolve: A Code Evolution Framework via Large Language Models <br>	`Preprint`	`2023.06`	-	-
<br> Demystifying GPT Self-Repair for Code Generation <br>	`ICLR'24`	`2023.06`	Github	-
Teaching Large Language Models to Self-Debug <br>	`ICLR'24`	`2023.06`	-	-
<br> LEVER: Learning to Verify Language-to-Code Generation with Execution <br>	`ICML'23`	`2023.02`	Github	-
<br> Coder Reviewer Reranking for Code Generation <br>	`ICML'23`	`2022.11`	Github	-
<br> CodeT: Code Generation with Generated Tests <br>	`ICLR'23`	`2022.07`	Github	-

🐙 Awesome Code Benchmark & Evaluation Papers

Dataset	Title	Venue	Date	Code	Resources
`CodeArena`	<br> Evaluating and Aligning CodeLLMs on Human Preference <br>	`Preprint`	`2024.12`	Github	HF
`FullStack Bench`	<br> FullStack Bench: Evaluating LLMs as Full Stack Coders <br>	`Preprint`	`2024.12`	Github	HF Github
`GitChameleon`	<br> GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models <br>	`Preprint`	`2024.11`	Github	-
`Evalperf`	<br> Evaluating Language Models for Efficient Code Generation <br>	`COLM'24`	`2024.08`	Github	HF
`LiveCodeBench`	<br> LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code <br>	`Preprint`	`2024.03`	Github	HF
`DevBench`	<br> DevBench: A Comprehensive Benchmark for Software Development <br>	`Preprint`	`2024.03`	Github	-
`SWE-bench`	<br> SWE-bench: Can Language Models Resolve Real-World GitHub Issues? <br>	`ICLR'24`	`2024.03`	Github	HF
`CrossCodeEval`	<br> CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion <br>	`NeurIPS'23`	`2023.11`	Github	-
`RepoCoder`	<br> Repository-Level Code Completion Through Iterative Retrieval and Generation <br>	`EMNLP'23`	`2023.10`	Github	-
`LongCoder`	<br> LongCoder: A Long-Range Pre-trained Language Model for Code Completion <br>	`ICML'23`	`2023.10`	Github	-
-	Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation <br>	`Preprint`	`2023.08`	-	-
`BioCoder`	<br> BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models <br>	`ISMB'24`	`2023.08`	Github	-
`RepoBench`	<br> RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems <br>	`ICLR'24`	`2023.06`	Github	HF
`Evalplus`	<br> Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation <br>	`NeurIPS'23`	`2023.05`	Github	HF
`Coeditor`	<br> Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing <br>	`ICLR'24`	`2023.05`	Github	-
`DS-1000`	<br> DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation <br>	`ICML'23`	`2022.11`	Github	HF
`MultiPL-E`	<br> MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation <br>	`Preprint`	`2022.08`	Github	HF
`MBPP`	<br> Program Synthesis with Large Language Models <br>	`Preprint`	`2021.08`	Github	HF
`APPS`	<br> Measuring Coding Challenge Competence With APPS <br>	`NeurIPS'21`	`2021.05`	Github	HF

🙌 Contributors

This is an active repository and your contributions are always welcome! If you have any question about this opinionated list, do not hesitate to contact me huybery@gmail.com.

Cite as

@software{awesome-code-llm,
  author = {Binyuan Hui, Lei Zhang},
  title = {An awesome and curated list of best code-LLM for research},
  howpublished = {\url{https://github.com/huybery/Awesome-Code-LLM}},
  year = 2023,
}

Acknowledgement

This project is inspired by Awesome-LLM.

Star History

⬆ Back to ToC