Home

Awesome

Awesome-Code-LLM

This is the repo for our survey Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code - a comprehensive review of LLM researches for code. Works in each category are ordered chronologically. If you have a basic understanding of machine learning but are new to NLP, we also provide a list of recommended readings in section 9.

<p align='center'> <img src='imgs/overview.png' style='width: 80%; '> </p>

News

🔥🔥🔥 [2024/08/03] Latest papers:

🔥🔥     [2024/07] We have compiled 118 papers from June 2024 in one WeChat article.

🔥         [2024/06] We have compiled 90 papers from May 2024 in one WeChat article.

Table of Contents

  1. Surveys

  2. Models

    2.1 Off-the-Shelf LLM

    2.2 Existing LLM Adapted to Code

    2.3 General Pretraining on Code

    <!-- prettier ignore -->

    2.4 (Instruction) Fine-Tuning on Code

    2.5 Reinforcement Learning on Code

  3. When Coding Meets Reasoning

    3.1 Coding for Reasoning

    3.2 Code Simulation

    3.3 Code Agents

    3.4 Interactive Coding

  4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages

  5. Methods/Models for Downstream Tasks

  6. Analysis of AI-Generated Code

  7. User-LLM Interaction

  8. Datasets

    8.1 Pretraining

    8.2 Benchmarks

  9. Recommended Readings

  10. Citation

  11. Star History

  12. Join Us

1. Surveys

We list several recent surveys on similar topics. While they are all about language models for code, 1-2 focus on NLP side; 3-6 focus on SE side; 7-11 are released after ours.

  1. "Large Language Models Meet NL2Code: A Survey" [2022-12] [ACL 2023] [paper]

  2. "A Survey on Pretrained Language Models for Neural Code Intelligence" [2022-12] [paper]

  3. "An Empirical Comparison of Pre-Trained Models of Source Code" [2023-02] [ICSE 2023] [paper]

  4. "Large Language Models for Software Engineering: A Systematic Literature Review" [2023-08] [paper]

  5. "Towards an Understanding of Large Language Models in Software Engineering Tasks" [2023-08] [paper]

  6. "Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey" [2023-10] [paper]

  7. "A Survey on Large Language Models for Software Engineering" [2023-12] [paper]

  8. "Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit" [2023-12] [paper]

  9. "A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond" [2024-03] [paper]

  10. "Tasks People Prompt: A Taxonomy of LLM Downstream Tasks in Software Verification and Falsification Approaches" [2024-04] [paper]

  11. "Automatic Programming: Large Language Models and Beyond" [2024-05] [paper]

2. Models

2.1 Off-the-Shelf LLM

These LLMs are not specifically trained for code, but have demonstrated varying coding capability.

  1. LaMDA: "LaMDA: Language Models for Dialog Applications" [2022-01] [paper]

  2. PaLM: "PaLM: Scaling Language Modeling with Pathways" [2022-04] [JMLR] [paper]

  3. GPT-NeoX: "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" [2022-04] [ACL 2022 Workshop on Challenges & Perspectives in Creating LLMs] [paper] [repo]

  4. BLOOM: "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model" [2022-11] [paper] [model]

  5. LLaMA: "LLaMA: Open and Efficient Foundation Language Models" [2023-02] [paper]

  6. GPT-4: "GPT-4 Technical Report" [2023-03] [paper]

  7. LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models" [2023-07] [paper] [repo]

  8. Phi-1.5: "Textbooks Are All You Need II: phi-1.5 technical report" [2023-09] [paper] [model]

  9. Baichuan 2: "Baichuan 2: Open Large-scale Language Models" [2023-09] [paper] [repo]

  10. Qwen: "Qwen Technical Report" [2023-09] [paper] [repo]

  11. Mistral: "Mistral 7B" [2023-10] [paper] [repo]

  12. Gemini: "Gemini: A Family of Highly Capable Multimodal Models" [2023-12] [paper]

  13. Phi-2: "Phi-2: The surprising power of small language models" [2023-12] [blog]

  14. YAYI2: "YAYI 2: Multilingual Open-Source Large Language Models" [2023-12] [paper] [repo]

  15. DeepSeek: "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" [2024-01] [paper] [repo]

  16. Mixtral: "Mixtral of Experts" [2024-01] [paper] [blog]

  17. DeepSeekMoE: "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" [2024-01] [paper] [repo]

  18. Orion: "Orion-14B: Open-source Multilingual Large Language Models" [2024-01] [paper] [repo]

  19. OLMo: "OLMo: Accelerating the Science of Language Models" [2024-02] [paper] [repo]

  20. Gemma: "Gemma: Open Models Based on Gemini Research and Technology" [2024-02] [paper] [blog]

  21. Claude 3: "The Claude 3 Model Family: Opus, Sonnet, Haiku" [2024-03] [paper] [blog]

  22. Yi: "Yi: Open Foundation Models by 01.AI" [2024-03] [paper] [repo]

  23. Poro: "Poro 34B and the Blessing of Multilinguality" [2024-04] [paper] [model]

  24. JetMoE: "JetMoE: Reaching Llama2 Performance with 0.1M Dollars" [2024-04] [paper] [repo]

  25. LLaMA 3: "The Llama 3 Herd of Models" [2024-04] [blog] [repo] [paper]

  26. Reka Core: "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models" [2024-04] [paper]

  27. Phi-3: "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" [2024-04] [paper]

  28. OpenELM: "OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework" [2024-04] [paper] [repo]

  29. Tele-FLM: "Tele-FLM Technical Report" [2024-04] [paper] [model]

  30. DeepSeek-V2: "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" [2024-05] [paper] [repo]

  31. GECKO: "GECKO: Generative Language Model for English, Code and Korean" [2024-05] [paper] [model]

  32. MAP-Neo: "MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series" [2024-05] [paper] [repo]

  33. Skywork-MoE: "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models" [2024-06] [paper]

  34. Xmodel-LM: "Xmodel-LM Technical Report" [2024-06] [paper]

  35. GEB: "GEB-1.3B: Open Lightweight Large Language Model" [2024-06] [paper]

  36. HARE: "HARE: HumAn pRiors, a key to small language model Efficiency" [2024-06] [paper]

  37. DCLM: "DataComp-LM: In search of the next generation of training sets for language models" [2024-06] [paper]

  38. Nemotron-4: "Nemotron-4 340B Technical Report" [2024-06] [paper]

  39. ChatGLM: "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools" [2024-06] [paper]

  40. YuLan: "YuLan: An Open-source Large Language Model" [2024-06] [paper]

  41. Gemma 2: "Gemma 2: Improving Open Language Models at a Practical Size" [2024-06] [paper]

  42. H2O-Danube3: "H2O-Danube3 Technical Report" [2024-07] [paper]

  43. Qwen2: "Qwen2 Technical Report" [2024-07] [paper]

  44. ALLaM: "ALLaM: Large Language Models for Arabic and English" [2024-07] [paper]

  45. SeaLLMs 3: "SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages" [2024-07] [paper]

  46. AFM: "Apple Intelligence Foundation Language Models" [2024-07] [paper]

2.2 Existing LLM Adapted to Code

These models are general-purpose LLMs further pretrained on code-related data.

2.3 General Pretraining on Code

These models are Transformer encoders, decoders, and encoder-decoders pretrained from scratch using existing objectives for general language modeling.

<p align='center'> <img src='imgs/model_detail.png' style='width: 90%; '> </p>

Encoder

  1. CuBERT (MLM + NSP): "Learning and Evaluating Contextual Embedding of Source Code" [2019-12] [ICML 2020] [paper] [repo]

  2. CodeBERT (MLM + RTD): "CodeBERT: A Pre-Trained Model for Programming and Natural Languages" [2020-02] [EMNLP findings 2020] [paper] [repo]

  3. GraphCodeBERT (MLM + DFG Edge Prediction + DFG Node Alignment): "GraphCodeBERT: Pre-training Code Representations with Data Flow" [2020-09] [ICLR 2021] [paper] [repo]

  4. SynCoBERT (MLM + Identifier Prediction + AST Edge Prediction + Contrastive Learning): "SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation" [2021-08] [paper]

  5. DISCO (MLM + Node Type MLM + Contrastive Learning): "Towards Learning (Dis)-Similarity of Source Code from Program Contrasts" [2021-10] [ACL 2022] [paper]

  6. Code-MVP (MLM + Type Inference + Contrastive Learning): "CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training" [2022-05] [NAACL 2022 Technical Track] [paper]

  7. CodeSage (MLM + Deobfuscation + Contrastive Learning): "Code Representation Learning At Scale" [2024-02] [ICLR 2024] [paper]

  8. CoLSBERT (MLM): "Scaling Laws Behind Code Understanding Model" [2024-02] [paper]

Decoder

  1. GPT-C (CLM): "IntelliCode Compose: Code Generation Using Transformer" [2020-05] [ESEC/FSE 2020] [paper]

  2. CodeGPT (CLM): "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [2021-02] [NeurIPS Datasets and Benchmarks 2021] [paper] [repo]

  3. CodeParrot (CLM) [2021-12] [blog]

  4. PolyCoder (CLM): "A Systematic Evaluation of Large Language Models of Code" [2022-02] [DL4C@ICLR 2022] [paper] [repo]

  5. CodeGen (CLM): "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis" [2022-03] [ICLR 2023] [paper] [repo]

  6. InCoder (Causal Masking): "InCoder: A Generative Model for Code Infilling and Synthesis" [2022-04] [ICLR 2023] [paper] [repo]

  7. PyCodeGPT (CLM): "CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation" [2022-06] [IJCAI-ECAI 2022] [paper] [repo]

  8. PanGu-Coder (CLM): "PanGu-Coder: Program Synthesis with Function-Level Language Modeling" [2022-07] [paper]

  9. SantaCoder (FIM): "SantaCoder: don't reach for the stars!" [2023-01] [paper] [model]

  10. CodeGeeX (CLM): "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [2023-03] [paper] [repo]

  11. StarCoder (FIM): "StarCoder: may the source be with you!" [2023-05] [paper] [model]

  12. Phi-1 (CLM): "Textbooks Are All You Need" [2023-06] [paper] [model]

  13. CodeFuse (CLM): "CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model" [2023-10] [paper] [model]

  14. DeepSeek Coder (CLM+FIM): "DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence" [2024-01] [paper] [repo]

  15. StarCoder2 (CLM+FIM): "StarCoder 2 and The Stack v2: The Next Generation" [2024-02] [paper] [repo]

  16. CodeShell (CLM+FIM): "CodeShell Technical Report" [2024-03] [paper] [repo]

  17. CodeQwen1.5 [2024-04] [blog]

  18. Granite: "Granite Code Models: A Family of Open Foundation Models for Code Intelligence" [2024-05] [paper] "Scaling Granite Code Models to 128K Context" [2024-07] [paper]

  19. NT-Java: "Narrow Transformer: Starcoder-Based Java-LM For Desktop" [2024-07] [paper]

Encoder-Decoder

  1. PyMT5 (Span Corruption): "PyMT5: multi-mode translation of natural language and Python code with transformers" [2020-10] [EMNLP 2020] [paper]

  2. Mastropaolo et al. (MLM + Deobfuscation): "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages" [2021-02] [ICSE 2021] [paper] [repo]

  3. DOBF (Span Corruption): "Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks" [2021-02] [NeurIPS 2021] [paper] [repo]

  4. PLBART (DAE): "Unified Pre-training for Program Understanding and Generation" [2021-03] [NAACL 2021] [paper] [repo]

  5. CodeT5 (Span Corruption + Identifier Tagging + Masked Identifier Prediction + Text2Code + Code2Text): "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation" [2021-09] [EMNLP 2021] [paper] [repo]

  6. SPT-Code (Span Corruption + NSP + Method Name Prediction): "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations" [2022-01] [ICSE 2022 Technical Track] [paper]

  7. AlphaCode (MLM + CLM): "Competition-Level Code Generation with AlphaCode" [2022-02] [Science] [paper] [blog]

  8. NatGen (Code Naturalization): "NatGen: Generative pre-training by "Naturalizing" source code" [2022-06] [ESEC/FSE 2022] [paper] [repo]

  9. ERNIE-Code (Span Corruption + Pivot-based Translation LM): "ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages" [2022-12] [ACL23 (Findings)] [paper][repo]

  10. CodeT5+ (Span Corruption + CLM + Text-Code Contrastive Learning + Text-Code Translation): "CodeT5+: Open Code Large Language Models for Code Understanding and Generation" [2023-05] [EMNLP 2023] [paper] [repo]

  11. AST-T5 (Span Corruption): "AST-T5: Structure-Aware Pretraining for Code Generation and Understanding" [2024-01] [ICML 2024] [paper]

UniLM

  1. CugLM (MLM + NSP + CLM): "Multi-task Learning based Pre-trained Language Model for Code Completion" [2020-12] [ASE 2020] [paper]

  2. UniXcoder (MLM + NSP + CLM + Span Corruption + Contrastive Learning + Code2Text): "UniXcoder: Unified Cross-Modal Pre-training for Code Representation" [2022-03] [ACL 2022] [paper] [repo]

2.4 (Instruction) Fine-Tuning on Code

These models apply Instruction Fine-Tuning techniques to enhance the capacities of Code LLMs.

  1. WizardCoder (StarCoder + Evol-Instruct): "WizardCoder: Empowering Code Large Language Models with Evol-Instruct" [2023-06] [ICLR 2024] [paper] [repo]

  2. PanGu-Coder 2 (StarCoder + Evol-Instruct + RRTF): "PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback" [2023-07] [paper]

  3. OctoCoder (StarCoder) / OctoGeeX (CodeGeeX2): "OctoPack: Instruction Tuning Code Large Language Models" [2023-08] [ICLR 2024 Spotlight] [paper] [repo]

  4. "At Which Training Stage Does Code Data Help LLMs Reasoning" [2023-09] [ICLR 2024 Spotlight] [paper]

  5. InstructCoder: "InstructCoder: Instruction Tuning Large Language Models for Code Editing" [paper] [repo]

  6. MFTCoder: "MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning" [2023-11] [KDD 2024] [paper] [repo]

  7. "LLM-Assisted Code Cleaning For Training Accurate Code Generators" [2023-11] [ICLR 2024] [paper]

  8. Magicoder: "Magicoder: Empowering Code Generation with OSS-Instruct" [2023-12] [ICML 2024] [paper]

  9. WaveCoder: "WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation" [2023-12] [paper]

  10. Astraios: "Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models" [2024-01] [paper]

  11. DolphCoder: "DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning" [2024-02] [paper]

  12. SafeCoder: "Instruction Tuning for Secure Code Generation" [2024-02] [ICML 2024] [paper]

  13. CCT: "Code Comparison Tuning for Code Large Language Models" [2024-03] [paper]

  14. SAT: "Structure-aware Fine-tuning for Code Pre-trained Models" [2024-04] [paper]

  15. CodeFort: "CodeFort: Robust Training for Code Generation Models" [2024-04] [paper]

  16. XFT: "XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts" [2024-04] [paper] [repo]

  17. AIEV-Instruct: "AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct" [2024-05] [paper]

  18. AlchemistCoder: "AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data" [2024-05] [paper]

  19. "From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers" [2024-05] [paper]

  20. "Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning" [2024-05] [paper]

  21. PLUM: "PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models" [2024-06] [paper]

  22. mCoder: "McEval: Massively Multilingual Code Evaluation" [2024-06] [paper]

  23. "Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models" [2024-06] [paper]

  24. Code-Optimise: "Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency" [2024-06] [paper]

  25. UniCoder: "UniCoder: Scaling Code Large Language Model via Universal Code" [2024-06] [paper]

  26. "Brevity is the soul of wit: Pruning long files for code generation" [2024-06] [paper]

  27. "Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning" [2024-07] [paper]

  28. InverseCoder: "InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct" [2024-07] [paper]

  29. "Curriculum Learning for Small Code Language Models" [2024-07] [paper]

  30. Genetic-Instruct: "Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models" [2024-07] [paper]

2.5 Reinforcement Learning on Code

  1. CompCoder: "Compilable Neural Code Generation with Compiler Feedback" [2022-03] [ACL 2022] [paper]

  2. CodeRL: "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning" [2022-07] [NeurIPS 2022] [paper] [repo]

  3. PPOCoder: "Execution-based Code Generation using Deep Reinforcement Learning" [2023-01] [TMLR 2023] [paper] [repo]

  4. RLTF: "RLTF: Reinforcement Learning from Unit Test Feedback" [2023-07] [paper] [repo]

  5. B-Coder: "B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis" [2023-10] [ICLR 2024] [paper]

  6. StepCoder: "StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback" [2024-02] [paper]

  7. RLPF & DPA: "Performance-Aligned LLMs for Generating Fast Code" [2024-04] [paper]

  8. "Measuring memorization in RLHF for code completion" [2024-06] [paper]

  9. "Applying RLAIF for Code Generation with API-usage in Lightweight LLMs" [2024-06] [paper]

3. When Coding Meets Reasoning

3.1 Coding for Reasoning

  1. PAL: "PAL: Program-aided Language Models" [2022-11] [ICML 2023] [paper] [repo]

  2. PoT: "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" [2022-11] [TMLR 2023] [paper] [repo]

  3. CSV: "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification" [2023-08] [ICLR 2024] [paper]

  4. MathCoder: "MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning" [2023-10] [ICLR 2024] [paper]

  5. CoC: "Chain of Code: Reasoning with a Language Model-Augmented Code Emulator" [2023-12] [ICML 2024] [paper]

  6. ReGAL: "ReGAL: Refactoring Programs to Discover Generalizable Abstractions" [2024-01] [ICML 2024] [paper]

  7. "Executable Code Actions Elicit Better LLM Agents" [2024-02] [ICML 2024] [paper]

  8. FlowMind: "FlowMind: Automatic Workflow Generation with LLMs" [2024-03] [paper]

  9. Think-and-Execute: "Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models" [2024-04] [paper]

  10. CoRE: "CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents" [2024-05] [paper]

  11. MuMath-Code: "MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning" [2024-05] [paper]

  12. COGEX: "Learning to Reason via Program Generation, Emulation, and Search" [2024-05] [paper]

  13. "Arithmetic Reasoning with LLM: Prolog Generation & Permutation" [2024-05] [paper]

  14. "Can LLMs Reason in the Wild with Programs?" [2024-06] [paper]

  15. DotaMath: "DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning" [2024-07] [paper]

  16. CIBench: "CIBench: Evaluating Your LLMs with a Code Interpreter Plugin" [2024-07] [paper]

  17. PyBench: "PyBench: Evaluating LLM Agent on various real-world coding tasks" [2024-07] [paper]

  18. AdaCoder: "AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering" [2024-07] [paper]

  19. PyramidCoder: "Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering" [2024-07] [paper]

3.2 Code Simulation

3.3 Code Agents

  1. Self-collaboration: "Self-collaboration Code Generation via ChatGPT" [2023-04] [paper]

  2. ChatDev: "Communicative Agents for Software Development" [2023-07] [paper] [repo]

  3. MetaGPT: "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework" [2023-08] [paper] [repo]

  4. CodeChain: "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules" [2023-10] [ICLR 2024] [paper]

  5. CONLINE: "CONLINE: Complex Code Generation and Refinement with Online Searching and Correctness Testing" [2024-03] [paper]

  6. LCG: "When LLM-based Code Generation Meets the Software Development Process" [2024-03] [paper]

  7. RepairAgent: "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair" [2024-03] [paper]

  8. MAGIS:: "MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution" [2024-03] [paper]

  9. SoA: "Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization" [2024-04] [paper]

  10. AutoCodeRover: "AutoCodeRover: Autonomous Program Improvement" [2024-04] [paper]

  11. SWE-agent: "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" [2024-05] [paper]

  12. MapCoder: "MapCoder: Multi-Agent Code Generation for Competitive Problem Solving" [2024-05] [paper]

  13. "Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?" [2024-05] [paper]

  14. FunCoder: "Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation" [2024-05] [paper]

  15. CTC: "Multi-Agent Software Development through Cross-Team Collaboration" [2024-06] [paper]

  16. MASAI: "MASAI: Modular Architecture for Software-engineering AI Agents" [2024-06] [paper]

  17. AgileCoder: "AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology" [2024-06] [paper]

  18. CodeNav: "CodeNav: Beyond tool-use to using real-world codebases with LLM agents" [2024-06] [paper]

  19. INDICT: "INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness" [2024-06] [paper]

  20. AppWorld: "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents" [2024-07] [paper]

3.4 Interactive Coding

4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages

5. Methods/Models for Downstream Tasks

For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).

<p align='center'> <img src='imgs/downstream-1.png' style='width: 100%; '> <img src='imgs/downstream-2.png' style='width: 100%; '> <img src='imgs/downstream-3.png' style='width: 100%; '> <img src='imgs/downstream-4.png' style='width: 100%; '> <img src='imgs/downstream-5.png' style='width: 100%; '> </p>

Code Generation

Code Translation

Code Summarization

Program Repair

Code Similarity and Embedding (Clone Detection, Code Search)

Type Prediction

Repository-Level Coding

Frontend Development & Web Agents

Text-To-SQL

Test Generation

Oracle Generation

Mutation Testing

Vulnerability Detection

Malicious Code Detection

Compiler Optimization

Decompilation

Commit Message Generation

Code Review

Log Analysis

Software Modeling

Requirement Engineering

6. Analysis of AI-Generated Code

Security and Vulnerabilities

Correctness

Hallucination

Efficiency

Robustness

Interpretability

AI-Generated Code Detection

Others

7. User-LLM Interaction

8. Datasets

8.1 Pretraining

  1. CodeSearchNet: "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [2019-09] [paper] [repo] [data]

  2. The Pile: "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" [2020-12], [paper] [data]

  3. CodeParrot, 2022-02, [data]

  4. The Stack: "The Stack: 3 TB of permissively licensed source code" [2022-11] [paper] [data]

  5. ROOTS: "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset" [2023-03] [NeurIPS 2022 Datasets and Benchmarks Track] [paper] [data]

  6. The Stack v2: "StarCoder 2 and The Stack v2: The Next Generation" [2024-02] [paper] [data]

8.2 Benchmarks

Integrated Benchmarks

Program Synthesis

DateVenueBenchmarkSizeLanguageSource
2018-02LREC 2018NL2Bash9305Bash"NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System" [paper] [data]
2018-08EMNLP 2018CONCODE104KJava"Mapping Language to Code in Programmatic Context" [paper] [data]
2019-10EMNLP-IJCNLP 2019JuICe1.5M/3725 *Python"JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation" [paper] [data]
2021-05NeurIPS 2021APPS10000Python"Measuring Coding Challenge Competence With APPS" [paper] [data]
2021-07arXivHumanEval164Python"Evaluating Large Language Models Trained on Code" [paper] [data]
2021-08arXivMBPP/MathQA-Python974/23914Python"Program Synthesis with Large Language Models" [paper] [MBPP] [MathQA-Python]
2021-08ACL/IJCNLP 2021PlotCoder40797Python"PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context" [paper] [data]
2022-01arXivDSP1119Python"Training and Evaluating a Jupyter Notebook Data Science Assistant" [paper] [data]
2022-02ScienceCodeContests13610C++, Python, Java"Competition-Level Code Generation with AlphaCode" [paper] [data]
2022-03EACL 2023 FindingsMCoNaLa896Python"MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages" [paper] [data]
2022-06arXivAixBench336Java"AixBench: A Code Generation Benchmark Dataset" [paper] [data]
2022-08IEEE Trans. Software EngineeringMultiPL-E"MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation", [paper] [data]
2022-10ICLR 2023MBXP12.4KPython, Java, JS, TypeScript, Go, C#, PHP, Ruby, Kotlin, C++, Perl, Scala, Swift"Multi-lingual Evaluation of Code Generation Models" [paper] [data]
2022-10ICLR 2023Multilingual HumanEval1.9KPython, Java, JS, TypeScript, Go, C#, PHP, Ruby, Kotlin, Perl, Scala, Swift"Multi-lingual Evaluation of Code Generation Models" [paper] [data]
2022-10ICLR 2023MathQA-X5.6KPython, Java, JS"Multi-lingual Evaluation of Code Generation Models" [paper] [data]
2022-11arXivExeDS534Python"Execution-based Evaluation for Data Science Code Generation Models" [paper] [data]
2022-11arXivDS-10001000Python"DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation" [paper] [data]
2022-12arXivODEX945Python"Execution-Based Evaluation for Open-Domain Code Generation" [paper] [data]
2023-02arXivCoderEval460Python, Java"CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models" [paper] [data]
2023-03arXivxCodeEval5.5MC, C#, C++, Go, Java, JS, Kotlin, PHP, Python, Ruby, Rust"xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [paper] [data]
2023-03arXivHumanEval-X820Python, C++, Java, JS, Go"CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [paper] [data]
2023-05arXivHumanEval+164Python"Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" [paper] [data]
2023-06arXivStudentEval1749 $^\dagger$Python"StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code" [paper] [data]
2023-08ICLR 2024 SpotlightHumanEvalPack984Python, JS, Go, Java, C++, Rust"OctoPack: Instruction Tuning Code Large Language Models" [paper] [data]
2023-06NeurIPS 2023DotPrompts10538 $^\ddagger$Java"Guiding Language Models of Code with Global Context using Monitors" [paper] [data]
2023-09arXivCodeApex476C++"CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [paper] [data]
2023-09arXivVerilogEval8645/156 $^\diamond$Verilog"VerilogEval: Evaluating Large Language Models for Verilog Code Generation" [paper] [data]
2023-11arXivML-Bench10040Bash"ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks" [paper] [data]
2023-12arXivTACO26,433Python"TACO: Topics in Algorithmic COde generation dataset" [paper] [data]
2024-01HPDCParEval420C++, CUDA, HIP"Can Large Language Models Write Parallel Code?" [paper] [data]
2024-04arXivUSACO307Python"Can Language Models Solve Olympiad Programming?" [paper] [data]
2024-04LREC-COLING 2024PECC2396Python"PECC: Problem Extraction and Coding Challenges" [paper] [data]
2024-04arXivCodeGuard+23Python, C"Constrained Decoding for Secure Code Generation" [paper] [data]
2024-05arXivNaturalCodeBench402Python, Java"NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts" [paper] [data]
2024-05arXivMHPP140Python"MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation" [paper] [repo]
2024-06arXivVHDL-Eval202VHDL"VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation" [paper]
2024-06arXivAICoderEval492Python"AICoderEval: Improving AI Domain Code Generation of Large Language Models" [paper] [data]
2024-06arXivVersiCode98,692Python"VersiCode: Towards Version-controllable Code Generation" [paper] [data]
2024-06IEEE AITest 2024ScenEval12,864Java"ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation" [paper]
2024-06arXivBigCodeBench1,140Python"BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" [paper] [data]
2024-07arXivCodeUpdateArena670Python"CodeUpdateArena: Benchmarking Knowledge Editing on API Updates" [paper] [data]
2024-07arXivLBPP161Python"On Leakage of Code Generation Evaluation Datasets" [paper] [data]
2024-07arXivNoviCode150Python"NoviCode: Generating Programs from Natural Language Utterances by Novices" [paper] [data]
2024-07arXivCase2Code1.3MPython"Case2Code: Learning Inductive Reasoning with Synthetic Data" [paper] [data]
2024-07arXivSciCode338Python"SciCode: A Research Coding Benchmark Curated by Scientists" [paper] [data]
2024-07arXivauto-regression460Python"Generating Unseen Code Tests In Infinitum" [paper]
2024-07arXivWebApp1K1000JavaScript"WebApp1K: A Practical Code-Generation Benchmark for Web App Development" [paper] [data]

* Automatically mined/human-annotated

$^\dagger$ 1749 prompts for 48 problems

$^\ddagger$ 10538 prompts for 1420 problems

$^\diamond$ Machine/human prompts

Visually Grounded Program Synthesis

DateVenueBenchmarkSizeLanguageSource
2024-04arXivMMCode3548Python"MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems" [paper] [data]
2024-05arXivPlot2Code132Python"Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots" [paper] [data]
2024-06arXivChartMimic1000Python"ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation" [paper] [data]

Code Reasoning and QA

DateVenueBenchmarkSizeLanguageSource
2021-09EMNLP Findings 2021CodeQA120K/70KJava/Python"CodeQA: A Question Answering Dataset for Source Code Comprehension" [paper] [data]
2022-10NAACL 2022CS1QA9237Python"CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course" [paper] [data]
2023-09arXivCodeApex250C++"CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [paper] [data]
2024-01ICML 2024CRUXEval800Python"CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution" [paper] [data]
2024-05arXivPythonIO2650Python"Multiple-Choice Questions are Efficient and Robust LLM Evaluators" [paper] [data]
2024-05arXivStaCCQA270KPython"Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering" [paper] [data]
2024-06arXivRepoQA500Python, C++, Java, Rust, TypeScript"RepoQA: Evaluating Long Context Code Understanding" [paper] [data]

Text-to-SQL

DateVenueBenchmarkSizeLanguageSource
2017-08arXivWikiSQL80654"Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning" [paper] [data]
2018-06CL 2018Advising4570"Improving Text-to-SQL Evaluation Methodology" [paper] [data]
2018-09EMNLP 2018Spider10181"Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task" [paper] [data]
2019-06ACL 2019SParC12726"SParC: Cross-Domain Semantic Parsing in Context" [paper] [data]
2019-07WWW 2020MIMICSQL10000"Text-to-SQL Generation for Question Answering on Electronic Medical Records" [paper] [data]
2019-09EMNLP-IJCNLP 2019CoSQL15598"CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases" [paper] [data]
2020-05LREC 2020Criteria-to-SQL2003"Dataset and Enhanced Model for Eligibility Criteria-to-SQL Semantic Parsing" [paper] [data]
2020-10EMNLP 2020 FindingsSquall11276"On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries" [paper] [data]
2020-10NAACL-HLT 2021Spider-Realistic508"Structure-Grounded Pretraining for Text-to-SQL" [paper] [data]
2021-06ACL/IJCNLP 2021Spider-Syn8034"Towards Robustness of Text-to-SQL Models against Synonym Substitution" [paper] [data]
2021-06NLP4Prog 2021SEDE12023"Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" [paper] [data]
2021-06ACL/IJCNLP 2021KaggleDBQA400"KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers" [paper] [data]
2021-09EMNLPSpider-DK535"Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization" [paper] [data]
2022-05NAACL 2022 FindingsSpider-SS/CG8034/45599"Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment" [paper] [data]
2023-05arXivBIRD12751"Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" [paper] [data]
2023-06ACL 2023XSemPLR24.4K"XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations" [paper] [data]
2024-05ACL Findings 2024EHR-SeqSQL31669"EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records" [paper]
2024-06NAACL 2024BookSQL100K"BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain" [paper] [data]

Code Translation

DateVenueBenchmarkSizeLanguageSource
2020-06NeurIPS 2020Transcoder GeeksforGeeks1.4KC++, Java, Python"Unsupervised Translation of Programming Languages" [paper] [data]
2021-02NeurIPS Datasets and Benchmarks 2021CodeTrans11.8KJava, C#"CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [paper] [data]
2021-08ACL 2023 FindingsAvatar9515Java, Python"AVATAR: A Parallel Corpus for Java-Python Program Translation" [paper] [data]
2022-06AAAI 2022CoST132KC++, Java, Python, C#, JS, PHP, C"Multilingual Code Snippets Training for Program Translation" [paper] [data]
2022-06arXivXLCoST567KC++, Java, Python, C#, JS, PHP, C"XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence" [paper] [data]
2023-03arXivxCodeEval5.6MC, C#, C++, Go, Java, JS, Kotlin, PHP, Python, Ruby, Rust"xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [paper] [data]
2023-03arXivHumanEval-X1640Python, C++, Java, JS, Go"CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [paper] [data]
2023-08arXivG-TransEval4000C++, Java, C#, JS, Python"On the Evaluation of Neural Code Translation: Taxonomy and Benchmark" [paper] [data]
2023-10arXivCodeTransOcean270.5K45"CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation" [paper] [data]

Program Repair

DateVenueBenchmarkSizeLanguageSource
2014-07ISSTA 2014Defects4J357Java"Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs" [paper] [data]
2015-12IEEE Trans. Software EngineeringManyBugs/IntroClass185/998C"The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs" [paper] [data]
2016-11FSE 2016BugAID105KJS"Discovering Bug Patterns in JavaScript" [paper] [data]
2017-02AAAI 2017DeepFix6971C"DeepFix: Fixing Common C Language Errors by Deep Learning" [paper] [data]
2017-05ICSE-C 2017Codeflaws3902C"DeepFix: Fixing Common C Language Errors by Deep Learning" [paper] [data]
2017-10SPLASH 2017QuixBugs80Java, Python"QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge" [paper] [data]
2018-05MSR 2018Bugs.jar1158Java"Bugs.jar: a large-scale, diverse dataset of real-world Java bugs" [paper] [data]
2018-12ACM Trans. Softw. Eng. Methodol.BFP124KJava"An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation" [paper] [data]
2019-01SANER 2019Bears251Java"Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies" [paper] [data]
2019-01ICSE 2019unnamed21.8K *Java"On Learning Meaningful Code Changes via Neural Machine Translation" [paper] [data]
2019-04ICST 2019BugsJS453JS"BugsJS: a Benchmark of JavaScript Bugs" [paper] [data]
2019-05ICSE 2019BugSwarm1827/1264Java/Python"BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes" [paper] [data]
2019-05ICSE 2019CPatMiner17K *Java"Graph-based mining of in-the-wild, fine-grained, semantic code change patterns" [paper] [data]
2019-05MSR 2020ManySStuBs4J154KJava"How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset" [paper] [data]
2019-11ASE 2019Refactory1783Python"Re-factoring based program repair applied to programming assignments" [paper] [data]
2020-07ISSTA 2020CoCoNut24MJava, Python, C, JS"CoCoNuT: combining context-aware neural translation models using ensemble for program repair" [paper] [data]
2020-10Inf. Softw. Technol.Review4Repair58021Java"Review4Repair: Code Review Aided Automatic Program Repairing" [paper] [data]
2020-11ESEC/FSE 2020BugsInPy493Python"BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies" [paper] [data]
2021-07ICML 2021TFix105KJS"TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer" [paper] [data]
2021-08arXivMegadiff663K *Java"Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size" [paper] [data]
2022-01SSB/TSSBMSR 20229M/3MPython"TSSB-3M: Mining single statement bugs at massive scale" [paper] [data]
2022-10MSR 2022FixJS324KJS"FixJS: a dataset of bug-fixing JavaScript commits" [paper] [data]
2022-11ESEC/FSE 2022TypeBugs93Python"PyTER: Effective Program Repair for Python Type Errors" [paper] [data]
2023-03arXivxCodeEval4.7MC, C#, C++, Go, Java, JS, Kotlin, PHP, Python, Ruby, Rust"xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [paper] [data]
2023-04arXivRunBugRun450KC, C++, Java, Python, JS, Ruby, Go, PHP"RunBugRun -- An Executable Dataset for Automated Program Repair" [paper] [data]
2023-08arXivHumanEvalPack984Python, JS, Go, Java, C++, Rust"OctoPack: Instruction Tuning Code Large Language Models" [paper] [data]
2024-01arXivDebugBench4253C++, Java, Python"DebugBench: Evaluating Debugging Capability of Large Language Models" [paper] [data]

* These are code-change datasest, and only a subset therein concerns bug fixing.

Code Summarization

DateVenueBenchmarkSizeLanguageSource
2016-08ACL 2016CODE-NN66K/32KC#/SQL"Summarizing Source Code using a Neural Attention Model" [paper] [data]
2017-07IJCNLP 2017unnamed150KPython"A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" [paper] [data]
2018-05ICPC 2018DeepCom588KJava"Deep code comment generation" [paper] [data]
2018-07IJCAI 2018TL-CodeSum411KJava"Summarizing Source Code with Transferred API Knowledge" [paper] [data]
2018-11ASE 2018unnamed109KPython"Improving Automatic Source Code Summarization via Deep Reinforcement Learning" [paper] [data]
2019-09arxivCodeSearchNet2.3MGo, JS, Python, PHP, Java, Ruby"CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [paper] [data]
2023-08arXivHumanEvalPack984Python, JS, Go, Java, C++, Rust"OctoPack: Instruction Tuning Code Large Language Models" [paper] [data]

Defect/Vulnerability Detection

DateVenueBenchmarkSizeLanguageSource
2018-01NDSS 2018CGD62KC, C++"VulDeePecker: A Deep Learning-Based System for Vulnerability Detection" [paper] [data]
2018-04IEEE Trans. Ind. Informaticsunnamed32988C, C++"Cross-Project Transfer Representation Learning for Vulnerable Function Discovery" [paper] [data]
2018-07ICMLA 2018Draper VDISC12.8MC, C++"Automated Vulnerability Detection in Source Code Using Deep Representation Learning" [paper] [data]
2018-07IEEE TDSCSySeVR15591C, C++"SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities" [paper] [data]
2019-02MSR 2019unnamed624Java"A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software" [paper] [data]
2019-09NeurIPS 2019Devign49KC"Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks" [paper] [data]
2019-11IEEE TDSCunnamed170KC, C++"Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases" [paper] [data]
2019-12ICLR 2020GREAT2.8MPython"Global Relational Models of Source Code" [paper] [data]
2020-01IEEE TDSCMVD182KC, C++"μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection" [paper] [data]
2020-02ICICS 2019unnamed1471C"Deep Learning-Based Vulnerable Function Detection: A Benchmark" [paper] [data]
2020-09IEEE Trans. Software Eng.ReVeal18KC"Deep Learning based Vulnerability Detection: Are We There Yet?" [paper] [data]
2020-09MSR 2020Big-Vul265KC, C++"A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries" [paper] [data]
2021-02ICSE (SEIP) 2021D2A1.3MC, C++"D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis" [paper] [data]
2021-05NeurIPS 2021PyPIBugs2374Python"Self-Supervised Bug Detection and Repair" [paper] [data]
2021-07In PROMISE 2021CVEfixes549527"CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software" [paper] [data]
2021-08ESEC/FSE 2021CrossVul2747640+"CrossVul: a cross-language vulnerability dataset with commit data" [paper] [data]
2023-04RAID 2023DiverseVul349KC, C++"DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection" [paper] [data]
2023-06arXivVulnPatchPairs26KC"Limits of Machine Learning for Automatic Vulnerability Detection" [paper] [data]
2023-11arXivVulBench455C"How Far Have We Gone in Vulnerability Detection Using Large Language Models" [paper] [data]
2024-03arXivPrimeVul236KC/C++"Vulnerability Detection with Code Language Models: How Far Are We?" [paper]
2024-06arXivVulDetectBench1000C/C++"VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models" [paper] [data]

Code Retrieval

DateVenueBenchmarkSizeLanguageSource
2018-03WWW 2018StaQC148K/120KPython/SQL"StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" [paper] [data]
2018-05ICSE 2018DeepCS16.2MJava"Deep Code Search" [paper] [data]
2018-05MSR 2018CoNaLa600K/2.9KPython"Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow" [paper] [data]
2019-08arXivunnamed287Java"Neural Code Search Evaluation Dataset" [paper] [data]
2019-09arXivCodeSearchNet2.3M/99Go, PHP, JS, Python, Java, Ruby"CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [paper] [data]
2020-02SANER 2020CosBench52Java"Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries" [paper] [data]
2020-08arXivSO-DS2.2KPython"Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent" [paper] [data]
2020-10ACM Trans. Knowl. Discov. DataFB-Java249KJava"Deep Graph Matching and Searching for Semantic Code Retrieval" [paper] [data]
2021-02NeurIPS Datasets and Benchmarks 2021AdvTest/WebQueryTest280K/1KPython"CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [paper] [[data]]
2021-05ACL/IJCNLP 2021CoSQA21KPython"CoSQA: 20,000+ Web Queries for Code Search and Question Answering" [paper] [data]
2024-03arXivProCQA5.2MC, C++, Java, Python, Ruby, Lisp, JS, C#, Go, Rust, PHP"ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search" [paper] [data]
2024-06arXivCoSQA+109KPython"CoSQA+: Enhancing Code Search Dataset with Matching Code" [paper] [data]
2024-07arXivCoIR~2M14"CoIR: A Comprehensive Benchmark for Code Information Retrieval Models" [paper] [data]

Type Inference

DateVenueBenchmarkSizeLanguageSource
2019-12ESEC/FSE 2020TypeWriter OSS208KPython"TypeWriter: Neural Type Prediction with Search-based Validation" [paper] [data]
2020-04PLDI 2020Typilus252KPython"Typilus: Neural Type Hints" [paper] [data]
2020-04ICLR 2020LambdaNet300 *TypeScript"LambdaNet: Probabilistic Type Inference using Graph Neural Networks" [paper] [data]
2021-04MSR 2021ManyTypes4Py869KPython"ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference" [paper] [data]
2022-10MSR 2022ManyTypes4TypeScript9.1MTypeScript"ManyTypes4TypeScript: a comprehensive TypeScript dataset for sequence-based type inference" [paper] [data]
2023-02ECOOP 2023TypeWeaver513 *TypeScript"Do Machine Learning Models Produce TypeScript Types That Type Check?" [paper] [data]
2023-03ICLR 2023BetterTypes4Py/InferTypes4Py608K/4.6KPython"TypeT5: Seq2seq Type Inference using Static Analysis" [paper] [data]
2023-05arXivOpenTau744 *TypeScript"Type Prediction With Program Decomposition and Fill-in-the-Type Training" [paper] [data]

* These are project counts.

Commit Message Generation

DateVenueBenchmarkSizeLanguageSource
2017-03ICPC 2017unnamed509KJava"Towards Automatic Generation of Short Summaries of Commits" [paper] [data]
2017-04ACL 2017CommitGen153KPython, JS, C++, Java"A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes" [paper] [data]
2017-08ASE 2017CommitGen32K/75K *Java"Automatically Generating Commit Messages from Diffs using Neural Machine Translation" [paper] [data]
2018-09ASE 2018NNGen27KJava"Neural-machine-translation-based commit message generation: how far are we?" [paper] [data]
2019-05MSR 2019PtrGNCMsg64.9KJava"Generating commit messages from diffs using pointer-generator network" [paper] [[data(https://zenodo.org/records/2593787)]]
2019-08IJCAI 2019CoDiSum90.7KJava"Commit message generation for source code changes" [paper] [data]
2019-12IEEE Trans. Software Eng.ATOM160KJava"ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking" [paper] [data]
2021-05arXivCommitBERT346KPython, PHP, Go, Java, JS, Ruby"CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model" [paper] [data]
2021-07ICSME 2021MCMD2.25MJava, C#, C++, Python, JS"On the Evaluation of Commit Message Generation Models: An Experimental Study" [paper] [data]
2021-07ACM Trans. Softw. Eng. Methodol.CoRec107KJava"Context-aware Retrieval-based Deep Commit Message Generation" [paper] [data]
2023-07ASE 2023ExGroFi19263Java"Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models" [paper] [data]
2023-08ASE 2023CommitChronicle10.7M20"From Commit Message Generation to History-Aware Commit Message Completion" [paper] [data]

* with/without verb-direct object filter

Repo-Level Coding

DateVenueBenchmarkSizeLanguageSource
2023-03arXivRepoEval1600/1600/373 *Python"RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation" [paper] [data]
2023-06ICLR 2024RepoBench890K/9M/43K $^\dagger$Python, Java"RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" [paper] [data]
2023-06NeurIPS 2023PragmaticCode880 **Java"Guiding Language Models of Code with Global Context using Monitors" [paper] [data]
2023-06arXivStack-Repo816KJava"RepoFusion: Training Code Models to Understand Your Repository" [paper] [data]
2023-09ISMB 2024BioCoder2269/460/460Python, Java"BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models" [paper] [data]
2023-09arXivCodePlan645/21 $^\ddagger$C#/Python $^\ddagger$"CodePlan: Repository-level Coding using LLMs and Planning" [paper] [data]
2023-10arXivSWE-Bench2294Python"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [paper] [data]
2023-10arXivCrossCodeEval9928Python, Java, TypeScript, C#"CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion" [paper] [data]
2024-03arXivEvoCodeBench275Python"EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories" [paper] [data]
2024-05arXivDevEval1874Python"DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories" [paper] [data]
2024-06arXivJavaBench389Java"Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench" [paper] [data]
2024-06arXivHumanEvo200/200Python/Java"Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond" [paper] [data]
2024-06arXivRepoExec355Python"REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark" [paper]
2024-06arXivRES-Q100Python, JavaScript"RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale" [paper] [data]

*Line Completion/API Invocation Completion/Function Completion

$^\dagger$ Retrieval/Completion/Pipeline

** File count

$^\ddagger$ Migration/Temporal Edit

Other tasks are coming soon!

9. Recommended Readings

30 papers as a primer on LLM.

DateKeywordPaperTL;DR
2014-09AttentionNeural Machine Translation by Jointly Learning to Align and TranslateThe original attention, proposed for encoder-decoder RNN
2015-08BPENeural Machine Translation of Rare Words with Subword UnitsByte-pair encoding: split rare words into subword units
2017-06TransformerAttention Is All You NeedReplace LSTM with self-attention for long-range dependency and parallel training
2017-10Mixed Precision TrainingMixed Precision TrainingStore model weights in fp16 to save memory
2018-04GLUEGLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language UnderstandingA language understanding benchmark
2018-06GPTImproving Language Understanding by Generative Pre-TrainingPretraining-finetuning paradigm applied to Transformer decoder
2018-10BERTBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMasked Language Modeling (MLM) applied to Transformer encoder for pretraining
2019-02GPT-2Language Models are Unsupervised Multitask LearnersGPT made larger (1.5B). They found language models implicitly learn about downstream tasks (such as translation) during pretraining.
2019-05SuperGLUESuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding SystemsAnother language understanding benchmark
2019-07RoBERTaRoBERTa: A Robustly Optimized BERT Pretraining ApproachAn optimized BERT
2019-09Megatron-LMMegatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismModel parallelism
2019-10ZeROZeRO: Memory Optimizations Toward Training Trillion Parameter ModelsMemory-efficient distributed optimization
2019-10T5Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerTransformer encoder-decoder pretrained with an MLM-like denoising objective
2020-05GPT-3Language Models are Few-Shot LearnersBy training an even larger version of GPT-2 (175B), they discovered a new learning paradigm: In-Context Learning (ICL)
2020-09MMLUMeasuring Massive Multitask Language UnderstandingA world-knowledge and complex reasoning benchmark
2020-12PileThe Pile: An 800GB Dataset of Diverse Text for Language ModelingA diverse pretraining dataset
2021-06LoRALoRA: Low-Rank Adaptation of Large Language ModelsMemory-efficient finetuning
2021-09FLANFinetuned Language Models Are Zero-Shot LearnersInstruction-finetuning
2021-10T0Multitask Prompted Training Enables Zero-Shot Task GeneralizationAlso instruction finetuning, but applied to the much smaller T5
2021-12GopherScaling Language Models: Methods, Analysis & Insights from Training GopherA 280B LLM with comprehensive experiments
2022-01CoTChain-of-Thought Prompting Elicits Reasoning in Large Language ModelsChain-of-Though reasoning
2022-03InstructGPTTraining language models to follow instructions with human feedbackGPT-3 instruction finetuned with RLHF (reinforcement learning from human feedback)
2022-03ChinchillaTraining Compute-Optimal Large Language ModelsA smaller (70B) version of Gopher that's pretrained on more data
2022-04PaLMPaLM: Scaling Language Modeling with PathwaysThe largest dense model ever (540B)
2022-050-shot CoTLarge Language Models are Zero-Shot ReasonersTell LLMs to think step by step, and they can actually do it
2022-06BIG BenchBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAnother world-knowledge and complex reasoning benchmark
2022-06Emergent AbilityEmergent Abilities of Large Language ModelsA review on emergent abilities
2022-10FlanScaling Instruction-Finetuned Language ModelsConsolidate all the existing instruction tuning datasets, and you get SOTA
2022-11BLOOMBLOOM: A 176B-Parameter Open-Access Multilingual Language ModelThe largest open-source LLM, trained on 46 languages, with detailed discussion about training and evaluation
2022-12Self-InstructSelf-Instruct: Aligning Language Models with Self-Generated InstructionsInstruction tuning using LLM-generated data

This list aims to provide the essential background for understanding current LLM technologies, and thus excludes more recent models such as LLaMA, GPT-4 or PaLM 2. For comprehensive reviews on these more general topics, we refer to other sources such as this paper or these repositories: Awesome-LLM, Awesome AIGC Tutorials. And for LLM applications in other specific domains: Awesome Domain LLM, Awesome Tool Learning, Awesome-LLM-MT, Awesome Education LLM.

Citation

If you find this repo or our survey helpful, please consider citing us:

@article{zhang2023unifying,
      title={Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code},
      author={Ziyin Zhang and Chaoyu Chen and Bingchang Liu and Cong Liao and Zi Gong and Hang Yu and Jianguo Li and Rui Wang},
      year={2023},
      journal={CoRR},
      volume={abs/2311.07989},
      url={https://doi.org/10.48550/arXiv.2311.07989},
      doi={10.48550/ARXIV.2311.07989},
      eprint={2311.07989},
      eprinttype={arXiv},
}

Star History

Star History Chart

Join US

We are the AI Native team within the Platform Technology Business Group at Ant Group, dedicated to the intelligentization of Ant Group's platform engineering. Established for over three years, our team has played a pivotal role in supporting the intelligent operation and maintenance of Ant Group's cloud computing infrastructure. Our mission is to build algorithm services and platforms with a wide user base through world-class technological innovation and impact, supporting the implementation of internal and external products and businesses. Embracing an innovation-driven ethos, our team not only supports business implementation but also propels technological influence. Over the past three years, we have published more than 20 papers at top conferences like ICLR, NeurIPS, KDD, and ACL. Our innovative business outcomes have earned us two Ant Technology's highest T-Star awards and one SuperMA award from Ant Group. Our open-source project CodeFuse has received 4K stars as of February 2024, and our models have been downloaded over 1.5 million times on Huggingface and Modelscope.

We are on the lookout for top talents to join our vibrant team! If you're eager to develop your career in an environment filled with energy, innovation, and a culture of excellence, we welcome you to explore our career opportunities for both campus and experienced hires. Join us and be a part of creating the next milestone in the industry.

Campus Recruitment: https://hrrecommend.antgroup.com/guide.html?code=8uoP5mlus5DqQYbE_EnqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GLn_7

Experienced Hires: https://talent.antgroup.com/off-campus-position?positionId=1933830

我们是平台技术事业群 AI Native 团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立 3 年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的 Mission 是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3 年以来在 ICLR、NeurIPS、KDD、ACL 等顶会发表论文 20 余篇,创新业务结果获得两次蚂蚁技术最高奖 T-Star,1 次蚂蚁集团最高奖 SuperMA。开源项目 CodeFuse 获得 4K 点赞(2024 年 2 月),Huggingface 和 modelscope 上模型累积下载量超过 150 万次。

我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。

校招https://hrrecommend.antgroup.com/guide.html?code=8uoP5mlus5DqQYbE_EnqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GLn_7

社招https://talent.antgroup.com/off-campus-position?positionId=1933830