Awesome

Awesome Machine Learning On Source Code

Notice: This repository is no longer actively maintained, and no further updates will be done, nor issues/PRs will be answered or attended. An alternative actively maintained can be found at ml4code.github.io repository.

A curated list of awesome research papers, datasets and software projects devoted to machine learning and source code. #MLonCode

Digests
Conferences
Competitions
Papers
- Program Synthesis and Induction
- Source Code Analysis and Language modeling
- Neural Network Architectures and Algorithms
- Embeddings in Software Engineering
- Program Translation
- Code Suggestion and Completion
- Program Repair and Bug Detection
- APIs and Code Mining
- Code Optimization
- Topic Modeling
- Sentiment Analysis
- Code Summarization
- Clone Detection
- Differentiable Interpreters
- Related research<details><summary>(links require "Related research" spoiler to be open)</summary>
Posts
Talks
Software
- Machine Learning
- Utilities
Datasets
Credits
Contributions
License

Digests

Learning from "Big Code" - Techniques, challenges, tools, datasets on "Big Code".
A Survey of Machine Learning for Big Code and Naturalness - Survey and literature review on Machine Learning on Source Code.

Conferences

<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> ACM International Conference on Software Engineering, ICSE
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> ACM International Conference on Automated Software Engineering, ASE
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE)
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> 2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER)
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> Machine Learning for Programming
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> Workshop on NLP for Software Engineering
<img src="badges/origin-industry-green.svg" alt="origin-industry" align="top"> SysML
- Talks
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> Mining Software Repositories
<img src="badges/origin-industry-green.svg" alt="origin-industry" align="top"> AIFORSE
<img src="badges/origin-industry-green.svg" alt="origin-industry" align="top"> source{d} tech talks
- Talks
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> NIPS Neural Abstract Machines and Program Induction workshop
- Talks
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> CamAIML
- Learning to Code: Machine Learning for Program Induction - Alexander Gaunt.
<img src="badges/origin-academia-blue.svg" alt="origin-academia" align="top"> MASES 2018

Competitions

CodRep - competition on automatic program repair: given a source line, find the insertion point.

Papers

Program Synthesis and Induction

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Program Synthesis and Semantic Parsing with Learned Code Idioms - Richard Shin, Miltiadis Allamanis, Marc Brockschmidt, Oleksandr Polozov, 2019.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> Synthetic Datasets for Neural Program Synthesis - Richard Shin, Neel Kant, Kavi Gupta, Chris Bender, Brandon Trabucco, Rishabh Singh, Dawn Song, ICLR 2019.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Execution-Guided Neural Program Synthesis - Xinyun Chen, Chang Liu, Dawn Song, ICLR 2019.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing - Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu, AAAI 2019.
<img src="badges/12-pages-beginner-brightgreen.svg" alt="12-pages-beginner" align="top"> NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System - Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, LREC 2018.
<img src="badges/18-pages-gray.svg" alt="18-pages" align="top"> Recent Advances in Neural Program Synthesis - Neel Kant, 2018.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine, ICLR 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Neural Program Search: Solving Programming Tasks from Description and Examples - Illia Polosukhin, Alexander Skidanov, ICLR 2018.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le, 2018.
<img src="badges/31-pages-gray.svg" alt="31-pages" align="top"> Towards Synthesizing Complex Programs from Input-Output Examples - Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai, AAAI 2018.
<img src="badges/14-pages-beginner-brightgreen.svg" alt="14-pages" align="top"> Synthesizing Benchmarks for Predictive Modeling - Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather, CGO 2017
<img src="badges/17-pages-beginner-brightgreen.svg" alt="17-pages-beginner" align="top"> Program Synthesis for Character Level Language Modeling - Pavol Bielik, Veselin Raychev, Martin Vechev, ICLR 2017.
<img src="badges/13-pages-beginner-brightgreen.svg" alt="13-pages-beginner" align="top"> SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning - Xiaojun Xu, Chang Liu, Dawn Song, 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Learning to Select Examples for Program Synthesis - Yewen Pu, Zachery Miranda, Armando Solar-Lezama, Leslie Pack Kaelbling, 2017.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli, NIPS 2017.
<img src="badges/14-pages-beginner-brightgreen.svg" alt="14-pages-beginner" align="top"> Learning to Infer Graphics Programs from Hand-Drawn Images - Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum, 2017.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Neural Attribute Machines for Program Generation - Matthew Amodio, Swarat Chaudhuri, Thomas Reps, 2017.
<img src="badges/11-pages-beginner-brightgreen.svg" alt="11-pages-beginner" align="top"> Abstract Syntax Networks for Code Generation and Semantic Parsing - Maxim Rabinovich, Mitchell Stern, Dan Klein, ACL 2017.
<img src="badges/20-pages-gray.svg" alt="20-pages" align="top"> Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, ICLR 2017.
<img src="badges/14-pages-gray.svg" alt="14-pages" align="top"> A Syntactic Neural Model for General-Purpose Code Generation - Pengcheng Yin, Graham Neubig, ACL 2017.
<img src="badges/12-pages-beginner-brightgreen.svg" alt="12-pages-beginner" align="top"> Program Synthesis from Natural Language Using Recurrent Neural Networks - Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael Ernst, 2017.
<img src="badges/18-pages-beginner-brightgreen.svg" alt="18-pages-beginner" align="top"> RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli, ICML 2017.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow, 2017.
<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> Neural Programming by Example - Chengxun Shu, Hongyu Zhang, AAAI 2017.
<img src="badges/21-pages-gray.svg" alt="21-pages" align="top"> DeepCoder: Learning to Write Programs - Balog Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow, ICLR 2017.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> A Differentiable Approach to Inductive Logic Programming - Yang Fan, Zhilin Yang, and William W. Cohen, 2017.
<img src="badges/12-pages-beginner-brightgreen.svg" alt="12-pages-beginner" align="top"> Latent Attention For If-Then Program Synthesis - Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, Mingcheng Chen, NIPS 2016.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top" id="card2code"> Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom, ACL 2016.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao, NIPS 2016.
<img src="badges/5-pages-gray.svg" alt="5-pages" align="top"> Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin, NIPS 2016.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Search-Based Generalization and Refinement of Code Templates - Tim Molderez, Coen De Roover, SSBSE 2016.
<img src="badges/14-pages-gray.svg" alt="14-pages" align="top"> Structured Generative Models of Natural Source Code - Chris J. Maddison, Daniel Tarlow, ICML 2014.

Source Code Analysis and Language modeling

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Modeling Vocabulary for Big Code Machine Learning - Hlib Babii, Andrea Janes, Romain Robbes, 2019.
<img src="badges/24-pages-gray.svg" alt="24-pages" align="top"> Generative Code Modeling with Graphs - Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, Oleksandr Polozov, ICLR 2019.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> NL2Type: Inferring JavaScript Function Types from Natural Language Information - Rabee Sohail Malik, Jibesh Patra, Michael Pradel, ICSE 2019.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> A Novel Neural Source Code Representation based on Abstract Syntax Tree - Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, Xudong Liu, ICSE 2019.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Deep Learning Type Inference - Vincent J. Hellendoorn, Christian Bird, Earl T. Barr and Miltiadis Allamanis, FSE 2018. Code.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Tree2Tree Neural Translation Model for Learning Source Code Changes - Saikat Chakraborty, Miltiadis Allamanis, Baishakhi Ray, 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> code2seq: Generating Sequences from Structured Representations of Code - Uri Alon, Omer Levy, Eran Yahav, 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Syntax and Sensibility: Using language models to detect and correct syntax errors - Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral, SANER 2018.
<img src="badges/25-pages-gray.svg" alt="25-pages" align="top"> code2vec: Learning Distributed Representations of Code - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, 2018.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi, ICLR 2018.
<img src="badges/36-pages-gray.svg" alt="36-pages" align="top"> A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2017.
<img src="badges/36-pages-gray.svg" alt="36-pages" align="top"> Are Deep Neural Networks the Best Choice for Modeling Source Code? - Vincent J. Hellendoorn, Premkumar Devanbu, FSE 2017.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham, 2016.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> Convolutional Neural Networks over Tree Structures for Programming Language Processing - Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin, AAAI-16. Code.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Suggesting Accurate Method and Class Names - Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton, FSE 2015.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Mining Source Code Repositories at Massive Scale using Language Modeling - Miltiadis Allamanis, Charles Sutton, MSR 2013.

Neural Network Architectures and Algorithms

<img src="badges/19-pages-gray.svg" alt="19-pages" align="top"> Learning Compositional Neural Programs with Recursive Tree Search and Planning - Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas, 2019.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> From Programs to Interpretable Deep Models and Back - Eran Yahav, ICCAV 2018.
<img src="badges/13-pages-gray.svg" alt="13-pages" align="top"> Neural Code Comprehension: A Learnable Representation of Code Semantics - Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler, NIPS 2018.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> A General Path-Based Representation for Predicting Program Properties - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, PLDI 2018.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks - Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu, AAAI 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification - Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, SANER 2018.
<img src="badges/17-pages-gray.svg" alt="17-pages" align="top"> Syntax-Directed Variational Autoencoder for Structured Data - Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song, ICLR 2018.
<img src="badges/19-pages-gray.svg" alt="19-pages" align="top"> Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna, ICLR 2018.
<img src="badges/13-pages-gray.svg" alt="13-pages" align="top"> Hierarchical multiscale recurrent neural networks - Chung Junyoung, Sungjin Ahn, and Yoshua Bengio, ICLR 2017.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach, 2016.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau, NIPS 2016.
<img src="badges/5-pages-gray.svg" alt="5-pages" align="top"> Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy, NIPS 2016.
<img src="badges/13-pages-gray.svg" alt="13-pages" align="top"> Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas, ICLR 2016.
<img src="badges/9-pages-gray.svg" alt="9-pages" align="top"> Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever, ICLR 2016.
<img src="badges/17-pages-gray.svg" alt="17-pages" align="top"> Neural Random-Access Machines - Karol Kurach, Marcin Andrychowicz, Ilya Sutskever, ERCIM News 2016.
<img src="badges/18-pages-gray.svg" alt="18-pages" align="top"> Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever, ICLR 2015.
<img src="badges/25-pages-gray.svg" alt="25-pages" align="top"> Learning to Execute - Wojciech Zaremba, Ilya Sutskever, 2015.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov, NIPS 2015.
<img src="badges/26-pages-gray.svg" alt="26-pages" align="top"> Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka, 2014.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> From Machine Learning to Machine Reasoning - Bottou Leon, Journal of Machine Learning 2011.

Embeddings in Software Engineering

<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> A Literature Study of Embeddings on Source Code - Zimin Chen and Martin Monperrus, 2019.
<img src="badges/3-pages-gray.svg" alt="3-pages" align="top"> AST-Based Deep Learning for Detecting Malicious PowerShell - Gili Rusak, Abdullah Al-Dujaili, Una-May O'Reilly, 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Deep Code Search - Xiaodong Gu, Hongyu Zhang, Sunghun Kim, ICSE 2018.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> Word Embeddings for the Software Engineering Domain - Vasiliki Efstathiou, Christos Chatzilenas, Diomidis Spinellis, MSR 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align=top> Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces - Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, Thomas Reps, FSE 2018.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Document Distance Estimation via Code Graph Embedding - Zeqi Lin, Junfeng Zhao, Yanzhen Zou, Bing Xie, Internetware 2017.
<img src="badges/3-pages-gray.svg" alt="3-pages" align="top"> Combining Word2Vec with revised vector space model for better code retrieval - Thanh Van Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Trong Duc Nguyen, Tien N. Nguyen, ICSE 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> From word embeddings to document similarities for improved information retrieval in software engineering - Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, Chang Liu, ICSE 2016.
<img src="badges/3-pages-gray.svg" alt="3-pages" align="top"> Mapping API Elements for Code Migration with Vector Representation - Trong Duc Nguyen, Anh Tuan Nguyen, Tien N. Nguyen, ICSE 2016.

Program Translation

<img src="badges/18-pages-gray.svg" alt="18-pages" align="top"> Towards Neural Decompilation - Omer Katz, Yuval Olshaker, Yoav Goldberg, Eran Yahav, 2019.
<img src="badges/14-pages-gray.svg" alt="14-pages" align="top"> Tree-to-tree Neural Networks for Program Translation - Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu, 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan, ASE 2017.
<img src="badges/5-pages-gray.svg" alt="5-pages" align="top"> A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich, ICNLP 2017.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo, ACL 2017.

Code Suggestion and Completion

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Aroma: Code Recommendation via Structural Code Search - Sifei Luan, Di Yang, Koushik Sen and Satish Chandra, 2019.
<img src="badges/9-pages-gray.svg" alt="9-pages" align="top"> Intelligent Code Reviews Using Deep Learning - Anshul Gupta, Neel Sundaresan, KDD DL Day 2018.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu, 2017.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel, 2016.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav, PLDI 2014.

Program Repair and Bug Detection

<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> SampleFix: Learning to Correct Programs by Sampling Diverse Fixes - Hossein Hajipour, Apratim Bhattacharya, Mario Fritz, 2019.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Maximal Divergence Sequential Autoencoder for Binary Software Vulnerability Detection - Tue Le, Tuan Nguyen, Trung Le, Dinh Phung, Paul Montague, Olivier De Vel, Lizhen Qu, ICLR 2019.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Neural Program Repair by Jointly Learning to Localize and Repair - Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh, ICLR 2019.
<img src="badges/11-pages-beginner-brightgreen.svg" alt="11-pages" align="top"> Compiler Fuzzing through Deep Learning - Chris Cummins, Pavlos Petoumenos, Alastair Murray, Hugh Leather, ISSTA 2018
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Automatically assessing vulnerabilities discovered by compositional analysis - Saahil Ognawala, Ricardo Nales Amato, Alexander Pretschner and Pooja Kulkarni, MASES 2018.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation - Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, ASE 2018.
<img src="badges/23-pages-gray.svg" alt="23-pages" align="top"> DeepBugs: A Learning Approach to Name-based Bug Detection - Michael Pradel, Koushik Sen, 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Learning How to Mutate Source Code from Bug-Fixes - Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, 2018.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> A deep tree-based model for software defect prediction - HK Dam, T Pham, SW Ng, T Tran, J Grundy, A Ghose, T Kim, CJ Kim, 2018.
<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> Automated Vulnerability Detection in Source Code Using Deep Representation Learning - Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley, 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Shaping Program Repair Space with Existing Patches and Similar Code - Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, Xiangqun Chen, 2018. (code).
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Learning to Repair Software Vulnerabilities with Generative Adversarial Networks - Jacob A. Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin, 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Dynamic Neural Program Embedding for Program Repair - Ke Wang, Rishabh Singh, Zhendong Su, ICLR 2018.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> Estimating defectiveness of source code: A predictive model using GitHub content - Ritu Kapur, Balwinder Sodhi, 2018
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> Automated software vulnerability detection with machine learning - Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Peter Chin, Tomo Lazovich, IWSPA 2018.
<img src="badges/34-pages-gray.svg" alt="34-pages" align="top"> Learning a Static Analyzer from Data - Pavol Bielik, Veselin Raychev, Martin Vechev, CAV 2017. video.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> To Type or Not to Type: Quantifying Detectable Bugs in JavaScript - Zheng Gao, Christian Bird, Earl Barr, ICSE 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities - Martin White, Michele Tufano, Matías Martínez, Martin Monperrus, Denys Poshyvanyk, 2017.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli, 2017.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> Automated Identification of Security Issues from Commit Messages and Bug Reports - Yaqin Zhou and Asankhaya Sharma, FSE 2017.
<img src="badges/31-pages-gray.svg" alt="31-pages" align="top"> SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt, 2017.
<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks - Min-je Choi, Sehun Jeong, Hakjoo Oh, Jaegul Choo, IJCAI 2017.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Tailored Mutants Fit Bugs Better - Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton, 2016.

APIs and Code Mining

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> SAR: Learning Cross-Language API Mappings with Little Knowledge - Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, FSE 2019.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code - Nghi D. Q. Bui, Lingxiao Jiang, ICSE 2018.
<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, IJCAI 2017.
<img src="badges/9-pages-gray.svg" alt="9-pages" align="top"> Mining Change Histories for Unknown Systematic Edits - Tim Molderez, Reinout Stevens, Coen De Roover, MSR 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, FSE 2016.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen, Journal of Systems and Software 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou, 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Parameter-Free Probabilistic API Mining across GitHub - Jaroslav Fowkes, Charles Sutton, FSE 2016.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> A Subsequence Interleaving Model for Sequential Pattern Mining - Jaroslav Fowkes, Charles Sutton, KDD 2016.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> Lean GHTorrent: GitHub data on demand - Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman, MSR 2014.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Mining idioms from source code - Miltiadis Allamanis, Charles Sutton, FSE 2014.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> The GHTorent Dataset and Tool Suite - Georgios Gousios, MSR 2013.

Code Optimization

<img src="badges/27-pages-gray.svg" alt="27-pages" align="top"> The Case for Learned Index Structures - Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis, SIGMOD 2018.
<img src="badges/14-pages-gray.svg" alt="14-pages" align="top"> End-to-end Deep Learning of Optimization Heuristics - Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather, PACT 2017
<img src="badges/14-pages-gray.svg" alt="14-pages" align="top"> Learning to superoptimize programs - Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohlim ICLR 2017.
<img src="badges/18-pages-gray.svg" alt="18-pages" align="top"> Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, USENIX Security Symposium 2017.
<img src="badges/25-pages-gray.svg" alt="25-pages" align="top"> Adaptive Neural Compilation - Rudy Bunel, Alban Desmaison, Pushmeet Kohli, Philip H.S. Torr, M. Pawan Kumar, NIPS 2016.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli, NIPS 2016.

Topic Modeling

<img src="badges/9-pages-gray.svg" alt="9-pages" align="top"> A Language-Agnostic Model for Semantic Source Code Labeling - Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe and David Slater, MASES 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Topic modeling of public repositories at scale using names in source code - Vadim Markovtsev, Eiso Kant, 2017.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code - Miltiadis Allamanis, Charles Sutton, MSR 2013.
<img src="badges/30-pages-gray.svg" alt="30-pages" align="top"> Semantic clustering: Identifying topics in source code - Adrian Kuhn, Stéphane Ducasse, Tudor Girba, Information & Software Technology 2007.

Sentiment Analysis

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> A Benchmark Study on Sentiment Analysis for Software Engineering Research - Nicole Novielli, Daniela Girardi, Filippo Lanubile, MSR 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Sentiment Analysis for Software Engineering: How Far Can We Go? - Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, ICSE 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Leveraging Automated Sentiment Analysis in Software Engineering - Md Rakibul Islam, Minhaz F. Zibran, MSR 2017.
<img src="badges/27-pages-gray.svg" alt="27-pages" align="top"> Sentiment Polarity Detection for Software Development - Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, Empirical Software Engineering 2017.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions - Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, Shahram Rahimi, ASE 2017.

Code Summarization

<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> Summarizing Source Code with Transferred API Knowledge - Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, Zhi Jin, IJCAI 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Deep Code Comment Generation - Xing Hu, Ge Li, Xin Xia, David Lo, Zhi Jin, ICPC 2018.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> A Neural Framework for Retrieval and Summarization of Source Code - Qingying Chen, Minghui Zhou, ASE 2018.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Improving Automatic Source Code Summarization via Deep Reinforcement Learning - Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu and Philip S. Yu, ASE 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> A Convolutional Attention Network for Extreme Summarization of Source Code - Miltiadis Allamanis, Hao Peng, Charles Sutton, ICML 2016.
<img src="badges/4-pages-gray.svg" alt="4-pages" align="top"> TASSAL: Autofolding for Source Code Summarization - Jaroslav Fowkes, Pankajan Chanthirasegaran, Razvan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton, ICSE 2016.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Summarizing Source Code using a Neural Attention Model - Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, ACL 2016.
<img src="badges/13-pages-gray.svg" alt="13-pages" align="top"> Automatic Generation of Pull Request Descriptions - Zhongxin Liu, Xin Xia, Christoph Treude, David Lo, Shanping Li, ASE 2019.

Clone Detection

<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection - Lutz Büch and Artur Andrzejak, SANER 2019.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Oreo: detection of clones in the twilight zone - Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes, FSE 2018.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> A Deep Learning Approach to Program Similarity - Niccolò Marastoni, Roberto Giacobazzi and Mila Dalla Preda, MASES 2018.
<img src="badges/6-pages-gray.svg" alt="6-pages" align="top"> Recurrent Neural Network for Code Clone Detection - Arseny Zorin and Vladimir Itsykson, SEIM 2018.
<img src="badges/8-pages-gray.svg" alt="8-pages" align="top"> The Adverse Effects of Code Duplication in Machine Learning Models of Code - Miltiadis Allamanis, 2018.
<img src="badges/28-pages-gray.svg" alt="28-pages" align="top"> DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, Programming Languages OOPSLA 2017.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Some from Here, Some from There: Cross-project Code Reuse in GitHub - Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, MSR 2017.
<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, ASE 2016.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> A study of repetitiveness of code changes in software evolution - HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, ASE 2013.

Differentiable Interpreters

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer - Joseph Suarez, Justin Johnson, Fei-Fei Li, 2018.
<img src="badges/16-pages-gray.svg" alt="16-pages" align="top"> Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction - Da Xiao, Jo-Yu Liao, Xingyuan Yuan, ICLR 2018.
<img src="badges/10-pages-gray.svg" alt="10-pages" align="top"> Differentiable Programs with Neural Libraries - Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow, ICML 2017.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Differentiable Functional Program Interpreters - John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, Daniel Tarlow, 2017.
<img src="badges/18-pages-gray.svg" alt="18-pages" align="top"> Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel, ICML 2017.
<img src="badges/15-pages-gray.svg" alt="15-pages" align="top"> Neural Functional Programming - Feser John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow, ICLR 2017.
<img src="badges/7-pages-gray.svg" alt="7-pages" align="top"> TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow, NIPS 2016.

<details> <summary>Related research</summary>

AST Differencing

<img src="badges/12-pages-gray.svg" alt="12-pages" align="top"> ClDiff: Generating Concise Linked Code Differences - Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, Wenyun Zhao, ASE 2018. Code.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Generating Accurate and Compact Edit Scripts Using Tree Differencing - Veit Frick, Thomas Grassauer, Fabian Beck, Martin Pinzger, ICSME 2018.
<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> Fine-grained and Accurate Source Code Differencing - Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, Martin Monperrus, ASE 2014.

Binary Data Modeling

Clustering Binary Data with Bernoulli Mixture Models - Neal S. Grantham.
A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki, Computational Statistics & Data Analysis 2017.
BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray, R Journal 2016.

Soft Clustering Using T-mixture Models

Robust mixture modelling using the t distribution - D. Peel and G. J. McLachlan, Statistics and Computing 2000.
Robust mixture modeling using the skew t distribution - Tsung I. Lin, Jack C. Lee and Wan J. Hsieh, Statistics and Computing 2010.

Natural Language Parsing and Comprehension

<img src="badges/11-pages-gray.svg" alt="11-pages" align="top"> A Fast Unified Model for Parsing and Sentence Understanding - Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, Christopher Potts, ACL 2016.

</details>

Posts

Talks

Software

Machine Learning

Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
vecino - Finds similar Git repositories.
apollo - Source code deduplication as scale, research.
gemini - Source code deduplication as scale, production.
enry - Insanely fast file based programming language detector.
hercules - Git repository mining framework with batteries on top of go-git.
DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
Code Neuron - Recurrent neural network to detect code blocks in natural language text.
Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
Clone Digger - clone detection for Python and Java.
Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
DeepBugs - Framework for learning bug detectors from an existing code corpus.
DeepSim - a deep learning-based approach to measure code functional similarity.
rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.

Utilities

go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
bblfsh - Self-hosted server for source code parsing.
engine - Scalable and distributed data retrieval pipeline for source code.
minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.

Datasets

Neural-Code-Search-Evaluation-Dataset - dataset contains links to 4.7M methods from 24k+ repositories with 287 StackOverflow questions and code snippet answers.
CodeSearchNet - collection of datasets and benchmarks for code retrieval using natural language. Contains 2M pairs of (comment, code).
Public Git Archive - 6 TB of Git repositories from GitHub.
StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
GitHub Issue Titles and Descriptions for NLP Analysis - ~8 million GitHub issue titles and descriptions from 2017.
GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.
NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the paper.
GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
BigCloneBench - Clone detection benchmark of 8 million function clone pairs in the IJaDataset.