Awesome
A Collection of Datasets for Big Code Analysis
A collection of datasets (and other resources) for big code analysis.
If you want to contribute to this list, please send a pull request.
Datasets
Name | Description | Tag | Language | Link |
---|---|---|---|---|
CodeSearchNet | Dataset and benchmarks for code retrieval using natural language | Code Retrieval, NLP | Multiple (Python) | link |
PY150 | 150k Python programs and corresponding abstract syntax trees, released by OOPSLA'16 Probabilistic Model for Code with Decision Trees | General | Python | link |
OJ-104 | Code from a Online Judge System, consisting of 104 classes of C programs, released by AAAI'16 Convolutional Neural Networks over Tree Structures for Programming Language Processing. | Code Classification, Clone Dectetion | C | link, also used in ASTNN |
code2seq | Datset released by the ICLR paper code2vec, code2seq, etc. | Code Completion | Java, C# | link |
BigCloneBench | BigCloneBench is a clone detection benchmark of known clones in the dataset source repository. | Clone Dectetion | Java | link |
Google Code Jam | Projects collected from Google Code Jam competition. | Clone Dectetion | Java | link |
CodeChef | Program classification dataset released by kaggle | Code Classification | Java | link |
OOPSLA19Li | Datset released by the OOPSLA'19 Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks | Bug Detection | Java | link |
Devign | Dataset released by NeurIPS'19 Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks | Vulnerability Identification | C | link |
Draper | The dataset consists of the source code of 1.27 million functions mined from open source software, labelled by static analysis for potential vulnerabilities. The dataset is released by ICMLA'18 Automated Vulnerability Detection in Source Code Using Deep Representation Learning | Vulnerability Identification | C | link |
VulDeePecker | Semantics-based Vulnerability Candidate (SeVC) dataset. Dataset released by NDSS'18 VulDeePecker: A Deep Learning-Based System for Vulnerability Detection | Vulnerability Detection | C/C++ | link |
SySeVR | The Semantics-based Vulnerability Candidate (SeVC) dataset released by arXiv'18 SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities | Vulnerability Detection | C | link |
Seahymn | Vulnerable functions from 9 open-source software projects | Vulnerability Detection | C | link |
Big-Vul | A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries | Vulnerability Detection | C/C++ | link |
RAISE19Ferenc | Dataset released by RAISE'19 Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions | Vulnerability Detection | JavaScript | link |
D2A | Differential Analysis Dataset released by ICSE-SEIP'21 paper D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis | Vulnerability Detection | C/C++ | link |
TypeWriter | Dataset released by FSE'20 TypeWriter: Neural Type Prediction with Search-based Validation | Type Inference | Python | link |
DeepTyper | Dataset released by FSE'18 Deep Learning Type Inference | Type Inference | JavaScript | link |
Typlus | Dataset released by PLDI'20 paper Typilus: Neural Type Hints | Type Inference | Python | link |