Home

Awesome

A Collection of Datasets for Big Code Analysis

A collection of datasets (and other resources) for big code analysis.

If you want to contribute to this list, please send a pull request.

Datasets

NameDescriptionTagLanguageLink
CodeSearchNetDataset and benchmarks for code retrieval using natural languageCode Retrieval, NLPMultiple (Python)link
PY150150k Python programs and corresponding abstract syntax trees, released by OOPSLA'16 Probabilistic Model for Code with Decision TreesGeneralPythonlink
OJ-104Code from a Online Judge System, consisting of 104 classes of C programs, released by AAAI'16 Convolutional Neural Networks over Tree Structures for Programming Language Processing.Code Classification, Clone DectetionClink, also used in ASTNN
code2seqDatset released by the ICLR paper code2vec, code2seq, etc.Code CompletionJava, C#link
BigCloneBenchBigCloneBench is a clone detection benchmark of known clones in the dataset source repository.Clone DectetionJavalink
Google Code JamProjects collected from Google Code Jam competition.Clone DectetionJavalink
CodeChefProgram classification dataset released by kaggleCode ClassificationJavalink
OOPSLA19LiDatset released by the OOPSLA'19 Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural NetworksBug DetectionJavalink
DevignDataset released by NeurIPS'19 Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural NetworksVulnerability IdentificationClink
DraperThe dataset consists of the source code of 1.27 million functions mined from open source software, labelled by static analysis for potential vulnerabilities. The dataset is released by ICMLA'18 Automated Vulnerability Detection in Source Code Using Deep Representation LearningVulnerability IdentificationClink
VulDeePeckerSemantics-based Vulnerability Candidate (SeVC) dataset. Dataset released by NDSS'18 VulDeePecker: A Deep Learning-Based System for Vulnerability DetectionVulnerability DetectionC/C++link
SySeVRThe Semantics-based Vulnerability Candidate (SeVC) dataset released by arXiv'18 SySeVR: A Framework for Using Deep Learning to Detect VulnerabilitiesVulnerability DetectionClink
SeahymnVulnerable functions from 9 open-source software projectsVulnerability DetectionClink
Big-VulA C/C++ Code Vulnerability Dataset with Code Changes and CVE SummariesVulnerability DetectionC/C++link
RAISE19FerencDataset released by RAISE'19 Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript FunctionsVulnerability DetectionJavaScriptlink
D2ADifferential Analysis Dataset released by ICSE-SEIP'21 paper D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential AnalysisVulnerability DetectionC/C++link
TypeWriterDataset released by FSE'20 TypeWriter: Neural Type Prediction with Search-based ValidationType InferencePythonlink
DeepTyperDataset released by FSE'18 Deep Learning Type InferenceType InferenceJavaScriptlink
TyplusDataset released by PLDI'20 paper Typilus: Neural Type HintsType InferencePythonlink

Resources