Home

Awesome

DeepBugs: Deep Learning to Find Bugs

DeepBugs is a framework for learning name-based bug detectors from an existing code corpus. See our OOPSLA'18 paper for a detailed description.

Quick Start

A quick and easy way to play with a simplified version of DeepBugs is a Jupyter notebook, which you can run on Google's Colaboratory. To use the full DeepBugs tool, read on.

Overview

Requirements

JavaScript Corpus

Learning a Bug Detector

Creating a bug detector consists of two main steps:

  1. Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.
  2. Train a classifier to distinguish correct from incorrect code examples.

Each bug detector addresses a particular bug pattern, e.g.:

Step 1: Extract positive and negative training examples

node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50

Step 2: Train a classifier to identify bugs

A) Train and validate the classifier python3 python/BugLearnAndValidate.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json --validation_data calls_yy*.json

B) Train a classifier for later use python3 python/BugLearn.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json

Note that learning a bug detector from the very small corpus of 50 programs will yield a classifier with low accuracy that is unlikely to be useful. To leverage the full power of DeepBugs, you'll need a larger code corpus, e.g., the JS150 corpus mentioned above.

Finding Bugs

Finding bugs in one or more source files consists of these two steps:

  1. Extract code pieces
  2. Use a trained classifier to identify bugs

Step 1: Extract code pieces

node javascript/extractFromJS.js calls --files <list of files>

Step 2: Use a trained classifier to identify bugs

python3 python/BugFind.py --pattern SwappedArgs --threshold 0.95 --model some/dir --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data calls_xx*.json

Embeddings for Identifiers

The above bug detectors rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped token_to_vector.json file. Alternatively, you can learn the embeddings via Word2Vec as follows:

  1. Extract identifiers and tokens:

node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50

  1. Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):

python3 python/TokensToTopTokens.py tokens_*.json

  1. Learn embeddings for identifiers and literals:

python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json