Home

Awesome

Function-level Vulnerability Detection and Dataset

Hi there, welcome!

This is an open-source project for function-level vulnerability detection. We use the source code functions as inputs and implement Word2vec for generating code embeddings. The output is a probability of the corresponding input sample being vulnerable or not. This project includes 6 mianstream neural network models and can be easily extended to use other network models implemented using Keras or Tensorflow.

For this project, we also collect vulnerable functions from 9 open-source software projects (written in C programming language). See Dataset for more details. We have detailed the data collection and evaluation processes in a paper which is currently under review. When the review process is completed, we will publish all the data.

Requirements

Instructions & Usage

Unzip the zip file of this repository, one will see the following folders:

And there are two Python script files:

Step 1: Train a Word2vec model

To use the provided data samples for model training, we have to train a Word2vec model first. By executing the following command:

python Word_to_vec_embedding.py --data_dir <path to code base.> --output_dir <path to the output file.>

We can train a Word2vec model based on the code specified in the <path to code base.>. The purpose of training the Word2vec model is to convert source code tokens to meaningful embeddings for the neural network models to learn from.

To use data samples from other sources, modify the parameters of the --data_dir to point to the directory where the sources are stored. The parameters available for the Word_to_vec_embedding.py script are as follows:

OptionsDescription
data_dirPath to the code base (can be obtained by download & unzip the files under data folder. By default, it is data/.)
output_dirThe output path of the trained Word2vec model(two files. By default, it is result/.)
n_workersThe number of threads for training.
sizeThe dimensionality of the word vectors. This is also the Embedding dimension used in the subsequent steps.
windowThe maximum distance between the current and predicted word within a sentence.
min_countIgnores all words with total frequency lower than this.
algorithmTraining algorithm: 1 for skip-gram; otherwise CBOW.
seedThe seed for the random number generator

One may check the parameter type and default value by using the option --help for this script. For more detailed configurations for Word2vec training, please refer to: https://radimrehurek.com/gensim/models/word2vec.html.

Step 2: Train a neural network model

When the Word2vec model is ready. One can train a neural network model. The parameters related to experiment/model settings are stored in a yaml configuration file. This allows users to conveniently adjust the settings by just changing the configuration file. See documentation and examples for more details.

Once the configuration file is ready, one may run the following command to train a neural network model.

Python main.py --config config\config.yaml --data_dir <path_to_your_code>

By default, the data used for training is at data\ folder. The trained models will be placed at result/models/ folder. The training logs will be at logs/. A user can use Tensorboard to visualize the training process by specifying the logs\ folder to Tensorboard.

Step 3: Test a trained neural network model

When training is completed, a user can test a network model on the test set by using following command:

Python main.py --config config\config.yaml --test --trained_model D:\Path\of\the\trained_model.h5

Users can use their own test set by specifying the using_separate_test_set to True in the config.yaml file.

There are also some options available which are listed below:

OptionsDescription
configPath to the configuration file.
seedRandom seed for reproduction of the results.
data_dirThe path of the code base for training. (can be obtained by download & unzip the files under data folder. By default, it is data/.)
logdirPath to store training logs (log files for Tensorboard). By default, it is logs/
output_dirThe output path of the trained network model. By default, it is result/models/<model_name.h5>
trained_modelThe path of the trained model for test. By default, the trained models are in result/models/
testSwitch to the test mode.
verboseShow all messages.

Dataset and Results

Contact:

You are welcomed to use/modify our code. Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper (when it is published.) if you use the code/data in your work. For acquiring more data or inquiries, please contact: junzhang@swin.edu.au.

Thanks!