Awesome
Function-level Vulnerability Detection and Dataset
Hi there, welcome!
This is an open-source project for function-level vulnerability detection. We use the source code functions as inputs and implement Word2vec for generating code embeddings. The output is a probability of the corresponding input sample being vulnerable or not. This project includes 6 mianstream neural network models and can be easily extended to use other network models implemented using Keras or Tensorflow.
For this project, we also collect vulnerable functions from 9 open-source software projects (written in C programming language). See Dataset for more details. We have detailed the data collection and evaluation processes in a paper which is currently under review. When the review process is completed, we will publish all the data.
Requirements
- Environments -- Please refer to required_packages.txt
- Hardware -- A GPU with at least 4GB RAM is recommended.
Instructions & Usage
Unzip the zip file of this repository, one will see the following folders:
- The config folder -- containing the configuration file.
- The data folder -- containing the source code functions (vulnerable and non-vulnerable).
- The graph folder -- containing the sample results and data statistics.
- The src folder -- containing the code for model training and test.
And there are two Python script files:
- main.py -- for training and test a specified network model
- Word_to_vec_embedding.py -- for training a Word2vec embedding model.
Step 1: Train a Word2vec model
To use the provided data samples for model training, we have to train a Word2vec model first. By executing the following command:
python Word_to_vec_embedding.py --data_dir <path to code base.> --output_dir <path to the output file.>
We can train a Word2vec model based on the code specified in the <path to code base.>
. The purpose of training the Word2vec model is to convert source code tokens to meaningful embeddings for the neural network models to learn from.
To use data samples from other sources, modify the parameters of the --data_dir
to point to the directory where the sources are stored. The parameters available for the Word_to_vec_embedding.py
script are as follows:
Options | Description |
---|---|
data_dir | Path to the code base (can be obtained by download & unzip the files under data folder. By default, it is data/ .) |
output_dir | The output path of the trained Word2vec model(two files. By default, it is result/ .) |
n_workers | The number of threads for training. |
size | The dimensionality of the word vectors. This is also the Embedding dimension used in the subsequent steps. |
window | The maximum distance between the current and predicted word within a sentence. |
min_count | Ignores all words with total frequency lower than this. |
algorithm | Training algorithm: 1 for skip-gram; otherwise CBOW. |
seed | The seed for the random number generator |
One may check the parameter type and default value by using the option --help
for this script. For more detailed configurations for Word2vec training, please refer to: https://radimrehurek.com/gensim/models/word2vec.html.
Step 2: Train a neural network model
When the Word2vec model is ready. One can train a neural network model. The parameters related to experiment/model settings are stored in a yaml configuration file. This allows users to conveniently adjust the settings by just changing the configuration file. See documentation and examples for more details.
Once the configuration file is ready, one may run the following command to train a neural network model.
Python main.py --config config\config.yaml --data_dir <path_to_your_code>
By default, the data used for training is at data\
folder. The trained models will be placed at result/models/
folder. The training logs will be at logs/
. A user can use Tensorboard to visualize the training process by specifying the logs\
folder to Tensorboard.
Step 3: Test a trained neural network model
When training is completed, a user can test a network model on the test set by using following command:
Python main.py --config config\config.yaml --test --trained_model D:\Path\of\the\trained_model.h5
Users can use their own test set by specifying the using_separate_test_set
to True in the config.yaml file.
There are also some options available which are listed below:
Options | Description |
---|---|
config | Path to the configuration file. |
seed | Random seed for reproduction of the results. |
data_dir | The path of the code base for training. (can be obtained by download & unzip the files under data folder. By default, it is data/ .) |
logdir | Path to store training logs (log files for Tensorboard). By default, it is logs/ |
output_dir | The output path of the trained network model. By default, it is result/models/<model_name.h5> |
trained_model | The path of the trained model for test. By default, the trained models are in result/models/ |
test | Switch to the test mode. |
verbose | Show all messages. |
Dataset and Results
- Dataset -- containing vulnerable and non-vulnerable functions labeled/collected from 9 open-source projects and data statistics.
- Training and Evaluation Results -- containing test results for reference.
Contact:
You are welcomed to use/modify our code. Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper (when it is published.) if you use the code/data in your work. For acquiring more data or inquiries, please contact: junzhang@swin.edu.au.
Thanks!