Awesome
Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior
https://arxiv.org/abs/1807.07525
Cite as: V.S. Bhaskara, and D. Bhattacharyya.
arXiv preprint
arXiv:1807.07525 [stat.ML] (2018)
.
References to the code
- The WGAN-GP model trained is based on the code published at https://github.com/igul222/improved_wgan_training.
We used theimproved_wgan_training/gan_64x64.py
script with the network architectures defined byGoodGenerator
andGoodDiscriminator
functions. - The 64-bit dHash used per channel is based on the implementation at https://github.com/JohannesBuchner/imagehash. An extension of the hash by concatenating dHashes across the channels for color images is presented in the
color_dHash192.py
script.
Dataset
dataset_filedetails.csv
: Lists the file SHA256 hashes and the file names of the 12,006 distinct executables used.
API Calls Hooked
HookedApiCallList.txt
: Lists all the 1,984 individual API calls that were hooked for determining the call invocation sequences of executables.
Figures (full resolution)
Figure 3a:
figure3a_samples_clean_preview.png
: Samples of 64x64 image representations corresponding to 64 distinct Clean files, arranged in a grid of 8x8, chosen randomly from the dataset.
Figure 3b:
figure3b_samples_malware_preview.png
: Samples of 64x64 image representations corresponding to 64 distinct Malicious files, arranged in a grid of 8x8, chosen randomly from the dataset.
Figure 7a:
figure7a_samples_malware_gan_train.png
: Samples of 64x64 image representations corresponding to 32 distinct Malicious files randomly chosen from the images used for Training the WGAN-GP model.
Figure 7b:
figure7b_samples_malware_gan_valid.png
: Samples of 64x64 image representations corresponding to 32 distinct Malicious files randomly chosen from the images used for Validating the WGAN-GP model.
Figure 10b:
figure10b_wgan_generated_samples.png
: Samples of 64x64 image representations corresponding to 64 synthetic images generated by the Generator after training the WGAN-GP model for 45,000 generator iterations.
Software Categorization
software_categorization_details/
: Contains the 64x64 PNGs of the scaled images used in Table 3
of the paper for demonstrating software categorization using images.
software_categorization_details/table3_filedetails.csv
: Lists the details of the files used in Table 3
of the paper, including, the file names, SHA256 digests, and their corresponding image hashes (SHA256 and 192-bit color dHash).
software_categorization_details/figure5_filedetails_categories_dhash_cutoff.csv
: Lists the details of the 254 files belonging to 21 file categories used for determining an optimal dHash cutoff demonstrated in Figure 5
of the paper.
Vector Arithmetic and Image Decodings
vector_arithmetic_and_decodings/
: Contains the PNGs used to demonstrate the decoding of the images to the API information, and the vector arithmetic in the noise vs pixel space.
The image decodings of the corresponding images are contained in the vector_arithmetic_and_decodings/image_decodings/
folder.
Training Information
Training the GAN
The WGAN-GP model was trained on 4 nVIDIA GTX TITAN X GPUs for about a day (~1.7 seconds per generator iteration) using tensorflow 1.5.0 on a Ubuntu 14.04 system with nVIDIA driver version 389.80, CuDNN 7, and CUDA 9.0.
Training the XGBoost Model
The XGBoost model of Section 4
of the paper was trained on the XGBoost 0.6
release with the following booster hyperparameters:
{'eval_metric': 'mlogloss', 'num_estimators': 200, 'alpha': 0, 'num_class': 2, 'booster': 'gbtree', 'colsample_bytree': 0.7, 'min_child_weight': 1e-06, 'subsample': 0.5, 'eta': 0.1, 'objective': 'multi:softprob', 'max_depth': 10, 'gamma': 0, 'lambda': 0}