Home

Awesome

Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior

https://arxiv.org/abs/1807.07525

Cite as: V.S. Bhaskara, and D. Bhattacharyya. arXiv preprint arXiv:1807.07525 [stat.ML] (2018).

References to the code

Dataset

dataset_filedetails.csv: Lists the file SHA256 hashes and the file names of the 12,006 distinct executables used.

API Calls Hooked

HookedApiCallList.txt: Lists all the 1,984 individual API calls that were hooked for determining the call invocation sequences of executables.

Figures (full resolution)

Figure 3a: figure3a_samples_clean_preview.png: Samples of 64x64 image representations corresponding to 64 distinct Clean files, arranged in a grid of 8x8, chosen randomly from the dataset. clean_preview

Figure 3b: figure3b_samples_malware_preview.png: Samples of 64x64 image representations corresponding to 64 distinct Malicious files, arranged in a grid of 8x8, chosen randomly from the dataset. mal_preview

Figure 7a: figure7a_samples_malware_gan_train.png: Samples of 64x64 image representations corresponding to 32 distinct Malicious files randomly chosen from the images used for Training the WGAN-GP model. gantrain_preview

Figure 7b: figure7b_samples_malware_gan_valid.png: Samples of 64x64 image representations corresponding to 32 distinct Malicious files randomly chosen from the images used for Validating the WGAN-GP model. ganvalid_preview

Figure 10b: figure10b_wgan_generated_samples.png: Samples of 64x64 image representations corresponding to 64 synthetic images generated by the Generator after training the WGAN-GP model for 45,000 generator iterations. syntheticmal_preview

Software Categorization

software_categorization_details/: Contains the 64x64 PNGs of the scaled images used in Table 3 of the paper for demonstrating software categorization using images.

software_categorization_details/table3_filedetails.csv: Lists the details of the files used in Table 3 of the paper, including, the file names, SHA256 digests, and their corresponding image hashes (SHA256 and 192-bit color dHash).

software_categorization_details/figure5_filedetails_categories_dhash_cutoff.csv: Lists the details of the 254 files belonging to 21 file categories used for determining an optimal dHash cutoff demonstrated in Figure 5 of the paper.

Vector Arithmetic and Image Decodings

vector_arithmetic_and_decodings/: Contains the PNGs used to demonstrate the decoding of the images to the API information, and the vector arithmetic in the noise vs pixel space. The image decodings of the corresponding images are contained in the vector_arithmetic_and_decodings/image_decodings/ folder.

Training Information

Training the GAN

The WGAN-GP model was trained on 4 nVIDIA GTX TITAN X GPUs for about a day (~1.7 seconds per generator iteration) using tensorflow 1.5.0 on a Ubuntu 14.04 system with nVIDIA driver version 389.80, CuDNN 7, and CUDA 9.0.

Training the XGBoost Model

The XGBoost model of Section 4 of the paper was trained on the XGBoost 0.6 release with the following booster hyperparameters:

{'eval_metric': 'mlogloss', 'num_estimators': 200, 'alpha': 0, 'num_class': 2, 'booster': 'gbtree', 'colsample_bytree': 0.7, 'min_child_weight': 1e-06, 'subsample': 0.5, 'eta': 0.1, 'objective': 'multi:softprob', 'max_depth': 10, 'gamma': 0, 'lambda': 0}