Awesome
Getting an API Key
AndroTotal has simplified the process for getting an API Key. Login/Create an Account at http://andrototal.org/ and you will then be able to view your profile settings. There is an API Tab which contains your key.
This repository contains a set of scripts to automate the process of gathering data from malware samples, training a machine learning model on that data, and plotting its classification accuracy.
-
Make a copy of config-template.ini called config.ini and edit it.
-
Ensure that the "tools" subdirectory has been initialized ("
$ git submodule update --init tools
") -
Either use
get_samples.py
to download samples or copy them into "all_apks" from another source. If you're usingget_samples.py
, you can monitor it in another shell by runningwatch "ls -l *.apk | wc -l"
-
sort_malicious.py
uses andrototal.org to sort them into "malicious_apk" and "benign_apk" folders. You can monitor it in another shell by runningwatch "ls -l benign_apk/*.apk | wc -l && ls -l malicious_apk/*.apk | wc -l"
-
extract_apks_parallel.sh
unpacks the .apk files into folders and processes some of the data therein. You can monitor it in another shell by runningwatch "wc -l benign_apk/valid_apks.txt; wc -l malicious_apk/valid_apks.txt"
-
Run one of the following scripts to generate feature vectors:
parse_xml.py
for permissions. "app_permission_vectors.json" is generatedparse_maline_output.py
for syscalls. "app_syscall_vectors.json" is generated. You will have to run maline first for this to work.parse_disassembled.py
for API calls. "app_method_vectors.json" is generatedparse_ssdeep.py
for fuzzy hashes. "app_hash_vectors.json" is generated. You will have to run ssdeep first for this to work.combine_features.py
for a combination of the top weighted features. "app_feature_vectors.json" is generated. This only works if you've previously trained a network on the specified features, and the feature weights files are named appropriately.
-
Run
$ run_trials.sh app_feature_vectors.json
(or whichever json you want) which runs thetensorflow_learn.py
script (where the ML happens) a number of times and puts the results into a folder. It also runsplot_data.py
andmatch_features.py
to create a plot and create a list of top weighted features, respectively. -
Change the parameters or input data and repeat step 6. It should be non-destructive so you can compare the results of different runs.
Note: If you want to use a SVM instead of a neural network, use sklearn_svm.py
in place of tensorflow_learn.py
. You can also use sklearn_tree.py
to use a decision tree.