Awesome
<div align="center"> <img src="others/images/logo/export.png" height="100px" alt="Logo"> </div> <div align="center"> <b><i>Platform for Automatic Analysis of Malicious Applications<br>Using Artificial Intelligence Algorithms</i></b> </div> <br>Description 🖼️
dike ( pronounced /ˈdaɪkiː/) is an open-source platform combining the fields of malware analysis with the one of artificial intelligence, more precisely the machine learning subfield.
Objectives 🎯
At the moment, dike is capable only of analyzing Portable Executable and Object Linking and Embedding formats. Besides this limitation, it has three main objectives:
- Regression of malice
- Classification in malware families
- Similarity analysis.
Features 🧰
The software enables the creation of analysis pipelines (named in the context of the platform models), which deals with the specific steps of the malware analysis and data engineering:
- Dataset management, where it uses three main sources of labeled PE and OLE files:
- The open-source dataset DikeDataset
- Accurate results of analysis made by the analysts of the organization in which the platform is set up
- Results of automatic VirusTotal scans
- Features extraction, in which extractors are used to obtain relevant information such as:
- Strings
- Characteristics of the file format
- Opcodes
- Windows API calls
- Macros
- Features preprocessing, where preprocessors are used to transform the features into a more friendly format for the machine learning algorithms
- Transformations
- Binarization
- Discretization
- Counting (and in a special approach, for categories of opcodes and API calls)
- Vectorization
- NGrams
- Scaling
- Dimensionality reduction
- Transformations
- Training of machine learning models with included cross-validation and evaluation (regression-wise and classification-wise).
Important Observation ⚠️
dike is part of my Bachelor thesis, which aims at demonstrating that the artificial intelligence techniques can improve the malware analysis. The document and the presentation (in Romanian 🇷🇴 only) can be found in a separate repository.
At the moment, this is the only place where some relevant information can be found:
- Software requirements
- Architecture (more detailed than the description above)
- Testing
- Evaluation
- Further development.
Setup 🛠️
- Download the script
manage.sh
from the folderinfrastructure
. - Obtain a VirusTotal API key.
- Create and host (on a server which the platform can access) a TGZ archive containing two folders,
ghidra
(with a Ghidra project) andqiling
(with the dynamically linked libraries needed by Qiling). - Run the script and follow the instructions.
For Private Repositories 🙊
If the repository hosting the platform is private, there are two steps that needs to be performed before:
- Generate an asymmetric key pair via
ssh-keygen -t ed25519 -C "EMAIL_ADDRESS"
, whereEMAIL_ADDRESS
need to be populated with your email address. - Add the public one into the GitHub's deployment key section.
Typical Usage 🔎
For Clients 👨💼
<details> <summary>Malice Prediction</summary> <kbd> <img src="others/screenshots/malice.png" alt="Malice Prediction"/> </kbd> </details> <details> <summary>Similarity Analysis</summary> <kbd> <img src="others/screenshots/similarity.png" alt="Similarity Analysis"/> </kbd> </details> <details> <summary>Feature-wise Comparison of Samples</summary> <kbd> <img src="others/screenshots/comparison.png" alt="Feature-wise Comparison of Samples"/> </kbd> </details> <details> <summary>Model Evaluation</summary> <kbd> <img src="others/screenshots/evaluation.png" alt="Model Evaluation"/> </kbd> </details> <details> <summary>Settings</summary> <kbd> <img src="others/screenshots/settings.png" alt="Settings"/> </kbd> </details>For Administrators 👩💻
A powerful command line interface can be used by the administrators, by running the dike
command on a leader server. Some available commands are demonstrated in the recording below.
The administrators use also manual editing of YAML files, respecting a schema depending on the context in which the file is used. Some existing files (one per type, only for exampling purposes) has comments to document these schemas as follows:
For Other Systems 🖥️
Other systems of the organization can use the scan services of the platform, creating HTTP or HTTPS (depending on the configuration) requests to the following API endpoints.
Route | Action |
---|---|
/get_malware_families | Retrieves the used malware families. |
/get_evaluation/MODEL_NAME | Retrieves the evaluation of a model. |
/get_configuration/MODEL_NAME | Retrieves the configuration. |
/get_features/MODEL_NAME/FILE_HASH | Retrieves the features of a file from the platform's dataset. |
/create_ticket/MODEL_NAME | Creates a prediction ticket. |
/get_ticket/TICKET_NAME | Retrieves the content of a prediction ticket. |
/publish/MODEL_NAME | Publishes for a specific model the results of a scan. |
Resources 🥣
The most important used resources are listed in the table below.
Name | Description | Link |
---|---|---|
Ghidra | Software reverse engineering framework | repository |
VirusTotal API | Scanning API that aggregates multiple antivirus engines | website |
Qiling | Python 3 emulation framework | repository |
Pandas | Python 3 data analysis and manipulation library | repository |
scikit-learn | Python 3 machine learning library | repository |
Python 3 | General-purpose programming language | website |
Docker | Software product for OS-level virtualization | website |
Docker Compose | Tool for running multi-container applications on Docker | repository |
GitHub | Git repository hosting service | website |
YAML | Data-serialization language | website |