Awesome
Graph Similar Maldoc Images
A script that extracts embedded images from Office Open XML (OOXML) documents and generates image hash similarity graphs that cluster visually similar images together. The script computes the Average Hash of each extracted image, then graphs the images if they meet the similarity threshold. The script can be used as a technique for visually identifying malware campaigns involving documents. To use the script, supply a directory containing OOXML files. If LibreOffice is in your PATH you can optionally convert non-OOXML Word, Excel, PowerPoint and Rich Text File documents to OOXML. The script outputs DOT files that can be exported as images using Graphviz. If Graphviz is in your PATH you can also export to an SVG (preferred) or PNG image.
Application
You can find regular posts with results of using this script at https://github.com/jstrosch/malware-samples
Output
Example image hash similarity graph (cropped). Here each node is a unique image that is connected by edges to other images that met the similarity threshold:
<img src="https://user-images.githubusercontent.com/1920756/103389929-6651e800-4ad7-11eb-9c67-cc24ca0642ad.png" width="700">Example CSV output of the script in detect mode, which lists images that match the similarity threshold with the signatures in the blacklist file,image_hash_signatures.txt:
<img src="https://raw.githubusercontent.com/cryptogramfan/Malware-Analysis-Scripts/master/graph_similar_document_images/images/graph_similar_document_images_screenshot_2.png" width="700">Abuse.ch Integration for Malware Signatures
The script also queries the Abuse.ch API to retrieve the malware signature of each sample, if available. Currently, this information is added as a label to the graph (although hard to see) as well as textual output upon script completion.
<img src="https://user-images.githubusercontent.com/1920756/103390472-39530480-4ada-11eb-9c42-5165a3b30980.png" width="700">For the look-up to work, the input files must be named with their MD5 hash (no extension). If you would like to use a different hashing algorithm, ensure that you update the parameters for the Abuse.ch API.
Example usage
Convert documents to OOXML, extract images from the documents, identify images that are similar to the blacklist and then graph images that meet the similarity threshold:
$ graph_similar_document_images.py -f ~/Samples -d image_hash_signatures.txt -c -g -t 80 -o svg
Help
usage: graph_similar_document_images.py [-h] -f INPUT_DIR
[-t MIN_SIMILARITY_THRESHOLD]
[-d SIG_FILE] [-g] [-c] [-o {svg,png}]
Usage: graph_similar_document_images.py -f <directory_containing_documents> -c -d <image_hash_signatures.txt>
-g -t <min_similarity_threshold> -o <svg|png>
optional arguments:
-h, --help show this help message and exit
-f INPUT_DIR, --files INPUT_DIR
Directory to process
-t MIN_SIMILARITY_THRESHOLD, --threshold MIN_SIMILARITY_THRESHOLD
Minimum percentage similarity between images to graph
(0 to 100)
-d SIG_FILE, --detect SIG_FILE
Detect mode identifies images that are similar to a
blacklist of known-bad images
-g, --graph Graph mode creates a graph of images that meet the
similarity threshold
-c, --convert Try converting documents to OOXML using LibreOffice
-o {svg,png}, --output {svg,png}
Output image format
Supported platforms
Tested on Ubuntu 18.04 with Python 3.
Installation
First install Graphviz and LibreOffice:
$ sudo add-apt-repository ppa:libreoffice/ppa
$ sudo apt update
$ sudo apt install graphviz libreoffice
Afterwards, install the required Python libraries:
$ python3 -m pip install -r requirements.txt
To view SVG files produced by the script you can use a viewer such as Inkscape. Outputting to PNG isn't recommended because the resulting files can be large.
License
Released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.