

FFRI Dataset scripts

This script allows you to create datasets in the same format as the FFRI dataset.


We recommend using Docker to create datasets. For more information, refer to the Using Docker section.

Alternatively, you can run this script natively by installing the following dependencies on tested platforms. For detailed instructions, see the Run this script natively section.

Using Docker

Make A CSV File

This script requires a CSV file that contains file information such as labels, dates, and file paths. For example:


Please note that the file paths in the CSV file should be specified as relative paths from the container's working directory.

Make Datasets

You can create datasets using the following commands:

docker build --target production --tag ffridataset-scripts .
docker run -v <path/to/here>/testbin:/work/testbin ffridataset-scripts test_main.py
# Note: The data directory should contain a CSV file and the executable files you want to process.
docker run -v <path/to/here>/data:/work/data -v <path/to/here>/out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver <version_string>

Please ensure the following:

To process non-PE files, include the --not-pe-only flag:

docker run -v <path/to/here>/data:/work/data -v <path/to/here>/out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver <version_string> --not-pe-only

Run This Script Natively

Prepare To Use

Attention We recommend running the following commands in the working directory (the ffridataset-scripts directory).

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

sudo apt update
sudo apt install -y --no-install-recommends wget git gcc g++ make autoconf libfuzzy-dev unar cmake mlocate libssl-dev libglib2.0-0 curl libboost-regex-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev build-essential libpcre2-dev libdouble-conversion-dev
sudo apt install -y --no-install-recommends libqt5core5a libqt5svg5 libqt5gui5 libqt5widgets5 libqt5opengl5 libqt5dbus5 libqt5scripttools5 libqt5script5 libqt5network5 libqt5sql5
sudo apt install -y --no-install-recommends libffi-dev libncurses5-dev zlib1g zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev liblzma-dev
sudo apt install -y --no-install-recommends software-properties-common gpg-agent gpg clang
wget https://github.com/horsicq/DIE-engine/releases/download/3.09/die_3.09_Ubuntu_22.04_amd64.deb
sudo apt --fix-broken install ./die_3.09_Ubuntu_22.04_amd64.deb
rm die_3.09_Ubuntu_22.04_amd64.deb

wget mark0.net/download/trid_linux_64.zip
unar trid_linux_64.zip
cp trid_linux_64/trid ./
chmod u+x trid
cp triddefs_dir/triddefs-dataset2024.trd triddefs.trd

cd workspace

git clone https://github.com/JPCERTCC/impfuzzy.git
cd impfuzzy
git checkout b30548d005c9d980b3e3630648b39830597293fc
cd ../

git clone https://github.com/JusticeRage/Manalyze.git
cd Manalyze
git checkout b6800ffcf2f7f4e82fe1f94d0eb2736e75e175ec
cmake .
cd ../

git clone https://github.com/lief-project/LIEF.git
git checkout 573c885de5a2bb217d4d0255b54f9b53d9a4d7c9
git apply ../../patches/lief.patch
cd ../

git clone  https://github.com/trendmicro/tlsh.git
cd tlsh
git checkout 96536e3f5b9b322b44ce88d36126121685e45a77
cd ../

git clone https://github.com/erocarrera/pefile.git
cd pefile
git checkout ceab92e003b3436d2e52b74e9c903e812a4aeae1
cd ../../

wget https://github.com/ninja-build/ninja/releases/download/v1.12.1/ninja-linux.zip
unar ninja-linux.zip
sudo mv ninja /usr/bin/

poetry install --no-root

If something goes wrong, refer to the Dockerfile.

Run Tests

Attention Do not store a file named test.exe in the working directory. The test script copies testbin/test.exe into the directory and then removes it.

poetry run python test_main.py

Make Datasets

Before running this script, you need to make a CSV file described in the Make A CSV File section and specify its file path as an argument. Unlike when using Docker, file paths can be specified as full paths.

Attention Do not store malware and cleanware in the working directory. This script will copy malware and cleanware into the directory and then removes them.

poetry run python main.py --csv <path/to/csv> --out <path/to/output_dataset_dir> --log <path/to/log_file> --ver <version_string>

Notes About Hashes

Notes About TrID Definition File



Profiling Measurement

First, create two folders:

mkdir out_dir
mkdir measurement

Next, build a Docker image by specifying the measurement target:

docker build --target measurement --tag ffridataset-scripts .

Then, run the following command to generate executables and a csv file:

docker run -v <path/to/here>\testbin:/work/testbin -v <path/to/here>\measurement\:/work/measurement ffridataset-scripts poetry run python create_measurement_env.py

Now you're ready to do profiling. To generate a cProfile result file, run:

docker run -v <path/to/here>\measurement:/work/data -v <path/to/here>\out_dir:/work/out_dir ffridataset-scripts poetry run python -m cProfile -o ./out_dir/profiling.stats main.py --csv ./data/test.csv --out ./out_dir --log ./test.log --ver v2023

Then, execute the following command:

docker run -v <path/to/here>\out_dir\:/work/out_dir/ --rm -p 8080:8080 ffridataset-scripts poetry run snakeviz /work/out_dir/profiling.stats  -s -p 8080 -H

Now, you can view the profiling results through your browser.


