Home

Awesome

Neural Nets Can Learn Function Type Signatures From Binaries

Authors

EKLAVYA is designed by Zheng Leong Chua, Shiqi Shen, Prateek Saxena, Zhenkai Liang.

Dataset

The dataset available from this page is the collection of function type signatures, which includes function banaries, number of arguments and types. It is a good dataset for people who want to try learning techniques or heuristic approaches in binary analysis while spending less effort on collecting and preprocessing.

The dataset consists of three parts, which is described below:

Function Representation

A function is represented as a dictionary having the following fields:

Example:

FuncDict = {
	"num_args": 3,
	"args_type": ["int", "char", "struct structure1*"], 
	"ret_type": "int", 
	"inst_strings": ["mov eax, 1", "nop", "push 3"],
	"inst_bytes": [[0x01, 0x02], [0xff], [0x20, 0x30, 0x40]],
	"boundaries": (0x80010, 0x800f0)
}

Binary Representation

A binary saved in pickles.tar.gz and clean_pickles.tar.gz is represented as a Dict object, having the following fields:

Example:

BinaryFileDict = {
    "functions": {
        "function1": funcDict1,
        "function2": funcDict2
    },
    "structures": {
        "structure1": ["int", "char [16]", "float"],
        "structure2": ["long", "long", "short"]
    },
    "text_addr": 0x800000,
    "binRawBytes": "\x00\x12\x22...",
    "arch": "i386",
    "binary_filename": "gcc-32-O1-coreutils-ls",
    "function_calls": {
         "func1": [
             {
             	 # caller's name
                 "caller": "func2",
                 # the indices of calling instructions
                 "call_instr_indices": [10, 17, 29] 
             },
             {
                 "caller": "func2",
                 "call_instr_indices": [19]
             }
        ]
    }
}

Code

Requirements

Train Embedding Model

Prepare the input file for training embedding model

python prep_embed_input.py [options] -i input_folder

Link to prep_embed_input.py

Options:

Train the embedding model

python train_embed.py -i input_path

Link to train_embed.py

Options:

Save the embedding vector

python save_embeddings.py [options] -p embed_pickle_path -m model_path

Link to save_embeddings.py

Options:

Train RNN Model

python train.py [options] -d data_folder -o output_dir -f split_func_path -e embed_path

Link to train.py

Options:

split_func_path Format

The split_func_path file saves the function names for training & testing dataset. If you are going to predict the type signatures from callees (function bodies), the function name is represented as "binary_file_name#func_name". If you are going to predict the type signatures from callers, the function name is represented as "binary_file_name#callee_name#caller_name#call_insn_indice".

Examples of split_func_path file for callees:

splitFuncDict = {
    'train':[
                'gcc-32-O1-binutils-objdump.pkl#OP_Rounding',
                'clang-32-O1-coreutils-csplit.pkl#keep_new_line',
                'gcc-32-O3-coreutils-mv.pkl#copy_internal',
                ...
            ],
    'test': [
                'gcc-32-O3-findutils-find.pkl#parse_amin',
                'gcc-32-O2-findutils-find.pkl#pred_size',
                'clang-32-O1-findutils-find.pkl#debug_strftime',
                ...
            ]
}

Examples of split_func_path file for callers:

splitFuncDict={
    'train':[
                'clang-32-O3-utillinux-dmesg.pkl#strnchr#print_record#283',
                'gcc-32-O3-coreutils-numfmt.pkl#process_line#main#386',
                'gcc-32-O3-coreutils-numfmt.pkl#process_line#main#557',
                ...
            ],
    'test': [
                'clang-32-O0-utillinux-utmpdump.pkl#gettok#undump#123',
                'gcc-32-O0-utillinux-ionice.pkl#ioprio_print#main#315',
                'clang-32-O1-inetutils-ping.pkl#ping_set_packetsize#ping_echo#24',
                ...
            ]
}
<!-- #### embed_path Format The embed_path file saves a dictionary saving the embedding vectors for all instruction. The key of this dictionary is a special string for each instruction. For example, if the bytes vector for one instruction is [232, 164, 254, 0, 0], the key for this instruction is '[232, 164, 254, 0, 0]'. The value of each instruction Example of embed_path: embedDict = { '[232, 164, 254, 0, 0]': { 'vector': [-0.15424642, -0.03994527, -0.06539968, 0.099554, ...] }, '[186, 195, 128, 10, 8]': { 'vector': [0.09991222, 0.05001251, 0.11093043, 0.0041295, ...] }, ... } -->

Testing RNN Model

Usage:

python eval.py [options] -d data_folder -f split_func_path -e embed_path -m model_dir -o output_dir

Link to eval.py

Options:

Disclaimer

The code is research-quality proof of concept, and is still under development for more features and bug-fixing.

References

Neural Nets Can Learn Function Type Signatures From Binaries

Zheng Leong Chua, Shiqi Shen, Prateek Saxena, Zhenkai Liang.

In the 26th USENIX Security Symposium (Usenix Security 2017)

Project Members

Zheng Leong Chua, Shiqi Shen, Prateek Saxena, Zhenkai Liang, Valentin Ghita.