Awesome

QSynth Synthesis Artefacts

Datasets

These artifacts provide the following four datasets:

syntia: 500 functions obfuscated with EncodeArithmetic and EncodeData
custom_EA: 500 functions obfuscated with EncodeArithmetic
custom_VR_EA: 500 functions obfuscated with Virtualize and EncodeArithmetic
custom_EA_ED: 500 functions obfuscated with EncodeArithmetic and Virtualize (the database trace is truncated to 293 functions)

A benchmark

Each dataset directory contains the following files:

original.c: source file containing all the functions
obfuscated.c: obfuscated source file as generated by Tigress
run_tigress.sh: Shell script to regenerating the obfuscated.c file
ground_truth.json: a JSON file containing the ground truth. For each function it contains the original expression (in original.c) and its obfuscated counter-part in (obfuscated.c)
obfuscated: pre-compiled binary as x86_64 (no PIE)
trace.db: the execution trace as a SQLite database (likely the most interesting file)
results:
- qsynth.log: the raw output of the synthesis script
- qsynth.json: QSynth processable output (very likely the second most interesting file)

Result

The qsynth.json contains for each function the the following entries such as:

{
  "target_1": {
    "fun_name": "target_1",
    "start": 164,
    "stop": 238,
    "orig_ast": "~(b*d ^ d*d) - b",
    "orig_node": 10,
    "orig_depth": 5,
    "obfu_ast": "(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1 ^ b) - ((~(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1) & b) + (~(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1) & b))",
    "obfu_node": 229,
    "obfu_depth": 12,
    "triton_ast": "(((b & d)*(b | d) + (~b & d)*(~d & b) & d*d) - (d*d | (b & d)*(b | d) + (~b & d)*(~d & b)) - 1 ^ b) - (((d*d ^ (b & d)*(b | d) + (~b & d)*(~d & b)) & b) + ((d*d ^ (b & d)*(b | d) + (~b & d)*(~d & b)) & b))",
    "triton_node": 95,
    "triton_depth": 10,
    "synthesized_ast": "~((d*d ^ d*b) + b)",
    "synth_node": 10,
    "synth_depth": 5,
    "dse_t": 0.10445356369018555,
    "synthesis_t": 0.03491616249084473,
    "sem_orig_obf": "UNK",
    "sem_obf_trit": "UNK",
    "sem_orig_synth": "OK",
    "is_simplified": true,
    "is_fully_synthesized": true
  }
}

Start and stop are the offsets in the execution trace (thus id in the DB). Then for each expression the expression itself and its node size and depth. Then dse_t, synthesis_t gives respectively the symbolic execution time and the synthesis time. is_simplified and is_fully_synthesized indicates if the function was simplified and if so if it was fully. sem_orig_obf, sem_obf_trit and sem_orig_synth indicates if the semantic is preserved between for instance original and synthesized (sem_orig_synth). "UNK" indicates that it has not been checked or that it yielded a timeout.

Synthesis oracle tables

The tables used for the benchmarks are available here: https://ret2libc.com/static/various/lts_15/ They are Python pickle objects. Expressions are encoded with a similar Reverse-Polish-Notation (RPN) than Syntia.