Home

Awesome

DataTracker

About

DataTracker is a tool for collecting high-fidelity data provenance from unmodified Linux programs. It is based on Intel Pin Dynamic Binary Instrumentation framework and libdft Dynamic Taint Analysis library. The taint marks supported by the original libdft are of limited size and cannot provide adequate fidelity for use in provenance tracking. For this, DataTracker uses a modified version of the library developed at VU University Amsterdam.

DataTracker was developed at VU University Amsterdam by Manolis Stamatogiannakis and presented at IPAW14. You can get a copy of the paper from VU Digital Archive Repository (VU-DARE). We also have a demo on YouTube. Presentation slides available upon request.

Requirements

DataTracker can work with 32bit Linux programs. This limitation is imposed by the current version of libdft. However, the methods of both software are not platform-specific. So, in principle, they can be ported on any platform supported by Intel Pin. The requirements for running DataTracker are:

Installation

After cloning DataTracker, follow these steps to compile it.

Multiarch setup (intel64 only): On intel64 (a.k.a. x86_64) hosts, DataTracker and libdft need to be cross-compiled to ia32. For this, you will need a working multiarch setup. Google and serverfault are your friends for this.

Build environment: On Debian/Ubuntu systems, you should install build-essential meta-package which will provide a C++ compiler and GNU Make. On other systems, you should either install some equivalent meta-package or install the tools one by one using trial and error.

Intel Pin: You can manually download a suitable Pin version and extract it in pin directory. For convenience, a makefile is provided which takes care of this. I.e. it downloads and extracts a suitable Pin version. Invoke it using:

make -C support -f makefile.pin

libdft: The modified libdft is packed as a submodule of DataTracker. You need to disable Git's certificate checking to successfully retrieve it. Because libdft does not use Pin's makefile infrastructure you need to set PIN_ROOT environment variable before compiling it. E.g.:

export PIN_ROOT=$(pwd)/pin
GIT_SSL_NO_VERIFY=true git submodule update --init
make support-libdft

dtracker pin tool: Finaly compile the pin tool of DataTracker using:

make

If all above steps were successfull, obj-ia32/dtracker.so will be created. This is Pin tool containing all the instrumentation required to capture provenance.

Runnning

Capturing raw provenance

To capture provenance from a program, launch it from the unix shell using something like this:

./pin/pin.sh -follow_execv -t ./obj-ia32/dtracker.so <knobs> -- <program> <args>

The command runs the program under Pin In addition to the standard Pin knobs, DataTracker additionally supports these tool-specific knobs:

Note that launching large programs using the method above takes a lot of time. For such programs, it is suggested to first launch the program and then attach DataTracker to the running process like this:

./pin/pin.sh -follow_execv -pid <pid> -t ./obj-ia32/dtracker.so <knobs>

The raw provenance generated by DataTracker is contained in file rawprov.out. Any additional debugging information are written in file pintool.log.

Converting to PROV

The raw2ttl.py script converts the raw provenance generated by DataTracker to PROV format in Turtle syntax. The converter works as a filter. So, a conversion would look like this:

python raw2ttl.py < rawprov.out > prov.ttl

Visualizing provenance

For visualization of the generated provenance, we suggest using provconvert from Luc Moreau's ProvToolbox. It is suggested to use the binary release.

Of course any other PROV-compatible tool can be used, either directly, or via conversion of the Turtle file to a supported syntax. If you were able to produce any good-looking provenance graph, we'd love to incorporate them in these pages.

Sample programs

In this repository also include a few sample programs we used for evaluating the effectiveness of DataTracker. You can find these programs in the samples directory. To build them, use:

make -C samples
<!-- Integration with SPADE ----------------------- ``` <provenance> ::= <provenance> <element> | <element> <element> ::= <node> | <dependency> <node> ::= <node-type> <node-id> <annotation-list> <node-type> ::= type: <vertex-type> <vertex-type> ::= Agent | Process | Artifact <node-id> ::= id: <vertex-id> <vertex-id> ::= <unique-identifier> <annotation-list> ::= <annotation-list> <annotation> | <annotation> <annotation> ::= <key> : <value> <dependency> ::= <dependency-type> <start-node> <end-node> <annotation-list> <dependency-type> ::= type: <edge-type> <edge-type> ::= WasControlledBy | WasGeneratedBy | Used | WasTriggeredBy | WasDerivedFrom <start-node> ::= from: <vertex-id> <end-node> ::= to: <vertex-id> ``` -->