Awesome
DataTracker
About
DataTracker is a tool for collecting high-fidelity data provenance from unmodified Linux programs. It is based on Intel Pin Dynamic Binary Instrumentation framework and libdft Dynamic Taint Analysis library. The taint marks supported by the original libdft are of limited size and cannot provide adequate fidelity for use in provenance tracking. For this, DataTracker uses a modified version of the library developed at VU University Amsterdam.
DataTracker was developed at VU University Amsterdam by Manolis Stamatogiannakis and presented at IPAW14. You can get a copy of the paper from VU Digital Archive Repository (VU-DARE). We also have a demo on YouTube. Presentation slides available upon request.
Requirements
DataTracker can work with 32bit Linux programs. This limitation is imposed by the current version of libdft. However, the methods of both software are not platform-specific. So, in principle, they can be ported on any platform supported by Intel Pin. The requirements for running DataTracker are:
- A C++11 compiler and unix build utilities (e.g. GNU Make).
- A recent (>=2.13) version of Intel Pin. The framework must be present in directory
pin
inside the DataTracker top directory. - A suitable version of the modified libdft - typically the latest available. This must be placed in directory
support/libdft
. - Python 2.7 for converting raw provenance to PROV format in Turtle syntax.
Installation
After cloning DataTracker, follow these steps to compile it.
Multiarch setup (intel64 only): On intel64 (a.k.a. x86_64) hosts, DataTracker and libdft need to be cross-compiled to ia32. For this, you will need a working multiarch setup. Google and serverfault are your friends for this.
Build environment:
On Debian/Ubuntu systems, you should install build-essential
meta-package which will provide a C++ compiler and GNU Make. On other systems, you should either install some equivalent meta-package or install the tools one by one using trial and error.
Intel Pin: You can manually download a suitable Pin version and extract it in pin
directory. For convenience, a makefile is provided which takes care of this. I.e. it downloads and extracts a suitable Pin version. Invoke it using:
make -C support -f makefile.pin
libdft: The modified libdft is packed as a submodule of DataTracker. You need to disable Git's certificate checking to successfully retrieve it. Because libdft does not use Pin's makefile infrastructure you need to set PIN_ROOT
environment variable before compiling it. E.g.:
export PIN_ROOT=$(pwd)/pin
GIT_SSL_NO_VERIFY=true git submodule update --init
make support-libdft
dtracker pin tool: Finaly compile the pin tool of DataTracker using:
make
If all above steps were successfull, obj-ia32/dtracker.so
will be created. This is Pin tool containing all the instrumentation required to capture provenance.
Runnning
Capturing raw provenance
To capture provenance from a program, launch it from the unix shell using something like this:
./pin/pin.sh -follow_execv -t ./obj-ia32/dtracker.so <knobs> -- <program> <args>
The command runs the program under Pin In addition to the standard Pin knobs, DataTracker additionally supports these tool-specific knobs:
-stdin [1|0]
: Turns tracking of data read from the standard input on or off. Default if off.-stdout [1|0]
: Turns logging of provenance of data written to standard output on or off. Default if on.-stderr [1|0]
: Turns logging of provenance of data written to standard error on or off. Default if off.
Note that launching large programs using the method above takes a lot of time. For such programs, it is suggested to first launch the program and then attach DataTracker to the running process like this:
./pin/pin.sh -follow_execv -pid <pid> -t ./obj-ia32/dtracker.so <knobs>
The raw provenance generated by DataTracker is contained in file rawprov.out
. Any additional debugging information are written in file pintool.log
.
Converting to PROV
The raw2ttl.py
script converts the raw provenance generated by DataTracker to PROV format in Turtle syntax. The converter works as a filter. So, a conversion would look like this:
python raw2ttl.py < rawprov.out > prov.ttl
Visualizing provenance
For visualization of the generated provenance, we suggest using provconvert
from Luc Moreau's ProvToolbox. It is suggested to use the binary release.
Of course any other PROV-compatible tool can be used, either directly, or via conversion of the Turtle file to a supported syntax. If you were able to produce any good-looking provenance graph, we'd love to incorporate them in these pages.
Sample programs
In this repository also include a few sample programs we used for evaluating the effectiveness of DataTracker. You can find these programs in the samples
directory. To build them, use:
make -C samples
<!--
Integration with SPADE
-----------------------
```
<provenance> ::= <provenance> <element> | <element>
<element> ::= <node> | <dependency>
<node> ::= <node-type> <node-id> <annotation-list>
<node-type> ::= type: <vertex-type>
<vertex-type> ::= Agent | Process | Artifact
<node-id> ::= id: <vertex-id>
<vertex-id> ::= <unique-identifier>
<annotation-list> ::= <annotation-list> <annotation> | <annotation>
<annotation> ::= <key> : <value>
<dependency> ::= <dependency-type> <start-node> <end-node>
<annotation-list>
<dependency-type> ::= type: <edge-type>
<edge-type> ::= WasControlledBy | WasGeneratedBy | Used | WasTriggeredBy | WasDerivedFrom
<start-node> ::= from: <vertex-id>
<end-node> ::= to: <vertex-id>
```
-->