Home

Awesome

Data Polygamy

Data Polygamy is a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal datasets. For more detailed information about the our framework, please refer to our SIGMOD paper:

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets, F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016

We strongly suggest users to read our paper before using our code.

The team includes:

Our code and data in this repository are available under the BSD license.

Contents

This README file is divided into the following sections:

1. Repository Overview

2. Dependencies

The Data Polygamy framework uses Java 1.7.0_45 and has the following dependencies:

3. Preliminaries

This section describes information about the data used by the framework that must be in place before executing the framework.

3.1. HDFS Directory

The code originally reads from and writes to HDFS. It assumes that the HDFS home directory has the following structure:

.
+-- data/
|   +-- datasets
|   +-- dataset
|   +-- dataset.header
|   +-- dataset.defaults
|   +-- ...
+-- pre-processing/
|   +-- ...
+-- aggregates/
|   +-- ...
+-- index/
|   +-- ...
+-- mergetree/
|   +-- ...
+-- relationships/
|   +-- ...
+-- relationships-ids/
|   +-- ...
+-- correlations/
|   +-- ...
+-- neighborhood
+-- neighborhood-graph
+-- zipcode
+-- zipcode-graph

where:

To automatically create the required directories, take a look at the load-hdfs-structure script.

3.2. Spatial Resolutions

The current implementation of Data Polygamy has support to five spatial resolutions: GPS, neighborhood, zipcode, grid, and city. The grid resolution has only been used for testing, and not in our final experiments. Note that the framework assumes that all the data fed to the pre-processing step corresponds to a single city; therefore, if you are handling data from more than one city, you probably need to provide a suitable resolution conversion under the resolution directory.

To use the neighborhood and zipcode resolutions, two files must be provided for each: a polygons file, and a graph file. The former contains all the polygons that represent the different regions of the resolution (e.g.: neighborhoods or zipcodes) with their corresponding ids. A polygon, in this case, is represented by a set of GPS points, where the last point is the same as the first one. The format is the following:

<region-id>              # first region
<number-of-polygons>
<number-of-data-points>  # first polygon
<point-1>
<point-2>
...    
<number-of-data-points>  # second polygon
...
<region-id>              # second region
...

The files neighborhood.txt and zipcode.txt are examples of such file for New York City.

The graph file represents a graph for the resolution, where each region of the resolution is a node, and there is an edge between two regions if these are neighboring regions. The first line of this file contains the number of nodes and number of edges, and the following lines represent the edges of the graph (one line per edge). The files neighborhood-graph.txt and zipcode-graph.txt are examples of such file for New York City.

The script load-spatial can be used to automatically upload our spatial resolutions files to HDFS.

3.3. Data

The data directory under HDFS contains all the datasets used by the framework.

Dataset Attributes

We assume the following types of attributes for a dataset:

All the other attributes can be ignored by enclosing their values by either double quotes (e.g.: "ignore me!") or the symbol $ (e.g.: $ignore me!$). Alternatively, you can also simply delete these attributes before executing the framework.

Dataset Files

For each dataset, three files are required and must be located under the data directory. For the purpose of this documentation, assume a dataset named taxi:

In addition to these dataset files, a file named datasets must be created under the data directory, containing a mapping between dataset name and dataset id. An example of such file is available here.

4. How To Build

We use Apache Maven 3.3.9 to build the Data Polygamy framework:

$ cd data-polygamy/
$ ./install-repositories
$ mvn clean package

This generates a jar file, with the following name and path: data-polygamy/target/data-polygamy-0.1-jar-with-dependencies.jar. For simplicity, we refer to this file as data-polygamy.jar throughout this documentation.

Note that all the dependencies are taken care of by Maven, except for JIDT, Java-ML, and JavaMI, since these libraries are not available in the central repository. Therefore, we include these libraries, as well as their corresponding licenses, under data-polygamy/lib. It is worth mentioning that we did not make modifications to any of these libraries.

5. How To Run

To run our framework, you will need Apache Hadoop. The framework can be summarized as follows:

<div align="center"><img src="framework.png" height="125"></div>

Each step of the framework is represented by a map-reduce job. The Pre-Processing step is executed once for each dataset, while the other steps can be executed once for multiple datasets.

5.1. Common Arguments

The following command-line arguments are available in all the steps of the framework:

Required Arguments:

Optional Arguments:

5.2. Pre-Processing Step

The Pre-Processing step is responsible for selecting data (from a dataset) that correspond to spatial, temporal, identifier, and numerical attributes. This step also does a pre-aggregation that is fed to the scalar function computation step.

To run the pre-processing step:

$ hadoop jar data-polygamy.jar edu.nyu.vida.data_polygamy.pre_processing.PreProcessing -m <machine> -n <number-nodes> -dn <dataset name> -dh <dataset header file> -dd <dataset defaults file> -t <temporal resolution> -s <spatial resolution> -cs <current spatial resolution> -i <temporal index> <spatial indices> ...

where:

The results are stored under the pre-processing directory.

5.3. Step 1: Scalar Function Computation

The Scalar Function Computation step is responsible for generating all possible scalar functions at different spatio-temporal resolutions.

To run the scalar function computation step:

$ hadoop jar data-polygamy.jar edu.nyu.vida.data_polygamy.scalar_function_computation.Aggregation -m <machine> -n <number-nodes> -g <datasets>

where:

The results are stored under the aggregates directory.

5.4. Step 2: Feature Identification

The Feature Identification step creates the merge tree indices (if they have not been created yet) and identifies the set of features for the different scalar functions.

To run the feature identification step:

$ hadoop jar data-polygamy.jar edu.nyu.vida.data_polygamy.feature_identification.IndexCreation -m <machine> -n <number-nodes> -g <datasets> -t

where:

The format of file data/thresholds must be the following:

<dataset-name>
<scalar-function-id> <threshold-salient-feature> <threshold-extreme-feature>
<scalar-function-id> <threshold-salient-feature> <threshold-extreme-feature>
...
<dataset-name>
<scalar-function-id> <threshold-salient-feature> <threshold-extreme-feature>
<scalar-function-id> <threshold-salient-feature> <threshold-extreme-feature>
...

In this file, values in a line are separated by the tab character (i.e., \t). To know which scalar function ids to use, you can take a look at the file pre-processing/*.aggregates corresponding to the dataset of interest.

The results (set of features for each scalar function at different resolutions) are stored under the index directory. Merge tree indices are stored under the mergetree directory.

5.5. Step 3: Relationship Computation (Query Evaluation)

The Relationship Computation step evaluates the relationships between all the possible pairs of functions corresponding to the input query, i.e., the query evaluation happens in this step.

To run the relationship computation step:

$ hadoop jar data-polygamy.jar edu.nyu.vida.data_polygamy.relationship_computation.Relationship -m <machine> -n <number-nodes> -g1 <datasets> -g2 <datasets> -sc <score-threshold> -st <strength threshold> -c -id -r

where:

This step supports the general form of the relationship query:

<b><i>Find relationships between G1 and G2 satisfying CLAUSE.</i></b>

G1 and G2 are the groups of datasets corresponding to arguments -g1 and -g2: all the possible relationships between G1 and G2 are evaluated; if G2 is not provided, we assume that G2 encompasses all the datasets in the corpus (i.e., under the data directory), thus allowing hypothesis generation. The remaining arguments and flags are part of the CLAUSE sentence. If users want to specify custom thresholds for computing salient and extreme features, instead of doing so as part of the CLAUSE sentence, it is better to first re-execute the feature identification step (specifying the desired thresholds), and then execute the relationship computation step.

The results are stored under the relationships directory if flag -id is not used; otherwise, results are stored under the relationships-ids directory.

Relationship Output

Consider datasets taxi, weather, and 311, and assume that there is only one possible spatio-temporal resolution, hour-city, for simplicity. The results are stored in the following structure:

.
+-- relationships/
|   +-- taxi-weather/
|       +-- hour-city-events-restricted/
|           +-- ...
|       +-- hour-city-outliers-restricted/
|           +-- ...
|   +-- taxi-311/
|       +-- hour-city-events-restricted/
|           +-- ...
|       +-- hour-city-outliers-restricted/
|           +-- ...
|   +-- 311-weather/
|       +-- hour-city-events-restricted/
|           +-- ...
|       +-- hour-city-outliers-restricted/
|           +-- ...

Relationships are computed for both salient features (regarded in the results as events) and extreme features (regarded in the results as outliers). If flag -c is used (i.e., complete randomization), restricted is replaced with complete in the results.

To download the results for, say, taxi and weather with salient features:

$ hdfs dfs -getmerge relationships/taxi-weather/hour-city-events-restricted output

For each relationship (pair of scalar functions), the following values are outputted (in this order): relationship score, relationship strength, p-value, number of matched events, number of matched positive events, number of matched negative events, number of positive events on the first scalar function only, number of negative events on the first scalar function only, number of positive events on the second scalar function only, number of negative events on the second scalar function only, number of positive-positive relationships, number of negative-negative relationships, number of positive-negative relationships, and number of negative-positive relationships.

5.6. Alternate Step: Correlation Computation

The Correlation Computation step, which is not an "official" step of our framework, is used to compute relationships among datasets that are based on standard correlation techniques (rather than on topology features): Pearson's correlation coefficient (PCC), mutual information (MI), and dynamic time warping (DTW). We use this step for comparison purposes only.

This should be executed after the scalar function computation step:

<div align="center"><img src="framework-standard-techniques.png" height="110"></div>

To run the correlation computation step:

$ hadoop jar data-polygamy.jar edu.nyu.vida.data_polygamy.standard_techniques.CorrelationTechniques -m <machine> -n <number-nodes> -g1 <datasets> -g2 <datasets>

where -g1 and -g2 are equivalent to the arguments in the relationship computation step.

Correlation Output

The results are stored in a similar structure as in the relationship computation step, except that there are no salient and extreme features, and Monte Carlo tests are always restricted. For each pair of scalar functions, the following values are outputted (in this order): PCC, MI, DTW, p-value for PCC, p-value for MI, and p-value for DTW.

6. Experiments

In this section, we show how to reproduce the results of our SIGMOD'16 paper.

We provide a pre-built jar file for the Data Polygamy framework at sigmod16/data-polygamy.jar. If you want to build the code yourself, follow the instructions here or take a look at prepareSoftware.sh.

We provide the following scripts to make it easier for re-running our experiments (these must be run from the sigmod16/ directory):

It is important to note that, since we cannot make the 911, Taxi, and Twitter datasets available, the scripts that we provide here do not take into account these datasets, and as a consequence, the performance results and plots will be consistent but visually different than the ones published on the paper. Please modify the scripts accordingly if you obtain the remaining datasets elsewhere.

Alternatively, we provide ReproZip packages for the original plots published in the paper, where you can obtain the original performance results. The ReproZip packages were mostly created on a Ubuntu 12.04 LTS machine, having the same versions for Python and matplotlib.

More detailed information about our experiments can be found in the following sections.

6.1. Machine Configuration

The experiments were executed on a cluster with 20 compute nodes, each node running Red Hat Enterprise Linux Server release 6.7, and having an AMD Opteron(TM) Processor 6272 (4x16 cores) 2.1GHz and 256GB of RAM. The installed software is the following:

All the files related to the experiments are located under sigmod16/. All the scripts assume that these software and libraries are properly installed.

6.2. Datasets

Gas Prices

The Gas Prices dataset that we used in the experiments is available here.

The original dataset is available at the U.S. Energy Information Administration website.

Vehicle Collisions

The Vehicle Collisions dataset that we used in the experiments is available here.

The original dataset is available at the NYC Open Data portal.

311 Complaints

The 311 dataset that we used in the experiments is available here.

The original dataset is available at the NYC Open Data portal.

911 Calls

This dataset is not open source, and therefore, we cannot make it available online.

Citi Bike Data

The Citi Bike dataset that we used in the experiments is available here.

The original dataset is available at the Citi Bike website.

Weather Data

The Weather dataset that we used in the experiments is available here.

The original dataset is available at the National Climatic Data Center website.

Traffic Speed

The Traffic Speed dataset that we used in the experiments is available here.

Taxi Data

The version of the Taxi dataset that we used in the experiments is not open source, and therefore, we cannot make it available online. However, the Taxi and Limousine Commission has made the trip data available on their website.

Twitter

The Twitter data that we used in the experiments is too large for sharing (approximately 700 GB). We obtained such dataset by using Twitter's streaming API.

Datasets from NYC Open Data

The 300 datasets from NYC Open Data (NYC Open collection) that we used in the experiments is available here.

6.3. Initial Setup

First, run the hdfs_dir script to create the appropriate HDFS directory structure:

$ cd sigmod16/setup/
$ ./hdfs_dir

To load the datasets from the NYC Urban collection, run the load_nyc_urban script:

$ cd sigmod16/setup/
$ ./load_nyc_urban

Please note that, since we cannot make the 911, Taxi, and Twitter datasets available, these datasets are not loaded to your HDFS.

To load the datasets from NYC Open Data (NYC Open collection), run the load_nyc_open script:

$ cd sigmod16/setup/
$ ./load_nyc_open

After loading all the datasets, run the following scripts to execute the pre-processing step:

$ cd sigmod16/pre-processing/
$ ./pre-processing-nyc-urban  ## NYC Urban collection
$ ./pre-processing-nyc-open   ## NYC Open collection

Finally, execute the following script to download some additional required data:

$ cd sigmod16/
$ ./download-time-series

6.4. Performance Evaluation

Section 6.1 of the paper

Merge Tree Index Performance (Figure 7)

This experiment was run for a single dataset (Taxi data, using its density function), for both city (1D) and neighborhood (3D) resolutions. Here, we used a single node in the cluster, and the experiment assumes that the system has at least 100GB of unused memory.

To generate the data, run the following:

$ cd sigmod16/performance-evaluation/merge-tree-index/
$ python run.py

Then, to produce the plots:

$ cd sigmod16/performance-evaluation/merge-tree-index/
$ python merge-tree-index-performance.py  ## Figures 7(a) and 7(b)

Alternatively, you can download the ReproZip package figure-7.rpz to reproduce the original plots:

$ reprounzip vagrant setup figure-7.rpz figure-7/
$ reprounzip vagrant run figure-7/
## Figure 7(a)
$ reprounzip vagrant download figure-7/ standalone-index-1d.png:figure-7a.png
## Figure 7(b)
$ reprounzip vagrant download figure-7/ standalone-index-3d.png:figure-7b.png

Feature Indexing and Identification (Figure 8)

First, run the following scripts:

$ cd sigmod16/performance-evaluation/
$ cd nyc-open/                       ## NYC Open collection
$ ./run-varying > run-varying.out
$ cd ../nyc-urban/                   ## NYC Urban collection
$ ./run-varying > run-varying.out

Then, to produce the plots:

$ cd sigmod16/performance-evaluation/
$ cd nyc-urban/                                                                           ## NYC Urban collection
$ python running-time-preprocessing.py metadata/ run-varying.out True                     ## Figure 8(a)
$ cd ../nyc-open/                                                                         ## NYC Open collection
$ python running-time-preprocessing.py metadata/ run-varying.out False nyc-open-metadata  ## Figure 8(b)

Alternatively, you can download the ReproZip package figure-8.rpz to reproduce the original plots:

$ reprounzip vagrant setup figure-8.rpz figure-8/
## Reproducing Figure 8(a)
$ reprounzip vagrant run figure-8/ 8a
$ reprounzip vagrant download figure-8/ output-nyc-urban.png:figure-8a.png
## Reproducing Figure 8(b)
$ reprounzip vagrant run figure-8/ 8b
$ reprounzip vagrant download figure-8/ output-nyc-open.png:figure-8b.png

Query Performance (Figure 9)

First, make sure to run the scripts for feature indexing and identification, i.e.:

$ cd sigmod16/performance-evaluation/
$ cd nyc-open/                       ## NYC Open collection
$ ./run-varying > run-varying.out
$ cd ../nyc-urban/                   ## NYC Urban collection
$ ./run-varying > run-varying.out

Then, to produce the plots:

$ cd sigmod16/performance-evaluation/
$ cd nyc-urban/                                                                          ## NYC Urban collection
$ python running-time-relationship.py metadata/ run-varying.out True                     ## Figure 9(a)
$ cd ../nyc-open/                                                                        ## NYC Open collection
$ python running-time-relationship.py metadata/ run-varying.out False nyc-open-metadata  ## Figure 9(b)

Alternatively, you can download the ReproZip package figure-9.rpz to reproduce the original plots:

$ reprounzip vagrant setup figure-9.rpz figure-9/
## Reproducing Figure 9(a)
$ reprounzip vagrant run figure-9/ 9a
$ reprounzip vagrant download figure-9/ output-nyc-urban.png:figure-9a.png
## Reproducing Figure 9(b)
$ reprounzip vagrant run figure-9/ 9b
$ reprounzip vagrant download figure-9/ output-nyc-open.png:figure-9b.png

Scalability (Figure 10)

The scalability experiment was the only one performed on Amazon Web Services (AWS), using Amazon S3 (for storage) and Amazon Elastic MapReduce (EMR). All the scripts related to this experiment are available under sigmod16/performance-evaluation/scalability/.

To run this experiment, install the following dependencies:

First, create a bucket on Amazon S3 (we refer to this bucket as BUCKET). Then, under BUCKET, create all the necessary directories presented before (e.g.: data, pre-processing, aggregates, ...), including a directory named emr-logs. Once all the directories are created, load all the data to Amazon S3:

$ cd sigmod16/performance-evaluation/scalability/setup/
$ ./load-data BUCKET
$ ./load-nyc-urban BUCKET

To run the pre-processing step:

$ cd sigmod16/performance-evaluation/scalability/
$ pyhon run-pre-processing.py BUCKET N_NODES AWS_ID AWS_KEY

where N_NODES is the number of nodes, AWS_ID is the AWS Access Key Id, and AWS_KEY is the AWS Secret Access Key.

Then, to finally run the scalability experiment, run each of the following commands:

$ cd sigmod16/performance-evaluation/scalability/
$ python run-cluster.py BUCKET 2 AWS_ID AWS_KEY
$ python run-cluster.py BUCKET 4 AWS_ID AWS_KEY
$ python run-cluster.py BUCKET 8 AWS_ID AWS_KEY
$ python run-cluster.py BUCKET 16 AWS_ID AWS_KEY

Important: after executing one command, wait for the cluster to be terminated before running the next one.

To produce the speedup plot (Figure 10):

$ cd sigmod16/performance-evaluation/scalability/
$ python get-stdout.py BUCKET
$ python speedup.py

Alternatively, you can download the ReproZip package figure-10.rpz to reproduce the original plot:

$ reprounzip vagrant setup figure-10.rpz figure-10/
$ reprounzip vagrant run figure-10/
$ reprounzip vagrant download figure-10/ speedup.png:figure-10.png

Relationship Pruning (Figure 11)

First, make sure to run the scripts for query performance. Then, run the following to download the relationship data:

$ cd sigmod16/performance-evaluation/nyc-open/
$ ./download-relationships
$ cd ../nyc-urban/
$ ./download-relationships

Next, we need to analyze the number of relationships for a varying number of datasets running the following:

$ cd sigmod16/performance-evaluation/nyc-urban/pruning/
$ ./get-pruning-data
$ cd ../../nyc-open/pruning/
$ ./get-pruning-data

To produce the pruning plots (Figure 11):

$ cd sigmod16/performance-evaluation/nyc-urban/pruning/
$ python pruning.py results events restricted week city True   ## Figure 11(a)
$ cd ../../nyc-open/pruning/
$ python pruning.py results events restricted week city False  ## Figure 11(b)

Note that you can generate the same plots for different resolutions. For instance, run python pruning.py results events restricted hour nbhd True to generate the pruning plots for the hour-city resolution.

Alternatively, you can download the ReproZip package figure-11.rpz to reproduce the original plots:

$ reprounzip vagrant setup figure-11.rpz figure-11/
## Reproducing Figure 11(a)
$ reprounzip vagrant run figure-11/ 11a
$ reprounzip vagrant download figure-11/ events-restricted-week-city-urban:figure-11a.png
## Reproducing Figure 11(b)
$ reprounzip vagrant run figure-11/ 11b
$ reprounzip vagrant download figure-11/ events-restricted-week-city-open:figure-11b.png

6.5. Correctness and Robustness

Section 6.2 of the paper

Although the following experiments use Hadoop for the execution, only a single machine is used (i.e., there is no need for a cluster).

Correctness

To run the correctness experiment:

$ cd sigmod16/
$ ./correctness

where:

Robustness (Figures 12, I, II, and III)

To run the robustness experiment:

$ cd sigmod16/robustness/
$ ./robustness

To produce the plots:

$ cd sigmod16/robustness/
$ python robustness.py noise-exp-taxi-city.out 1D False

In addition to the plots for relationship score and strength (originally published in the paper), the script also plots the p-values with increasing levels of noise.

Alternatively, you can download the ReproZip package robustness.rpz to reproduce the original plots:

$ reprounzip vagrant setup robustness.rpz robustness/
$ reprounzip vagrant run robustness/
$ reprounzip vagrant download robustness/ 12a:figure-12a.png    ## Figure 12(a)
$ reprounzip vagrant download robustness/ 12b:figure-12b.png    ## Figure 12(b)
$ reprounzip vagrant download robustness/ Ia:figure-Ia.png      ## Figure I(a)
$ reprounzip vagrant download robustness/ Ib:figure-Ib.png      ## Figure I(b)
$ reprounzip vagrant download robustness/ IIa:figure-IIa.png    ## Figure II(a)
$ reprounzip vagrant download robustness/ IIb:figure-IIb.png    ## Figure II(b)
$ reprounzip vagrant download robustness/ IIIa:figure-IIIa.png  ## Figure III(a)
$ reprounzip vagrant download robustness/ IIIb:figure-IIIb.png  ## Figure III(b)

You can also retrieve all the plots simultaneously:

$ reprounzip vagrant download robustness/ --all

6.6. Standard Techniques

Section 6.4 of the paper

To run the standard techniques on the NYC Urban collection:

$ cd sigmod16/
$ ./standard-techniques

Please note that this assumes that the script sigmod16/performance-evaluation/nyc-urban/run-varying has already been run, i.e., that the scalar function computation step has already been executed.

6.7. Relationships

First, make sure to run the scripts for query performance. Then, run the following to download the relationship data from the NYC Urban collection:

$ cd sigmod16/performance-evaluation/nyc-urban/
$ ./download-relationships

Relationships for salient features can be found under sigmod16/performance-evaluation/nyc-urban/output-events, while relationships for extreme features can be found under sigmod16/performance-evaluation/nyc-urban/output-outliers.