Awesome

SparkProject

Invoking Apache Spark from ArcGIS for Desktop

This project contains two modules, SparkApp and SparkToolbox.

SparkApp uses Apache Spark to perform spatial binning of data residing in HDFS. The spatial bins in this project is a set of honeycomb shaped two dimensional polygons that are generated by the SparkToolbox.
SparkToolbox is geoprocessing toolbox extension to ArcGIS for Desktop and has two tools, HexTool and SparkTool.
- HexTool is a geoprocessing tool that generates a polygon features class where each polygon is shaped as hexagon. When invoked, the tool prompts the user for the hexagon width in map units and based on the map viewing extent, it generates a honeycomb style set of polygons to fill the map.
- SparkTool is yet another geoprocessing tool and is the Spark driver to execute the above SparkApp on a Spark cluster. When invoked, it prompts the user to supply the location of the spark cluster, the feature class that contains the bins and the HDFS location of the point data to bin. The result is a table with two columns. The first column is the feature id of the bin and the second column is the total number of HDFS data points that are covered by the bin area.

SparkToolShot

Building Spark

I use the Cloudera Distribution of Hadoop for all my BigData Hadoop work, so I downloaded the 0.8.1- CDH4 tarball and uncompressed it onto a local shared drive. I placed it in a shared drive as I will need it for development and for local distribution on my local Hadoop cluster.

I had to modify the top level pom.xml to explicitly reference a specific version of Hadoop:

<hadoop.version>2.0.0-mr1-cdh4.4.0</hadoop.version>

I added a reference to the Cloudera maven repository:

<repository>
  <id>cloudera-repo</id>
  <name>Cloudera Repository</name>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  <releases>
    <enabled>true</enabled>
  </releases>
  <snapshots>
    <enabled>false</enabled>
  </snapshots>
</repository>

and compiled it to be installed in my local maven repo:

mvn -DskipTests install

in such that I can reference it in my project pom.xml as:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.9.3</artifactId>
    <version>0.8.1-incubating</version>
</dependency>

Note to self - Dive into Scala and SBT !!

Getting Started

I highly recommend that you download the Cloudera Quick Start VM. Once started, for this project you only need ZooKeeper and HDFS up and running.

Copy the zip.zip file from the data folder into the VM. Unzip the file and put the zip.txt file into hadoop.

$ hadoop fs -put zip.txt zip.txt

BTW - this file contains the centroid location all the zipcodes in the United States.

Finally, copy into the VM the Spark distribution, and start the Spark master and slave. Check out the standalone mode documentation for more details.