Home

Awesome

SparkProject

Invoking Apache Spark from ArcGIS for Desktop

This project contains two modules, SparkApp and SparkToolbox.

SparkToolShot

Building Spark

I use the Cloudera Distribution of Hadoop for all my BigData Hadoop work, so I downloaded the 0.8.1- CDH4 tarball and uncompressed it onto a local shared drive. I placed it in a shared drive as I will need it for development and for local distribution on my local Hadoop cluster.

I had to modify the top level pom.xml to explicitly reference a specific version of Hadoop:

<hadoop.version>2.0.0-mr1-cdh4.4.0</hadoop.version>

I added a reference to the Cloudera maven repository:

<repository>
  <id>cloudera-repo</id>
  <name>Cloudera Repository</name>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  <releases>
    <enabled>true</enabled>
  </releases>
  <snapshots>
    <enabled>false</enabled>
  </snapshots>
</repository>

and compiled it to be installed in my local maven repo:

mvn -DskipTests install

in such that I can reference it in my project pom.xml as:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.9.3</artifactId>
    <version>0.8.1-incubating</version>
</dependency>

Note to self - Dive into Scala and SBT !!

Getting Started

I highly recommend that you download the Cloudera Quick Start VM. Once started, for this project you only need ZooKeeper and HDFS up and running.

Copy the zip.zip file from the data folder into the VM. Unzip the file and put the zip.txt file into hadoop.

$ hadoop fs -put zip.txt zip.txt

BTW - this file contains the centroid location all the zipcodes in the United States.

Finally, copy into the VM the Spark distribution, and start the Spark master and slave. Check out the standalone mode documentation for more details.