Home

Awesome

<p align="center"><img src="./images/logo.png" alt="icon"></p>

See original announcements on:

For more information, see gallia-core documentation, in particular:

Description

This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)

Test Run

You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli and don't mind the cost (see below).

The script does the following:

To run it on a small subset (expect ~$3<sup>[2]</sup> in AWS charges), use:

./testrun.sh 10 4 # process first 10 files, using 4 workers

To run it in full (expect ~$18<sup>[2]</sup> in AWS charges), use:

./testrun.sh ALL <number-of-workers> # eg 60 workers

The full EMR run will take about 120 minutes with 60 workers<sup>[1]</sup>. As one would expect, it follows the distribution below:

|distribution

Input

Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/

Output

Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz files

Limitations

Notable limitations are:

See list of spark-related tasks for more limitations.

Footnotes

Contact

You may contact the author at cros.anthony@gmail.com