Awesome
<p align="center"><img src="./images/logo.png" alt="icon"></p>See original announcements on:
- Spark mailing list
- GeneMania google group
- BioStars
For more information, see gallia-core documentation, in particular:
Description
This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)
Test Run
You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli
and don't mind the cost (see below).
The script does the following:
- Creates an S3 bucket for the code and data
- Retrieves code and uploads it to the bucket (source+binaries)
- Retrieves the data (or a subset thereof) and uploads it to the bucket
- Creates an EMR Spark cluster and run the program as a single step
- Awaits until termination and logs results
To run it on a small subset (expect ~$3<sup>[2]</sup> in AWS charges), use:
./testrun.sh 10 4 # process first 10 files, using 4 workers
To run it in full (expect ~$18<sup>[2]</sup> in AWS charges), use:
./testrun.sh ALL <number-of-workers> # eg 60 workers
The full EMR run will take about 120 minutes with 60 workers<sup>[1]</sup>. As one would expect, it follows the distribution below:
Input
Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/
Output
Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz
files
Limitations
Notable limitations are:
- Only available for Scala 2.12 because:
- sbt-assembly does not seem to be available for 2.13
- Spark support for 2.13 is still immature
- The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:
- gallia-core's
io.in
mechanisms (fluency, actions and atoms) vs gallia-spark's - gallia-core's
io.out
mechanisms (fluency, actions and atoms) vs gallia-spark's
- gallia-core's
See list of spark-related tasks for more limitations.
Footnotes
- <sup>[1]</sup> <a name="number-of-workers"></a> ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
- <sup>[2]</sup> <a name="cost-estimate"></a>Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)
Contact
You may contact the author at cros.anthony@gmail.com