Awesome

SparkTDA

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
Scalable Mapper Implementation
Scalable Multiscale Mapper Implementation
Scalable Tower Computation for Multiscale Mapper
Scalable Persistent Homology Computation on Top of Apache Spark

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

Status

WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.

Examples

Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap)	2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)
60k images clustered in 784 dimensions without any projection loss	60k images clustered in 784 dimensions witout any projection loss

Requirements

This library requires Spark 2.0+

Building and Running Unit Tests

To compile this project, run sbt package from the project home directory. This will also run the Scala unit tests. To run the unit tests, run sbt test from the project home directory. This project uses the sbt-spark-package plugin, which provides the 'spPublish' and 'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by supplying a comma-delimited list of Maven coordinates with --packages and download the package from the locally repository or official Spark Packages repository.

The package can be published locally with:

$ sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

$ sbt spPublish

Using with Spark Shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

Future Works

Mapper

Write Wiki
Implement Python APIs
Publish to Spark Packages
Benchmark
Consider using GraphFrames instead of plain GraphX
Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

Related Softwares & Projects

References

Mapper

KNN/ANN/SNN

LSH

M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.