Awesome
SparkTDA
The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:
- Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
- Scalable Mapper Implementation
- Scalable Multiscale Mapper Implementation
- Scalable Tower Computation for Multiscale Mapper
- Scalable Persistent Homology Computation on Top of Apache Spark
If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.
Status
WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.
Examples
Mapper
2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap) | 2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap) |
---|---|
60k images clustered in 784 dimensions without any projection loss | 60k images clustered in 784 dimensions witout any projection loss |
Requirements
This library requires Spark 2.0+
Building and Running Unit Tests
To compile this project, run sbt package
from the project home directory. This will also run the Scala unit tests.
To run the unit tests, run sbt test
from the project home directory. This project uses the
sbt-spark-package plugin, which provides the 'spPublish' and
'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by
supplying a comma-delimited list of Maven coordinates with --packages
and download the package from the locally
repository or official Spark Packages repository.
The package can be published locally with:
$ sbt spPublishLocal
The package can be published to Spark Packages with (requires authentication and authorization):
$ sbt spPublish
Using with Spark Shell
This package can be added to Spark using the --packages
command line option. For example, to include it when starting
the spark shell:
$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11
Future Works
Mapper
- Write Wiki
- Implement Python APIs
- Publish to Spark Packages
- Benchmark
- Consider using GraphFrames instead of plain GraphX
- Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers
Related Softwares & Projects
References
Mapper
- G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
- J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013
- T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015
- E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015
- E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.
- H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.
KNN/ANN/SNN
- L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.
- L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.
- D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629
- M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309
- L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.
- M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.