Awesome
Introduction to FastPG
Fast phenograph-like clustering of millions of items with scores of features.
See our preprint article introducing FastPG:
Thomas Bodenheimer, Mahantesh Halappanavar, Stuart Jefferys, Ryan Gibson, Siyao Liu, Peter J Mucha, Natalie Stanley, Joel S Parker, Sara R Selitsky. FastPG: Fast clustering of millions of single cells. bioRxiv 2020.06.19.159749; doi: https://doi.org/10.1101/2020.06.19.159749
License
This package is licensed under the MIT license, except for the Grappolo C++ library by Mahantesh Halappanavar (hala@pnnl.gov) which is included and licensed under the BSD-3 license as described in the headings of the relevant *.cpp files. Some alterations to the original Grappolo source have been made to support installation and use within an R package. The original Grappolo C++ library is available at https://hpc.pnl.gov/people/hala/grappolo.html
Installation
You can install locally from GitHub, or use the Docker container from DockerHub:
Local install
Before installing, you must have R, of course, but also:
-
If you are on Windows you will need the version of Rtools appropriate for the version of R (and need to set the path correctly), see https://cran.r-project.org/bin/windows/Rtools/. If you upgraded from R 3 to R 4 and have not already done so, you should uninstall the old and install the new Rtools.
-
If you are on Mac OS X, you will need the x-code command line tools and probably need to install the OpenMP library, which Apple no longer distributes as part of x-code. One way to do that is to install the data.table package as described here: https://firas.io/post/data.table_openmp/. Afterwards, FastPG and any other programs that use OpenMP should work.
You can install the FastPG
package from GitHub using the Bioconductor installation manager. If you don't have Bioconductor you need to install the BiocManager
and remotes
packages from CRAN. Then you can install using:
# Requires the CRAN packages "remotes" and "BiocManager" to be installed.
BiocManager::install("sararselitsky/FastPG")
To install the latest code from a branch or a specific tagged version, append "@<source>" to the repository string used above, e.g BiocManager::install("sararselitsky/FastPG@dev")
or BiocManager::install("sararselitsky/FastPG@0.0.8")
Use a pre-built Docker container
Another way to use FastPG is via a Docker container, jefferys/fastpg:latest
, that is already set up and has the package pre-installed. You can just run it interactively, using the R command line inside it. To use the pre-built FastPG container from DockerHub you should be running on a 64 bit Intel/AMD compatible machine.
docker pull jefferys/fastpg:latest
docker run -it --rm -v $PWD:$PWD -w $PWD jefferys/fastpg:latest
R
# or (singularity < 3.0)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg-latest.simg
R
# or (singularity 3.0+)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg_latest.sif
Note that you should consider this container version like an "application" and not an environment. You may have problems installing additional packages into it. To do that, see Extending the Docker container.
Building your own Docker container
If you want to build your own container instead of pulling a pre-build one, you can use the Dockerfile
included in the repository in the Docker/
directory as a guide. The build.sh
file in the same directory automatically builds and tags the container with the name and tags used by the pre-built container at DockerHub, you should change the tags by editing the parameters in the build.sh
file, or by manually building it and tagging it yourself.
git clone --single-branch https://github.com/sararselitsky/FastPG.git
cd FastPG/Docker
# Edit Docker tags in build.sh for your use
./build.sh --no-cache
You can just use this with Docker as above, but you can't just use a local Docker container with singularity versions less than 3.0. You have to push the container to some Docker registry before you can run it. With 3.0+ you can pull a local container directly by using docker-daemon://
instead of docker://
to get a local container.
Extending the Docker container
If you want to add additional things to the container, you should build your own. You can extend the existing container either by editing the provided Dockerfile
or by using the existing container as the FROM
that your own Docker container is based on.
Clustering with FastPG
Clustering is as simple as:
clusters <- FastPG::fastCluster( data, k, num_threads )
The fastCluster()
function takes a number of additional tuning parameters, but
those have reasonable defaults.
The data
parameter
The main input is the numeric data to cluster as a matrix, where rows are elements to cluster and columns are the features that make elements similar or different. Any data matrix will do, for this example we extract a 265,627 x 32 numeric data matrix from the GitHub-published clustering benchmark mass cytometry data set, sourced from the phenograph paper [@Levine-2015].
url <- "https://github.com/lmweber/benchmark-data-Levine-32-dim/raw/master/data/Levine_32dim.fcs"
file <- "Levine_32dim.fcs"
download.file( url, file, mode="wb") # This downloads a 41.5 MB binary file
dataColumns <- c( 5:36 ) # extract only the data columns, whatever they are
data <- flowCore::exprs( flowCore::read.FCS( file, truncate_max_range= FALSE ))
data <- data[ , dataColumns ]
The k
parameter
To cluster a data set, a local neighborhood size needs to be specified as a parameter.
k <- 30
The num_threads
parameter
The number of cpus to use should be specified, it defaults to 1. However, it will only limit the k nearest neighbors part of the clustering (see Internal algorithm). The rest of the clustering will use all available cpus regardless of what this is set to.
num_threads <- 4
Results
fastCluster()
returns a list with two elements
$modularity
= The modularity of the network created from the overlapping nearest neighbor graphs.$communities
= An integer vector where the nth element is the nth element from the input data. The value is the cluster that each input element has been assigned to.table(clusters$communities)
shows a count by cluster id. Note that these id's are arbitrary and not deterministic.
Caution: -1 indicates a point that was not clustered; each can be considered their own cluster, even though they are all labeled “-1”. It is probable that you will not have any singleton clusters, but they can occur.
Internal algorithm
FastPG utilizes the same three main steps as the phenograph algorithm [@Levine-2015, @Chen-2016], but uses fast, parallel implementations.
- The k nearest neighbors determining step is implemented using hierarchical navigable small world graphs [@Malkov-2016] via the RcppHNSW library.
- The nearest-neighbor distances are generated using an included parallel Jaccard metric function.
- Clustering is implemented as community detection in the graph formed from the overlapping "k best friends" for each element. This is done using a parallel Louvain algorithm as implemented by Grappolo [@Lu-2015]. Code for Grappolo has been included within this package; the standalone application with additional functionality is available for download as described in the license section above.
Calling fastCluster()
is equivalent to and is essentially implemented as the following sequence of commands:
# Approximate k nearest neighbors
all_knn <- RcppHNSW::hnsw_knn( data, k= k, distance= 'l2',
n_threads= num_threads )
ind <- all_knn$idx
# Parallel Jaccard metric
links <- FastPG::rcpp_parallel_jce(ind)
links <- FastPG::dedup_links(links)
# Parallel Louvain clustering
clusters <- FastPG::parallel_louvain( links )
Note that RcppHNSW::hnsw_knn() and FastPG::parallel_louvain( links ) have numerous additional parameters; only the parameters used that are different from the defaults are shown above. The FastPG::fastCluster() wrapper allows setting all applicable parameters. See the function documentation for additional details.
References
Chen, Hao, Mai Chan Lau, Michael Thomas Wong, Evan W Newell, Michael Poidinger, and Jinmiao Chen. 2016. “Cytofkit: A Bioconductor Package for an Integrated Mass Cytometry Data Analysis Pipeline.” PLoS Comput Biol 12 (9).
Levine, Jacob H, Erin F Simonds, Sean C Bendall, Kara L Davis, El-ad D Amir, Michelle D Tadmor, Oren Litvin, et al. 2015. “Data-Driven Phenotypic Dissection of Aml Reveals Progenitor-Like Cells That Correlate with Prognosis.” Cell 162 (1): 184–97. https://doi.org/10.1016/j.cell.2015.05.047.
Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. “Parallel Heuristics for Scalable Community Detection.” Parallel Computing 47: 19–37. https://doi.org/https://doi.org/10.1016/j.parco.2015.03.003.
Malkov, Yury A., and D. A. Yashunin. 2016. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” CoRR abs/1603.09320. http://arxiv.org/abs/1603.09320.