Home

Awesome

Introduction to FastPG

Fast phenograph-like clustering of millions of items with scores of features.

See our preprint article introducing FastPG:

Thomas Bodenheimer, Mahantesh Halappanavar, Stuart Jefferys, Ryan Gibson, Siyao Liu, Peter J Mucha, Natalie Stanley, Joel S Parker, Sara R Selitsky. FastPG: Fast clustering of millions of single cells. bioRxiv 2020.06.19.159749; doi: https://doi.org/10.1101/2020.06.19.159749

License

This package is licensed under the MIT license, except for the Grappolo C++ library by Mahantesh Halappanavar (hala@pnnl.gov) which is included and licensed under the BSD-3 license as described in the headings of the relevant *.cpp files. Some alterations to the original Grappolo source have been made to support installation and use within an R package. The original Grappolo C++ library is available at https://hpc.pnl.gov/people/hala/grappolo.html

Installation

You can install locally from GitHub, or use the Docker container from DockerHub:

Local install

Before installing, you must have R, of course, but also:

You can install the FastPG package from GitHub using the Bioconductor installation manager. If you don't have Bioconductor you need to install the BiocManager and remotes packages from CRAN. Then you can install using:

# Requires the CRAN packages "remotes" and "BiocManager" to be installed.
BiocManager::install("sararselitsky/FastPG")

To install the latest code from a branch or a specific tagged version, append "@<source>" to the repository string used above, e.g BiocManager::install("sararselitsky/FastPG@dev") or BiocManager::install("sararselitsky/FastPG@0.0.8")

Use a pre-built Docker container

Another way to use FastPG is via a Docker container, jefferys/fastpg:latest, that is already set up and has the package pre-installed. You can just run it interactively, using the R command line inside it. To use the pre-built FastPG container from DockerHub you should be running on a 64 bit Intel/AMD compatible machine.

docker pull jefferys/fastpg:latest
docker run -it --rm -v $PWD:$PWD -w $PWD jefferys/fastpg:latest
R
   # or (singularity < 3.0)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg-latest.simg
R
   # or (singularity 3.0+)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg_latest.sif

Note that you should consider this container version like an "application" and not an environment. You may have problems installing additional packages into it. To do that, see Extending the Docker container.

Building your own Docker container

If you want to build your own container instead of pulling a pre-build one, you can use the Dockerfile included in the repository in the Docker/ directory as a guide. The build.sh file in the same directory automatically builds and tags the container with the name and tags used by the pre-built container at DockerHub, you should change the tags by editing the parameters in the build.sh file, or by manually building it and tagging it yourself.

git clone --single-branch https://github.com/sararselitsky/FastPG.git
cd FastPG/Docker
# Edit Docker tags in build.sh for your use
./build.sh --no-cache

You can just use this with Docker as above, but you can't just use a local Docker container with singularity versions less than 3.0. You have to push the container to some Docker registry before you can run it. With 3.0+ you can pull a local container directly by using docker-daemon:// instead of docker:// to get a local container.

Extending the Docker container

If you want to add additional things to the container, you should build your own. You can extend the existing container either by editing the provided Dockerfile or by using the existing container as the FROM that your own Docker container is based on.

Clustering with FastPG

Clustering is as simple as:

clusters <- FastPG::fastCluster( data, k, num_threads )

The fastCluster() function takes a number of additional tuning parameters, but those have reasonable defaults.

The data parameter

The main input is the numeric data to cluster as a matrix, where rows are elements to cluster and columns are the features that make elements similar or different. Any data matrix will do, for this example we extract a 265,627 x 32 numeric data matrix from the GitHub-published clustering benchmark mass cytometry data set, sourced from the phenograph paper [@Levine-2015].

url <- "https://github.com/lmweber/benchmark-data-Levine-32-dim/raw/master/data/Levine_32dim.fcs"
file <- "Levine_32dim.fcs"
download.file( url, file, mode="wb") # This downloads a 41.5 MB binary file
dataColumns <- c( 5:36 ) # extract only the data columns, whatever they are
data <-  flowCore::exprs( flowCore::read.FCS( file, truncate_max_range= FALSE ))
data <- data[ , dataColumns ]

The k parameter

To cluster a data set, a local neighborhood size needs to be specified as a parameter.

k <- 30

The num_threads parameter

The number of cpus to use should be specified, it defaults to 1. However, it will only limit the k nearest neighbors part of the clustering (see Internal algorithm). The rest of the clustering will use all available cpus regardless of what this is set to.

num_threads <- 4

Results

fastCluster() returns a list with two elements

Caution: -1 indicates a point that was not clustered; each can be considered their own cluster, even though they are all labeled “-1”. It is probable that you will not have any singleton clusters, but they can occur.

Internal algorithm

FastPG utilizes the same three main steps as the phenograph algorithm [@Levine-2015, @Chen-2016], but uses fast, parallel implementations.

Calling fastCluster() is equivalent to and is essentially implemented as the following sequence of commands:

# Approximate k nearest neighbors
all_knn <- RcppHNSW::hnsw_knn( data, k= k, distance= 'l2',
                               n_threads= num_threads )

ind <- all_knn$idx

# Parallel Jaccard metric
links <- FastPG::rcpp_parallel_jce(ind)
links <- FastPG::dedup_links(links)

# Parallel Louvain clustering
clusters <- FastPG::parallel_louvain( links )

Note that RcppHNSW::hnsw_knn() and FastPG::parallel_louvain( links ) have numerous additional parameters; only the parameters used that are different from the defaults are shown above. The FastPG::fastCluster() wrapper allows setting all applicable parameters. See the function documentation for additional details.

References

Chen, Hao, Mai Chan Lau, Michael Thomas Wong, Evan W Newell, Michael Poidinger, and Jinmiao Chen. 2016. “Cytofkit: A Bioconductor Package for an Integrated Mass Cytometry Data Analysis Pipeline.” PLoS Comput Biol 12 (9).

Levine, Jacob H, Erin F Simonds, Sean C Bendall, Kara L Davis, El-ad D Amir, Michelle D Tadmor, Oren Litvin, et al. 2015. “Data-Driven Phenotypic Dissection of Aml Reveals Progenitor-Like Cells That Correlate with Prognosis.” Cell 162 (1): 184–97. https://doi.org/10.1016/j.cell.2015.05.047.

Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. “Parallel Heuristics for Scalable Community Detection.” Parallel Computing 47: 19–37. https://doi.org/https://doi.org/10.1016/j.parco.2015.03.003.

Malkov, Yury A., and D. A. Yashunin. 2016. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” CoRR abs/1603.09320. http://arxiv.org/abs/1603.09320.