Home

Awesome

ECLAIR

Robust and scalable inference of cell lineages from gene expression data.

ECLAIR achieves a higher level of confidence in the estimated lineages through the use of approximation algorithms for consensus clustering and by combining the information from an ensemble of minimum spanning trees so as to come up with an improved, aggregated lineage tree.

In addition, the present package features several customized algorithms for assessing the similarity between weighted graphs or unrooted trees and for estimating the reproducibility of each edge in a given tree.

How ECLAIR graphs and trees are generated

ECLAIR stands for Ensemble Clustering for Lineage Analysis, Inference and Robustness. It proceeds as follow:

Statistical performance of ECLAIR

To compare two lineage trees, one has to take into account their edge connections but also the sample contents of their nodes, since the variation associated to subsampling results in different clusters of samples. Although there are many papers on graph matching and graph comparison, we are not aware of any previously published method that takes into account the node differences. We therefore developed customized statistical tests suitable for comparing lineage trees.

Our ECLAIR package features a module entirely devoted to computing through befitting data structures and algorithms such statistical measures and a few more tests on pairs of ECLAIR trees.

Installation

ECLAIR is written in Python 2.7. It has been tested on Fedora Linux and on Ubuntu and should be supported by any other member of the UNIX-like family of operating systems.

Install ECLAIR by sending a request to the Python Package Index (PyPI) as follows:

Any missing or out-of-date dependency should be automatically resolved. Apart from the Python Standard Library, those include:

Please note that as part of the installation of this package, some code written in C that is part of the Cluster_Ensembles package will be automatically compiled, under the hood and according to the specifications of your machine. For this process to go seamlessly, you have however to ensure availability of CMake and GNU make on your operating system. Cluster_Ensembles also requires the 32-bit version of the GNU C library. Please refer to the Cluster_Ensembles documentation for more information on how to meet those few requirements depending on Linux distribution.

Usage

To subject a dataset to an ECLAIR analysis:

To launch a full-fledged statistical performance analysis of ECLAIR and see how it consistenly performs better than SPADE, a popular method for estimating cell lineages, proceed as follows:

The eponymous folder ECLAIR_performance will be created in your current directory, recording on the fly the results of various statistical tests and comparisons of ECLAIR graphs and trees, as well as of SPADE trees.

In the current version, the statistical performance of ECLAIR is only evaluated for a fairly large (by the current standards of computational biology) flow cytometry dataset of half-a-million samples and 8 features, as well as on a qPCR dataset of mouse bone marrow samples. It shouldn't be difficult for anyone competent in Python to quickly peruse through the source code of ECLAIR and bring about a few of the changes required to submit his/her own data to a similar statistical analysis (those changes mostly pertain to domain-specific knowledge and to the format of your dataset). ECLAIR has been designed so as to accommodate arbitrarily large datasets (this is achieved through the use HDF5 data structures, most notably).

Upon sending the ECLAIR_performance command, several "experiments" will be performed, including the comparisons of pairs of ECLAIR graphs or trees and pairs of SPADE trees generated on the same dataset. The comparison of ECLAIR instances and of SPADE instances generated on non-overlapping datasets and evaluated on a separate test set calls for detailed explanations.

We are splitting a dataset into three equally-sized, non-overlapping parts, S1, S2 and S3. We train an ECLAIR tree (Ecl_1) and a SPADE tree on S1 (Spd_1). We then train another ECLAIR tree (Ecl_2) and yet another SPADE tree (Spd_2) on the set S2.

The training procedure for Ecl_1 involves 50 runs of downsampling and clustering of the samples within S1. The downsampling ratio is set at 50%. Therefore, Ecl_1 is an aggregation of 50 trees, all generated from S1 alone.

In order to compare Ecl_1 with Ecl_2, the cells in S3 are mapped to the clusters/nodes in Ecl_1 and in Ecl_2 to which they are nearest in the high-dimensional gene expression space.

Idem when it comes to comparing Spd_1 and Spd_2.

The procedure outlined above is repeated 10 times. We end up with two lists of 30 correlation coefficients telling us about the similarity of as many pairs of ECLAIR or SPADE trees. Indeed, while things have been exposed as involving only the evaluation of Ecl_1 and Ecl_2 on S3 using as a test set, one can also generate an ECLAIR tree using S3 as a training set. This allows the additional comparisons of Ecl_1 with Ecl_3 and of Ecl_2 with Ecl_3.

It also bears pointing out we are using the same test set (S3) for assessing the similarity of pairs of ECLAIR trees (Ecl_1 vs. Ecl_2) as for evaluating the similitude of pairs of SPADE trees (Spd_1 vs. Spd_2).

References