Home

Awesome

Clustering benchmark data: 13-dimensional data set from Levine et al. (2015)

This repository contains R code to prepare benchmark data set Levine_13dim, which can be used to test clustering algorithms.

The data set is a 13-dimensional mass cytometry (CyTOF) data set, consisting of protein expression levels for n = 167,044 cells, p = 13 protein markers (dimensions), and k = 24 manually gated cell populations (clusters), from one individual. Cluster labels are available for 49% (81,747) of the cells.

This is a companion repository to benchmark-data-Levine-32-dim, which contains R code to prepare a similar benchmark data set with higher dimensionality (32 dimensions). For more details, including background information and additional details on the data sets, see the other repository.

The data set is sourced from the following paper:

Raw data can be accessed through Cytobank:

If you use these data sets, please reference the paper by Levine et al. (2015).

Background

For background information on mass cytometry (CyTOF), and additional details on the Levine et al. (2015) paper and the benchmark data sets, see the other repository benchmark-data-Levine-32-dim.

13-dimensional benchmark data set

Levine et al. (2015) used two benchmark mass cytometry (CyTOF) data sets from healthy samples to demonstrate the performance of the PhenoGraph algorithm.

The 13-dimensional benchmark data set consists of protein expression levels from healthy human bone marrow mononuclear cells (BMMCs), from one healthy individual. (This data set is referred to as "benchmark data set 1" in Levine et al. 2015).

The data set contains n = 167,044 cells, with a dimensionality of p = 13 surface marker proteins. Manually gated cell population (cluster) labels are provided for k = 24 major immune cell populations. Cluster labels are available for 49% (81,747) of the cells, with the remaining 51% (85,297) labeled as "unassigned". All cells are from a single individual.

The 13 surface markers are: CD45, CD45RA, CD19, CD11b, CD4, CD8, CD34, CD20, CD33, CD123, CD38, CD90, and CD3. All 13 surface markers were used for manual gating. An additional "DNA * cell length" gating step was also applied to remove platelets. See Levine et al. (2015), Supplemental Experimental Procedures, for more details.

This repository

This repository contains an R script to pre-process and export the 13-dimensional benchmark data set in standard formats, in order to make it easier for researchers from other fields to access the data set to test clustering algorithms. This consists of the following steps:

For more details, see the repository for the 32-dimensional benchmark data set benchmark-data-Levine-32-dim.

Contents

The files in this repository are:

References and links

The benchmark data sets are sourced from the paper by Levine et al. (2015):

Data from Levine et al. (2015) are publicly available through Cytobank at the following links. Note that a (free) Cytobank account is required.

Additional information can also be found on the Dana Pe'er lab web page, at: http://www.c2b2.columbia.edu/danapeerlab/html/phenograph.html

The 13-dimensional benchmark data set was originally published by Bendall et al. (2011):