Awesome

A comparison of automatic cell identification methods for single-cell RNA-sequencing data

We present a comprehensive evaluation of the performance of state-of-the-art classification methods, in addition to general-purpose classifiers, for automatic cell identification single cell RNA-sequencing datasets. Our goal is to provide the community with a fair evaluation of all available methods to facilitate the users’ choice as well as direct further developments to focus on the challenging aspects of automated cell type identification. (published in genome biology Sep. 2019 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z)

Repository description

We provide all the scripts to run and evaluate all classifiers, and to reproduce the results introduced in the paper.

'Scripts' folder contains a wrapper function to read the data and apply certain classification method.
Cross_Validation R script can be used to produce training and test indices for cross validation.
rank_gene_dropouts Python script can be used to apply feature selection using the dropout method, and rank genes accordingly.
evaluate R script can be used to evaluate the prediction of a certain classifier and obtain scores such as accuracy, median F1-score and % unlabeld cells.

For more details, please check function documentations.

General Usage

To benchmark and fairly evaluate the performance of different classifiers using benchmark-datasets (Filtered datasets can be downloaded from https://zenodo.org/record/3357167), apply the following steps:

Step 1

Apply the Cross_Validation R function on the corresponding dataset to obtain fixed training and test cell indices, straitified across different cell types. For example, using the Tabula Muris (TM) dataset

Cross_Validation('~/TM/Labels.csv', 1, '~/TM/')

This command will create a CV_folds.RData file used as input in Step 2.

Step 2

Run each classifier wrapper. For example, running scPred on TM dataset

run_scPred('~/TM/Filtered_TM_data.csv','~/TM/Labels.csv','~/TM/CV_folds.RData','~/Results/TM/')

This command will output the true and predicted cell labels as csv files, as well as the classifier computation time.

Step 3

Evaluate the classifier prediction by

result <- evaluate('~/Results/TM/scPred_True_Labels.csv', '~/Results/TM/scPred_Pred_Labels.csv')

This command will return the corresponding accuracy, median F1-score, F1-scores for all cell populations, % unlabeled cells, and confusion matrix.

Usage with feature selection

Step 1

Cross_Validation('~/TM/Labels.csv', 1, '~/TM/')

This command will create a CV_folds.RData file used as input in Step 2 and 3.

Step 2

Apply the rank_gene_dropouts Python script to get the genes ranking for each training fold using the dropout criteria

rank_gene_dropouts('~/TM/Filtered_TM_data.csv', '~/TM/CV_folds.RData', '~/TM/')

This command will create a rank_genes_dropouts.csv file used as input in Step 3.

Step 3

Run each classifier wrapper. For example, running scPred on TM dataset with 1000 genes

run_scPred('~/TM/Filtered_TM_data.csv','~/TM/Labels.csv','~/TM/CV_folds.RData','~/Results/TM/',
GeneOrderPath = '~/TM/rank_genes_dropouts.csv',NumGenes = 1000)

This command will output the true and predicted cell labels as csv files, as well as the classifier computation time.

Step 4

Evaluate the classifier prediction by

result <- evaluate('~/Results/TM/scPred_True_Labels.csv', '~/Results/TM/scPred_Pred_Labels.csv')

This command will return the corresponding accuracy, median F1-score, F1-scores for all cell populations, % unlabeled cells, and confusion matrix.

Evaluate Marker-based methods using DE genes

To evaluate the marker-based methods SCINA, DigitalCellSorter and Garnett using DE genes learned from the data, you may follow these steps:

Step 1

Cross_Validation('~/TM/Labels.csv', 1, '~/Zheng_sorted/')

This command will create a CV_folds.RData file used as input in Step 2 and 3.

Step 2

For each fold use the training data to get the DE genes using the DEgenesMAST R function, and pass these DE genes to the corresponding method, for example here we use SCINA, to obtain cell prediction for the test data.

load('CV_folds.RData')
Data <- read.csv('~/Zheng_sorted/Filtered_DownSampled_SortedPBMC_data',row.names = 1)
Labels <- as.matrix(read.csv('~/Zheng_sorted/Labels.csv'))
Labels <- as.vector(Labels[,col_Index])
Data <- Data[Cells_to_Keep,]
Labels <- Labels[Cells_to_Keep]

for (i in c(1:n_folds))
{
    MarkerGenes <-  DEgenesMAST(t(Data[Train_Idx[[i]],]), Labels[Train_Idx[[i]]], Normalize = TRUE, LogTransform = TRUE)
    ## write the MarkerGenes into a marker genes file format, depending on the tested method, for example for SCINA
    write.csv(MarkerGenes, 'MarkerGenes.csv')
    ## run the SCINA wrapper using these DE marker genes
    run_SCINA(Data[Test_Idx[[i]],], Labels[Test_Idx[[i]]], 'MarkerGenes.csv', '~/Results/Zheng_sorted/')
}

Snakemake

To support future extension of this benchmarking work with new classifiers and datasets, we provide a Snakemake workflow to automate the performed benchmarking analyses (https://github.com/tabdelaal/scRNAseq_Benchmark/tree/snakemake_and_docker).