Awesome

CIPR-Shiny

Cluster Identity Predictor

<br>

During the analysis of single cell RNA sequencing (scRNAseq) data, annotating the biological identity of cell clusters is an important step before downstream analyses and it remains technically challenging. The current solutions for annotating single cell clusters generally lack a graphical user interface, can be computationally intensive or have a limited scope. On the other hand, manually annotating single cell clusters by examining the expression of marker genes can be subjective and labor-intensive.

To improve the quality and efficiency of annotating cell clusters in scRNAseq data, we present a web-based R/Shiny app and R package, Cluster Identity PRedictor (CIPR), which provides a graphical user interface to quickly score gene expression profiles of unknown cell clusters against mouse or human references, or a custom dataset provided by the user. CIPR can be easily integrated into the current pipelines to facilitate scRNAseq data analysis.

CIPR performs analyses at individual cluster level and generates informative graphical outputs to help the users assess the quality of algorithmic predictions (see the example outputs below).

This repository contains the source code for the Shiny implementation of CIPR pipeline. For CIPR R package, please check out CIPR-Package repository.

Reference datasets available in CIPR

Immunological Genome Project (ImmGen) microarray data from sorted mouse immune cells. This dataset is prepared by using both V1 and V2 ImmGen releases and it contains 296 samples from 20 different cell types (253 subtypes).
Mouse RNAseq data from sorted cells reported in Benayoun et al. (2019). This dataset contains 358 sorted immune and nonimmune samples from 18 different lineages (28 subtypes).
Blueprint/Encode RNAseq data that contains 259 sorted human immune and nonimmune samples from 24 different lineages (43 subtypes).
Human Primary Cell Atlas that contains microarray data from 713 sorted immune and nonimmune cells (37 main cell types and 157 subtypes).
DICE (Database for Immune Cell Expression(/eQTLs/Epigenomics) that contains 1561 human immune samples from 5 main cell types (15 subtypes). To reduce object sizes, mean TPM values per cell type is used.
Human microarray data from sorted hematopoietic cells reported in Novershtern et al. (2011). This dataset contains data from 211 samples and 17 main cell types (38 subtypes)
Human RNAseq data from sorted cells reported in Monaco et al. (2019). This dataset contains 114 samples originating from 11 main cell types (29 subtypes)
A custom reference dataset provided by the user. This dataset can be obtained from a number of high througput methods including microarray and bulk RNAseq. For details about how to prepare custom reference, please see the How-to tab on the Shiny website.

Analytical approach

CIPR calculates pairwise identity scores between individual unknown clusters and the reference samples and generates a vector of identity scores per each cluster in the experiment. While doing this CIPR utilizes two main approaches:

Comparison of differentially expressed genes. In this method users provide an input data frame that contains the log fold-change (logFC) values of differentially expressed genes in each cluster. The algorithm first calculates differential expression within the reference data frame for each gene by taking the ratio of the expression value of individual subsets to the average expression in the entire data frame. Then the CIPR pipeline compares these reference logFC values to the logFC from the experimental clusters. The users can select one of three methods for these comparisons:
- LogFC dot product: LogFC values of the matching genes are mutliplied and added together to yield an aggregate identity score.
- LogFC Spearman's correlation: Rank correlation is calculated between the logFC values of the experimental and reference data.
- LogFC Pearson's correlation: Linear correlation is calculated between the logFC values of the expermental and reference data.
<br>
Comparison of all genes. In this method, users provide an input data frame that contains average gene expression per cluster. The algorithm compares the expression profiles of individual cluster to that from reference dataset. In this method, all the common genes between experimenal and reference data are used in the analysis regardless of their expression values and differential expression status. Users can use one of the two methods in this approach:
- Spearman's correlation: Rank correlation between the experimental clusters and reference cell subsets
- Pearson's correlation: Linear correlation (which could be beneficial especially when using custom references where the reference and the experimental data is obtained using similar methodologies.)

Flexible options

To be adaptable to various experimental contexts, CIPR enables users to:

Select only interesting reference subsets from the provided reference datasets
Limit the analysis to the genes whose expression variance (in the reference dataset) is above a certain quantile determined by the user.

Sample outputs

Results per cluster

In the plot below x-axis signifies the individual samples within the reference data frame (ImmGen in this example). Reference cell types are marked by different colors. Each data point indicates the identity score calculated for Cluster 1 in the input data. Shaded regions demarcate 1 and 2 standard deviations around the average identity score across the reference dataset. In this analysis logFC dot product method was used.

Summary of top hits per cluster

It is often easier to examine the top predictions in one graph. This plot shows the top 5 scoring reference samples for each cluster (shown in different colors). The user can draw a rectangle around the data points which will prompt a table output underneath the image with further details.