Home

Awesome

SCALE

Allele-Specific Expression by Single-Cell RNA Sequencing

Author

Yuchao Jiang, Nancy R. Zhang, Mingyao Li

Maintainer

Yuchao Jiang yuchaoj@email.unc.edu

Description

SCALE is a statistical framework for Single Cell ALlelic Expression analysis. SCALE estimates kinetic parameters that characterize the transcriptional bursting process at the allelic level, while accounting for technical bias and other complicating factors such as cell size. SCALE detects genes with significantly different bursting kinetics between the two alleles, as well as genes where the two alleles exhibit dependence in their bursting processes.

Installation

install.packages("rje")
install.packages("tsne")
install.packages("scatterplot3d")
install.packages("devtools")
library(devtools)
install_github("yuchaojiang/SCALE/package")

Demo code & Vignettes

Questions & Issues

If you have any questions with the package, feel free to post in our Google user group https://groups.google.com/d/forum/SCALE_scRNAseq or email us at SCALE_scRNAseq@googlegroups.com. We will try our best to reply as soon as we can.

Citation

Yuchao Jiang, Nancy R. Zhang, and Mingyao Li. "SCALE: modeling allele-specific gene expression by single-cell RNA sequencing." Genome Biology 18.1 (2017): 74. link

Common questions

SCALE needs to be applied to a homogeneous cell population, where the same bursting kinetics are shared across all cells. Possible heterogeneity due to, for example, cell subgroups, lineages, and donor effects, can lead to biased downstream analysis. We find that an excessive number of significant genes showing coordinated bursting between the two alleles can be indicative of heterogeneity with the cell population, which shoud be further stratified. Therefore, it is recommended that users adopt dimensionality reduction and clustering methods (e.g., t-SNE, PCA, ZIFA, RCA, hierarchical clustering, SC3, etc.) on the expression matrix for clustering. SCALE can then be applied to a homogeneous cell cluster that is identified.

SCALE uses external spike-ins to estimate the technical-noise associated paramters. SCALE does not require spike-in in every cell in the experiment. As long as some cells from the same batch have spike-ins, they can be used to model and capture the dropout events and the amplification and sequencing biases during the library prep step. Make sure that the column names of spikein_input are the same as (a subset of) those of alleleA and alleleB.

The technical noise associated parameters include: 𝛼_𝑐 which models the capture and sequencing efficiency; 𝛽_𝑐 which models the amplification bias; πœ…_𝑐 and 𝜏_𝑐 which characterize whether a transcript is successfully captured in the library. To estimate 𝛼_𝑐 and 𝛽_𝑐, SCALE uses the external spike-ins to fit a log-linear regression with the true number of molecules as response and the observed number of reads per locus as predictor. SCALE adopts the Nelder-Mead simplex algorithm to jointly optimize πœ…_c and 𝜏_c, which model the probability of non-dropout. Refer to the Method section in our paper for more details. A pdf plot with estimated 𝛼_𝑐, 𝛽_𝑐, πœ…_c, and 𝜏_c will be generated by SCALE by default.

Note: Sometimes there can be a β€œpoor" fitting of πœ… and 𝜏, as compared to Supplementary Figure S5 in our paper. This is due to low sequencing depth as well as low amplification efficiency in the library prep and the sequencing procedure. These are reflected by small 𝛼 and 𝛽 terms, which results in the 𝛼*(Y^𝛽) being much smaller than Y and even significantly smaller than 1 for the lowly spiked-in molecules or lowly expressed endogenous genes. In this case, when sequencing is performed, no matter whether the transcript is dropped out (modeled by πœ… and 𝜏) or not, it will almost always result in a zero from the Poisson sampling. This will result in a large estimate of πœ… and 𝜏 by SCALE, with the rate pi = expit(kappa + tau * log(Y)) being almost always 1 meaning that there is no drop out but rather the zeros are from Poisson sampling with a small lambda = 𝛼*(Y^𝛽). However, this doesn’t mean there is no dropout β€” it simply means that the sequencing depth followed by the amplification is so low that we cannot estimate dropout using spike-ins (aka, πœ… and 𝜏 are not identifiable). We've carried out empirical analysis and showed that this issue does not affect the downstream analysis of bursting kinetics, since the lowly expressed transcript results in zero read counts will be modeled and captured, whether it is due to Poisson sampling or dropout events.

If you do not have spike-ins in your scRNA-seq dataset, you should still adopt other denoising/imputation methods for data normalization. The technical dropout affects the estimation of burst frequency and the amplification (and potentially sequencing) bias affects the estimation of burst size. Suppose you have a normalized gene expression matrix, you can pre-specify log(alpha), beta, kappa, and tau to be log(1), 1, -10000, 0 so that SCALE does not perform another round of normalization.

abkt = c(0, 1, 10000, 0); names(abkt) = c("log(alpha)", "beta", "kappa", "tau")

Cell size can be inferred from the GAPDH expression levels (with adjustment of library size factor which is just due to sequencing depth rather than cell size difference, see first equation under Methods section in our paper) or the ratio of endogenous reads over the total number of spike-in reads. For consistency, these two measurements should NOT be combined together.

To correctly calculate the cell size factor baed on sequencing reads alone isn’t trivial. We are hoping that there could be measurements available from the single cell isolation machine (e.g., Fluidigm C1) soon. We recommend first estimating the bursting kinetics with all cell size factors set to 1. After this, one can try again using the estimated cell size factors. The correlations between the bursting kinetics of the two alleles can serve as a good sanity check (on the genome wide scale, they should be correlated with a fairly decent correlation coefficient).

Quality control procedures are recommended to filter out both extreme cells and genes before applying SCALE. Some metrics may include: library size factor (see first equation under Methods in our paper), PCA result (to remove cell outliers or heterogeneity), allelic ratio (standard deviation of a gene across all cells), ratio of reads that map to spike-ins versus endogenous genes (i.e., cells with extreme cell sizes), and true number of spike-in molecules (first column of spikein_input, where spike-ins with small number of molecules should be removed).

In this paper, a gene is categorized to be monoallelically expressed, if only one allele is expressed in a nonzero proportion of cells. Gene categorization is carried out through the empirical Bayes framework outlined in our paper. While SCALE is focused on detecting differential bursting kinetics between the two alleles, monoallelic expression is an extreme case where one allele is completely off (which is equivalent to having an infinitely large burst frequency). We don't infer bursting kinetics -- there is no way nor need to do so -- or perform hypothesis testings on these monoallelically expressed genes. However, we think these genes are just as, if not more, important in biology. The gene categorization results are included in the table generated in the last step of SCALE.