Home

Awesome

qsmooth

Global normalization methods and their assumptions

Global normalization methods such as quantile normalization have become a standard part of the analysis pipeline for high-throughput data to remove unwanted technical variation. These methods and others that rely solely on observed data without external information (e.g. spike-ins) are based on the assumption that only a minority of genes are expected to be differentially expressed (or that an equivalent number of genes increase and decrease across biological conditions. This assumption can be interpreted in different ways leading to different global normalization procedures. For example, in one normalization procedure, the method assumes the mean expression level across genes should be the same across samples. In contrast, quantile normalization assumes the only difference between the statistical distribution of each sample is technical variation. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference.

How to evaluate if global normalization methods are appropriate?

While these assumptions may be reasonable in certain experiments, they may not always be appropriate. Recently, an R/Bioconductor package (quantro) has been developed to test for global differences between groups of distributions to evaluate whether global normalization methods such as quantile normalization should be applied. If global differences are found between groups of distributions, these changes may be of technical or biological of interest. If these changes are of technical interest (e.g. batch effects), then global normalization methods should be applied. If these changes are related to a biological factor (e.g. normal/tumor or two tissues), then global normalization methods should not be applied because the methods will remove the interesting biological variation (i.e. differentially expressed genes) and artificially induce differences between genes that were not differentially expressed. In the cases with global differences between groups of distributions between biological conditions, quantile normalization is not an appropriate normalization method. In these cases, we can consider a more relaxed assumption about the data, namely that the statistical distribution of each sample should be the same within biological conditions or groups (compared to the more stringent assumption of quantile normalization, which states the statistical distribution is the same across all samples).

qsmooth: a generalization of quantile normalization

Here we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is a weighted average of the two types of assumptions about the data. The qsmooth R-package contains the qsmooth() function, which computes a weight at every quantile that compares the variability between groups relative to within groups. In one extreme, quantile normalization is applied and in the other extreme quantile normalization within each biological condition is applied. The weight shrinks the group-level quantile normalized data towards the overall reference quantiles if variability between groups is sufficiently smaller than the variability within groups. The algorithm is described in the Figure below (see the vignettes/qsmooth-vignette.pdf for more details).

qsmooth algorithm

Installing qsmooth

The R-package qsmooth can be installed from Github using the R package devtools:

Use to install the latest version of qsmooth from Github:

library(devtools)
install_github("stephaniehicks/qsmooth")

It can also be installed using Bioconductor:

# install BiocManager from CRAN (if not already installed)
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

# install qsmooth package
BiocManager::install("qsmooth")

After installation, the package can be loaded into R.

library(qsmooth)

Using qsmooth

The main function in the qsmooth package is qsmooth(). The qsmooth() function needs two objects: (1) a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns (e.g. let's call it eset) and (2) a group level factor called group_factor (let's call it outcome). This order of this factor variable must match the order of the columns in the eset object because it contains information about which group each sample is from.

To run the qsmooth() function,

qs <- qsmooth(object = eset, group_factor = outcome)

Individual slots can be extracted using accessor methods:

qsmoothData(qs) # extract smoothed quantile normalized data
qsmoothWeights(qs) # extract smoothed quantile normalized weights

The weights can be directly plotted using the qsmoothPlotWeights() function.

qsmoothPlotWeights(qs) # plot weights 

See vignettes/qsmooth-vignette.pdf for more details.

Bug reports

Report bugs as issues on the GitHub repository

Contributors