Awesome
README for pRSEM demo
Table of Contents
<a name="Introduction"></a> Introduction
Prior-enhanced RSEM (pRSEM) is an RNA-seq quantification method that utilizes external data for the task of transcript abundance estimation. The workflow of pRSEM is illustrated in the following figure.
This repository is a mini-example for running pRSEM. It contains all the required software packages, input files, and scripts. More details is described below. This demo runs in 4 threads. The installation and running will take about 20 to 30 minutes on a 4 x 2.4GHz core machine depending on which R/Bioconductor libraries users have already installed.
<a name="Download"></a> Download
There are two ways to download this demo and all three required submodules
- Download the tar ball or zip file via https. They are also listed on the release page with name pRSEM_demo_and_AllSubmodules
- Use a git command:
git clone --recursive git@github.com:pliu55/pRSEM_demo
<a name="SystemRequirements"></a> System Requirements
- Linux
- Hard drive space > 2.5G
- Perl version >= 5.8.8
- Python version >= 2.7.3
- R version >= 3.3.1
- POSIX Threads
- Boost (C++ libraries)
<a name="Submodules"></a> Submodules
This demo requires three submodules:
- pRSEM: prior-enhanced RSEM
- STAR: aligner for RNA-seq reads, version 2.5.2b
- Bowtie: aligned for ChIP-seq reads, version 1.0.1
<a name="Usage"></a> Usage
Go to this demo's folder and type
./run_pRSEM_demo.sh
This script will carry out the following tasks:
- Install Bowtie, STAR, RSEM, and all required libraries not yet installed by users
- Prepare genome references for Bowtie, STAR, and RSEM
- Derive prior parameters from RNA Polymerase II ChIP-seq data and use them to quantify RNA-seq data
- Derive prior parameters from a combination of four histone modfication ChIP-seq data sets and use them to quantify RNA-seq data
- Perform a testing procedure using a combination of four histone modification ChIP-seq data sets as the external data
- Perform a testing procedure using RNA Polymerase II ChIP-seq peaks as the external data
<a name="Input"></a> Input
All of the following data sets are under the folder input/. The RNA-seq and PolII ChIP-seq data were derived from ENCODE2 mouse Mel cell line. Although they are derived from a cell line rather than from tissue, we named them with keyword mmliver just to be consistent with the examples given in RSEM's documentation. The four histone modification ChIP-seq data sets were derived from Lara-Astiaso and Weiner et al. Science 2014 345:943.
- RNA-seq, paired-end reads in FASTQ format, a small subset of the RNA-seq data on Mel's biological replicate 1
- mmliver_1.fa.gz: first mate
- mmliver_2.fa.gz: second mate
- Pol II ChIP-seq reads in FASTQ format
- mmliver_PolIIRep1.fq.gz: replicate 1 for RNA polymerase II, a small subset of the Pol II ChIP-seq data on Mel's biological replicate 1
- mmliver_PolIIRep2.fq.gz: replicate 2 for RNA polymerase II, a small subset of the Pol II ChIP-seq data on Mel's biological replicate 2
- mmliver_ChIPseqCtrl.fa.gz: replicate 1 for control, a small subset of the the control ChIP-seq data on Mel's biological replicate 1
- Pol II ChIP-seq peaks in BED format
- PolII.bed.gz: ChIP-seq peaks for RNA polymerase II. It was derived from the above FASTQ files using ENCODE2's pipeline with an IDR threshold at 0.05.
- Histone modification ChIP-seq in FASTQ format
- H3K27Ac.fastq.gz: a small subset of H3K27Ac modification ChIP-seq data from LT_HSC cells
- H3K4me1.fastq.gz: a small subset of H3K4me1 modification ChIP-seq data from LT_HSC cells
- H3K4me2.fastq.gz: a small subset of H3K4me2 modification ChIP-seq data from LT_HSC cells
- H3K4me3.fastq.gz: a small subset of H3K4me3 modification ChIP-seq data from LT_HSC cells
- Genome sequence
- chr19.fa.gz: gzipped Chromosome 19 sequence from mouse's mm10 assembly.
- Transcript annotation
- chr19.gtf.gz: gzipped Protein-coding genes on chromsome 19 from GENCODE mouse release M4
- Mappability
- mm10.36mer.chr19.fake.bigWig: a mocked alignability files for mm10's chromosome 19
<a name="Output"></a> Output
All output files will be stored in the following four folders under output/:
-
genome_references/: genomic index files for Bowtie, STAR, and RSEM
-
quantification_results_PolII/: output files from pRSEM's expression-calculation step by using prior from Pol II ChIP-seq data. They are organized in the same way as RSEM's output. The only difference is that there are five files specifically generated by pRSEM under folder output/quantification_results_PolII/demo.stat/:
-
demo_prsem.all_tr_features: isofrom features to derive and assign pRSEM prior. The first line is a header and the rest is one isoform per line. The description for each column is:
column name descrption trid transcript ID from input annotation geneid gene ID from input anntation chrom isoform's chromosome name strand isoform's strand name start TSS if isoform is on '+' strand; TES if isoform is on '-' strand; where TSS stands for isoform's transcription start site, i.e. 5'-end, and TES stands for isoform's transcript end site, i.e. 3'-end end TES if isoform is on '+' strand; TSS if isoform is on '-' strand tss_mpp average mappability of [TSS-500bp, TSS+500bp] body_mpp average mappability of (TSS+500bp, TES-500bp) tes_mpp average mappability of [TES-500bp, TES+500bp] pme_count isoform's fragment or read count from RSEM's posterior mean estimates tss isoform's TSS tss_pk equal to 1 if isoform's [TSS-500bp, TSS+500bp] region overlaps with a RNA Pol II peak; 0 otherwise is_training equal to 1 if isoform is in the training set where Pol II prior is learned; 0 otherwise partition group number that isoform was partitioned to -
demo_prsem.all_tr_prior: prior parameters for every isoform. This file does not have a header. Each line contains an isoform's prior parameter and its transcript ID delimited by
#
-
demo_prsem.pval_LL: contains a p-value indicating if Pol II ChIP-seq data is informative for RNA-seq quantification and a log-likelihood denoting the fitness of pRSEM's Dirichlet-multinomial model to Pol II-partitioned training set isoforms
-
demo_uniform_prior_1.gene.results: RSEM's posterior mean estimates on the gene level with an initial pseudo-count of one for every isoform
-
demo_uniform_prior_1.isoform.results: RSEM's posterior mean estimates on the isoform level with an initial pseudo-count of one for every isoform
-
-
quantification_results_histone/: output files from pRSEM's expression-calculation step by using prior from four histone modification ChIP-seq data. They are very similar to those in quantification_results_PolII/. The only difference are in two files:
-
demo_prsem.all_tr_features: similar to quantification_results_PolII/demo.stat/demo_prsem.all_tr_features. It does not have the column tss_pk, but a few more new columns as shown below:
column name description pme_TPM isoform abundance from RSEM's posterior mean estimates H3K27Ac normalized H3K27Ac signals (in RPKM) of isoform's [TSS-500, TSS+500] H3K4me1 normalized H3K4me1 signals (in RPKM) of isoform's [TSS-500, TSS+500] H3K4me2 normalized H3K4me2 signals (in RPKM) of isoform's [TSS-500, TSS+500] H3K4me3 normalized H3K4me3 signals (in RPKM) of isoform's [TSS-500, TSS+500] prd_expr_prob probability of being expressed from a logistic regression model prior isoform's prior parameter -
demo_prsem.lgt_mdl.RData: It stores an R logistic regression object, which was obtained based on four histone modification signals and fragment counts of isoforms in the training set.
-
-
testing_procedure_histone_PolII/: output files from pRSEM's testing procedure. Test results are saved in demo.all.pval_LL. All the other files are similar to those in quantification_results_PolII/ and quantification_results_histone/.
- demo.all.pval__LL: contains three lines
- line 1: Header
- line 2: Testing procedure result when using four histone modification ChIP-seq data as the external data for RNA-seq quantification
- line 3: Testing procedure result when using Pol II peaks as the external data for RNA-seq quantification
- demo.all.pval__LL: contains three lines
Please note that, in order to shorten the running time as much as possible, the input ChIP-seq and RNA-seq files were prepared in extremely small (and unrealistic) sizes, and Gibbs sampling were set to run in just 100 instead of the default 1000 steps. As a result, the final quantification results may vary from time to time.
<a name="Contact"></a> Contact
Got a question? Please post it at RSEM's GitHub Issues page with @pliu55 mentioned.
<a name="License"></a> License
This demo is licensed under the GNU General Public License v3.