Awesome

RNAseq-transcript-nf

RNA-seq transcript-level analysis nextflow pipeline

Workflow representation

Description

Performs transcript identification and quantification from a series of BAM files using StringTie, following the Nature Protocol paper (Pertea et al. 2016; doi:10.1038/nprot.2016.095)

The twopass mode involves 4 steps:

a 1st pass identifies new transcripts for each BAM file
a merging step merges the list of transcripts of each BAM file
a 2nd pass quantifies transcripts from the merged file (without new transcript discovery) in each sample

Dependencies

Nextflow: for common installation procedures see the IARC-nf repository.
StringTie
R with packages tidyverse, ballgown, and tximeta

twopass mode

In twopass mode, original and novel transcript annotations gtfs are compared with 4. gffcompare

A conda receipe, and docker and singularity containers are available with all the tools needed to run the pipeline (see "Usage")

Input

Type	Description
input_folder	input folder with BAM files

Parameters

Mandatory

Name	Example value	Description
--gtf	ref_annot.gtf	annotation .gtf file

Optional

Name	Default value	Description
--input_file	null	File in TSV format containing columns "ID" (sample ID), "bam" (path to RNA-seq BAM file), and "readlength" (the sample's read length)
--output_folder	.	folder where output is written
--readlength	75	Read length for count computation (only if input_folder is used)
--mem	2	memory
--cpu	2	number of CPUs
--prepDE_input	none	File given to script prepDE from StringTie
--annot_organism, --annot_genome, --annot_provider, --annot_version, --ref	"Homo sapiens","hg38", Unknown, Unknown, Unknown	metainformation stored in SummarizedExperiment R object

Note that you have two ways of providing input: specifying a folder (then all bam files will be processed) or a file with columns "ID", "bam", and "readlength" (in any order). The file method is preferred when bam files do not all have the same read length.

Also note that of the metainformation used for the SummarizedExperiment object creation, only --annot_organism and --annot_genome are actually used to retrieve information about the annotation in the R script; the other parameters (including the reference) are just written in the metainformation of the object. When no genome or organism is specified, the script attempts to retrieve automatically the metainformation from the gtf file, and otherwise falls back to defaults (hg38 and Homo sapiens).

Flags

Name	Description
--help	print usage and optional parameters
--twopass	Enable StringTie 2pass mode

Usage

nextflow run iarcbioinfo/RNAseq-transcript-nf --input_folder BAM/ --output_folder out --gtf ref_annot.gtf

To run the pipeline on a series of BAM files in folder BAM and an annotation file ref_annot.gtf, one can type:

nextflow run IARCbioinfo/RNAseq-transcript-nf -r v2.2 -profile singularity --input_folder BAM/ --output_folder out --gtf ref_annot.gtf

To run the pipeline without singularity just remove "-profile singularity". Alternatively, one can run the pipeline using a docker container (-profile docker) the conda receipe containing all required dependencies (-profile conda).

Output

Type	Description
expr_matrices/*_matrix.csv	matrices with gene and transcript expression in different formats (counts, FPKM, and TPM)
logs/	StringTie logs
stats/	gatk stats files from mutect
intermediate_files/expression_matrices/	same as expr_matrices/*_matrix.csv but with one matrix per read length
intermediate_files/sample_folders/	(see below)
(optional) gtf/	if the twopass mode is enabled, stringtie_annot.gtf contains identified genes and transcripts and folder gffcmp stats comparing original gtf with stringtie gtf (see below)
Robjects	R data objects in ballgown format (bg.rda, containing both gene- and transcript-level information) and SummarizedExperiment format (gene.SE.rda and transcript.SE.rda)

The sample_folders/ folder contain subfolders (sample/ST1pass/ or sample/ST2pass depending on the twopass option), which themselves contain a folder for each sample, with:

an expression quantification file (*_gene_abund.tab) with FPKM and TPM
an annotation file (*_merged.gtf)
Ballgown input files for statistical analysis using R package ballgown (exon/transcript and intron/transcript ids correspondance e2t.ctab and i2t.ctab, exon, intron, and transcript-level quantification files e_data.ctab, i_data.ctab, and t_data.ctab)

The gtf/gffcmp folder contains an annotation file with the discovered and known transcripts (gtf/gffcmp_merged.annotated.gtf), along with information about naming (gffcmp_merged.stringtie_merged.gtf.refmap, gffcmp_merged.tracking) and positions (gffcmp_merged.loci, gffcmp_merged.stringtie_merged.gtf.tmap), and some statistics (gtf/gffcmp_merged.stats)

See the SummarizedExperiment documentation for details about the structure. In our case, the data contains:

4 assays accessible using function assay() (raw_counts, length--with transcript lengths--, abundance_FPKM, and abundance_TPM)
metadata about package versions, gtf (reference genome, version, annotation ) accessible using function metadata()
feature data (transcript and gene name, id) accessible using function rowData() and rowRanges() (detailed information)
sample data (name and read length) accessible using function colData()

Directed Acyclic Graph

Contributions

Name	Email	Description
Nicolas Alcala*	AlcalaN@fellows.iarc.fr	Developer to contact for support

References

Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650-1667.