Home

Awesome

Allele-specific RNA-seq pipeline

Docker Build Status Nextflow version Singularity version Docker version Build Status <img align="right" href="https://biocore.crg.eu/" src="https://raw.githubusercontent.com/CRG-CNAG/BioCoreMiscOpen/master/logo/biocore-logo_small.png" />

This Nextflow[1] based workflow allows alignment of either single or paired ends reads to a reference transcriptome described as a genome in fasta file plus an annotation in GTF format using STAR[2] aligner considering the variant information provided as a VCF file.

STAR implements the WASP[3] method for filtering of allele specific alignments and reports within the resulting alignments which variant is found and from which allele. A python script based on pysam [4] efficiently separates the alignment generated by STAR into four different files:

HTseq-count[5] tool is then used for counting tags separated in those three categories mapping to the genes. Finally a report is generated using multiQC[6] that summarize the results of each step together with an initial QC evaluation fo raw reads done using FastQC[7]

Install

You need to install either Docker or Singularity and Nextflow.

Then you can clone the repository:

git clone --depth 1 git@github.com:biocorecrg/allele_specific_RNAseq.git

or

git clone --depth 1 https://github.com/biocorecrg/allele_specific_RNAseq.git 

depending on your GitHub configuration.

Prepare the input data

You need to extract the SNP information from the global VCF file, so first of all you need to download the VCF file with the index:

wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi

then the reference genome:

wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

and the annotation from Ensembl. We used the version Mus_musculus.GRCm38.68 not available in Ensembl archive.

You can use the pipelines specifying between Docker and Singularity containers by using the options

-with-singularity

or

-with-docker

The module makeAnno can be used for generating a VCF file with SNP for the interesting species and a genome with SNP position masked with Ns.

For using the module:

cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB 129S1_SvImJ --genome GRCm38_68.fa --outvcf CAST_EiJ-129S1_SvImJ.vcf > log

Or in case you want to compare against the reference genome:

cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB reference --genome GRCm38_68.fa --outvcf CAST_EiJ-C57BL_6J.vcf > log

This can take some memory for masking the genome. In the output folder named filteredVCF you will find both gzipped masked genome and gzipped vcf annotations.

Run the pipeline

cd allele_specific_RNAseq; 
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg > log

optionally you might want to check your pipeline on Nextflow's Tower website. You need to register using an istitutional mail and set the token provided in a variable as described:

export TOWER_ACCESS_TOKEN=<<<<<TOKEN NUMBER>>>>>>
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg -with-tower > log 

you can check the status of your pipeline live on the tower website.

The parameters for running the pipeline are defined in the file params.config that can be changed accordingly.

parametervalue
reads$baseDir/test/*_{1,2}.fastq.gz
genome$baseDir/test/GRCm38_68_19.masked.fa.gz
annotation$baseDir/test/Mus_musculus.GRCm38.68_19.gtf
strandnessreverse
variants$baseDir/test/19_filt.vcf.gz
output$baseDir/output_test
singleNO
varcut1
titleAllele specific RNAseq project
subtitleThis is my wonderful RNA experiment
PILuca Cozzuto
UserLuca Cozzuto
UCSCgenomeIDmm10
emailmymail@mydomain.eu

providing a real email address will deliver a mail with the multiqc report when the analysis is finished.

Fastq reads

Fastq paired ends reads can be either plain or gzipped. You need to specify this syntax for capturing the identifier from single end samples:

PATH/*.fastq.gz

and this for paired end ones:

PATH/*_{1,2}.fastq.gz

Genome

Gzipped masked fasta file of the genome obtained running makeAnno

Strandness

It can be either "unstranded" or "forward" or "reverse". It depends on the kind of sequencing.

annotation

GTF file

variants

Gzipped VCF file obtained running makeAnno

single

YES: single end reads. NO: paired ends

varcut

Number of SNP needed for assigning a read to a variant.

Results

The following folder will contain the final outputs:

propA = alleleA/(alleleA+alleleB) * composite 
propB = alleleB/(alleleA+alleleB) * composite