Awesome
Allele-specific RNA-seq pipeline
<img align="right" href="https://biocore.crg.eu/" src="https://raw.githubusercontent.com/CRG-CNAG/BioCoreMiscOpen/master/logo/biocore-logo_small.png" />
This Nextflow[1] based workflow allows alignment of either single or paired ends reads to a reference transcriptome described as a genome in fasta file plus an annotation in GTF format using STAR[2] aligner considering the variant information provided as a VCF file.
STAR implements the WASP[3] method for filtering of allele specific alignments and reports within the resulting alignments which variant is found and from which allele. A python script based on pysam [4] efficiently separates the alignment generated by STAR into four different files:
- Allele A
- Allele B
- Neither A nor B
- Ambiguous (found both variants on one or different pairs)
HTseq-count[5] tool is then used for counting tags separated in those three categories mapping to the genes. Finally a report is generated using multiQC[6] that summarize the results of each step together with an initial QC evaluation fo raw reads done using FastQC[7]
Install
You need to install either Docker or Singularity and Nextflow.
Then you can clone the repository:
git clone --depth 1 git@github.com:biocorecrg/allele_specific_RNAseq.git
or
git clone --depth 1 https://github.com/biocorecrg/allele_specific_RNAseq.git
depending on your GitHub configuration.
Prepare the input data
You need to extract the SNP information from the global VCF file, so first of all you need to download the VCF file with the index:
wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi
then the reference genome:
wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
and the annotation from Ensembl. We used the version Mus_musculus.GRCm38.68 not available in Ensembl archive.
You can use the pipelines specifying between Docker and Singularity containers by using the options
-with-singularity
or
-with-docker
The module makeAnno can be used for generating a VCF file with SNP for the interesting species and a genome with SNP position masked with Ns.
For using the module:
cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB 129S1_SvImJ --genome GRCm38_68.fa --outvcf CAST_EiJ-129S1_SvImJ.vcf > log
Or in case you want to compare against the reference genome:
cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB reference --genome GRCm38_68.fa --outvcf CAST_EiJ-C57BL_6J.vcf > log
This can take some memory for masking the genome. In the output folder named filteredVCF you will find both gzipped masked genome and gzipped vcf annotations.
Run the pipeline
cd allele_specific_RNAseq;
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg > log
optionally you might want to check your pipeline on Nextflow's Tower website. You need to register using an istitutional mail and set the token provided in a variable as described:
export TOWER_ACCESS_TOKEN=<<<<<TOKEN NUMBER>>>>>>
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg -with-tower > log
you can check the status of your pipeline live on the tower website.
The parameters for running the pipeline are defined in the file params.config that can be changed accordingly.
parameter | value |
---|---|
reads | $baseDir/test/*_{1,2}.fastq.gz |
genome | $baseDir/test/GRCm38_68_19.masked.fa.gz |
annotation | $baseDir/test/Mus_musculus.GRCm38.68_19.gtf |
strandness | reverse |
variants | $baseDir/test/19_filt.vcf.gz |
output | $baseDir/output_test |
single | NO |
varcut | 1 |
title | Allele specific RNAseq project |
subtitle | This is my wonderful RNA experiment |
PI | Luca Cozzuto |
User | Luca Cozzuto |
UCSCgenomeID | mm10 |
mymail@mydomain.eu |
providing a real email address will deliver a mail with the multiqc report when the analysis is finished.
Fastq reads
Fastq paired ends reads can be either plain or gzipped. You need to specify this syntax for capturing the identifier from single end samples:
PATH/*.fastq.gz
and this for paired end ones:
PATH/*_{1,2}.fastq.gz
Genome
Gzipped masked fasta file of the genome obtained running makeAnno
Strandness
It can be either "unstranded" or "forward" or "reverse". It depends on the kind of sequencing.
annotation
GTF file
variants
Gzipped VCF file obtained running makeAnno
single
YES: single end reads. NO: paired ends
varcut
Number of SNP needed for assigning a read to a variant.
Results
The following folder will contain the final outputs:
- Index: the indexed genome
- QC: containing fastQC results
- Alignments: containing sorted BAM files as they were from normal bulk RNAseq
- Report: A detailed report for the pipeline run, that will be sent via email.
- Allele_alignments: containing BAM files containing reads with variants marked as: alleleA, alleleB, ref and ambiguous. The count is done considering the strand protocol used and specified as a parameter.
- cut_N * A folder that is generated for each SNP cut off chosen containing the following sub-folders:
- Allele_single_counts: containing the count per gene per sample and single alleles. The count is done considering the strand protocol used and specified as a parameter.
- Allele_merged_Counts: containing the count per gene per allele in a single file for each sample. The count is done considering the strand protocol used and specified as a parameter.
- Counts: composite counts per gene per sample. No distincion between alleles. Four fields: gene id, tot counts considering the sequencing unstranded, tot counts considering the sequencind done on the forward strand, tot counts considering reverse strand. This information is useful for validating the strand specific protocol used in the sequencing step.
- Proportions: for each sample the read count per allele are divided for the sum of counts of the variants and multiplied for the composite count. The count is done considering the strand protocol used and specified as a parameter.
propA = alleleA/(alleleA+alleleB) * composite
propB = alleleB/(alleleA+alleleB) * composite