Awesome
gatk4-DataPreProcessing-nf
Nextflow pipeline for pre-process BAM(s) with hg38 and GATK4, following GATK Best Practices.
<div style="text-align:center"><img src="https://us.v-cdn.net/5019796/uploads/editor/3o/dznasg7toiq1.png" width="200" /></div>Description
Tailored to fit the need of re-analyzing BAM files under new GATK4 Best Practices, and with all hg38 databases.
Dependencies
- This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
- GATK4 executables
- Picard tools
- BWA, especially BWAKIT, because a post-alignment treatment is required (more info).
- Sambamba.
- Qualimap binary in your PATH (for a nice QC per BAM).
- References (genome in fasta, dbSNP vcf, 1000 Genomes vcf, Mills and 1000 Genomes Gold Standard vcf), available in GATK Bundle.
IMPORTANT note about post-alignment : according to this post, BWA has an implicit alt-aware mode. In order to have the expected behavior of postalt.js
step, one must make sure to have within the FASTA reference folder, the <name_of_ref>.fasta.alt
as well.
Input
--input
: your intput BAM file(s) (do not forget the quotes for multiple BAM files e.g.--input "test_*.bam"
)--output_dir
: the folder that will contain your aligned, recalibrated, analysis-ready BAM file(s).--ref_fasta
: your reference in FASTA.--dbsnp
: dbSNP VCF file.--onekg
: 1000 Genomes High Confidence SNV VCF file.--mills
: Mills and 1000 Genomes Gold Standard SID VCF file.--gatk_exec
: the full path to your GATK4 binary file.--interval_list
: a file for the intervals to call on. More information on interval_list format.
A nextflow.config is also included, please modify it for suitability outside our pre-configured clusters (see Nexflow configuration).
Usage for Cobalt cluster
nextflow run iarcbioinfo/gatk4-DataPreProcessing.nf -profile cobalt --input "/data/test_*.bam" --output_dir /data/myRecalBAMs --ref_fasta /ref/Homo_sapiens_assembly38.fasta --gatk_exec /bin/gatk-4.0.6.0/gatk --dbsnp /ref/dbsnp_146.hg38.vcf.gz --onekg /ref/1000G_phase1.snps.high_confidence.hg38.vcf.gz --mills Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --interval_list Exome.interval_list