Awesome

gatk4-DataPreProcessing-nf

Nextflow pipeline for pre-process BAM(s) with hg38 and GATK4, following GATK Best Practices.

Description

Tailored to fit the need of re-analyzing BAM files under new GATK4 Best Practices, and with all hg38 databases.

Dependencies

This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
GATK4 executables
Picard tools
BWA, especially BWAKIT, because a post-alignment treatment is required (more info).
Sambamba.
Qualimap binary in your PATH (for a nice QC per BAM).
References (genome in fasta, dbSNP vcf, 1000 Genomes vcf, Mills and 1000 Genomes Gold Standard vcf), available in GATK Bundle.

IMPORTANT note about post-alignment : according to this post, BWA has an implicit alt-aware mode. In order to have the expected behavior of postalt.js step, one must make sure to have within the FASTA reference folder, the <name_of_ref>.fasta.alt as well.

Input

--input : your intput BAM file(s) (do not forget the quotes for multiple BAM files e.g. --input "test_*.bam")
--output_dir : the folder that will contain your aligned, recalibrated, analysis-ready BAM file(s).
--ref_fasta : your reference in FASTA.
--dbsnp : dbSNP VCF file.
--onekg : 1000 Genomes High Confidence SNV VCF file.
--mills : Mills and 1000 Genomes Gold Standard SID VCF file.
--gatk_exec : the full path to your GATK4 binary file.
--interval_list : a file for the intervals to call on. More information on interval_list format.

A nextflow.config is also included, please modify it for suitability outside our pre-configured clusters (see Nexflow configuration).

Usage for Cobalt cluster

nextflow run iarcbioinfo/gatk4-DataPreProcessing.nf -profile cobalt --input "/data/test_*.bam" --output_dir /data/myRecalBAMs --ref_fasta /ref/Homo_sapiens_assembly38.fasta --gatk_exec /bin/gatk-4.0.6.0/gatk --dbsnp /ref/dbsnp_146.hg38.vcf.gz --onekg /ref/1000G_phase1.snps.high_confidence.hg38.vcf.gz --mills Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --interval_list Exome.interval_list