Home

Awesome

gatk4-DataPreProcessing-nf

Nextflow pipeline for pre-process BAM(s) with hg38 and GATK4, following GATK Best Practices.

<div style="text-align:center"><img src="https://us.v-cdn.net/5019796/uploads/editor/3o/dznasg7toiq1.png" width="200" /></div>

Description

Tailored to fit the need of re-analyzing BAM files under new GATK4 Best Practices, and with all hg38 databases.

Dependencies

  1. This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
  2. GATK4 executables
  3. Picard tools
  4. BWA, especially BWAKIT, because a post-alignment treatment is required (more info).
  5. Sambamba.
  6. Qualimap binary in your PATH (for a nice QC per BAM).
  7. References (genome in fasta, dbSNP vcf, 1000 Genomes vcf, Mills and 1000 Genomes Gold Standard vcf), available in GATK Bundle.

IMPORTANT note about post-alignment : according to this post, BWA has an implicit alt-aware mode. In order to have the expected behavior of postalt.js step, one must make sure to have within the FASTA reference folder, the <name_of_ref>.fasta.alt as well.

Input

A nextflow.config is also included, please modify it for suitability outside our pre-configured clusters (see Nexflow configuration).

Usage for Cobalt cluster

nextflow run iarcbioinfo/gatk4-DataPreProcessing.nf -profile cobalt --input "/data/test_*.bam" --output_dir /data/myRecalBAMs --ref_fasta /ref/Homo_sapiens_assembly38.fasta --gatk_exec /bin/gatk-4.0.6.0/gatk --dbsnp /ref/dbsnp_146.hg38.vcf.gz --onekg /ref/1000G_phase1.snps.high_confidence.hg38.vcf.gz --mills Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --interval_list Exome.interval_list