Awesome

Introduction

This is a tiny python script to generate MAF files from output generated by stadard annotation programs. Currently, annovar - table_annovar.pl output and bcftools csq outputs can be converted to maf.

$ python annovar2maf.py -h
usage: annovar2maf [-h] [-t TSB] [-b BUILD] [-p {refGene,ensGene}] [-c] input

Convert annovar and bcftools-csq annotations to MAF

positional arguments:
  input                 Annovar anotations file [Ex: myanno.hg19_multianno.txt] or a csq formatted file.

optional arguments:
  -h, --help            show this help message and exit
  -t TSB, --tsb TSB     Sample name. Default parses from the file name
  -b BUILD, --build BUILD
                        Reference genome build [Default: hg38]
  -p {refGene,ensGene}, --protocol {refGene,ensGene}
                        Protocol used to generate annovar annotations [Default: refGene]
  -c, --csq             Input file is a bcftools csq formatted output

annovar2maf

python annovar2maf.py -t foo -b GRCh37 tests/test_mutect.refseq.hg19_multianno.txt 

# For annovar annotations generated with ensGene as a protocol
python annovar2maf.py -p ensGene -t foo -b GRCh37 tests/test_mutect.ens.hg19_multianno.txt

csq2maf

Similar to VEP, bcftools csq command can annotate variants with consequences. The program is lightweight and extremely fast Output can be converted to tsv with split-vep and then converted to MAF.

ref="Homo_sapiens.GRCh37.dna.primary_assembly.fa"

# Get the GFF files for your ref build
## GRCh38 with and without the chr prefix
#wget ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.110.chr.gff3.gz
#wget ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.110.gff3.gz

## GRCh37 with and without the chr prefix
#wget ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.chr.gff3.gz
wget ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz

## Step-1: Below commands left normalizes the VCF, splits multi-alleleic variants, annotates vcf with variant consequences while prioritizing variants with worst consequences. 
bcftools norm -f ${ref} -m -both -Oz tests/test_mutect.vcf.gz | bcftools csq -c CSQ -f ${ref} -g Homo_sapiens.GRCh37.82.gff3.gz -p a | \
bcftools +split-vep /dev/stdin -Oz -o tests/test_mutect.csq.vcf.gz -c - -s worst

## Step-2: Below command converts csq annotated vcf to tsv
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%gene\t%transcript\t%Consequence\t%amino_acid_change\t%dna_change\n' tests/test_mutect.csq.vcf.gz > tests/test_mutect.csq.tsv

## Step-3: Now Covert tsv to maf
python annovar2maf.py -c -t foo -b GRCh37 tests/test_mutect.csq.tsv