Awesome
Tip and tricks for BAM files
- Usefull tools
- Samtools organisation and repositories
- Other tools
- Tip and tricks
- Check if BAM is sorted
- Sort a BAM
- Extract run ID, flow cell ID and Lane number
- Extract sample name
- Change sample name
- Simple variant calling with freebayes
- Add new read group in header
- Calculate bed-positions coverage
Usefull tools
Samtools organisation and repositories
- File format specification
- Samtools github page
- Samtools webpage
- Samtools man page
Compilation (from here):
git clone --branch=develop git://github.com/samtools/htslib.git
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/samtools.git
cd bcftools; make
cd ../samtools; make
cd ../htslib; make
Other tools
- biobambam2 github
- Picard with its github page
- R package rbamtools
- bedtools documentation and github page
Tip and tricks
Check if BAM file is sorted
samtools view -H test.bam | grep @HD
Will show SO:coordinate
if sorted.
Sort a BAM file
samtools sort -o test_sorted.bam -T tmp test.bam
Extract run ID, flow cell ID and Lane number
This only works for recent Illumina BAM files but can easily be adapted for other types and for FASTQ files. We extract the first column from the first read of the BAM that looks like HISEQ:292:C5GGUACXX:6:1101:16516:24562
and take fields 2, 3 and 4 separated by :
:
samtools view test.bam | head -1 | cut -f1 | cut -f2-4 -d':' | tr ':' '\t'
Extract sample name
samtools view -H test.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq
To have the sample names of all BAM files in a folder with their file names:
for f in *.bam; do echo -ne "$f\t" ; samtools view -H $f | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq; done
Change sample name
For all read groups in a BAM file
samtools view -H test.bam | sed "s/SM:[^\t]*/SM:TEST_SAMPLE_NAME/g" | samtools reheader - test.bam > test_SM.bam
Simple variant calling using freebayes
freebayes -f ucsc.hg19.fasta --min-alternate-count 5 --no-complex --min-mapping-quality 20 --min-base-quality 20 \
--min-coverage 20 test.bam | vcffilter -f "QUAL > 20" | vcfbreakmulti | vt normalize - -q -r ucsc.hg19.fasta > test.vcf
Add a new ID-specified read group in the bam header
RGline=$(samtools view -H test.bam | grep '@RG' | head -1 | sed "s/ID:[^\t]*/ID:NEW_ID/g")
line_number=$(samtools view -H test.bam | sed -n '/@RG/=' - | head -1)
samtools view -H test.bam | sed "${line_number}"'i\'"${RGline}" | samtools reheader - test.bam > test_RGadded.bam
unset RGline; unset line_number
Calculate coverage for each position of a bed
bedtools coverage -d -a my_bed.bed -b my_bam.bam
Find where reads map in a BAM file
This will only report non-zero coverage and create contiguous regions with similar coverage (see bedtools
manual). Here for example we only keep regions with at least 10 reads:
$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10'
chr1 154566162 154566163 14
chr1 154566163 154566164 15
chr1 154566164 154566167 18
chr1 154566167 154566171 19
It can also be useful to group these regions and to report the total number of reads in each region. The first step is to merge contiguous regions identified by the previous command with bedtools merge
. Here we also merge regions less than 100bp from each other:
$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10' | bedtools merge -d 100 -i stdin
chr1 154566162 154566294
chr2 45189642 45189787
Finally we can now ask bedtools
to count the number of reads in each of these regions using coverage
, and the full command is:
$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10' | bedtools merge -d 100 -i stdin | bedtools coverage -a stdin -b test.bam
chr1 154566162 154566294 21 132 132 1.0000000
Here for example there are 21 reads in region chr1:154566162-154566294. These 21 reads cover 132bp and the region itself is 132bp long, so 100% of the region is covered.
Convert BAM to FASTQ
Usefull links explaining why it's not so simple (for paired-end sequencing at least): http://crazyhottommy.blogspot.fr/2015/10/convert-bam-to-fastq-and-remap-it-with.html
Warning when reading old texts about this: htscmd bamshuf
has been successively renamed samtools bamshuf
and now samtools collate
(since samtools v1.3
). Similarly htscmd bam2fq
has been successively renamed samtools bam2fq
and now simply samtools fastq
.
This commands allows to do it without intermediate files, including the final zip:
samtools collate -uOn 128 in.bam tmp-prefix | samtools fastq -s se.fq.gz - | gzip > in_interleaved_reads.fq.gz
One might often do this to realign afterward the FASQ file. In this case it's even possibe to avoid creating the intermediate FASTQ and to go directly from the BAM to the realigned BAM in a single command:
samtools collate -uOn 128 in.bam tmp-prefix | samtools fastq -s se.fq.gz - | bwa mem -p ref.fa -
Actually one can even directly mark duplicates and sort the final BAM in a single command like speedseq is doing using samblaster and sambamba.