Home

Awesome

Build Status License: GPL v3

samclip

Filter SAM file for soft and hard clipped alignments

Introduction

Most short read aligners perform local alignment of reads to the reference genome. Examples includes bwa mem, minimap2, and bowtie2 (unless in --end-to-end mode). This means the ends of the read may not be part of the best alignment.

This can be caused by:

Read aligners output a SAM file. Column 6 in this format stores the CIGAR string. which describes which parts of the read aligned and which didn't. The unaligned ends of the read can be "soft" or "hard" clipped, denoted with S and H at each end of the CIGAR string. It is possible for both types to be present, but that is not common. Soft and hard don't mean anything biologically, they just refer to whether the full read sequence is in the SAM file or not.

Some examples of a 100bp aligned read:

Motivation

One may wish to remove these alignments to avoid downstream problems. In particular, samclip was designed to remove clipped alignments to improve variant calling, by removing suspicious local aligments causing false positives near structural variation. However, it does keep them if they hit the ends of contigs, which is particularly important given the lower coverage often observed at those locations.

Installation

samclip has no dependencies except Perl 5.10 or higher. It only uses core modules, so no CPAN needed.

Direct script download

% cd /usr/local/bin  # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/samclip/master/samclip
% chmod +x samclip

Homebrew

% brew install brewsci/bio/samclip

Conda

% conda install -c bioconda -c conda-forge samclip

Github

% git clone https://github.com/tseemann/samclip.git
% cp samclip/samclip /usr/local/bin # choose a folder in your $PATH

Test Installation

% ./samclip --version
samclip 0.2

% ./samclip --help
SYNOPSIS
  Filter SAM file for soft & hard clipped alignments
AUTHOR
  Torsten Seemann (@torstenseemann)
USAGE
  % samclip --ref ref.fa < in.sam > out.sam
  % minimap2 ref.fa R1.fq R2.fq | samclip --ref ref.fa | samtools sort > out.bam
OPTIONS
  --help         This help
  --version      Print version and exit
  --ref FASTA    Reference genome - needs FASTA.fai index
  --max NUM      Maximum clip length to allow (default=5)
  --invert       Output rejected SAM lines and ignore good ones
  --debug        Print verbose debug info to stderr
  --progress N   Print progress every NUM records (default=100000,none=0)
HOMEPAGE
  https://github.com/tseemann/samclip

Examples

% samclip --ref ref.fa < in.sam > out.sam

% samclip --ref ref.fa in1.sam in2.sam > out.sam

# the -h is to emsure the SAM header is included
% samtools view -h in.bam | samclip --ref ref.fa | samtools sort > out.bam

% bwa mem ref.fa R1.fq R2.fq | samclip --ref ref.fa | samtools sort > out.bam 

Options

Issues

Submit feedback to the Issue Tracker

License

GPL v3

Author

Torsten Seemann