Awesome
Structural variant calling tutorial using long-reads.
Installation
Installation of required libraries depends on your linux distribution, e.g., for Ubuntu:
apt-get update
apt-get install -y autoconf build-essential cmake g++ git libcurl4-gnutls-dev libbz2-dev libdeflate-dev libgl1-mesa-dev libncurses-dev liblzma-dev pkg-config zlib1g-dev
Then clone the repository and install all dependencies using mamba:
git clone --recursive https://github.com/tobiasrausch/sv
cd sv
make all
Download the course data:
make FILE=1PfCy8yESCxvI8RJsfxTbF-QsygfnKNA2 download
Load the software stack
Load the mamba environment with all required tools.
if [ ! -z ${CONDA_PREFIX+x} ]; then conda deactivate; fi
export PATH=`pwd`/mamba/bin:${PATH}
SV Calling
Reconstructing a derivative chromosome in cancer
In this practical we will analyze structural variants (SVs) of a pancreatic cancer cell line from the Genome in a Bottle project. The data was subsampled and subset to chr1 and chr5 to speed up all subsequent analysis. The tumor genome alignment file is named tumor.bam
and the control genome alignment file is named control.bam
.
Structural variant alignment quality control
For the discovery of structural variants, you should first assess the quality of the sequencing data as, for example, local assembly methods suffer from short reads and a high sequencing error rate and read-depth methods from uneven coverage. Common quality criteria are the percentage of reads mapped, the duplicate rate, the read-length distribution and the error rate. Popular tools to compute long-read quality control metrics are NanoPlot and Alfred.
cd data/lr/
alfred qc -r genome.fa -o qc.tsv.gz -j qc.json.gz tumor.bam
zcat qc.tsv.gz | grep ^ME | datamash transpose
NanoPlot --bam tumor.bam -o qc_tumor
cat qc_tumor/NanoStats.txt
As you can see from the QC results, the data has been downsampled to fairly low coverage to speed up all analyses in this tutorial. This implies that some structural variants will have only weak support. In terms of QC interpretation, there are some general things to watch out for, such as unexpected high error rates (>4%), less than 80% of reads above Q10, an N50 read length below 10Kbp or unexpected patterns in the read length histogram.
Exercises
- What is the difference between the weighted read length histogram and the non-weighted histogram (WeightedHistogramReadlength.png, Non_weightedHistogramReadlength.png)?
- How is the N50 read length calculated?
- What is the median coverage of the tumor genome?
Germline Structural Variants
Before we dive into structural variant (SV) calling, let's get an idea of how SVs look like in long-read sequencing data. I have prepared a BED file with some "simple" germline structural variants like deletions and insertions and one complex example.
cat svs.bed
Using IGV we can browse these SVs interactively.
igv -g genome.fa
Once IGV has started use 'File' and 'Load from File' to load the tumor.bam
and control.bam
alignment file. Then import the svs.bed
file from your working directory using 'Regions' and 'Import Regions'.
You can then easily navigate to the structural variants with 'Regions' and 'Region Navigator'.
Select a structural variant in the region navigator and click 'View', which will center the IGV alignment view on the selected structural variant.
You can zoom in and out using the '+' and '-' signs in the toolbar at the top.
You may also want to visualize clipped reads using the 'View' menu, then click 'Preferences...' and then click the 'Alignments' tab to switch on 'Show soft-clipped reads'. For the last complex SV, you may also want to highlight one specific read named 8c0eba99-ad7d-4246-87f0-46177b9946d4
which is over 100Kbp long: Use a right click on tumor.bam
and then 'Select by name...'. We will later use that read to visualize this complex region.
Exercises
- Which of the two insertions could be a mobile element insertion? What typical features of a mobile element can you observe for that insertion?
- Do you see for heterozygous SVs, potential heterozygous SNPs in the vicinity that are on the same haplotype?
Plotting structural variants
IGV is excellent for interactive browsing but for large numbers of SVs you can use command-line tools such as wally to plot multiple SVs in batch.
wally region -R svs.bed -cp -g genome.fa tumor.bam control.bam
Complex structural variants
Even in germline genomes we can observe complex structural variants or potential reference errors. One example was in the svs.bed
file.
cat svs.bed | grep "complex"
In IGV, we already saw that a very long read named 8c0eba99-ad7d-4246-87f0-46177b9946d4
maps to that region.
Let's do a dotplot of that read against the reference region using wally.
wally dotplot -g genome.fa -r 8c0eba99-ad7d-4246-87f0-46177b9946d4 -e chr1:16500000-16600000 tumor.bam
This dotplot suggests a repeat expansion relative to GRCh38.
Exercises
- The UCSC genome browser suggests that there is already a patch of the reference for this area of the genome, called chr1_MU273333v1_fix.fa. You may want to try and realign the read to the patched genome using minimap2.
samtools view -F 3844 tumor.bam chr1:16500000-16650000 | grep "^8c0eba99-ad7d-4246-87f0-46177b9946d4" | cut -f 1,10 | awk '{print ">"$1"\n"$2;}' > read.fa
minimap2 -ax map-ont -L chr1_MU273333v1_fix.fa read.fa | samtools sort -o read.bam -
samtools index read.bam
wally dotplot -g chr1_MU273333v1_fix.fa -r 8c0eba99-ad7d-4246-87f0-46177b9946d4 -e chr1_MU273333v1_fix:300000-400000 read.bam
Delly structural variant calling
Delly is a method for detecting structural variants using short- or long-read sequencing data.
Using the tumor and normal genome alignment, delly calculates structural variants and outputs them as a BCF file, the binary encoding of VCF. Delly's long-read SV discovery mode uses the subcommand lr
.
delly lr -g genome.fa -o sv.bcf tumor.bam control.bam
VCF encoding of structural variants
VCF was originally designed for small variants such as single-nucleotide variants (SNVs) and short insertions and deletions (InDels). That's why all SV callers heavily use the VCF INFO fields to encode additional information about the SV such as the structural variant end position (INFO:END) and the SV type (INFO:SVTYPE). You can look at the header of the BCF file using grep where '-A 2' includes the first two structural variant records after the header in the file:
bcftools view sv.bcf | grep "^#" -A 2
Delly uses the VCF:INFO fields for structural variant site information, such as how confident the structural variant prediction is and how accurate the breakpoints are. The genotyping fields contain the actual sample genotype, its genotype quality and genotype likelihoods and various count fields for the variant and reference supporting reads. Please note that at this stage the BCF file contains germline and somatic structural variants but also false positives caused by mis-mappings or incomplete reference sequences.
Querying VCF files
Bcftools offers many possibilities to query and reformat SV calls. For instance, to output a table with the chromosome, start, end, identifier and genotype of each SV we can use:
bcftools query -e 'SVTYPE=="BND"' -f "%CHROM\t%POS\t%INFO/END\t%ID[\t%GT]\n" sv.bcf | head
Inter-chromosomal translocations with SV type BND
are a special case because they involve two different chromosomes.
bcftools query -i 'SVTYPE=="BND"' -f "%CHROM\t%POS\t%INFO/CHR2\t%INFO/POS2\t%ID[\t%GT]\n" sv.bcf
Bcftools can also filter by genomic regions. For instance, to verify delly found the Alu (mobile element) insertion we previously looked at in IGV, you can use:
grep "INS02" svs.bed
bcftools view sv.bcf chr5:56632119-56632158
Delly's consensus sequence (INFO:CONSENSUS) is a local assembly of all SV-supporting reads. So we can again create a dotplot using wally to highlight the insertion relative to GRCh38.
bcftools query -f "%ID\t%INFO/CONSENSUS\n" sv.bcf | grep "INS00001302" | awk '{print ">"$1"\n"$2;}' > ins.fa
samtools faidx genome.fa chr5:56631000-56633000 | sed 's/^>.*$/>hg38/' >> ins.fa
wally dotplot ins.fa
To identify additional Alu candidate insertions, we can for instance filter by size and select insertions with a poly-A tail.
bcftools query -i 'SVTYPE=="INS" && SVLEN>250 && SVLEN<350' -f "%CHROM\t%POS\t%ALT\n" sv.bcf | grep "AAAAAAAAAAA"
Exercises
- How many inter-chromosomal translocations were identified by delly?
- How can bcftools be used to count the number of structural variants for the different SV types (DEL, INS, DUP, INV, BND)?
- Using the bcftools query command above: How many potential Alu insertions are in forward and reverse orientation?
Somatic structural variant filtering
Delly's initial SV calling cannot differentiate somatic and germline structural variants. We therefore now use delly's somatic filtering, which requires a sample file listing tumor and control sample names from the VCF file.
cat spl.tsv
There are many parameters available to tune the somatic structural variant filtering like the minimum variant allele frequency to filter out subclonal variants, for instance. Let's leave the defaults for the time being and only filter for SVs with the VCF FILTER field set to PASS.
delly filter -p -f somatic -o somatic.bcf -s spl.tsv sv.bcf
As expected, the somatic SVs have a homozygous reference genotype in the control sample.
bcftools query -f "%CHROM\t%POS\t%INFO/END\t%ID[\t%GT]\n" somatic.bcf
Using Bcftools and wally we can also easily plot the intra-chromosomal somatic SVs.
bcftools query -e 'SVTYPE=="BND"' -f "%CHROM\t%POS\t%INFO/END\t%ID\n" somatic.bcf | awk '{print $1"\t"($2-50)"\t"($3+50)"\t"$4;}' > somatic.bed
wally region -R somatic.bed -cp -g genome.fa tumor.bam control.bam
Exercises
- Do you think all somatic variants are truly somatic? Which ones are likely false positive somatic SVs?
Reconstructing a derivative chromosome in cancer
IGV and wally are good for relatively small SVs but for large SVs like the duplication-type SV or inter-chromosomal translocations we need to integrate read-depth with structural variant predictions to get a better overview of complex somatic rearrangements. Let's first create a simple read-depth plot.
delly cnv -i 50000 -j 50000 -w 50000 -uo cnv.bcf -c cnv.cov.gz -g genome.fa -m map.fa.gz tumor.bam
Rscript cnBafSV.R cnv.cov.gz
Now we can overlay the somatic structural variants on top of the read-depth information.
bcftools query -f "%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE\t%ID\t%INFO/CHR2\t%INFO/POS2\n" somatic.bcf > svs.tsv
Rscript cnBafSV.R cnv.cov.gz svs.tsv
Apparently, chr1 and chr5 are connected but the question is whether the left or the right end is joined from each inter-chromosomal translocation breakpoint. Delly outputs the orientation of each segment in the INFO:CT field.
bcftools query -i 'SVTYPE=="BND"' -f "%INFO/CT\n" somatic.bcf
In this case, 3to3
indicates that chr1p is joined with chr5p in inverted orientation. Here is a brief summary of the different connection types.
Exercises
- Does the derivative chromosome containing chr1p and chr5p contain a centromere?
Calling and phasing small variants (SNVs and InDels)
For phasing and haplo-tagging the tumor reads we first call small variants using longshot in the matched control.
longshot --no_haps -s HG008-N-P --bam control.bam --ref genome.fa --out control.longshot.vcf
bgzip control.longshot.vcf
tabix control.longshot.vcf.gz
We can then perform long-read based haplotyping using WhatsHap and phase heterozygous variants.
whatshap phase --ignore-read-groups --reference genome.fa control.longshot.vcf.gz control.bam -o control.whatshap.vcf
bgzip control.whatshap.vcf
tabix control.whatshap.vcf.gz
Let's briefly inspect some phasing statistics using the WhatsHap stats
subcommand.
whatshap stats --chr-lengths <(cut -f 1,2 genome.fa.fai) control.whatshap.vcf.gz
Exercises
- How many mega-basepairs is the longest phased block?
- What is the N50 phased block length?
Haplotagging of tumor reads
We can now use the haplotypes derived from the matched control to phase the tumor reads.
whatshap haplotag -o tumor.hp.bam --reference genome.fa --ignore-read-groups --tag-supplementary control.whatshap.vcf.gz tumor.bam
samtools index tumor.hp.bam
The tumor.hp.bam
file now has the HP
tag included for all the reads that could be phased to haplotype 1 or 2. IGV can group and color reads by HP
tag so let's revisit the SVs from the svs.bed
file that we used in the beginning of the tutorial.
igv -g genome.fa
As previously, once IGV has started use 'File' and 'Load from File' to load the tumor.hp.bam
alignment file with the HP
haplotype tags. Then import the svs.bed
file from your working directory using 'Regions' and 'Import Regions'. You can then easily navigate to the structural variants with 'Regions' and 'Region Navigator'. Once you have selected an SV, right click on tumor.hp.bam
, use 'Color alignments by' and select 'tag'. Then enter 'HP' to color by haplotype. Alternatively, you can also use 'Group alignments by' tag to sort reads according to assigned haplotype.
Exercises
- Did all heterozygous SVs separate correctly by haplotype?
Variant allele frequency of small variants across somatic copy-number alterations
Let's annotate the small variants of the control genome in the matched tumor.
longshot --no_haps --potential_variants control.longshot.vcf.gz --output-ref --min_cov 3 --min_alt_count 1 --min_alt_frac 0.1 -s HG008-T --bam tumor.bam --ref genome.fa --out tumor.longshot.vcf
bgzip tumor.longshot.vcf
tabix tumor.longshot.vcf.gz
Then we merge the tumor and control VCF files to annotate all heterozygous variants with their allelic depth in the tumor.
bcftools merge -O b -o tumor.control.bcf tumor.longshot.vcf.gz control.whatshap.vcf.gz
bcftools query -f "%CHROM\t%POS[\t%GT\t%AD]\n" tumor.control.bcf | grep "0|1" | cut -f 1,2,4 | sed 's/,/\t/' | awk '$3+$4>0 {print $1"\t"$2"\t"($3/($3+$4));}' > var.vaf
bcftools query -f "%CHROM\t%POS[\t%GT\t%AD]\n" tumor.control.bcf | grep "1|0" | cut -f 1,2,4 | sed 's/,/\t/' | awk '$3+$4>0 {print $1"\t"$2"\t"($4/($3+$4));}' >> var.vaf
head var.vaf
We can then overlay the variant allele frequencey (B-allele frequency) of small variants with the read-depth information.
Rscript cnBafSV.R cnv.cov.gz svs.tsv var.vaf
Exercises
- For the somatic duplication, we have an estimated total copy-number of 3. What are the expected B-allele frequencies in that region?