Home

Awesome

<p align="center"> <img src="resources/icon.png"/> </p>

Samovar is a mosaic SNV caller for 10x Genomics linked-read WGS data. Starting from the phased VCF and BAM files output by 10x Genomics Longranger pipeline, Samovar first selects candidate variant sites by filtering sites based on features of the aligned reads, such as the depth of coverage, that contribute to false positive variant calls (1). A random forest tree model is trained (3) with simulated mosaic variants as positive examples (2) and real homozygous/heterozygous variants (2) called in the sample by Longranger as negative examples. Candidate variant sites are scored with this model (4). Repeat regions and non-diploid copy-number regions are optionally filtered out (5) before a final filter removes false positives resulting from alignment errors (6) to produce scored variant calls (7).

<p align="center"> <img src="resources/samovar_wf.png"/> </p>

Dependencies and Configuration

Requires: Python (v.3 recommended, v.2 compatible) with pyfaidx; scikit-learn; simplesam; fisher; and installations of samtools, bedtools. Some steps are compatible with pypy
MARCC The SLURM sbatch parameter --ntasks N should be set to accompany Python script parameter --nproc N

Installation

Main Pipeline

./samovar
Options: generateVarfile, simulate, train, preFilter, classify, postFilter (run without options to see arguments)

samovar generateVarfile

Generates a tab-separated file with fields contig \t site \t VAF to specify where mosaic-like sites will be simulated

samovar generateVarfile --out out.varfile --vcf sample.vcf --fai genome.fa.fai

--out output varfile name [out.varfile]
--vcf sites to avoid in simulation, e.g. the VCF from your sample (takes comma-separated multiple file names)
--fai FASTA file index of the genome from samtools faidx
--variantspacing spacing along the genome of simulated sites [80000]
--vafspacing increment of variant allele fraction [0.05]

Notes:

samovar simulate

Simulates mosaic-like training examples at sites specified in varfile where there is at least one haplotype-discordant read

samovar simulate --bam sample.bam --varfile out.varfile --simulate --nproc 4 > mosaic.features.tsv

--simulate mosaic site simulation mode
--bam bam file to simulate sites from
--varfile varfile from 0.setup/generateVarfile.py of sites and VAF
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)

Notes:

Computes features at known homozygous/heterozygous sites from VCF that have at least one haplotype-discordant read

samovar simulate --bam sample.bam --varfile sample.vcf --het --max 50000 --nproc 4 > het.features.tsv

--het will extract features from sites with genotype 1|0\|0|1 that PASS filter and are SNP [i.e. len(ref) == 1 and len(alt) == 1]
--hom will extract features from sites with genotype 0|0\|1|1 that PASS filter and are SNP [i.e. len(ref) == 1 and len(alt) == 1]
--bam bam file to simulate sites from
--varfile sites to extract features from, e.g. the VCF from your sample
--max maximum number of sites to extract features from
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)
Must catch output from stdout.

Notes:

samovar train

Uses feature vectors from mosaic-like, heterozygous, and homozygous sites to train a random forest classifier

samovar train --mosaic mosaic.features.tsv --het het.features.tsv --hom hom.features.tsv

--mosaic file from --simulate in 1.simulate/simulate.py
--het file from --het in 1.simulate/simulate.py
--hom file from --simulate in 1.simulate/simulate.py Equal number of het and hom examples are randomly selected to give equal number of mosaic and germline training examples.
--out classifier file in Python's pickle binary format [clf.pkl]
--mindepth read coverage required to use a site as a training example [16] ] --nestimators [100] maxleafnodes [50] scikit-learn random forest hyperparameters
--modelsize maximum number of mosaic training examples [None: use all data]

Notes:

samovar preFilter

Scans genome, calculates features for sites that pass all filters. Outputs "vectors" for passing sites which will later be classified and ranked by the random forest

samovar preFilter --bam sample.bam --nproc 48 > vectors.txt 2> intervalsComplete.txt

--bam bam file of your sample
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)

Notes:

samovar classify

Uses the random forest model to rank the feature vectors, outputting those with a higher-than-threshold probability in an output file with format: contig \t position \t position+1 \t Samovar score \t read depth \t Number of haplotype discordant read \t MAF \t minor allele base

samovar classify --clf clf.pkl --vectors vectors.txt > predictions.tsv

--clf classifier trained from 2.train/train.py
--vectors output file from filter step

Notes:

Use bedtools to apply region filters

Filter out simple repeats and CNV

bedtools intersect -v -a predictions.tsv -b repeatsb37.bed | bedtools intersect -v -a stdin -b CNVNATOR.bed > regionfiltered.tsv

Notes:

samovar postFilter

Calculates the minimum Fisher score for association between minor allele reads and mismatches, indels, or alignment endpoints

samovar postFilter --bam sample.bam --bed regionfiltered.tsv --ref genome.fa --vcfavoid sample.vcf --nproc 4 > predictionsFinal.tsv

--bam BAM file of your sample
--bed variant calls
--vcfavoid VCF of sites NOT to consider for possible linkage (i.e. VCF from your sample)
--ref reference genome FASTA
--p minimum Fisher p-value to report - default 0.005

Notes:

Utilities

0.setup/cnvnator2bed.py

python cnvnator2bed.py [infile] [outfile] [.fai of genome]

Converts cnvnator output into a .bed file with +/- 5 buffer around CNV regions

Example

example/EXAMPLE.txt

Reads and variants from Genome in a Bottle NA24385 GRCh38 extracted by:

samtools view -b [input.bam] chr1:200000000-201000000 > example.bam
samtools index example.bam
vcftools --vcf [input.vcf] -c --recode --chr chr1 --from-bp 200000000 --to-bp 201000000 --remove-indels --remove-filtered-all > example.vcf

Can replace "python" with "pymp" and/or add "--nproc N" to commands where permitted
Without pymp or parallelism, the filter step will take about 6 minutes; the rest will take a few seconds each.

python ../1.simulate/simulate.py --bam example.bam --varfile out.varfile --simulate > outs/mosaic.features.tsv
python ../1.simulate/simulate.py --bam example.bam --varfile example.vcf --hom  > outs/hom.features.tsv
python ../1.simulate/simulate.py --bam example.bam --varfile example.vcf --het  > outs/het.features.tsv
python ../2.train/train.py --mosaic outs/mosaic.features.tsv --het outs/het.features.tsv --hom outs/hom.features.tsv --nestimators 1 --maxleafnodes 5 --out outs/clf.pkl
python ../3.classifyAndFilter/filter.py --bam example.bam > outs/vectors.txt
python ../3.classifyAndFilter/classify.py --clf outs/clf.pkl --vectors outs/vectors.txt > outs/predictions.txt
python ../3.classifyAndFilter/linkageFilter.py --bam example.bam --bed outs/predictions.txt --vcfavoid example.vcf --ref [GRCh38 reference genome FASTA] > outs/finalPredictions.txt

Experiments

Scripts related to experiments in the forthcoming Samovar publication.