Awesome

smoove-nf

Nextflow implementation of the smoove toolset (and some others) focused on reliably calling SVs in your data.

The workflow

The workflow consists of a number of steps, each generally outputing to unique result directories.

Call genotypes

smoove call is run on individual bam or cram alignment files. Output is written to $outdir/smoove-called and includes $sample-smoove.genotyped.vcf.gz and an index.

Merge genotypes

Next, we collect all SVs across samples into a single, merged (union) VCF using smoove merge. Results are written to $outdir/smoove-merged and include the file $project.sites.vcf.gz.

Genotype all samples

Using the union of SVs across all samples, we genotype each sample at those sites using smoove genotype with duphold for depth annotations. Output is written to $outdir/smoove-genotyped/$sample-smoove.genotyped.vcf.gz.

Square and annotate VCF

Take all single sample genotyped VCFs and paste into a single, square, joint-called file using smoove paste. Then annotate the variants using the annotation supplied from --gff with smoove annotate. Results are written to:

$outdir/smoove-squared/$project.smoove.square.anno.vcf.gz
- Annotated and indexed VCF for all SVs across all samples.
$outdir/bpbio/svvcf.html
- A report of SV counts per sample by SV type.

Coverage profiling

Using indexcov, estimate coverage across the genome per sample and perform coverage-based quality control. The full report output of goleft indexcov is written to $outdir/indexcov. Its report is written to $outdir/indexcov/index.html.

Workflow report

Logs and output of various steps are aggregated and summarized into one report written to $outdir/smoove-nf.html.

Cumulative chromosome coverage is available in $outdir/covviz_report.html.

Usage

A Docker container is maintained in parallel with this workflow (https://hub.docker.com/r/brentp/smoove) and will be pulled by Nextflow before data processing begins. There's no need to download and install dependencies outside of Docker or Singularity and Nextflow.

nextflow run brwnj/smoove-nf -latest [nextflow options] [smoove-nf options]

Running this using provided containers can be accomplished using the docker profile:

nextflow run brwnj/smoove-nf -latest -profile docker [nextflow options] [smoove-nf options]

Required parameters

--bams
- Aligned sequences in .bam and/or .cram format. Indexes (.bai/.crai) must be present.
- Use wildcards like 'SRP1234/alignments/*.cram' to specify your alignment files.

Note: Nextflow will handle wildcard expansion in this case, so it's critical we quote we the string like:

nextflow run brwnj/smoove-nf -latest \
	--bams '~/SRP1234/alignments/*.cram'

--fasta
- File path to reference fasta. Index (.fai) must be present.
- GRCh38 is available at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome
--gff
- Annotation GFF used to annotate variants.
- GRCh38 reference is available at: ftp://ftp.ensembl.org/pub/release-95/gff3/homo_sapiens/Homo_sapiens.GRCh38.95.chr.gff3.gz

Optional parameters

--outdir
- The base results directory for output
- default: './results'
--bed
- File path to bed of exclude regions for smoove call.
- Exclude regions for b37 and GRCh38 are made available by the Hall lab under speedseq.
--exclude
- regular expression of chromosomes to skip
- You should escape '$', e.g. "~random\$,~_alt\$"
- default: "~^HLA,~^hs,~:,~^GL,~M,~EBV,~^NC,~^phix,~decoy,~random\$,~Un,~hap,~_alt\$"
--project
- Acts as the file prefix for merged and squared sites
- default: 'sites'
--sexchroms
- Comma delimited names of the sex chromosome(s) used to infer sex, e.g. --sexchroms 'chrX,chrY'
- default: 'X,Y'
--sensitive
- Preserves more variants from being filtered throughout the workflow
- default: false

covviz params

--zthreshold
- a sample must greater than this many standard deviations in order to be found significant
- default: 3.5
--distancethreshold
- consecutive significant points must span this distance in order to pass this filter
- default: 150000
--slop
- leading and trailing segments added to significant regions to make them more visible
- default: 500000
--minsamples
- Show all traces when analyzing this few samples; ignores z-threshold, distance-threshold, and slop
- default: 8

somalier params

--knownsites
- optional, but required in order to run somalier quality control
- VCF of known polymorphic sites -- download links can be found at https://github.com/brentp/somalier/releases, but any set of common variants will work
- default: false
--ped
- optional, but required in order to run somalier relate and generate somalier's HTML report
- sample relationship definitions
- default: false

Updating

To pull changes to made to the workflow and ensure you're running the latest version, use:

nextflow pull brwnj/smoove-nf

That will either pull any changes or confirm you're at the latest version.

Alternatively, when you run the workflow simply use:

nextflow run brwnj/smoove-nf -latest