Awesome
smoove-nf
Nextflow implementation of the smoove toolset (and some others) focused on reliably calling SVs in your data.
The workflow
The workflow consists of a number of steps, each generally outputing to unique result directories.
Call genotypes
smoove call
is run on individual bam or cram alignment files. Output is written to $outdir/smoove-called
and includes $sample-smoove.genotyped.vcf.gz
and an index.
Merge genotypes
Next, we collect all SVs across samples into a single, merged (union) VCF using smoove merge
. Results are written to $outdir/smoove-merged
and include the file $project.sites.vcf.gz
.
Genotype all samples
Using the union of SVs across all samples, we genotype each sample at those sites using smoove genotype
with duphold
for depth annotations. Output is written to $outdir/smoove-genotyped/$sample-smoove.genotyped.vcf.gz
.
Square and annotate VCF
Take all single sample genotyped VCFs and paste into a single, square, joint-called file using smoove paste
. Then annotate the variants using the annotation supplied from --gff
with smoove annotate
. Results are written to:
$outdir/smoove-squared/$project.smoove.square.anno.vcf.gz
- Annotated and indexed VCF for all SVs across all samples.
$outdir/bpbio/svvcf.html
- A report of SV counts per sample by SV type.
Coverage profiling
Using indexcov, estimate coverage across the genome per sample and perform coverage-based quality control. The full report output of goleft indexcov
is written to $outdir/indexcov
. Its report is written to $outdir/indexcov/index.html
.
Workflow report
Logs and output of various steps are aggregated and summarized into one report written to $outdir/smoove-nf.html
.
Cumulative chromosome coverage is available in $outdir/covviz_report.html
.
Usage
A Docker container is maintained in parallel with this workflow (https://hub.docker.com/r/brentp/smoove) and will be pulled by Nextflow before data processing begins. There's no need to download and install dependencies outside of Docker or Singularity and Nextflow.
nextflow run brwnj/smoove-nf -latest [nextflow options] [smoove-nf options]
Running this using provided containers can be accomplished using the docker
profile:
nextflow run brwnj/smoove-nf -latest -profile docker [nextflow options] [smoove-nf options]
Required parameters
--bams
- Aligned sequences in .bam and/or .cram format. Indexes (.bai/.crai) must be present.
- Use wildcards like
'SRP1234/alignments/*.cram'
to specify your alignment files.
Note: Nextflow will handle wildcard expansion in this case, so it's critical we quote we the string like:
nextflow run brwnj/smoove-nf -latest \
--bams '~/SRP1234/alignments/*.cram'
--fasta
- File path to reference fasta. Index (.fai) must be present.
- GRCh38 is available at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome
--gff
- Annotation GFF used to annotate variants.
- GRCh38 reference is available at: ftp://ftp.ensembl.org/pub/release-95/gff3/homo_sapiens/Homo_sapiens.GRCh38.95.chr.gff3.gz
Optional parameters
--outdir
- The base results directory for output
- default: './results'
--bed
- File path to bed of exclude regions for
smoove call
. - Exclude regions for b37 and GRCh38 are made available by the Hall lab under speedseq.
- File path to bed of exclude regions for
--exclude
- regular expression of chromosomes to skip
- You should escape '$', e.g.
"~random\$,~_alt\$"
- default:
"~^HLA,~^hs,~:,~^GL,~M,~EBV,~^NC,~^phix,~decoy,~random\$,~Un,~hap,~_alt\$"
--project
- Acts as the file prefix for merged and squared sites
- default: 'sites'
--sexchroms
- Comma delimited names of the sex chromosome(s) used to infer sex, e.g.
--sexchroms 'chrX,chrY'
- default: 'X,Y'
- Comma delimited names of the sex chromosome(s) used to infer sex, e.g.
--sensitive
- Preserves more variants from being filtered throughout the workflow
- default: false
covviz params
--zthreshold
- a sample must greater than this many standard deviations in order to be found significant
- default: 3.5
--distancethreshold
- consecutive significant points must span this distance in order to pass this filter
- default: 150000
--slop
- leading and trailing segments added to significant regions to make them more visible
- default: 500000
--minsamples
- Show all traces when analyzing this few samples; ignores z-threshold, distance-threshold, and slop
- default: 8
somalier params
--knownsites
- optional, but required in order to run somalier quality control
- VCF of known polymorphic sites -- download links can be found at https://github.com/brentp/somalier/releases, but any set of common variants will work
- default: false
--ped
- optional, but required in order to run
somalier relate
and generate somalier's HTML report - sample relationship definitions
- default: false
- optional, but required in order to run
Updating
To pull changes to made to the workflow and ensure you're running the latest version, use:
nextflow pull brwnj/smoove-nf
That will either pull any changes or confirm you're at the latest version.
Alternatively, when you run the workflow simply use:
nextflow run brwnj/smoove-nf -latest