Awesome
Whole Genome Sequencing Structural Variation Pipeline
Quick start
# Install nextflow:
curl -fsSL get.nextflow.io | bash
mv ./nextflow ~/bin
# Set work dir to no-backup, put this in your .bashrc
export NXF_WORK=$SNIC_NOBACKUP/work
# Pull worfklow from this repo, run manta, normalize, and variant effect predictor:
nextflow run -profile milou NBISweden/wgs-structvar --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep
# Monitor log file
tail -f .nextflow.log
Your summary files will be in the results
subdirectory.
General information
This is a pipeline for running the two structural variation callers fermikit and manta on UPPMAX. You can choose to run either of the two structural variation callers or both (and generate summary files). The main focus on this pipeline is to enable better comparisions with the SweGen dataset, the default parameters for the tools are the same that were used for that dataset. If you have access to the structural variants in the swegen dataset you can add that file to the pipeline and thereby have the ability to filter population specific variants.
Profiles for running on Uppmax HPC clusters
It is possible to run the pipeline in a few different ways. Either as a single-node job or letting nextflow distribute the tasks using the SLURM queing engine. There is also some slight differences in module usage depending on which HPC system is used.
specify the profile to use with the -profile
option to NextFlow:
Masking
Artifact masking
The pipeline will use the following mask files to remove known artifacts:
- From cc2qe/speedseq: https://github.com/cc2qe/speedseq/raw/master/annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed
- From lh3/varcmp: https://github.com/lh3/varcmp/raw/master/scripts/LCR-hs37d5.bed.gz
You can configure the location of the artifact mask files with the
--mask_artifact_dir
command line option.
Cohort masking
The pipeline can take bed files to filter variants. To run the pipeline with
filters put the bed
files in the mask_cohort/
subdirectory and add the
mask_cohort
option to the --steps
comma separated command line argument, eg:
cp some_bed_file.bed <path-to-wgs-structvar>/mask_cohort/
nextflow run -profile biancalocal <path-to-wgs-structvar>/main.nf --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep,mask_cohort
You can configure the location of the cohort mask files with the
--mask_cohort_dir
command line option.
Detailed usage
Command line options
Run a local copy of the wgs-structvar WF:
nextflow main.nf --bam <bamfile> [more options]
OR run from github:
nextflow nbisweden/wgs-structvar --bam <bamfile> [more options]
Options:
Required
--bam Input bamfile
OR
--runfile Input runfile for multiple bamfiles in the same run.
Whitespace separated, first column is bam file,
second column is output directory and an optional third column
with a run id to more easily keep track of the run (otherwise
it\'s autogenerated).
--project Uppmax project to log cluster time to
-profile <profile>
Where profile can be any of milou, localmilou, bianca,
localbianca and devel. The local<x> are for running the
entire workflow on a single node on the cluster, without
the local prefix the slurm queueing system is used.
Optional
--help Show this message and exit
--fastq Input fastqfile (default is bam but with fq as fileending)
Used by fermikit, will be created from the bam file if
missing.
--steps Specify what steps to run, comma separated: (default: manta, vep)
Callers: manta, fermikit
Annotation: vep, snpeff
Extra: normalize (with vt),
mask_cohort (with bed files in mask_cohort/)
--sg_mask_ovlp Fractional overlap for use with the filter option
--no_sg_reciprocal Don't use a reciprocal overlap for the filter option
--outdir Directory where resultfiles are stored (default: results)
--prefix Prefix for result filenames (default: no prefix)
--mask_artifacts_dir
Directory with bed files for artifact filtering (default: mask_artifacts)
--mask_cohort_dir
Directory with bed files for cohort filtering (default: mask_cohort)
The log file .nextflow.log
will be produced when running and can be monitored
by e.g. tail -f .nextflow.log
Customization
Nextflow can pull from github (master branch) so if you specify this repo it will run
what is currently in it. However if you want to customize the parameters more you will
want to clone the repo and edit the nextflow.config
file in it.
It's probably only the params
scope of the config file that is of interest
to customize.
The first part has the default values for the command line parameters, see the usage message for information on them.
The next section has the reference assembly to use, both as fasta and assembly name.
You may want to use different versions of the modules used by this workflow,
currently you will have to edit the profiles to do that. On uppmax we have the
milou profile which specifies all the modules and versions, see the
config/milou.config
.
The runtimes of the different programs is set in the config/standard.config
file. That file also specifies how to deal with errors and the interaction
with the Slurm scheduler, you probably don't want to change those unless you
know what you are doing.
The two folders mask_artifacts
and mask_cohort
contain bed files to
filter the vcf-files from the callers. The artifact directory contains files
that should remove problematic regions, it removes everything that has an
overlap of at least 25% with a region in the artifact mask. The cohort one is
for more stringent filtering of already known variants, and here the default
filter threshold is instead a reciprocal overlap of 95%. It can be customized
with the two options sg_mask_ovlp
(default 0.95) and no_sg_reciprocal
.
Support
If you need help with this module, please create a support issue in github.