Awesome
<img align="center" src="https://cloud.githubusercontent.com/assets/3366818/12507528/99afd6fa-c0f6-11e5-993b-5ac925d178fa.png" width="700">A multi-sample somatic variant caller
<img align="center" src="https://cloud.githubusercontent.com/assets/3366818/10489240/7da79144-729c-11e5-8cb3-0225106d9b06.jpg" width="500">A preprint describing needlestack is available on biorxiv: https://www.biorxiv.org/content/10.1101/639377v1
Contact: follm@iarc.fr
Jump over here for a quick start!
Needlestack development is support by the US National Cancer Institute (grant number R21CA175979) and the French Institut national du cancer.
WARNING
Since the last release, we have harmonized our pipelines and the following option names changed:
bam_folder
is nowinput_bams
and accepts text filefasta_ref
is nowref
max_DP
is nowmax_dp
out_folder
is nowoutput_folder
out_vcf
is nowoutput_vcf
and is now a mandatory argumentout_annotated_vcf
is nowoutput_annotated_vcf
pairs_file
is nowtn_pairs
no_plots
is nowplots
(see the Detailed description section)
Description
Needlestack is an ultra-sensitive multi-sample variant caller for Next Generation Sequencing (NGS) data. It is based on the idea that analysing several samples together can help estimate the distribution of sequencing errors to accurately identify variants. It has been initially developed for somatic variant calling using very deep NGS data from circulating free DNA, but is also applicable to lower coverage data like Whole Exome Sequencing (WES) or even Whole Genome Sequencing (WGS). It is a highly scalable and reproducible pipeline thanks to the use of nextflow and Docker/Singularity container technologies.
Here is a summary of the method:
- At each position and for each candidate variant, we model sequencing errors using a negative binomial regression with a linear link and a zero intercept. The data is extracted from the BAM files using samtools.
- Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use a robust estimator for the negative binomial regression (published here with code available here).
- We calculate for each sample a p-value for being a variant (outlier from the regression) that we further transform into q-values to account for multiple testing.
Note that needlestack is not performing a re-assembly of reads to determine haplotypes as some other modern variant callers are doing (Mutect2, Strelka2 etc.). For this reason we recommend using ABRA2, a realigner for NGS data, before running needlestack. We also host a nextflow pipeline for ABRA2.
The detection of a variant can be visualized graphically. Here are two examples of a variant detected, one from a series of samples with a coverage of 120X (left) and one with a very high coverage (50000X, right). Each dot represents a sequenced sample colored according to its Phred scale q-value. The black regression line shows the estimated sequencing-error rate along with the 99% confidence interval (black dotted lines) containing samples. Colored-dotted lines correspond to the limits of regions defined for different significance q-value thresholds. In both case one sample appears as outlier from the regression (in red), and is therefore classified as carrying the given mutation. Note that the position on the left has a high error rate, while the position on the right has a very low error rate, allowing the detection of a variant with a tiny allelic fraction (0.025%).
Dependencies
Needlestack works under most Linux distributions and Apple OS X.
-
This pipeline is based on nextflow. As we host several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
-
External software:
- bedtools
- samtools
- R with Rscript
- Compile the file
mpileup2readcounts.cc
located here - Add the previous tools to your path (executables are assumed to be respectively called
samtools
,Rscript
andmpileup2readcounts
) - Optionally, the bioconductor packages required for the alignments plots (see the Detailed description section)
You can avoid installing all the external software by using Docker or Singularity. See the IARC-nf repository for more information. Warning: the docker/singularity containers don't include the bioconductor packages needed to plot alignments.
To use needlestack without nextflow, in addition to the previous tools, download the files in this bin directory and add them to your path. See the Usage section to run needlestack without nextflow.
Input
Type | Description |
---|---|
BAM files | BAM files (called *.bam ) grouped in a single folder or described in a text file (one BAM path per line) along with their index files (called *.bam.bai or *.bai ). A minimum of 20 BAM files is recommended. |
fasta file | A reference fasta file (eventually compressed with bgzip) along with its faidx index (and *.gzi faidx index if compressed). |
bed file | Optional input file. Otherwise the variant calling is performed on the whole reference provided. |
A sample dataset is available here for testing.
Parameters
Type --help
to get the full list of options. All parameters are prefixed with a double-dash like in --help
.
-
Mandatory
Name | Example value | Description |
---|---|---|
input_bams | BAM/ or bams.txt | Folder containing the BAM files |
ref | fasta_name.fa | Reference fasta file |
output_vcf | vcf_name.vcf | Name of the output VCF file generated by needlestack |
-
Optional
Name | Default value | Description |
---|---|---|
min_dp * | 30 | Minimum median coverage to consider a site. In addition, at least 10 samples have to be covered by min_dp . |
min_ao * | 3 | Minimum number of non-ref reads in at least one sample to consider a site |
min_af * | 0 | Minimum allelic fraction in at least one sample to consider a site |
nsplit | 1 | Split the bed file in nsplit pieces and run in parallel |
min_qval | 50 | qvalue threshold in Phred scale to consider a variant |
sb_type | SOR | Strand bias measure, either SOR , RVSB or FS |
sb_snv | 100 or 1000 | Strand bias threshold for SNVs. The default of 100 (1000 if sb_type is FS ) means no filter |
sb_indel | 100 or 1000 | Strand bias threshold for indels. The default of 100 (1000 if sb_type is FS ) means no filter |
map_qual | 0 | Min mapping quality (passed to samtools ) |
base_qual | 13 | Min base quality (passed to samtools ) |
max_dp | 50000 | Downsample coverage per sample (passed to samtools ) |
plots | SOMATIC if tn_pairs provided, ALL if not | Creates pdf plots of regressions in the output. To remove plots set --plots to NONE (See Plot options paragraph). |
genome_release | - | Reference genome for the annotions on the alignments plots. Examples: Hsapiens.UCSC.hg19 , Hsapiens.UCSC.hg18 , Hsapiens.UCSC.hg38 , Mmusculus.UCSC.mm10 . The right terminology is described below (Plot options paragraph). WARNING: this option is mandatory if the --do_alignments flag is used |
output_folder | --bam_folder | Output folder, by default equals to the input bam folder |
bed | - | BED file containing a list of regions (or positions) where needlestack will perform variant calling |
region | - | A region in format CHR:START-END where calling should be done |
tn_pairs | - | A tab-delimited file containing two columns (normal and tumor sample names) for each sample in line. This enables matched tumor/normal pair calling features (see below) |
sigma_normal | 0.1 | Sigma parameter for negative binomial modeling of expected germline allelic fraction. We strongly recommend not to change this parameter unless you really know what it means |
min_af_extra_rob | 0.2 | Minimum allelic fraction to exclude a sample at a position for extra-robust regression |
min_prop_extra_rob | 0.1 | Minimum proportion of samples having an allelic fraction to be excluded from extra-robust regression |
max_prop_extra_rob | 0.8 | Maximum proportion of samples having an allelic fraction to be excluded from extra-robust regression |
power_min_af | Allelic fraction used to classify genotypes as 0/0 or ./. depending of the power to detect a variant at this fraction (see below) | |
input_vcf | A VCF file (from GATK) where calling should be done. Needlestack will extract DP and AO from this VCF (DP and AD fields) and annotate it with phred q-value score (FORMAT/QVAL field), error rate (INFO/ERR ) and over-dispersion sigma parameter (INFO/SIG ). WARNING: by default, only works with split (coreutils) version > 8.13 |
* Caution: min_dp
, min_ao
and min_af
are position-based filters, which means that needlestack might still identify a variant with an allelic fraction lower than min_af
in a sample if at the same position, another sample has an allelic fraction larger than min_af
.
By default, if neither --bed
nor --region
are provided, needlestack runs on the whole reference genome provided, building a bed file from fasta index.
If --bed
and --region
are both provided, it runs on the region only.
-
Flags
Flags are parameters without value.
Name | Description |
---|---|
help | Display help |
all_SNVs | Output all SNVs, even when no variant is detected |
extra_robust_gl | Perform extra-robust regression (useful for common germline SNPs, see below) |
no_labels | No label for the outliers on regression plots |
no_indels | Do not perform variant calling for insertions and deletions |
no_contours | Do not plot q-values contours (for q-value threshold={10,30,50,70,100} by default) and do not plot minimum detectable allelic fraction as a function function of coverage |
use_file_name | Use the bam file names as sample names. By default the sample name is extracted from the bam file SM tag |
do_alignments | Add alignments plots to the pdf plots of regressions |
Usage
-
Optionally download a sample dataset.
git clone --depth=1 https://github.com/IARCbioinfo/data_test
-
Run the pipeline using nextflow (here using docker).
cd data_test nextflow run iarcbioinfo/needlestack -with-docker \ --bed BED/TP53_all.bed \ --input_bams BAM/BAM_multiple/ \ --ref REF/17.fasta \ --output_vcf all_variants.vcf
You will find a VCF file called
all_variants.vcf
in theBAM_multiple/
folder once done.The first time it will take more time as the pipeline will be downloaded from github and the docker container from dockerhub.
Official releases can be found here. There is a corresponding official docker container for each release and one can run a particular version using (for example for v0.3):
nextflow run iarcbioinfo/needlestack -r v0.3 -with-docker \ --bed BED/TP53_all.bed --bam_folder BAM/BAM_multiple/ --fasta_ref REF/17.fasta --out_vcf all_variants.vcf
It is also possible to run the pipeline without nextflow (not recommended) using the needlestack.sh
. Here on the example dataset downloaded above:
bash cd data_test/ needlestack.sh --region=17:7572814-7573814 --bam_folder=BAM/BAM_multiple --ref=REF/17.fasta --output_vcf=all_variants.vcf
Output
Type | Description |
---|---|
VCF file | File containing all the variants detected. It contains a header of meta-information and data lines containing information about each variant position |
PDF files | For each variant detected a pdf is generated, this pdf file contains the regression plots and alignments plots if the --do_alignments flag is used |
Detailed description
Germline, somatic, matched Tumor-Normal pairs calling and contamination
When using matched tumor/normal, Needlestack can classify variants (VCF FORMAT/STATUS
field) according to the following table:
For this, one need to provide a tab-delimited file containing two columns with normal and tumor sample names using the --tn_pairs
option. The first line of this file is a header with TUMOR
and NORMAL
keywords. When one normal or one tumor is missing, one can write NA
. In this mode, the parameter power_min_af
defines the allelic fraction in the tumor one is trying to detect to classify genotypes as ./.
or 0/0
depending on the power to detect this allelic fraction. Variants found as somatic in a tumor, but germline in another sample of the series will be flagged as POSSIBLE_CONTAMINATION
. We found this particularly important, as needlestack is very sensitive to low allelic fractions, to filter out contamination among samples, coming for example from pooled exome capture.
In other cases (when there is no --tn_pairs
parameter defined), genotypes are defined as ./.
or 0/0
assuming one is looking for allelic fractions expected for germline variants (negative binomial distribution centered at 0.5 with over-dispersion parameter sigma=sigma_normal
, with sigma_normal=0.1
by default). If you are looking for somatic variants without matched-normal and assuming you are interesting to correctly distinguish ./.
and 0/0
genotypes, you can set the power_min_af
parameter to the lowest allelic fraction of somatic variants you are interested with (and your coverage allows you to find). Note that this is by far not the most common situation, and that in most cases you don't have to worry about the power_min_af
parameter.
Plot options
Two options to consider:
-
--plots
:SOMATIC
: produce pdf regression plots (like in the description above) only for somatic variants (option not available in the annotation mode). Default value when using matched tumor/normal.ALL
: produce pdf regression plots for all variants. Default value when not using matched tumor/normal or when using annotation mode.NONE
: remove pdf regression plots from the output.
-
--do_alignments
: To add the alignments plots after the regression plots. If this flag is used, the name of the reference genome (--genome_release
option) needs to be provided to choose the correct annotation (See--genome_release
option below).
The following bioconductor packages need to be installed for plotting the alignments:
- The Gviz package
- An Annotation package for TxDb objects.
- A Genome wide annotation, it contains mappings between Entrez Gene identifiers and GenBank accession numbers. Examples: the Genome wide annotation package for Human:
org.Hs.eg.db
and for the mouse:org.Mm.eg.db
.
Example: If the hg19 release of the human genome is used, the following packages should be installed: Gviz
, TxDb.Hsapiens.UCSC.hg19.knownGene
(hg18 and hg38 UCSC version can also be used) and org.Hs.eg.db
These packages exist for other organisms than Human but have not been tested. One can for example generate the alignments plot for mouse data by installing TxDb.Mmusculus.UCSC.mm10.knownGene
(mm9 UCSC version can also be used) and org.Mm.eg.db
. For the other organisms the packages need to have the same nomenclature as the ones listed above.
The --genome_release
option needs to be provided and corresponds to the TxDb annotation package name without its prefix and suffix. For the hg19 release of the human genome, one needs to set --genome_release
to Hsapiens.UCSC.hg19
.
Note that the packages chosen for the annotations are compatible with the UCSC notations since most of the Gviz fonctionalities can handle these notations. The reference genome used for the BAM alignments can be based on GENCODE, UCSC or ENSEMBL genome varieties.
Warning: the docker/singularity containers don't include the bioconductor packages needed to plot alignments.
The alignment plot represents from top to bottom:
- Chromosome representation: a red vertical line shows the position of the variant.
- Genomic axis associated with the alignment.
- BAM alignment(s) (50 bases on both sides of the variant): the variant position is highlighted in red. Note that insertions and deletions are not currently supported by Gviz.
- The reference genome: the variant position is highlighted in red.
- Genome annotation: the yellow blocks represent exons, the variant position is highlighted in red. The annotation is represented only if the variant is not in an intergenic region.
- Zoom-out on the genome annotation: representation of the whole gene with its name on the left side. The annotation is represented only if the variant is not in an intergenic region. A red vertical line shows the position of the variant.
- Genomic axis corresponding to the previous gene annotation. A red vertical line shows the position of the variant.
Common variants
Needlestack is designed to identify rare variants (i.e. only a few samples in your set of samples have a particular variant), because of the robust regression used. Therefore, common SNPs (>10%) or strong somatic hotspots will be missed. The optional extra_robust_gl
can overcome partially this issue for common germline mutations: it first discard high allelic fraction (>20%, assuming these are likely true variants) before fitting the regression model when between 10% and 50% of sample have such a high allelic fraction. A flag is written in the VCF INFO/WARN
field when this happened (EXTRA_ROBUST_GL
). Additionally, when an other allele than the reference allele is the most common, it is taken as the reference allele and a flag is also written in the VCF (INFO/WARN=INV_REF
).
Strand bias
For conventional variant callers, GATK WES/WGS recommended values for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. We haven't found this particularly useful, but a more detailed evaluation is necessary. For amplicon based targeted sequencing, RVSB>0.85 seems to reduce some errors. There is no hard filter by default as this is easy to do afterward using bcftools filter command.
Contributions
Name | Description | |
---|---|---|
Matthieu Foll* | follm@iarc.fr | Lead developer, contact for support |
Tiffany Delhomme* | delhommet@students.iarc.fr | Main developer, contact for support |
Nicolas Alcala | alcalan@fellows.iarc.fr | tumor/normal pair mode, power calculation |
Aurelie Gabriel | gabriela@students.iarc.fr | Alignment plot, nextflow pipeline, mpileup2readcounts, bash version |
References
- Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data Tiffany M. Delhomme, Patrice H. Avogbe, Aurelie A.G. Gabriel, Nicolas Alcala, Noemie Leblay, Catherine Voegele, Maxime Vallee, Priscilia Chopard, Amelie Chabrier, Behnoush Abedi-Ardekani, Valerie Gaborieau, Ivana Holcatova, Vladimir Janout, Lenka Foretova, Sasa Milosavljevic, David Zaridze, Anush Mukeriya, Elisabeth Brambilla, Paul Brennan, Ghislaine Scelo, Lynnette Fernandez-Cuesta, Graham Byrnes, Florence Le Calvez-Kelm, James D D. McKay, Matthieu Foll bioRxiv 639377; doi: https://doi.org/10.1101/639377. https://www.biorxiv.org/content/10.1101/639377v1
Needlestack has been used in these papers:
- Urinary TERT promoter mutations as non-invasive biomarkers for the comprehensive detection of urothelial cancer Avogbe et al. 2019 EBioMedicine.
- Fernandez-Cuesta L, Perdomo S, Avogbe PH, Leblay N, Delhomme TM, Gaborieau V, Abedi-Ardekani B, Chanudet E, Olivier M, Zaridze D, Mukeria A, Vilensky M, Holcatova I, Polesel J, Simonato L, Canova C, Lagiou P, Brambilla C, Brambilla E, Byrnes G, Scelo G, Le Calvez-Kelm F, Foll M, McKay JD, Brennan P. Identification of Circulating Tumor DNA for the Early Detection of Small-cell Lung Cancer. EBioMedicine. 2016 Aug;10:117-23. doi: 10.1016/j.ebiom.2016.06.032. Epub 2016 Jun 25. PubMed PMID: 27377626; PubMed Central PMCID: PMC5036515.
- Le Calvez-Kelm F, Foll M, Wozniak MB, Delhomme TM, Durand G, Chopard P, Pertesi M, Fabianova E, Adamcakova Z, Holcatova I, Foretova L, Janout V, Vallee MP, Rinaldi S, Brennan P, McKay JD, Byrnes GB, Scelo G. KRAS mutations in blood circulating cell-free DNA: a pancreatic cancer case-control. Oncotarget. 2016 Nov 29;7(48):78827-78840. doi: 10.18632/oncotarget.12386. PubMed PMID: 27705932; PubMed Central PMCID: PMC5346680.