Awesome

NOTE: New version supporting 12 tools (CNVbenchmarkeR2) can be found here.

CNVbenchmarkeR

CNVbenchmarkeR is a framework to benchmark algorithms when detecting germline copy number variations (CNVs) against different NGS datasets. Current version supports DECoN, CoNVaDING, panelcn.MOPS, ExomeDepth and CODEX2 tools.

It is part of our publication in which we performed a benchmark of germline CNV calling tools for targeted gene-panel data. Citation: Moreno-Cabrera, J.M., del Valle, J., Castellanos, E. et al. Evaluation of CNV detection tools for NGS panel data in genetic diagnostics. Eur J Hum Genet (2020). https://doi.org/10.1038/s41431-020-0675-z

Prerequisites

Algorithms have to be properly installed. Links for algorithms installation:

How to use

Get Code

git clone https://github.com/TranslationalBioinformaticsIGTP/CNVbenchmarkeR

Configure algorithms.yaml to set which algortithms will be benchmarked. In case of executing DECoN, modify algorithms/decon/deconParams.yaml by setting deconFolder to your DECoN folder installation. In case of executing CoNVaDING, modify algorithms/convading/convadingParams.yaml by setting convadingFolder param.
Configure datasets.yaml to define against which datasets the algorithms will be executed. Within this file, it is important to provide files with the exact expected format (special attention to validated_results_file and bed_file that are tab-delimited files). To do so, please check the examples folder.
Launch CNVbenchmarker

cd CNVbenchmarkerR
./runBenchmark.sh

Output

A summary file and a .csv results file will be generated at output/summary folder. Stats include sensitivity, specificity, no-call rate, precision (PPV), NPV, F1, MCC and kappa coefficient.

Stats are calculated per ROI, per gene and at whole strategy level (gene level including no-calls, i. e., low quality regions)

Logs files will be generated at logs folder. Output for each algorithm and dataset will be generated at output folder.

Troubleshooting

Two important checks to ensure that metrics are computed correctly:

The sample names in the validated_results_file should match the file names of your bam files (excluding the .bam extension). For example, if the validated_results_file contains sample names like mySample2312, your bam files should have file names like mySample2312.bam .
Provide and use chromosomes names with the same format, that is, do not use "chr5" and "5" in you bed and validated_results_file files, for example.

Extra feature: optimizer

An optimizer is also attached in the framework. It executes a CNV calling algorithm against a dataset with many different values for each param. Up to 22 values are evaluated for each param. It is implemented using a greedy algorithm which starts from each different param. The CNV algorithm will be executed a maximum of (n_params^2)*22 times.

It will be improve sensitivity allowing drops of specificity defined at optimizerParams.yaml.

Prerequisites

An SGE cluster system has to be available.

How to use

Configure optimizers/optimizerParams.yaml by defining optimizer params, dataset and algorithm to be optimized. Note: it is recommended to optimize over a random subset (training subset) of the original subset. Then, performance can be compared on the validation subset.
Execute optimizer:

cd optimizers
Rscript optimizer.r optimizerParams.yaml