Home

Awesome

DeCiFer

DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF), a statistic that quantifies the proportion of cells which acquired the SNV or whose ancestors acquired the SNV. DCF is related to the commonly used cancer cell fraction (CCF) but further accounts for SNVs which are lost due to deleterious somatic copy-number aberrations (CNAs), identifying clusters of SNVs which occur in the same phylogenetic branch of tumour evolution.

The full description of the algorithm and its application on published cancer datasets are described in

Gryte Satas†, Simone Zaccaria†, Mohammed El-Kebir†,* and Ben Raphael*, 2021
† Joint First Authors
* Corresponding Authors

The results of the related paper are available at:

DeCiFer data

This repository includes detailed instructions for installation and requirements, demos and tutorials of DeCiFer, a list of current issues, and contacts. This repository is currently in a preliminary release and improved versions are released frequently. During this stage, please keep checking for updates.

Contents

  1. Algorithm
  2. Installation
  3. Usage
  4. Development
  5. Contacts

<a name="algorithm"></a>

Algorithm

<img src="doc/decifer.png" width="500">

DeCiFer uses the Single Split Copy Number (SSCN) assumption and evolutionary constraints to enumerate potential genotype sets. This allows DeCiFer to exclude genotype sets with constant mutation multiplicity (CMM) that are not biologically likely (red crosses) and include additional genotype sets (green star) that are. DeCiFer simultaneously selects a genotype set for each SNV and clusters all SNVs based on a probabilistic model of DCFs, which summarize both the prevalence of the SNV and its evolutionary history.

<a name="installation"></a>

Installation

DeCiFer is mostly written in Python 2.7 and has an optional component in C++. The recommended installation is through conda but we also provide custom instructions to install DeCiFer in any Python environment.

Automatic installation

The recommended installation is through bioconda and requires conda, which can be easily and locally obtained by installing one of the two most common freely available distributions: anaconda or miniconda. Please make sure to have executed the required channel setup for bioconda. Thus, the following one-time one-line command is sufficient to fully install DeCiFer within a virtual conda environment called decifer:

conda create -n decifer decifer -y -c bioconda

After such one-time installation, DeCiFer can be executed in every new session after activating the decifer environment as follows:

conda activate decifer

Manual installation

DeCiFer can also be installed in a conda environment directly from this repo. Thus, the following one-time commands are sufficient to fully install DeCiFer within a virtual conda environment called decifer from this Git repo:

git clone https://github.com/raphael-group/decifer.git && cd decifer/
conda create -c anaconda -c conda-forge -n decifer python=2.7 numpy scipy matplotlib-base pandas seaborn -y
pip install .

Custom installation

DeCiFer can be installed with pip by the command pip install . in any Python2.7 environment with the following packages or compatible versions:

PackageTested versionComments
numpy1.16.1Efficient scientific computations
scipy1.2.1Efficient mathematical functions and methods
pandas0.20.1Dataframe management
matplotlib2.0.2Basic plotting utilities
seaborn0.7.1Advanced plotting utilities

Installation of C++ component

DeCiFer includes C++ code to enumerate state/genotype trees. The dependencies for this code are as follows.

PackageTested versionComments
cmake>= 2.8Build environment
lemon1.3.1C++ graph library
boost>= 1.69.0C++ library for scientific computing

To build this code, enter the following commands from the root of the repository:

mkdir build
cd build
# OPTIONAL: specify lemon and/or Boost paths if not detected automatically.
cmake ../src/decifer/cpp/ -LIBLEMON_ROOT=/usr/local/ -DBOOST_ROOT=/scratch/software/boost_1_69_0/
make

<a name="usage"></a>

Usage

DeCiFer can be executed using the command decifer, whose manual describes the available parameters and argument options. See more details below.

  1. Required input data
  2. Optional input data
  3. Output
  4. System requirements
  5. Demos
  6. Reccomendations and quality control

<a name="requireddata"></a>

Required input data

DeCiFer requires two input data:

  1. Input mutations with nucleotide counts and related copy numbers in a tab-separated file (TSV) with three header lines ((1) The first specifies the number of mutations; (2) The second specifies the number of samples; and (3) The third is equal to: #sample_index sample_label character_index character_label ref var) and where every other row has the following values for every mutation in every sample:
NameDescriptionMandatory
Sample indexa unique number identifying the sampleYes
Sample labela unique name for the sampleYes
Mutation indexa unique number identifying the mutationYes
Mutation labela unique name identifying the mutationYes
REFNumber of reads with reference allele for the mutationYes
ALTNumber of reads with alternate allele for the mutationYes
Copy numbers and proportionsTab-separated A B U where A,B are the inferred allele-specific copy numbers for the segment harboring the mutation and U is the corresponding proportion of cells (normal and tumour) with those copy numbers. Groups of cells/clones with the same allele-specific copy numbers must be combined into a single proportion.Yes
Additional copy numbersAn arbitrary number of fields with the same format as of Copy numbers and proportions describing the proportions of cells with different copy numbers. Note that all proportions should always sum up to 1.No
  1. Input tumour purity in a two-column tab-separated file where every row SAMPLE-INDEX TUMOUR-PURITY defines the tumour purity TUMOUR-PURITY of a sample with index SAMPLE-INDEX.

For generating the input files for DeCiFer, please see the scripts directory for more information. Examples may be found in the data directory.

<a name="optionaldata"></a>

Optional input data

DeCiFer can use the following additional and optional input data:

1. Data for fitting beta-binomial distributions to read count data

To use beta-binomial distributions to cluster mutations (default is binomial), pass the --betabinomial flag to decifer along with 2 additional arguments, --snpfile and --segfile, which are used to specify the locations of 2 files that contain information to parameterize the beta-binomial for each sample.

The file passed to DeCiFer via --snpfile contains information about the read counts of germline (not somatic) variants and has the following format:

FieldDescription
SAMPLEName of a sample
CHRName of the chromosome
POSGenomic position in CHR
REF_COUNTNumber of reads harboring reference allele in POS
ALT_COUNTNumber of reads harboring alternate allele in POS

The file passed to DeCiFer via --segfile, which specifies the allele-specific copy number per segment, is the same as the best.seg.ucn file used by the vcf_2_decifer.py python script that generates the input files for DeCiFer. Please simply specify the location of this file.

Custom state trees

Users may pass a file containing the set of all possible state trees for DeCiFer to evaluate. State trees have been pre-generated for the set of most common copy numbers, however a dataset might have a combination of copy numbers which has not been included. In this case, the user can use the command generatestatetrees to generate all the state trees needed for their dataset, for instance, following the instructions in the scripts directory. The script in this directory not only generates input files for decifer, but also a file called cn_states.txt that lists all the unique CN states for your data. This file may be used with generatestatetrees as shown in the scripts directory under the section "Adressing the "Skipping mutation warning"".

<a name="output"></a>

Output

DeCiFer's main output file (ending with _output.tsv) corresponds to a single TSV file encoding a dataframe where every row corresponds to an input mutation and with the following fields:

NameDescription
mut_indexUnique identified for a mutation
VAR_{SAMPLE}Variant sequencing read count of the mutation for every sample with index {SAMPLE}
TOT_{SAMPLE}Total sequencing read count of the mutation for every sample with index {SAMPLE}
VAR_{SAMPLE}Variant sequencing read count of the mutation for every sample with index {SAMPLE}
clusterUnique identifier of the inferred mutation cluster
state_treeInferred state tree defined as a ->-separated edge list of genotypes
clusterUnique identifier of the inferred mutation cluster; cluster 1 is the truncal cluster, and the next p clusters (where p is the number of samples) are sample-specific clusters, or SNVs that are unique to one of the p samples
true_cluster_DCF{SAMPLE}Inferred true cluster DCF of the mutation in every sample with index {SAMPLE}; when execute in CCF-mode, DCF will be CCF instead; these values take the form cluster center;(lower cluster CI, upper cluster CI)
point_estimate_DCF{SAMPLE}Point estimate of the mutation DCF in every sample with index {SAMPLE}; when execute in CCF-mode, DCF will be CCF instead
cmm_CCF{SAMPLE}Inferred CCF of the mutation under the previous CMM assumption in every sample with index {SAMPLE}
Explained;-separated list of all the clusters to which the mutation could be assigned
LHs;-separated list of the negative-log likelihoods of assigned the mutation to all clusters in Explained

For the column containing the true_cluster_DCF, the CIs correspond to the 95% credible interval of the posterior distribution of the DCF cluster center (Eqn 8 in manuscript and S23 in supplement) . These CIs have been corrected for multiple tests. Specifically, for each cluster, we find the lower CI by finding the X=[0.025/(number of hypothesis tests)] quantile, where the number of tests corresponds to (number of clusters)*(number of samples for patient). The same procedure is used for the upper CI, by finding the quantile that corresponds to 1-X.

These cluster CIs may also be found in the output file ending in _cluster.CIs.tsv. This file contains this information in a more condensed format, reporting only the upper and lower CIs for each cluster for each sample (in column f_lb and f_ub respectively). These numbers may contain "NaN" if no mutations were assigned to that particular cluster.

The file ending in _model_selection.tsv shows how decifer selected the best value of K clusters.

Lastly, the file ending in _Outliers_output.tsv contains SNVs that were flagged as outliers: the variant allele frequency (VAF) of the SNV was more than 1.5 (default) standard deviations away from the VAF of the assigned cluster center. Users may change this behavior via the --vafdevfilter option. This default behavior filters out noisy data or germline contamination that manifests as e.g. SNVs being assigned to the truncal cluster yet having very low DCF values in the point_estimate_DCF column of the output file.

<a name="requirements"></a>

System requirements

DeCiFer is highly parallelized in order to make efficient the extensive computations needed for clustering under a probabilistic model thousands of mutations across multiple tumour samples from the same patient. We recommend executing DeCiFer on multi-processing computing machines as the running time will scale down nearly proportionally with the number of parallel jobs, which can be specified with the argument -j. If the parameter is not specified, then DeCiFer will attempt to use all available CPUs; however, when using a computing cluster, we strongly recommend the user to always specifies -j in order to match the number of requested CPUs and avoid computing competition. Finally, note that also required memory also scales with the number of parallel processes; however in all previous tests on thousands of mutations with high number of parallel processes, DeCiFer never required more than 80GB of RAM. Please lower -j in case of exceeding memory.

<a name="demos"></a>

Demos

Each demo is an exemplary and guided execution of a DeCiFer.Each demo is simultaneously a guided description of the entire example and a BASH script which can be directly executed to run the complete demo from this repository. As such, the user can both read the guided description as a web page and run the same script to execute the demo. At this time the following demos are available (more demos will be available soon):

DemoDescription
A12Demo of DeCiFer basic command on prostate cancer patient A12

<a name="reccomendations"></a>

Recommendations and quality control

<a name="development"></a>

Development

DeCiFer is in active development, please report any issue or question as this could help the development and improvement of DeCiFer. Known issues with current version are reported here below.

<a name="contacts"></a>

Contacts

DeCiFer has been developped and actively mantained by three previous Postdoctoral Research Associates at Princeton University in the research group of prof. Ben Raphael:

Additional active contributors to DeCiFer are: