Home

Awesome

Pipeline ITH

Description

A pipeline to study intratumor heterogeneity (ITH) with Canopy<sup>[1]</sup>.

Marathon

This pipeline has been inspired from the Marathon pipeline<sup>[2]</sup> proposed by the authors of Canopy. Marathon is a description of a conceptual pipeline using Falcon and Canopy. It is not a functional and automated pipeline. The goal of the ITH pipeline is to propose a working and automated pipeline.

General overview

<img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/pipeline_overview.png" style="width:70%; display:block; margin:auto;" />

Note

In this documentation, patient ID has been replaced with ##, and tumors ID has been replaced with T1 and T2.

Steps

Post-alignment

Germline calling

Then the VCF output file has been filtered on PASS value : scripts/keep_pass.sh.

Somatic calling

Then the VCF output file has been filtered on PASS value : scripts/keep_pass.sh.

Calling quality control

To generate four different charts :

<table> <thead> <tr> <th>Germline AF distribution</th> <th>Somatic Venn</th> </tr> </thead> <tbody> <tr> <td><img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/Calling_quality_control_germline_AF.png" /></td> <td><img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/Calling_quality_control_Venn.png" /></td> </tr> </tbody> <thead> <tr> <th>Somatic AF distributions and overlap</th> <th>Somatic / Germline overlap</th> </tr> </thead> <tbody> <tr> <td><img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/Calling_quality_control_tumors_overlap.png" /></td> <td><img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/Calling_quality_control_tumors_normal_overlap.png" /></td> </tr> </tbody> </table>

VCF normalization and annotation

The VCF can be normalized to have a format compatible with annovar.

Normalization
Annotation

(In this pipeline, Annovar has been run on somatic VCF only)

Tumor coverage

For each patient, somatic calling of tumor1 & 2 give two variant lists with their respective positions and coverage. Germline calling also gives a variant list with its own positions and coverage.

We need the coverage of these positions in the others samples. For example, we need the coverage in tumor2 at the positions of tumor1 somatic variants. Inversely, we need the coverage in tumor1 at the positions of tumor2 somatic variants.

We also need tumor1 & 2 coverage at the positions of the germline variants.

<img src="https://raw.githubusercontent.com/IARCbioinfo/marathon-wgs/master/images/coverage.png" />
Tumor coverage at the other tumor positions
Tumor coverage at the germline positions

Somatic allele-specific copy numbers profiling

To get the copy numbers

The script is parallelized by chromosome.
To split germline VCF by chromosome, use this script : scripts/split_vcf_chromosome.sh

Rscript falcon.R /path/to/germline_VCF/splitted_by_chromosomes/sample.GERMLINE.chrY.vcf patient1 normal_sample_id tumor1_sample_id tumor2_sample_id Y /path/to/output/dir /path/to/marathon/libs/falcon.output.R /path/to/marathon/libs/falcon.qc.R
To get the copy numbers, in the other tumor regions with variations
Notes

It is important to note that these Falcon scripts use some custom libraries stored in marathon/libs/.
Sometimes, these libraries are simple overrides of Falcon with little modifications.

Heterogeneity characterization and tree generation

SNA pre-clustering and Monte Carlo Markov chain sampling

This step computes all input matrices required by Canopy, and performs SNA pre-clustering, and then a MCMC sampling to give subclones with composition and history.

This step has been parallelized by number of subclones.

Tree generation
Events filtering

patient_id = args[1] data_path = args[2] file_name = args[3] input_somatic_VCF_t1 = args[4] input_somatic_VCF_t2 = args[5] only_exonic = args[6]

Notes

It is important to note that these Canopy scripts use some custom libraries stored in marathon/libs/.
Sometimes, these libraries are simple overrides of Canopy with little modifications.

References

[1] Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing
Jiang, Y., Qiu, Y., Minn, A.J. and Zhang, N.R., 2016.
Proceedings of the National Academy of Sciences.
http://www.pnas.org/content/pnas/113/37/E5528.full.pdf

[2] Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny
Eugene Urrutia Hao Chen Zilu Zhou Nancy R Zhang Yuchao Jiang
Bioinformatics, bty057 (2018)
https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty057/4838234?redirectedFrom=fulltext