Awesome
TCGAmutations - An R data package for TCGA somatic mutations
Introduction
TCGAmutations
is an R data package containing somatic mutations from CCLE and TCGA cohorts. This is particularly useful for those working with mutation data from TCGA studies - where most of the time is spent on searching various databases, downloading, compiling and tidying up the data before even the actual analysis is started. This package tries to mitigate the issue by providing pre-compiled, curated somatic mutations from 33 TCGA cohorts and 2427 cell line profiles from CCLE - along with relevant clinical information for all sequenced samples.
Installation
BiocManager::install("PoisonAlien/TCGAmutations")
Usage
There are only two commands
tcga_available()
- Lists the available cohorts in the packagetcga_load()
- Takes a cohort name and returns the corresponding MAF object
There are sources from which MAF files were compiled:
Cohorts
tcga_available()
function lists the available cohorts along with the source and sample size.
> tcga_available()
Study_Abbreviation source n_samples
<char> <char> <char>
1: ACC MC3|Firehose|CCLE 92|62|NA
2: BLCA MC3|Firehose|CCLE 411|395|NA
3: BRCA MC3|Firehose|CCLE 1026|978|NA
4: CESC MC3|Firehose|CCLE 291|194|NA
5: CHOL MC3|Firehose|CCLE 36|35|NA
6: COAD MC3|Firehose|CCLE 406|367|NA
7: DLBC MC3|Firehose|CCLE 37|48|NA
8: ESCA MC3|Firehose|CCLE 185|185|NA
9: GBM MC3|Firehose|CCLE 400|283|NA
10: HNSC MC3|Firehose|CCLE 509|511|NA
11: KICH MC3|Firehose|CCLE 66|66|NA
12: KIRC MC3|Firehose|CCLE 370|476|NA
13: KIRP MC3|Firehose|CCLE 282|282|NA
14: LAML MC3|Firehose|CCLE 140|193|NA
15: LGG MC3|Firehose|CCLE 525|516|NA
16: LIHC MC3|Firehose|CCLE 365|373|NA
17: LUAD MC3|Firehose|CCLE 517|533|NA
18: LUSC MC3|Firehose|CCLE 485|178|NA
19: MESO MC3|Firehose|CCLE 82|NA|NA
20: OV MC3|Firehose|CCLE 411|466|NA
21: PAAD MC3|Firehose|CCLE 178|126|NA
22: PCPG MC3|Firehose|CCLE 184|179|NA
23: PRAD MC3|Firehose|CCLE 498|498|NA
24: READ MC3|Firehose|CCLE 150|122|NA
25: SARC MC3|Firehose|CCLE 239|247|NA
26: SKCM MC3|Firehose|CCLE 468|290|NA
27: STAD MC3|Firehose|CCLE 439|393|NA
28: TGCT MC3|Firehose|CCLE 134|147|NA
29: THCA MC3|Firehose|CCLE 500|496|NA
30: THYM MC3|Firehose|CCLE 123|120|NA
31: UCEC MC3|Firehose|CCLE 531|248|NA
32: UCS MC3|Firehose|CCLE 57|57|NA
33: UVM MC3|Firehose|CCLE 80|80|NA
34: CCLE_2024Q2 MC3|Firehose|CCLE NA|NA|2427
Study_Abbreviation source n_samples
TCGA cohorts
MC3
> luad <- TCGAmutations::tcga_load(study = "LUAD")
Loading LUAD. Please cite: https://doi.org/10.1016/j.cels.2018.03.002 for reference
> luad
An object of class MAF
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build GRCh37 NA NA
2: Center . NA NA
3: Samples 517 NA NA
4: nGenes 17130 NA NA
5: Frame_Shift_Del 4021 7.778 5
6: Frame_Shift_Ins 1185 2.292 1
7: In_Frame_Del 388 0.750 0
8: In_Frame_Ins 37 0.072 0
9: Missense_Mutation 133671 258.551 177
10: Nonsense_Mutation 11074 21.420 13
11: Nonstop_Mutation 179 0.346 0
12: Splice_Site 4469 8.644 5
13: Translation_Start_Site 225 0.435 0
14: total 155249 300.288 202
Clinical data for MC3 are obtained from harmonized clinical data resource. Thanks to @mitchellcheung8 for pointing to the reference and the data source.
Recommendations for survival analysis (as suggested by the publication)
Recommended use of the endpoints: For clinical outcome endpoints, we recommend the use of
PFI
for progression-free interval, andOS
for overall survival. Both endpoints are relatively accurate. Given the relatively short follow-up time,PFI
is preferred overOS
. Detailed recommendations please refer to Table 3 in the accompanying paper.
Below are the column names for the event and the timepoint.
endpoint | event column name | timepoint column name |
---|---|---|
PFI (Progression-free interval) | CDR_PFI | CDR_PFI.time |
OS (Overall survival) | CDR_OS | CDR_OS.time |
DSS (Disease-specific survival) | CDR_DSS | CDR_DSS.time |
DFI (Disease-free interval) | CDR_DFI | CDR_DFI.time |
example usage for survival:
#OS
maftools::mafSurvival(maf = brca, genes = c("TP53"), time = "CDR_OS.time", Status = "CDR_OS")
#PFI
maftools::mafSurvival(maf = brca, genes = c("TP53"), time = "CDR_PFI.time", Status = "CDR_PFI")
Firehose
Change source
argument to Firehose
for MAF files from Broad Firehose
WARNING: Use Firehose data at your own risk. MAF data has not been updated in a long time. It is strongly suggested to use the default MC3
cohort
> TCGAmutations::tcga_load(study = "LUAD", source = "Firehose")
Loading LUAD. Please cite: dx.doi.org/10.7908/C17P8XT3 for reference
An object of class MAF
ID summary Mean Median
1: NCBI_Build 37 NA NA
2: Center broad.mit.edu NA NA
3: Samples 533 NA NA
4: nGenes 16515 NA NA
5: Frame_Shift_Del 4018 7.538 5
6: Frame_Shift_Ins 1409 2.644 2
7: In_Frame_Del 526 0.987 1
8: In_Frame_Ins 74 0.139 0
9: Missense_Mutation 119156 223.557 157
10: Nonsense_Mutation 9521 17.863 12
11: Nonstop_Mutation 157 0.295 0
12: Splice_Site 7675 14.400 9
13: total 142536 267.422 187
Returned MAF objects can be passed to any functions from maftools for visualization and analysis.
CCLE
All the somatic point mutations and indels called in the DepMap cell lines.
Note that this object contains data from DepMap 24Q2 Public relase. Data was kindly formatted into MAF file and made available by DepMap project. See below References for proper citation.
> ccle = tcga_load(study = "CCLE_2024Q2", source = "CCLE")
Loading CCLE_2024Q2. Please cite: https://doi.org/10.25452/figshare.plus.25880521.v1 for reference
> ccle
An object of class MAF
Index: <ID>
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build GRCh38 NA NA
2: Center <NA> NA NA
3: Samples 2427 NA NA
4: nGenes 19605 NA NA
5: Frame_Shift_Del 52762 21.740 6
6: Frame_Shift_Ins 18349 7.560 2
7: In_Frame_Del 6250 2.575 2
8: In_Frame_Ins 1973 0.813 0
9: Missense_Mutation 776466 319.928 172
10: Nonsense_Mutation 51593 21.258 10
11: Nonstop_Mutation 1159 0.478 0
12: Splice_Site 30655 12.631 7
13: Translation_Start_Site 1921 0.792 0
14: total 941128 387.774 202
Above MAF includes 1788 unqiue cell lines (from 2427 profiles) spanning 87 primary diseases. See ccle@clinical.data
to learn more.
Below are some helpful subset commands:
#Get all AML cell lines
maftools::subsetMaf(maf = ccle, clinQuery = "DepmapModelType == 'AML'")
#Get all cell lines of cervical origin
maftools::subsetMaf(maf = ccle, clinQuery = "OncotreeLineage == 'Cervix'")
#Get HELA
maftools::subsetMaf(maf = ccle, clinQuery = "StrippedCellLineName == 'HELA'")
#Get cell lines with WGS
maftools::subsetMaf(maf = ccle, clinQuery = "Datatype == 'wgs'")
FAQ
Q:How did I compile the data?
**A:**See compile_MC3.R and compile_CCLE.R for the details.
Q: Are there any non-TCGA/external cohorts
**A:**Please open an issue if you have any particular publication in mind that you want me to include in the package.
References
For maftools
Maftools: efficient and comprehensive analysis of somatic variants in cancer. Mayakonda A, Lin DC, et. al. Genome Research
For MC3 cohort
Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Kyle Ellrott, Matthew H. Bailey, Gordon Saksena, et. al. Cell Syst
For clinical data resource
An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.Liu, Jianfang et al. Cell
For CCLE
Please cite the below figshare for the data:
_DepMap, Broad (2024). DepMap 24Q2 Public. Figshare+. Dataset. https://doi.org/10.25452/figshare.plus.25880521.v1_
Please cite the below If you’d like to cite The DepMap project:
Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov G, Cowley GS, Gill S, Harrington WF, Pantel S, Krill-Burger JM, Meyers RM, Ali L, Goodale A, Lee Y, Jiang G, Hsiao J, Gerath WFJ, Howell S, Merkel E, Ghandi M, Garraway LA, Root DE, Golub TR, Boehm JS, Hahn WC. Defining a Cancer Dependency Map. Cell. 2017 Jul 27;170(3):564-576.