Awesome
Awesome TCGA
Curated list of awesome resources to access data from The Cancer Genome Atlas (TCGA) project, with a particular focus on computational tools allowing pan-TCGA analysis and/or giving access to the results of such tools.
Official links
General informations
- TCGA project homepage
- Infographic summary of the project
- NCI TCGA Wiki - General help about TCGA project. One page you may visit often is the TCGA barcode description.
- Data documentation - Describe how the data is generated, in particular the details of the bioinformatics pipeline used.
Data repositories
- Genomic Data Commons (GDC) data portal - The old TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/) is no longer operational and all TCGA data now resides at the GDC. Note that the GDC host other datasets than just the TCGA.
- GDC homepage
- GDC data documentation
- GDC data release notes
- GDC legacy archive - The legacy data is the original data that uses the old genome build (hg19) as produced by the original submitter. The legacy data is not actively being updated in any way. Users should migrate to the harmonized data.
- List of cohorts with sample sizes - Shortcut to the GDC data portal with the list of all cancer sites with the number of cases and the number of available cases per data category.
Downloading the data
List of command line tools, API or R packages to download the data.
Official tools
- GDC data transfert tool - Official command line tool, see here for a nice tutorial.
- GDC API - Official HTTP API. Note the BAM Slicing that can be quite useful.
Broad Institute GDAC
The Broad TCGA Data and Analyses (Broad GDAC) Firehose provides TCGA Level 3 data and Level 4 analyses packaged in a form amenable to immediate algorithmic analysis. This is a useful resource to access analyses results not performed by the GDC (e.g. MutSig2CV, correlation with clinical variables, mRNA clustering etc.). They are automatically running in a systematic way the software we usually see in a TCGA publication. However the data is currently based on the old hg19 TCGA data for somatic variant calling.
- Firehose - Refers to the computational infrastructure.
- Python and UNIX wrappers
- FirebrowseR - An R package to download directly the results of the analyses performed by Firehose in R.
- Firebrowse - A web UI to visualise the results of the analyses performed by Firehose.
Others
The GDC hosts a list of such tools: https://gdc.cancer.gov/access-data/gdc-community-tools.
- TCGABiolinks - A R/Bioconductor package to search, download and prepare relevant data for analysis in R. Very powerful and well documented.
- GDC Spreadsheet Download Tool - Tool to download clinical and/or biospecimen metadata for a given set of files in a tab-delimited format.
- GenomicDataCommons - A R/Bioconductor package for querying, accessing, and mining genomic datasets available from the GDC.
- gdctools - Broad Institute Python and UNIX CLI utilities to simplify search and retrieval of open-access data from the GDC.
Cloud computing
List of cloud computing facilities hosting the TCGA data.
- Cancer Genomics Cloud - Developed by Seven Bridges Genomics. They have a blog with useful case studies.
- ISB Cancer Genomics Cloud - Developed by the Institute for Systems Biology in Seattle.
- FireCloud - Developed by the BROAD Institute.
Pan-TCGA analyses
List of analyses performed in a consistent manner on all (or at least several) TCGA datasets, where the results are freely available.
- Firehose - See above for the associated tools to download the data. They run many software on all TCGA cohorts and make the results available.
- Tumor Fusion Gene Data Portal - 9,966 tumor samples from 33 TCGA cancer types and 689 normal samples in 19 TCGA normal tissue types were analyzed by PRADA pipeline and the realigned BAM files of RNAseq data.
- DriverDBv2 - WES and RNA-seq reanalysis to identify driver genes. Provides a nice graphical summary of mutation clustering in genes (e.g. for TP53).
- ChimerDB - A comprehensive database of fusion genes encompassing analysis of deep sequencing data (including TCGA) and manual curations.
- ASCAT Ploidy and Purity Estimates - COSMIC hosts a tab separated table listing the ploidy and aberrant cell fraction (purity estimate), for TCGA samples re-analysed using ASCAT.
- BioXpress - RNA-seq-derived gene expression database, including TCGA among others.
Publications
- Official publication list
- Publication from Seven Bridges Genomics using their cloud solution.
- TCGABiolinks - Paper describing the R TCGABiolinks package.
- FirebrowseR - Paper describing the R FirebrowseR package.
- GenomicDataCommons - Paper describing the R GenomicDataCommons package.
- DriverDBv2
- ChimerDB - A new paper is in press for v3.0 according to rcsb.ewha.ac.kr/fusiongene.
- BioXpress