Home

Awesome

Stochastic GOEA Simulations

DOI

Stochastic simulations of multitudes of Gene Ontology Enrichment Analyses (GOEAs)
are used to generate simulated values of FDR, sensitivity, and specificity for GOEAs run using GOATOOLS.

This repo also contains stochastic simulations showing the FDR, sensitivity, and specificity of multipletest correction methods including FDR Benjamini/Hochberg (non-negative) and Bonferroni one-step correction. These simulations were used to architect the overall simulation strategy and investigate an effective figure to display multiple sets of information including:

Conclusions from Stochastic GOEA Simulations

  1. GO terms associated with huge numbers (thousands, in human) of genes cause FDR failures
  2. Removing even just 30 of the 17,000+ (human) GOs which are highly annotated causes good passing FDRs
  3. A study size of 4 genes in a GOEA will likely return an unacceptable amount of misses (False Negative)
  4. As study size increases, sensitivity improves (e.g., better sensitivity, fewer False Negatives)
  5. As the percentage of 'actually significant genes' rises in the study set, so does sensitivity
  6. Using a version of propagate counts greatly improves sensitivity
  7. Remove selected highly annotated GO terms prior to running a GOEA using these criteria:
    • Highly annotated GO terms (e.g., top 1%). Example in human: remove GOs assc. w/thousands of genes
    • low depth (near the top)
    • high descendant count

Table of Contents

To Cite

Please cite the following paper if you mention the stochastic simulations in this repo in your research

GOATOOLS: A Python library for Gene Ontology analyses
Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H
2018 | Scientific reports | PMID:30022098 | DOI:10.1038/s41598-018-28948-z

Details

Recreating the stochastic simulations

To recreate all five of our stochastic GOEA simulation plots (for a total of 100,000 total stochastic simulations) featured in the GOATOOLS manuscript and supplemental data, clone the repository, https://github.com/dvklopfenstein/goatools_simulation, and run this make target from the command line:

  $ make run_ms

Manuscript Figures

Results for 40,000 GOATOOLS GOEA stochastic simulations (20,000 simulations for each panel) with varying sensitivity and consistently high specificity. GOEAs performed well on study groups of 8+ genes if the GOATOOLS GOEA option propagate_counts set to True.

fig3

Supplemental Figures

Supplemental Figure 1) Initial failing simulations

The first GOATOOLS GOEA simulations fail in panels A3 and A4 with FDR values exceeding the alpha of 0.05 set by the researcher. The values of failing FDRs are shown using red text. The source of the failures were false positives for GO terms annotated with large numbers of gene products. For mouse annotations in the biological_process branch, GO terms annotated with 1,000 or more genes were the source of failures.

suppfig1

Supplemental Figure 2) Enriched-only viewed

GOATOOLS GOEAs stress tests with randomly shuffled associations nearly pass if only enriched GO terms are viewed. The associations are randomly shuffled while still maintaining the distribution number of GO terms per gene. The failing FDRs (above 0.05) are seen in panels A2 and A3 for gene groups having 96, 112, or 124 genes.

suppfig2

Supplemental Figure 3) 30 Broad GO terms removed

GOATOOLS GOEAs stress tests with randomly shuffled associations pass for all cases if only 30 out of over 17k+ GO terms associated with more than 1000 genes are removed. The median number of genes per GO term in the mouse associations is 3 genes/GO. Genes per GO term ranges from 1 gene to ~7k genes per GO term. (mean=16 genes/GO, SD=128).

suppfig3

Copyright (C) 2016-present, DV Klopfenstein, Haibao Tang. All rights reserved.