What is ORA and why do we need multiple-testing correction?
Over-representation analysis (ORA) is a statistical framework for asking whether a predefined set of genes (such as a pathway, GO term, or disease signature) contains more genes from your input list than expected by chance. It takes your selected gene set (in pathXcite this refers to the literature-derived genes you choose for analysis) and compares its overlap with each term in a database against a random expectation based on the background universe.
Concretely, ORA builds a 2×2 contingency table for each term and uses the hypergeometric test (equivalent to a one-sided Fisher's exact test) to calculate a p-value for enrichment. Terms with significantly more overlap than expected suggest biological processes or pathways that may be relevant to your research topic.
Because ORA typically tests hundreds or thousands of terms at once, the probability of obtaining some low p-values purely by chance becomes high, a problem known as the multiple testing problem. To control false positives and make results interpretable, p-values must be adjusted using correction methods such as Benjamini-Hochberg (FDR control) or Bonferroni/Holm (FWER control). The choice of correction affects the balance between discovery power and error control, and should reflect the structure of your data and your tolerance for false positives.
Quick chooser: what should I use?
| Situation | Test | Correction | Why |
|---|---|---|---|
| Standard ORA on many terms; libraries assumed independent or mildly correlated | Hypergeometric (one-sided enrichment) | Benjamini-Hochberg (BH) | Controls FDR under independence/positive dependence; good power |
| Terms may be arbitrarily dependent (nested/overlapping, GO graph) | Hypergeometric | Benjamini-Yekutieli (BY) | FDR control under arbitrary dependence; conservative |
| Very small number of tests (≤ 10) or you must avoid any false positives | Hypergeometric | Holm (step-down) or Bonferroni | Controls FWER; Holm usually strictly more powerful than Bonferroni |
| Want a bit more power than BH when many nulls are true (two-stage) | Hypergeometric | Two-stage BH or Two-stage BKY | Estimate proportion of true nulls; increases power while controlling FDR |
From overlap to statistics: the 2×2 table
For a given term (e.g. pathway) and your selected gene set, we construct a contingency table:
| In term | Not in term | Total | |
|---|---|---|---|
| Your genes (query) | k | \(n - k\) | n |
| Background | \(M - k\) | \(N - n - (M - k)\) | \(N - n\) |
| Total | M | \(N - M\) | N |
- N: background universe size (e.g., genes present in all Enrichr libraries)
- M: term size (genes in the term)
- n: your selected genes
- k: observed overlap count
Hypergeometric test (enrichment p-value)
Assuming random sampling without replacement from the background, the probability of observing exactly \(k\) overlapping genes is:
\[ \Pr(K = k) \;=\; \frac{\binom{M}{k}\,\binom{N-M}{\,n-k\,}}{\binom{N}{n}} \, . \]
The one-sided enrichment p-value is the tail probability of observing at least this much overlap by chance:
\[ p_{\text{enrich}} \;=\; \sum_{i=k}^{\min(n,M)} \frac{\binom{M}{i}\,\binom{N-M}{\,n-i\,}}{\binom{N}{n}} \, . \]
(\(\binom{a}{b}\) is “\(a\) choose \(b\)”.) This is numerically identical to Fisher's exact test (right-tailed) on the 2×2 table above.
Odds ratio (effect size)
The odds ratio summarizes enrichment magnitude, independent of sample size:
\[ \OR \;=\; \frac{k/(n-k)}{\, (M-k)/(N - M - n + k) \,} \, . \]
- \(\OR > 1\): enrichment; \(\OR < 1\): depletion.
- We often report \(\log_2(\OR)\) for symmetry.
- Confidence intervals can be formed via standard log-OR variance formulas; for large counts this aligns with Fisher's exact test.
Z-score (deviation from expectation)
Given the hypergeometric expectation and variance for the overlap:
\[ \mu \;=\; \E[K] \;=\; n\,\frac{M}{N} \qquad\text{and}\qquad \sigma^2 \;=\; \Var[K] \;=\; n\,\frac{M}{N}\left(1-\frac{M}{N}\right)\frac{N-n}{N-1} \]
The z-score measures how many standard deviations the observation is from expectation:
\[ z \;=\; \frac{k - \mu}{\sigma} \, . \]
- \(z > 0\) suggests enrichment; larger means more surprising given the background.
- Z-scores complement p-values: p reflects tail probability; z reflects standardized effect vs. expectation.
Combined score (p × z composite)
Many tools (including pathXcite) expose a composite score to rank terms by both rarity and standardized surprise. A common form is:
\[ \text{combined\_score} \;=\; z \cdot \bigl(-\log_{10}(p_{\text{enrich}})\bigr) \, . \]
- Down-weights terms that are “barely significant” (weak \(-\log_{10} p\)) or “only large-\(z\) with mediocre \(p\).”
- Use for ranking; report adjusted p-values for statistical control.
Multiple testing: adjusted p-values (FDR/FWER)
Testing hundreds of terms inflates false positives. We adjust p-values across the set of \(m\) tests.
Two families of control
- FWER (family-wise error rate): probability of ≥1 false positive. Very strict. Methods: Bonferroni, Sidak, Holm, Holm-Sidak, Hochberg, Hommel.
- FDR (false discovery rate): expected proportion of false discoveries among rejections. More power. Methods: BH, BY, two-stage BH, two-stage BKY.
Benjamini-Hochberg (BH, FDR)
Sort p-values ascending: \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\). For desired FDR \(\alpha\), find largest \(k\) with \(p_{(k)} \le (k/m)\alpha\). Reject \(p_{(1)},\dots,p_{(k)}\). Adjusted p-values are:
\[ p^{\mathrm{BH}}_{(i)} \;=\; \min_{j\ge i}\left\{ \frac{m}{j}\, p_{(j)} \right\} \;\; \text{clipped to }[0,1]. \]
- Valid under independence and positive dependence between tests.
- Good default for gene set libraries without extreme term overlap.
Benjamini-Yekutieli (BY, FDR)
Same as BH but with a harmonic penalty \(c(m)=\sum_{i=1}^{m} \tfrac{1}{i}\):
\[ p^{\mathrm{BY}}_{(i)} \;=\; \min_{j\ge i}\left\{ \frac{m \, p_{(j)}}{j \, c(m)} \right\}. \]
- Controls FDR under arbitrary dependence; conservative as \(c(m)\approx \ln m + \gamma\).
- Use when terms are heavily overlapping/hierarchical (e.g., GO graph).
Two-stage BH (Storey/Tibshirani-style)
Estimate the proportion of true nulls \(\pi_0\) (e.g., via a tuning \(\lambda\)). Replace \(m\) by \(m\pi_0\) in BH thresholds. Increases power when many terms are truly null.
Two-stage BKY (Benjamini-Krieger-Yekutieli)
Adaptive step-up that estimates \(\pi_0\) differently and can be more powerful than BH in some regimes while controlling FDR.
Bonferroni (FWER)
\[ p_{\text{Bonf}} \;=\; \min\!\bigl(1,\, m\,p\bigr). \]
Sidak (FWER)
\[ p_{\text{Sidak}} \;=\; 1 - (1-p)^m \;\;\; (\text{Bonferroni when } p\ll 1). \]
Holm (step-down FWER)
Sort p's ascending. For \(i=1,\dots,m\), compare \(p_{(i)}\) to \(\alpha/(m-i+1)\); stop at first failure. Adjusted p-values:
\[ p^{\mathrm{Holm}}_{(i)} \;=\; \max_{j\le i}\left\{ (m-j+1)\, p_{(j)} \right\}. \]
Uniformly more powerful than Bonferroni; recommended when you need FWER control.
Holm-Sidak (step-down FWER)
As Holm but replaces \(\alpha/(m-i+1)\) with Sidak-derived thresholds; slightly more power under independence.
Hochberg (step-up FWER; “Simes-Hochberg”)
Sort p's ascending. For \(i=m,\dots,1\), compare \(p_{(i)}\) to \(\alpha/(m-i+1)\); reject down to the first pass. More powerful than Holm under independence.
Hommel (FWER)
Exact closed-testing procedure; strongest single-step FWER control but algorithmically more involved. Generally more powerful than Hochberg/Holm.
Reading the result table
| Column | How it's computed | Interpretation |
|---|---|---|
| Terms | Gene sets (e.g., pathways) from the chosen library | Concepts tested for enrichment |
| Overlap | \(k/M\) as “\(k/M\)” plus list of matched genes | Counts matched vs. term size; not a probability by itself |
| P-value | Right-tailed hypergeometric (or Fisher exact) with \(N, M, n, k\) | Chance of \(\ge k\) overlap if genes were random |
| Odds Ratio | 2×2 table formula; add 0.5 if any cell is zero | Effect size (\(>1\) enrichment; \(<1\) depletion) |
| Z-score | \(\bigl(k - n\cdot M/N\bigr) / \sqrt{n\,(M/N)\,(1-M/N)\,((N-n)/(N-1))}\) | Standardized deviation from expectation |
| Combined Score | \(z \cdot \bigl(-\log_{10}(p)\bigr)\) | Ranking aid mixing rarity and standardized surprise |
| Adjusted P-value | Apply chosen correction (BH/BY/Bonferroni/Holm/…) | Controls FDR or FWER across terms |
Practical guidance & diagnostics
Choosing a correction
- BH (FDR): best default; balanced power/control.
- BY (FDR): if term dependence is extreme; expect fewer discoveries.
- Holm (FWER): when any false positive is unacceptable.
- Two-stage BH/BKY: larger screens with many nulls; more power.
When results look odd
- Huge \(\OR\) but modest \(p\): small \(k\) on tiny \(M\). Verify stability; check \(z\) and adjusted \(p\).
- Strong \(p\) but \(\OR\approx 1\): large \(n\) or \(N\) can make tiny effects significant; interpret with effect sizes.
- Everything significant: Universe too small or selection biased; revisit \(N/n\) and filters.
- Nothing significant: Try broader document/gene selection or switch library; check power via expected \(\mu\).
Mini-example (numbers)
Suppose \(N=20{,}000\), \(M=260\), \(n=120\), and you observe \(k=14\).
- Expectation: \(\mu = 120 \cdot (260/20000) = 1.56\); far below 14 → enrichment signal.
- Variance: \(\sigma^{2} \approx n\cdot p \cdot (1-p)\cdot \frac{N-n}{N-1} \approx 1.53\); \(\sigma \approx 1.24\)
- Z-score: \(z \approx (14-1.56)/1.24 \approx 10.1\) (very large)
- P-value (right tail hypergeometric): numerically \(\ll 10^{-6}\)
- Odds ratio: large (\(\gg 1\)). Combined score thus very high; BH-adjusted \(p\) still tiny if testing, say, \(m=10{,}000\) terms.
Next steps
Keep this page handy while interpreting results. For workflow tuning, continue with: