Hypergeometric test, Odds Ratios, Z-scores & FDR Control

Understanding results in pathXcite

Understand exactly how overlap, p-values, odds ratios, z-scores, and multiple-testing corrections are computed, what they mean, and which options to choose for robust conclusions.

What is ORA and why do we need multiple-testing correction?

Over-representation analysis (ORA) is a statistical framework for asking whether a predefined set of genes (such as a pathway, GO term, or disease signature) contains more genes from your input list than expected by chance. It takes your selected gene set (in pathXcite this refers to the literature-derived genes you choose for analysis) and compares its overlap with each term in a database against a random expectation based on the background universe.

Concretely, ORA builds a 2×2 contingency table for each term and uses the hypergeometric test (equivalent to a one-sided Fisher's exact test) to calculate a p-value for enrichment. Terms with significantly more overlap than expected suggest biological processes or pathways that may be relevant to your research topic.

Because ORA typically tests hundreds or thousands of terms at once, the probability of obtaining some low p-values purely by chance becomes high, a problem known as the multiple testing problem. To control false positives and make results interpretable, p-values must be adjusted using correction methods such as Benjamini-Hochberg (FDR control) or Bonferroni/Holm (FWER control). The choice of correction affects the balance between discovery power and error control, and should reflect the structure of your data and your tolerance for false positives.

Quick chooser: what should I use?

SituationTestCorrectionWhy
Standard ORA on many terms; libraries assumed independent or mildly correlated Hypergeometric (one-sided enrichment) Benjamini-Hochberg (BH) Controls FDR under independence/positive dependence; good power
Terms may be arbitrarily dependent (nested/overlapping, GO graph) Hypergeometric Benjamini-Yekutieli (BY) FDR control under arbitrary dependence; conservative
Very small number of tests (≤ 10) or you must avoid any false positives Hypergeometric Holm (step-down) or Bonferroni Controls FWER; Holm usually strictly more powerful than Bonferroni
Want a bit more power than BH when many nulls are true (two-stage) Hypergeometric Two-stage BH or Two-stage BKY Estimate proportion of true nulls; increases power while controlling FDR
Default in pathXcite: BH (FDR). You can switch the correction method in the Enrichment panel.

From overlap to statistics: the 2×2 table

For a given term (e.g. pathway) and your selected gene set, we construct a contingency table:

In termNot in termTotal
Your genes (query)k\(n - k\)n
Background\(M - k\)\(N - n - (M - k)\)\(N - n\)
TotalM\(N - M\)N
Universe choice matters: Using all genes in a library vs. only genes observed in your corpus changes expectations (\(\E[K]=n\cdot M/N\)) and p-values.

Hypergeometric test (enrichment p-value)

Assuming random sampling without replacement from the background, the probability of observing exactly \(k\) overlapping genes is:

\[ \Pr(K = k) \;=\; \frac{\binom{M}{k}\,\binom{N-M}{\,n-k\,}}{\binom{N}{n}} \, . \]

The one-sided enrichment p-value is the tail probability of observing at least this much overlap by chance:

\[ p_{\text{enrich}} \;=\; \sum_{i=k}^{\min(n,M)} \frac{\binom{M}{i}\,\binom{N-M}{\,n-i\,}}{\binom{N}{n}} \, . \]

(\(\binom{a}{b}\) is “\(a\) choose \(b\)”.) This is numerically identical to Fisher's exact test (right-tailed) on the 2×2 table above.

Interpretation: Small p values mean the observed overlap would be rare if your genes were unrelated to the term. P-values depend on \(N, M, n, k\).
Edge cases: If \(n=0\), \(M=0\), or \(N \le \max(n,M)\), the test is undefined. If \(k=0\) the p-value is 1.

Odds ratio (effect size)

The odds ratio summarizes enrichment magnitude, independent of sample size:

\[ \OR \;=\; \frac{k/(n-k)}{\, (M-k)/(N - M - n + k) \,} \, . \]

Zero cells? Add a small continuity offset (e.g., +0.5) to all cells before computing \(\OR\) to avoid division by zero.

Z-score (deviation from expectation)

Given the hypergeometric expectation and variance for the overlap:

\[ \mu \;=\; \E[K] \;=\; n\,\frac{M}{N} \qquad\text{and}\qquad \sigma^2 \;=\; \Var[K] \;=\; n\,\frac{M}{N}\left(1-\frac{M}{N}\right)\frac{N-n}{N-1} \]

The z-score measures how many standard deviations the observation is from expectation:

\[ z \;=\; \frac{k - \mu}{\sigma} \, . \]

Small \(\sigma\): When \(\sigma \approx 0\) (e.g., extreme \(n\) or \(M\)), \(z\) is unstable. Prefer p-value and \(\OR\) there.

Combined score (p × z composite)

Many tools (including pathXcite) expose a composite score to rank terms by both rarity and standardized surprise. A common form is:

\[ \text{combined\_score} \;=\; z \cdot \bigl(-\log_{10}(p_{\text{enrich}})\bigr) \, . \]

Implementations vary (e.g., some use \(z \cdot \ln(1/p)\)). The qualitative behavior is the same.

Multiple testing: adjusted p-values (FDR/FWER)

Testing hundreds of terms inflates false positives. We adjust p-values across the set of \(m\) tests.

Two families of control

  • FWER (family-wise error rate): probability of ≥1 false positive. Very strict. Methods: Bonferroni, Sidak, Holm, Holm-Sidak, Hochberg, Hommel.
  • FDR (false discovery rate): expected proportion of false discoveries among rejections. More power. Methods: BH, BY, two-stage BH, two-stage BKY.
pathXcite default: BH (FDR). Switch to BY if terms are highly dependent and you prefer conservatism; to Holm if you must control FWER.

Benjamini-Hochberg (BH, FDR)

Sort p-values ascending: \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\). For desired FDR \(\alpha\), find largest \(k\) with \(p_{(k)} \le (k/m)\alpha\). Reject \(p_{(1)},\dots,p_{(k)}\). Adjusted p-values are:

\[ p^{\mathrm{BH}}_{(i)} \;=\; \min_{j\ge i}\left\{ \frac{m}{j}\, p_{(j)} \right\} \;\; \text{clipped to }[0,1]. \]

Benjamini-Yekutieli (BY, FDR)

Same as BH but with a harmonic penalty \(c(m)=\sum_{i=1}^{m} \tfrac{1}{i}\):

\[ p^{\mathrm{BY}}_{(i)} \;=\; \min_{j\ge i}\left\{ \frac{m \, p_{(j)}}{j \, c(m)} \right\}. \]

Two-stage BH (Storey/Tibshirani-style)

Estimate the proportion of true nulls \(\pi_0\) (e.g., via a tuning \(\lambda\)). Replace \(m\) by \(m\pi_0\) in BH thresholds. Increases power when many terms are truly null.

Two-stage BKY (Benjamini-Krieger-Yekutieli)

Adaptive step-up that estimates \(\pi_0\) differently and can be more powerful than BH in some regimes while controlling FDR.

Bonferroni (FWER)

\[ p_{\text{Bonf}} \;=\; \min\!\bigl(1,\, m\,p\bigr). \]

Sidak (FWER)

\[ p_{\text{Sidak}} \;=\; 1 - (1-p)^m \;\;\; (\text{Bonferroni when } p\ll 1). \]

Holm (step-down FWER)

Sort p's ascending. For \(i=1,\dots,m\), compare \(p_{(i)}\) to \(\alpha/(m-i+1)\); stop at first failure. Adjusted p-values:

\[ p^{\mathrm{Holm}}_{(i)} \;=\; \max_{j\le i}\left\{ (m-j+1)\, p_{(j)} \right\}. \]

Uniformly more powerful than Bonferroni; recommended when you need FWER control.

Holm-Sidak (step-down FWER)

As Holm but replaces \(\alpha/(m-i+1)\) with Sidak-derived thresholds; slightly more power under independence.

Hochberg (step-up FWER; “Simes-Hochberg”)

Sort p's ascending. For \(i=m,\dots,1\), compare \(p_{(i)}\) to \(\alpha/(m-i+1)\); reject down to the first pass. More powerful than Holm under independence.

Hommel (FWER)

Exact closed-testing procedure; strongest single-step FWER control but algorithmically more involved. Generally more powerful than Hochberg/Holm.

Adjusted p-value (“q-value” in FDR context): the smallest \(\alpha\) at which the hypothesis would be called significant under the chosen procedure. Always report the method along with adjusted p-values (e.g., “BH-adjusted p”).

Reading the result table

ColumnHow it's computedInterpretation
Terms Gene sets (e.g., pathways) from the chosen library Concepts tested for enrichment
Overlap \(k/M\) as “\(k/M\)” plus list of matched genes Counts matched vs. term size; not a probability by itself
P-value Right-tailed hypergeometric (or Fisher exact) with \(N, M, n, k\) Chance of \(\ge k\) overlap if genes were random
Odds Ratio 2×2 table formula; add 0.5 if any cell is zero Effect size (\(>1\) enrichment; \(<1\) depletion)
Z-score \(\bigl(k - n\cdot M/N\bigr) / \sqrt{n\,(M/N)\,(1-M/N)\,((N-n)/(N-1))}\) Standardized deviation from expectation
Combined Score \(z \cdot \bigl(-\log_{10}(p)\bigr)\) Ranking aid mixing rarity and standardized surprise
Adjusted P-value Apply chosen correction (BH/BY/Bonferroni/Holm/…) Controls FDR or FWER across terms

Practical guidance & diagnostics

Choosing a correction

  • BH (FDR): best default; balanced power/control.
  • BY (FDR): if term dependence is extreme; expect fewer discoveries.
  • Holm (FWER): when any false positive is unacceptable.
  • Two-stage BH/BKY: larger screens with many nulls; more power.

When results look odd

  • Huge \(\OR\) but modest \(p\): small \(k\) on tiny \(M\). Verify stability; check \(z\) and adjusted \(p\).
  • Strong \(p\) but \(\OR\approx 1\): large \(n\) or \(N\) can make tiny effects significant; interpret with effect sizes.
  • Everything significant: Universe too small or selection biased; revisit \(N/n\) and filters.
  • Nothing significant: Try broader document/gene selection or switch library; check power via expected \(\mu\).
Report best practice: term name; \(k/M\); \(\OR\) with CI (if available); raw \(p\); adjusted \(p\) (method); library & version; universe definition; selection criteria.

Mini-example (numbers)

Suppose \(N=20{,}000\), \(M=260\), \(n=120\), and you observe \(k=14\).

Takeaway: big deviation from expectation + tiny tail probability → robust enrichment even after correction.

Next steps

Keep this page handy while interpreting results. For workflow tuning, continue with:

← Back to Tutorials