Tutorial: Gene Ranking Strategies, Absolute Frequency vs GF-IDF

Understand how different ranking strategies influence which genes are prioritized and learn how to choose the right one for your analysis.

1. Why gene ranking matters

Once genes have been identified from a set of scientific texts, the next step is deciding how to rank them. Ranking determines which genes you consider most relevant for further analysis, such as enrichment testing or network exploration. Two widely used strategies are absolute frequency and GF-IDF. Each captures a different notion of importance, and understanding their differences is key to interpreting your results correctly.

2. Ranking by absolute frequency

The most straightforward approach is to count how many times a gene appears across the set of articles you've collected. Genes that are mentioned more often receive a higher rank. This approach assumes that frequent mention is a proxy for biological importance or relevance.

Advantages

Limitations

Use absolute frequency when you want a broad picture of the most prominent genes associated with your search or when prioritizing coverage over specificity.

3. Ranking by GF-IDF

A more context-sensitive approach is to rank genes using GF-IDF (Gene Frequency–Inverse Document Frequency). This method re-weights how often a gene appears in your selected set by how common it is in the overall biomedical literature. The idea is to reward genes that are over-represented in your selection relative to their usual background frequency.

The GF-IDF formula

\(\require{html}\)

\[ \GFIDF\!\bigl(\tip{g}{a-gene};\ \tip{A}{selected-articles},\ \tip{P}{reference-corpus}\bigr) = \underbrace{ \frac{ \displaystyle \sum_{\tip{a \in A}{articles-in-A}} \tip{m(g,a)}{mentions-of-g-in-article-a} }{ \displaystyle \sum_{\tip{a \in A}{articles-in-A}} \sum_{\tip{g' \in \Gamma(A)}{genes-mentioned-in-A}} \tip{m(g',a)}{mentions-of-g'-in-article-a} } }_{\tip{\GF_A(g)}{gene-frequency-in-A}} \times \underbrace{ \ln \!\left( \frac{ \tip{|P|}{articles-in-P} + \tip{1}{add-one-smoothing} } { \tip{\df_P(g)}{articles-in-P-that-mention-g} + \tip{1}{add-one-smoothing} } \right) }_{\tip{\IDF_P(g)}{inverse-document-frequency-in-P}} \]

Equation 1. GF-IDF combines the gene frequency \(\GF_A(g)\) in your selected articles with a smoothed inverse document frequency \(\IDF_P(g)\) computed over the reference corpus.

Notation quick reference g is a gene; A is the set of selected articles; P is the reference corpus; m(g,a) counts mentions of gene g in article a; Γ(A) is the set of genes mentioned at least once in A; df_P(g) counts articles in P that mention g; |P| is the total number of articles in P; ln is the natural logarithm; +1 indicates add-one smoothing.

Advantages

Limitations

Worked micro-example

Suppose \(A\) contains 200 total gene mentions, of which \(g\) appears 10 times. The reference corpus \(P\) has \(|P|=1{,}000{,}000\) articles, and \(\df(g,P)=50{,}000\).

Component Value Computation
Gene frequency in \(A\) (\(\GF_A(g)\)) \(\dfrac{\displaystyle \sum_{a \in A} m(g,a)}{\displaystyle \sum_{a \in A} \sum_{g' \in \Gamma(A)} m(g',a)} = \dfrac{10}{200} = 0.05\) relative frequency of \(g\) mentions in \(A\)
Inverse document frequency (\(\IDF_P(g)\)) \(\ln \!\Bigl( \dfrac{1{,}000{,}000 + 1}{50{,}000 + 1} \Bigr) \approx 3.00\) natural log with +1 smoothing
GF-IDF \(0.05 \times 3.00 \approx 0.150\) \(\GF_A(g) \times \IDF_P(g)\)

4. Choosing the right ranking strategy

Whether you use absolute frequency or GF-IDF depends on the question you're asking:

Criterion Absolute Frequency GF-IDF
What it prioritizes Most commonly mentioned genes Genes unusually enriched in your selection
Best suited for Highlighting well-known or canonical genes Finding context-specific or novel candidates
Bias profile Favors widely studied genes Penalizes globally common genes
Recommended when Creating overviews or summarizing mainstream knowledge Exploring niche topics or reducing literature bias

In many cases, examining both rankings provides complementary insight: frequency highlights the established core, while GF-IDF draws attention to less obvious but potentially important genes.

5. Practical workflow tips

← Back to Tutorials