Tutorial: What is pubtator3?

Learn how pubtator3 powers gene annotation in pathXcite and why it's a cornerstone for literature-based enrichment analysis.

1. Introduction to pubtator3

pubtator3 is a large-scale text mining and annotation service maintained by the National Center for Biotechnology Information (NCBI). It automatically scans biomedical literature and identifies key biological entities, including genes, diseases, mutations, chemicals, and species, and links them to standardized database identifiers.

The tool continuously processes new and updated articles from PubMed and PubMed Central (PMC), producing a rich annotation layer over millions of abstracts and full-text articles. These annotations enable downstream applications like pathXcite to retrieve high-quality gene information directly from the literature without requiring users to perform manual entity recognition.

2. What pubtator3 does

pubtator3 applies advanced named entity recognition (NER) and normalization methods to biomedical text. For each article, it:

These annotations are regularly refreshed as the underlying NLP models improve and as the biomedical literature expands. This ensures that analyses performed today benefit from more complete and accurate annotations than those done in the past; this is an important consideration for reproducibility and continuous refinement.

3. How pathXcite uses pubtator3

pathXcite relies on pubtator3 to perform gene annotation once you have curated a set of relevant articles. When you provide a list of PubMed or PMC identifiers, pathXcite queries the pubtator3 service and retrieves all gene mentions associated with those articles.

Each gene is then mapped to its Entrez Gene ID and added to your project's annotation database. These standardized annotations are essential for accurate gene ranking and enrichment, allowing downstream analysis against hundreds of gene set libraries without additional preprocessing.

Because pubtator3 handles entity recognition centrally, you do not need to manually parse articles or perform text mining yourself; pathXcite uses the best available annotations right out of the box.

4. Abstracts vs. full-text annotation

pubtator3 covers two main types of literature:

Full-text annotations can significantly increase the number and diversity of recognized gene mentions, improving downstream analyses such as ranking and enrichment. However, even abstract-level annotations are sufficient to capture the core biological signals for many workflows.

5. Update frequency and data coverage

pubtator3 is continuously updated to keep pace with the expanding biomedical literature. Newly published articles are typically annotated and available within days of appearing in PubMed or PMC. In addition, improvements in text-mining algorithms and entity dictionaries are regularly integrated into the pipeline.

As a result:

6. Practical considerations

Provide valid PMIDs or PMCIDs: pubtator3 retrieves annotations based on these identifiers. Include PMC open-access articles when possible to maximize annotation depth. Remember that subscription-only articles will still yield annotations from abstracts. Periodically rerun annotations if your project spans a long time frame: new data and improved models can refine results.

7. Learn more

You can explore pubtator3 directly at the official NCBI page:

https://www.ncbi.nlm.nih.gov/research/pubtator3/

For an in-depth description, see:

8. Summary

PubTator3 is a foundational service that transforms unstructured biomedical text into structured, standardized annotations. By integrating pubtator3, pathXcite automatically extracts gene mentions from your literature corpus (from abstracts and full-text papers) and converts them into data ready for gene ranking and enrichment analysis. Its continuous updates and broad coverage ensure that your analyses stay current with the evolving scientific landscape.

← Back to Tutorials