NLM DIR Seminar Schedule

Seminars Home

Schedule Seminar

UPCOMING SEMINARS

Sept. 9, 2025 Chih-Hsuan Wei
No Data Left Behind: FAIR-SMart Enables FAIR Access to Supplementary Materials for Research Transparency
Sept. 16, 2025 James Leaman JR.
TBD
Sept. 23, 2025 Martha Nelson
TBD
Sept. 30, 2025 Erez Persi
TBD
Oct. 7, 2025 Liana Yeganova
TBD

RECENT SEMINARS

July 15, 2025 Noam Rotenberg
Cell phenotypes in the biomedical literature: a systematic analysis and the NLM CellLink text mining corpus
July 3, 2025 Matthew Diller
Using Ontologies to Make Knowledge Computable
July 1, 2025 Yoshitaka Inoue
Graph-Aware Interpretable Drug Response Prediction and LLM-Driven Multi-Agent Drug-Target Interaction Prediction
June 10, 2025 Aleksandra Foerster
Interactions at pre-bonding distances and bond formation for open p-shell atoms: a step toward biomolecular interaction modeling using electrostatics
June 3, 2025 MG Hirsch
Interactions among subclones and immunity controls melanoma progression

Scheduled Seminars on July 15, 2025

Speaker

Noam Rotenberg

PI/Lab

Richard Scheuermann

Time

11 a.m.

Presentation Title

Cell phenotypes in the biomedical literature: a systematic analysis and the NLM CellLink text mining corpus

Location

Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

Single-cell technologies enable the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale.
We present NLM CellLink, a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers where possible, using their meaning in context, with matches labeled as exact or related. Annotation was performed by four cell biologists using a multi-round process.
NLM CellLink contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. We fine-tune BiomedBERT in a named entity recognition (NER) task and apply SapBERT and MedCPT in a simplified entity linking framework to demonstrate consistency and usability of the corpus. The fine-tuned NER models perform significantly better than LLAMA and GPT with zero-shot inference.
The NLM CellLink corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.

NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on July 15, 2025

Abstract:

ARCHIVES