NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on July 15, 2025

Speaker
Noam Rotenberg
PI/Lab
Richard Scheuermann
Time
11 a.m.
Presentation Title
Cell phenotypes in the biomedical literature: a systematic analysis and the NLM CellLink text mining corpus
Location
Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

Single-cell technologies enable the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale.
We present NLM CellLink, a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers where possible, using their meaning in context, with matches labeled as exact or related. Annotation was performed by four cell biologists using a multi-round process.
NLM CellLink contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. We fine-tune BiomedBERT and LLAMA in a named entity recognition (NER) task and apply SapBERT and MedCPT in a simplified entity linking framework to demonstrate consistency and usability of the corpus. The fine-tuned NER models perform significantly better than LLAMA and GPT with zero-shot inference.
The NLM CellLink corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.