NLM DIR Seminar Schedule
UPCOMING SEMINARS
-
July 15, 2025 Noam Rotenberg
Cell phenotypes in the biomedical literature: a systematic analysis and the NLM CellLink text mining corpus
RECENT SEMINARS
-
July 3, 2025 Matthew Diller
Using Ontologies to Make Knowledge Computable -
July 1, 2025 Yoshitaka Inoue
Graph-Aware Interpretable Drug Response Prediction and LLM-Driven Multi-Agent Drug-Target Interaction Prediction -
June 10, 2025 Aleksandra Foerster
Interactions at pre-bonding distances and bond formation for open p-shell atoms: a step toward biomolecular interaction modeling using electrostatics -
June 3, 2025 MG Hirsch
Interactions among subclones and immunity controls melanoma progression -
May 29, 2025 Harutyun Sahakyan
In silico evolution of globular protein folds from random sequences
Scheduled Seminars on July 15, 2025
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
Single-cell technologies enable the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale.
We present NLM CellLink, a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers where possible, using their meaning in context, with matches labeled as exact or related. Annotation was performed by four cell biologists using a multi-round process.
NLM CellLink contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. We fine-tune BiomedBERT and LLAMA in a named entity recognition (NER) task and apply SapBERT and MedCPT in a simplified entity linking framework to demonstrate consistency and usability of the corpus. The fine-tuned NER models perform significantly better than LLAMA and GPT with zero-shot inference.
The NLM CellLink corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.