NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on Sept. 18, 2025

Speaker
James Robert Leaman Jr.
PI/Lab
Zhiyong Lu
Time
3 p.m.
Presentation Title
Data-centric methods for resource construction in biomedical natural language processing
Location
Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

The biomedical literature is not static: new scientific concepts are often accompanied by new terminology. However, natural language processing (NLP) systems often struggle with novel terms and concepts, resulting in poor performance precisely when scientific interest in these new ideas is increasing. In this talk, I will present two case studies of building biomedical NLP resources using data-centric methods to efficiently capture new concepts while meeting requirements for coverage and quality. First, I will discuss applying human-in-the-loop machine learning and data programming (a form of semi-supervised learning) to create the Long COVID collection. This work identified 763 distinct terms for the condition, yet 69.0% of articles referred to it by no recognizable name. Second, I will describe a document selection strategy designed to balance representativeness with targeted enrichment of less-frequent values. This method, recently used to construct the NLM CellLink corpus, provides a principled and efficient way to improve coverage of “long tail” values, resulting in stronger training generalization and more robust evaluation. Together, these use cases show how data-centric methods can be used to create resources that generalize better and evaluate systems more accurately, even as scientific language evolves.