NLM DIR Seminar Schedule
UPCOMING SEMINARS
RECENT SEMINARS
-
Dec. 2, 2025 Qingqing Zhu
CT-Bench & CARE-CT: Building Reliable Multimodal AI for Lesion Analysis in Computed Tomography -
Nov. 25, 2025 Jing Wang
MIMIC-EXT-TE: Millions Clinical Temporal Event Time-Series Dataset -
Oct. 21, 2025 Yifan Yang
TBD -
Oct. 14, 2025 Devlina Chakravarty
TBD -
Oct. 9, 2025 Ziynet Nesibe Kesimoglu
TBD
Scheduled Seminars on Sept. 18, 2025
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
The biomedical literature is not static: new scientific concepts are often accompanied by new terminology. However, natural language processing (NLP) systems often struggle with novel terms and concepts, resulting in poor performance precisely when scientific interest in these new ideas is increasing. In this talk, I will present two case studies of building biomedical NLP resources using data-centric methods to efficiently capture new concepts while meeting requirements for coverage and quality. First, I will discuss applying human-in-the-loop machine learning and data programming (a form of semi-supervised learning) to create the Long COVID collection. This work identified 763 distinct terms for the condition, yet 69.0% of articles referred to it by no recognizable name. Second, I will describe a document selection strategy designed to balance representativeness with targeted enrichment of less-frequent values. This method, recently used to construct the NLM CellLink corpus, provides a principled and efficient way to improve coverage of “long tail” values, resulting in stronger training generalization and more robust evaluation. Together, these use cases show how data-centric methods can be used to create resources that generalize better and evaluate systems more accurately, even as scientific language evolves.