NLM DIR Seminar Schedule
UPCOMING SEMINARS
-
May 26, 2026 Harutyun Saakyan
Emergence of ribonucleoproteins in molecular evolution simulations -
May 27, 2026 Brian Abraham
Cis-Regulatory Organization and Transcription Factor Control of Cell Identity and Disease -
June 4, 2026 Yin Fang
TBD -
June 9, 2026 Pascal Mutz
TBD -
June 11, 2026 Angela Jiang
TBD
RECENT SEMINARS
-
May 19, 2026 Leann Lindsey
Are Genomic Language Models Learning? Insights from Tokenization Analysis and Prophage Detection in Bacterial Genomes -
May 14, 2026 Brandon Colelough
Biomedical LLM Hallucinations: Detection, Taxonomy, and Mechanistic Knowledge Localization -
May 12, 2026 John Bridgers
A bi-partition function algorithm to evaluate inferred subclonal structures in single-cell sequencing data -
May 5, 2026 Benjamin Hou
Machine Learning for Craniofacial Malocclusion Prediction -
April 28, 2026 Niccolo Marini
From Unimodal Datasets to Multimodal Foundation Models: Synthetic Clinical Notes for Dermatology AI
Scheduled Seminars on April 16, 2026
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
Common Date Elements (CDEs) are standardized, precisely defined questions paired with a set of allowable responses that if used systematically, enables easier comparison and combination of data from multiple studies. The NIH CDE Repository is a centralized resource for finding common data elements that have been recommended or required for use by NIH institutes and centers for clinical and translational research studies. In Fiscal Year 23, congressional appropriations language directed NIH and the Office of Data Science Strategy (ODSS) to “…encourage development and use of CDEs in disease areas where they currently do not exist, the agreement directs ODSS to work with ICs to support efforts to develop CDEs, including through collaborations with research stakeholders. The agreement also directs ODSS to provide a list of diseases and disease areas actively under development, such as immune and immune-mediated conditions, to inform further NIH efforts to support development of such elements.”
However, the NIH CDE repository, containing more than 22,000 CDEs, has become increasingly challenging to navigate due to redundant entries, inconsistent formatting, and variable metadata quality. Researchers frequently create duplicate CDEs because they cannot easily find suitable existing options to meet their needs. For example, a search for "patient identifier" returns over 3,000 results that must be manually reviewed to determine if any are appropriate for a particular use. Data challenges such as these keep the CDE repository from realizing its full potential.
In this project, we evaluated the completeness and consistency of CDE representations in a representative subset of CDEs. Additionally, we developed and tested automated approaches using language models—including large language models, transformer models, and small language models—and unsupervised clustering to identify semantically similar CDEs, and determined approaches to improve repository organization. The methods for identifying semantically similar CDEs can be applied to the existing complete CDE corpus and to future CDE submissions, enabling NIH to maximize its investment in CDEs and support the 2025-2030 NIH Strategic Plan for Data Science.