NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on Feb. 3, 2026

Speaker
Matthew Diller
PI/Lab
Richard Scheuermann
Time
11 a.m.
Presentation Title
Analyzing Similarity in Common Data Elements in the NIH CDE Repository via Semantic Clustering
Location
Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

Common Date Elements (CDEs) are standardized, precisely defined questions paired with a set of allowable responses that if used systematically, enables easier comparison and combination of data from multiple studies. The NIH CDE Repository is a centralized resource for finding common data elements that have been recommended or required for use by NIH institutes and centers for clinical and translational research studies. In Fiscal Year 23, congressional appropriations language directed NIH and the Office of Data Science Strategy (ODSS) to “…encourage development and use of CDEs in disease areas where they currently do not exist, the agreement directs ODSS to work with ICs to support efforts to develop CDEs, including through collaborations with research stakeholders. The agreement also directs ODSS to provide a list of diseases and disease areas actively under development, such as immune and immune-mediated conditions, to inform further NIH efforts to support development of such elements.”

However, the NIH CDE repository, containing more than 22,000 CDEs, has become increasingly challenging to navigate due to redundant entries, inconsistent formatting, and variable metadata quality. Researchers frequently create duplicate CDEs because they cannot easily find suitable existing options to meet their needs. For example, a search for "patient identifier" returns over 3,000 results that must be manually reviewed to determine if any are appropriate for a particular use. Data challenges such as these keep the CDE repository from realizing its full potential.

In this project, we evaluated the completeness and consistency of CDE representations in a representative subset of CDEs. Additionally, we developed and tested automated approaches using language models—including large language models, transformer models, and small language models—and unsupervised clustering to identify semantically similar CDEs, and determined approaches to improve repository organization. The methods for identifying semantically similar CDEs can be applied to the existing complete CDE corpus and to future CDE submissions, enabling NIH to maximize its investment in CDEs and support the 2025-2030 NIH Strategic Plan for Data Science.