NLM DIR Seminar Schedule
UPCOMING SEMINARS
-
May 12, 2026 John Bridgers
A bi-partition function algorithm to evaluate inferred subclonal structures in single-cell sequencing data -
May 14, 2026 Brandon Colelough
TBD -
May 19, 2026 Leann Lindsey
Are Genomic Language Models Learning? Insights from Tokenization Analysis and Prophage Detection in Bacterial Genomes -
May 26, 2026 Harutyun Saakyan
TBD -
May 27, 2026 Brian Abraham
Cis-Regulatory Organization and Transcription Factor Control of Cell Identity and Disease
RECENT SEMINARS
-
May 5, 2026 Benjamin Hou
Machine Learning for Craniofacial Malocclusion Prediction -
April 28, 2026 Niccolo Marini
From Unimodal Datasets to Multimodal Foundation Models: Synthetic Clinical Notes for Dermatology AI -
April 21, 2026 Yoshitaka Inoue
Drug Response Prediction: Generalization using Graph Neural Networks & Reasoning over Predictions using LLMs -
April 16, 2026 Matthew Diller
Analyzing Similarity in Common Data Elements in the NIH CDE Repository via Semantic Clustering -
April 7, 2026 Henry Secaira Morocho
Toward a systematic method of database enrichment for reference-based metagenomics
Scheduled Seminars on April 7, 2026
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
Tracing millions of shotgun metagenomic sequencing reads to their source requires mapping them to a database of reference microbial genomes. While most efforts have improved the efficiency and accuracy of sequence aligners and taxonomic profilers, the engineering of reference databases has received inadequate attention. On the one hand, databases with few genomes yield a highly incomplete picture of the community because multiple taxonomic groups are missing. On the other hand, databases with the most comprehensive genome collections can introduce inaccuracies and misidentifications in the inferred community due to a highly skewed distribution, in which most sequenced genomes belong to a handful of microbial species of biomedical interest. Therefore, striking a balance between under- and over-represented taxa is essential for reference-based metagenomics, as database composition determines which taxa are inferred to be present in the community. Here, we explore whether database composition can be adaptively enriched for each sample. For this, we developed an iterative enrichment method that traverses a distance-based tree connecting all genomes to identify additional microbes of interest. Using simulated metagenomic datasets, we demonstrate that enriched databases improve metagenomic classification at the genome level and can recover microbial community composition with accuracy comparable to that of larger databases. Our results indicate that database engineering has a significant impact on metagenomic analysis, and incorporating enriched databases can yield a more accurate description of microbial communities.