NLM DIR Seminar Schedule
UPCOMING SEMINARS
RECENT SEMINARS
-
April 7, 2026 Henry Secaira Morocho
Toward a systematic method of database enrichment for reference-based metagenomics -
March 17, 2026 Roman Kogay
Diversification vs Streamlining: Selection Landscapes of Prokaryotic Genome Evolution -
March 10, 2026 Zhizheng Wang
Large Language Models for Gene Set Analysis -
March 5, 2026 Hasan Balci
From Sketch to SBGN: An AI-Assisted and Interactive Workflow for Generating Pathway Maps -
March 3, 2026 Gianlucca Goncalves Nicastro
Systematic identification of Salmonella T6SS effectors uncovers a lipid-targeting family.
Scheduled Seminars on April 7, 2026
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
Tracing millions of shotgun metagenomic sequencing reads to their source requires mapping them to a database of reference microbial genomes. While most efforts have improved the efficiency and accuracy of sequence aligners and taxonomic profilers, the engineering of reference databases has received inadequate attention. On the one hand, databases with few genomes yield a highly incomplete picture of the community because multiple taxonomic groups are missing. On the other hand, databases with the most comprehensive genome collections can introduce inaccuracies and misidentifications in the inferred community due to a highly skewed distribution, in which most sequenced genomes belong to a handful of microbial species of biomedical interest. Therefore, striking a balance between under- and over-represented taxa is essential for reference-based metagenomics, as database composition determines which taxa are inferred to be present in the community. Here, we explore whether database composition can be adaptively enriched for each sample. For this, we developed an iterative enrichment method that traverses a distance-based tree connecting all genomes to identify additional microbes of interest. Using simulated metagenomic datasets, we demonstrate that enriched databases improve metagenomic classification at the genome level and can recover microbial community composition with accuracy comparable to that of larger databases. Our results indicate that database engineering has a significant impact on metagenomic analysis, and incorporating enriched databases can yield a more accurate description of microbial communities.