NLM DIR Seminar Schedule
UPCOMING SEMINARS
RECENT SEMINARS
-
June 11, 2026 Angela Jiang
Identification and Evolutionary Analysis of Steroid-Metabolism Enzymes in Gut Microbes -
June 10, 2026 Luda Diatchenko
New Insights on Pain Biology from Human Transcriptomics: How Stimulation of Immune Response Shapes Pain Resolution -
June 9, 2026 Pascal Mutz
Characterization of covalently closed circular RNA replicators detected in (meta)transcriptomic data -
June 4, 2026 Madeleine Clore
Explaining why AlphaFold struggles to predict mutational effects -
May 27, 2026 Brian Abraham
Cis-Regulatory Organization and Transcription Factor Control of Cell Identity and Disease
Scheduled Seminars on May 19, 2026
In-person: Building 38A/B2N14 NCBI Library or Meeting Link
Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.
Abstract:
Genomic language models (gLMs) promise to decode the regulatory and functional logic encoded in DNA, yet whether current architectures learn meaningful biological representations remains contested. Recent studies question the foundational abilities of gLMs, demonstrating that they fail to outperform randomly initialized or simple supervised models on standard benchmarks, while model authors point to zero-shot performance and unsupervised motif discovery as evidence of foundational biological understanding. We present two complementary efforts to investigate this question. First, we systematically evaluate how tokenization strategy (nucleotide, k-mer, and byte-pair encoding) affects model behavior across three genomic benchmarks, probing whether token granularity shapes what gLMs capture at the nucleotide level. Second, we introduce LAMBDA, a genomic language model benchmark that leverages bacteriophages as a test system to investigate the annotation abilities of genomic language models. Unlike well-annotated model organism genomes, the vast majority of phage genomes remain poorly characterized, making them an ideal domain for testing whether gLMs identify meaningful sequence patterns beyond homology. LAMBDA evaluates gLM embeddings through phage-bacteria discrimination tasks of increasing complexity, including genome-wide prophage detection, and provides a rigorous framework for evaluating model performance on a genome-wide annotation task with direct relevance to microbiology and medicine.