NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on May 19, 2026

Speaker
Leann Lindsey
PI/Lab
Xiaofang Jiang
Time
11 a.m.
Presentation Title
Are Genomic Language Models Learning? Insights from Tokenization Analysis and Prophage Detection in Bacterial Genomes
Location
Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

Genomic language models (gLMs) promise to decode the regulatory and functional logic encoded in DNA, yet whether current architectures learn meaningful biological representations remains contested. Recent studies question the foundational abilities of gLMs, demonstrating that they fail to outperform randomly initialized or simple supervised models on standard benchmarks, while model authors point to zero-shot performance and unsupervised motif discovery as evidence of foundational biological understanding. We present two complementary efforts to investigate this question. First, we systematically evaluate how tokenization strategy (nucleotide, k-mer, and byte-pair encoding) affects model behavior across three genomic benchmarks, probing whether token granularity shapes what gLMs capture at the nucleotide level. Second, we introduce LAMBDA, a genomic language model benchmark that leverages bacteriophages as a test system to investigate the annotation abilities of genomic language models. Unlike well-annotated model organism genomes, the vast majority of phage genomes remain poorly characterized, making them an ideal domain for testing whether gLMs identify meaningful sequence patterns beyond homology. LAMBDA evaluates gLM embeddings through phage-bacteria discrimination tasks of increasing complexity, including genome-wide prophage detection, and provides a rigorous framework for evaluating model performance on a genome-wide annotation task with direct relevance to microbiology and medicine.