Base by Base

339: cxt: A language model for population genetics


Listen Later

Korfmann K et al., Proceedings of the National Academy of Sciences (PNAS) - This episode examines cxt, a decoder-only transformer that performs next-coalescence prediction by translating local mutational context into pairwise TMRCA estimates. Trained on stdpopsim simulations, cxt delivers rapid, scalable coalescence-time inference, calibrated posteriors, and practical adaptations for empirical data. Key terms: language models, coalescent theory, uncertainty, stdpopsim, simulation-based inference.

Study Highlights:
The authors develop cxt, an autoregressive transformer that predicts discretized pairwise coalescence times from SFS-weighted mutation windows, framing TMRCA inference as a translation task. Trained on extensive stdpopsim simulations, cxt matches state-of-the-art accuracy in well-specified settings and generalizes to many out-of-sample species with some loss of accuracy. The model produces well-calibrated approximate posteriors, enables rapid GPU inference (millions of predictions in minutes), and can be fine-tuned or adapted for large Ne, missing data, or small sample sizes. Applications to human and Anopheles genomes recover known signals at LCT, HLA, inversion regions, and the Rdl insecticide-resistance locus.

Conclusion:
cxt reframes coalescent inference as a language-modeling problem, providing a fast, scalable, and adaptable tool that learns priors from simulations to infer local TMRCA and aggregate demography while offering uncertainty quantification through approximate posteriors.

Music:
Enjoy the music based on this article at the end of the episode.

Article title:
Accessible, realistic genome simulation with selection using stdpopsim

First author:
Korfmann K

Journal:
Proceedings of the National Academy of Sciences (PNAS)

DOI:
10.1073/pnas.2518956123

Reference:
Korfmann K., Pope N. S., Meleghy M., Tellier A., Kern A. D. Coalescence and translation: A language model for population genetics. Proc. Natl. Acad. Sci. U.S.A. 2026;123:e2518956123. doi:10.1073/pnas.2518956123

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

On PaperCast Base by Base you'll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.com/episodes/cxt-language-model-for-population-genetics

QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2026-04-11.

QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Audited transcript sections describing the cxt model (architecture and next-coalescence prediction), training on stdpopsim, performance benchmarks, generalization to unseen species, empirical data applications (LCT, HLA, Rdl), missing data handling/adapters, and environmental considerations.
- transcript topics: ARG basics and coalescent theory as context; cxt architecture: decoder-only transformer and next-coalescence prediction; input encoding: mutational densities, SFS, rotary embeddings; training data: stdpopsim simulations and catalog breadth; benchmark comparisons: Singer+Polegon and SMC++; generalization to stdpopsim v0.3 and unseen species

QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0

Metadata Audited:
- article_doi
- article_title
- article_journal
- license

Factual Items Audited:
- cxt is a decoder-only transformer that autoregressively predicts local coalescence times (TMRCA) via next-coalescence prediction
- Trained on stdpopsim simulations; generalizes across demographies, including unseen species
- Inference is fast (millions of TMRCAs in minutes) on a single NVIDIA A100 GPU
- cxt yields well-calibrated approximate posteriors for TMRCA
- Compared to Singer+Polegon and SMC++, cxt is competitive and often superior in well-specified/out-of-distribution scenarios
- Empirical data show clear LCT and HLA signals in humans and Rdl dynamics in Anopheles; missing data handling improves robustness

QC result: Pass.

...more
View all episodesView all episodes
Download on the App Store

Base by BaseBy Gustavo Barra