April 13, 2026

“Sparse Autoencoders for Single-Cell Models” by Ihor Kendiukhov

3 minutes

People are rushing to build bigger and bigger single cell foundation models (trained on RNA sequencing data), but in my view we have not extracted even a small fraction of the knowledge and capabilities that already exist inside the models we have today.

To explain what I mean, I want to argue three things in this post, and then show the empirical work behind them.

Thesis 1: Biological foundation models are not like LLMs, and the field's habit of evaluating them the same way is causing us to systematically underestimate what they contain. When you interact with GPT, the surface-level outputs (the text it generates) are a fairly good proxy for the model's capabilities. You can read what it writes and form a reasonable opinion. Biological foundation models are fundamentally different in this respect. A model like Geneformer or scGPT takes a cell's gene expression profile and produces embeddings, predictions of masked genes, or cell type classifications. These surface-level outputs are only a small sliver of what the model is doing internally. The model has been trained on tens of millions of cells, and the representations it has built to solve its training objective contain compressed biological knowledge that [...]

---

First published:

April 12th, 2026

Source:

https://www.lesswrong.com/posts/b3cvG387LKmbiBPaH/sparse-autoencoders-for-single-cell-models-1

---

Narrated by TYPE III AUDIO.

...more