
Sign up to save your podcasts
Or


Note: Thank you to rush.cloud and latch.bio for sponsoring this episode!
Rush is augmenting drug discovery for all scientists with machine-driven superintelligence.
LatchBio is building agentic scientific tooling that can analyze a wide range of scientific data, with an early focus on spatial biology. Clip on them in the episode.
If you’re at all interested in sponsoring future episodes, reach out!
***
This is an interview with Yunha Hwang, an assistant professor at MIT (and co-founder of the non-profit Tatta Bio). She is working on building and applying genomic language models to help annotate the function of the (mostly unknown) universe of microbial genomes.
There are two reasons you should watch this episode.
One, Yunha is working on an absurdly difficult and interesting problem: microbial genome function annotation. Even for E. coli, one of the most studied organisms on Earth, we don’t know what half to two-thirds of its genes actually do. For a random microbe from soil, that number jumps to 80-90%. Her lab is one of the leading groups working to apply deep learning to solving the problem, and last year, released a paper that increasingly feels foundational within it (with prior Owl Posting podcast guest Sergey Ovchinnikov an author on it!). We talk about that paper, its implications, and where the future of machine learning in metagenomics may go.
And two, I was especially excited to film this so I could help bring some light to a platform that she and her team at Tatta Bio has developed: SeqHub. There’s been a lot of discussion online about AI co-scientists in the biology space, but I have increasingly felt a vague suspicion that people are trying to be too broad with them. It feels like the value of these tools are not with general scientific reasoning, but rather from deep integration with how a specific domain of research engages with their open problems. SeqHub feels like one of the few systems that mirrors this viewpoint, and while it isn’t something I can personally use—since its use-case is primarily in annotating and sharing microbial genomes, neither of which I work on!—I would still love for it to succeed. If you’re in the metagenomics space, you should try it out!
Youtube: https://youtu.be/w6L9-ySnxZI?si=7RBusTAyy0Ums6Oh Spotify: https://open.spotify.com/episode/2EgnV9Y1Mm9JV5m9KAY6yL?si=J5ZmF2i3TtuT10D40jjgawApple Podcast: https://apple.co/4pu4TRBTranscript: https://www.owlposting.com/p/we-dont-know-what-most-microbial
Timestamps:
00:02:07 – Introduction
00:02:23 – Why do microbial genomes matter
00:04:07 – Deep learning acceptance in metagenomics
00:05:25 – The case for genomic “context” over sequence matching
00:06:43 – OMG: the only ML-ready metagenomic dataset
00:09:27 – gLM2: A multimodal genomic language model
00:11:06 – What do you do with the output of genomic language models?
00:17:41 – How will OMG evolve?
00:20:26 – Why train on only microbial genomes, as opposed to all genomes?
00:22:58 – Do we need more sequences or more annotations?
00:23:54 – Is there a conserved microbial genome ‘language’?
00:28:11 – What non-obvious things can this genomic language model tell you?
00:33:08 – Semantic deduplication and evaluation
00:37:33 – How does benchmarking work for these types of models?
00:41:31 – Gaia: A genomic search engine
00:44:18 – Even ‘well-studied’ genomes are mostly unannotated
00:50:51 – Using agents on Gaia
00:54:53 – Will genomic language models reshape the tree of life?
00:59:18 – Current limitations of genomic language models
01:08:54 – Directed evolution as training data
01:12:35 – What is Tatta Bio?
01:19:02 – Building Google for genomic sequences (SeqHub)
01:25:46 – How to create communities around scientific OSS
01:29:06 – What’s the purpose in the centralization of the software?
01:35:37 – How will the way science is done change in 10 years?
By Abhishaike MahajanNote: Thank you to rush.cloud and latch.bio for sponsoring this episode!
Rush is augmenting drug discovery for all scientists with machine-driven superintelligence.
LatchBio is building agentic scientific tooling that can analyze a wide range of scientific data, with an early focus on spatial biology. Clip on them in the episode.
If you’re at all interested in sponsoring future episodes, reach out!
***
This is an interview with Yunha Hwang, an assistant professor at MIT (and co-founder of the non-profit Tatta Bio). She is working on building and applying genomic language models to help annotate the function of the (mostly unknown) universe of microbial genomes.
There are two reasons you should watch this episode.
One, Yunha is working on an absurdly difficult and interesting problem: microbial genome function annotation. Even for E. coli, one of the most studied organisms on Earth, we don’t know what half to two-thirds of its genes actually do. For a random microbe from soil, that number jumps to 80-90%. Her lab is one of the leading groups working to apply deep learning to solving the problem, and last year, released a paper that increasingly feels foundational within it (with prior Owl Posting podcast guest Sergey Ovchinnikov an author on it!). We talk about that paper, its implications, and where the future of machine learning in metagenomics may go.
And two, I was especially excited to film this so I could help bring some light to a platform that she and her team at Tatta Bio has developed: SeqHub. There’s been a lot of discussion online about AI co-scientists in the biology space, but I have increasingly felt a vague suspicion that people are trying to be too broad with them. It feels like the value of these tools are not with general scientific reasoning, but rather from deep integration with how a specific domain of research engages with their open problems. SeqHub feels like one of the few systems that mirrors this viewpoint, and while it isn’t something I can personally use—since its use-case is primarily in annotating and sharing microbial genomes, neither of which I work on!—I would still love for it to succeed. If you’re in the metagenomics space, you should try it out!
Youtube: https://youtu.be/w6L9-ySnxZI?si=7RBusTAyy0Ums6Oh Spotify: https://open.spotify.com/episode/2EgnV9Y1Mm9JV5m9KAY6yL?si=J5ZmF2i3TtuT10D40jjgawApple Podcast: https://apple.co/4pu4TRBTranscript: https://www.owlposting.com/p/we-dont-know-what-most-microbial
Timestamps:
00:02:07 – Introduction
00:02:23 – Why do microbial genomes matter
00:04:07 – Deep learning acceptance in metagenomics
00:05:25 – The case for genomic “context” over sequence matching
00:06:43 – OMG: the only ML-ready metagenomic dataset
00:09:27 – gLM2: A multimodal genomic language model
00:11:06 – What do you do with the output of genomic language models?
00:17:41 – How will OMG evolve?
00:20:26 – Why train on only microbial genomes, as opposed to all genomes?
00:22:58 – Do we need more sequences or more annotations?
00:23:54 – Is there a conserved microbial genome ‘language’?
00:28:11 – What non-obvious things can this genomic language model tell you?
00:33:08 – Semantic deduplication and evaluation
00:37:33 – How does benchmarking work for these types of models?
00:41:31 – Gaia: A genomic search engine
00:44:18 – Even ‘well-studied’ genomes are mostly unannotated
00:50:51 – Using agents on Gaia
00:54:53 – Will genomic language models reshape the tree of life?
00:59:18 – Current limitations of genomic language models
01:08:54 – Directed evolution as training data
01:12:35 – What is Tatta Bio?
01:19:02 – Building Google for genomic sequences (SeqHub)
01:25:46 – How to create communities around scientific OSS
01:29:06 – What’s the purpose in the centralization of the software?
01:35:37 – How will the way science is done change in 10 years?