March 11, 2026

In Lockstep

19 minutes

Now I have everything. Composing the show notes:

Here are the show notes:

Episode 28: In Lockstep

Why it matters. Every LLM-based text-to-speech system shipping today carries a structural flaw: text tokens and audio frames move at incompatible speeds inside the same model, forcing engineers to choose between reliability, quality, and inference cost. Hume AI's TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment eliminates the mismatch entirely — enforcing strict one-to-one synchronization between text tokens and continuous acoustic vectors, producing zero content hallucinations across 1,000+ test samples and running at 5× the throughput of comparable systems.

Hume AI. Hume AI is a voice AI infrastructure company building training data pipelines, evaluation systems, and reinforcement learning tooling for frontier labs and enterprises. TADA ships fully open: code, pre-trained weights, the audio tokenizer, and vocoder decoder — a 1B English model and a 3B multilingual model covering eight languages, both built on Llama architecture. Resources: arXiv paper, GitHub repository, HuggingFace demo, PyPI package, Hume AI blog post. Trung Dang also holds an affiliation with Dartmouth College.

The Researchers. The paper's authors are Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr Cłapa, Peter Chin, and Alan Cowen — founder and CEO of Hume AI, whose prior research on the taxonomy of human emotions underpins the company's focus on expressive, emotionally-aware speech generation. Panagiotis Tzirakis and Alice Baird are established researchers in affective computing and speech-based emotion recognition.

Key Technical Concepts. Prior LLM-TTS architectures — including VALL-E (Microsoft), SoundStorm (Google DeepMind), and systems built on EnCodec discrete tokens — all operate at fixed acoustic frame rates (12.5–75 Hz), producing speech sequences 5–35× longer than their text inputs and requiring either reduced frame rates (losing expressiveness) or intermediate semantic token layers (adding complexity and failure points). TADA replaces discrete acoustic tokens with continuous vectors aligned one-to-one to text tokens via a learned encoder-aligner pair, collapsing the sequence length disparity entirely. A flow matching head decodes LLM hidden states into waveforms via a vocoder, enabling high-fidelity reconstruction without the computational overhead of fixed-rate codecs. The paper introduces Speech Free Guidance — blending logits from text-only and text-speech inference modes — to close the modality gap that appears when co-generating text and speech simultaneously. The architecture opens direct paths for extension: new modalities via tokenizer adaptation, long-context drift mitigation, and assistant-scenario fine-tuning.

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

...more

View all episodes

By Daily Tech Feed

March 11, 2026

In Lockstep

19 minutes

Now I have everything. Composing the show notes:

Here are the show notes:

Episode 28: In Lockstep

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

...more

Share In Lockstep

Sign up to save your podcasts

In Lockstep

In Lockstep