The Future of Voice AI

Real-world problems with STT | Klemen Simonic (Soniox) & Kwindla Kramer (Daily)


Listen Later

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guests are Klemen Simonic, Co-Founder & CEO at Soniox, and Kwindla Hultman Kramer, Co-Founder & CEO at Daily.

Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems.

Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.

Recap Video

Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.

Takeaways

* Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call.

* Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors.

* Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows.

* Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control.

* POC to production fails not because of LLMs, but because engineering teams underestimate context management.

* A universal multilingual model can outperform single-language models by transferring entity knowledge across languages.

* Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language.

* Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case.

* Feeding both sides of the conversation into STT gives models more context and improves accuracy.

* Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments.

* Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward.

* Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes.

* Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive.

* The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely.



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit voice-ai-newsletter.krisp.ai
...more
View all episodesView all episodes
Download on the App Store

The Future of Voice AIBy Davit Baghdasaryan