
Sign up to save your podcasts
Or


In the Future of Voice AI series of interviews, I ask three questions to my guests:
This episode’s guests are Klemen Simonic, Co-Founder & CEO at Soniox, and Kwindla Hultman Kramer, Co-Founder & CEO at Daily.
Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems.
Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.
Recap Video
Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.
Takeaways
* Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call.
* Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors.
* Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows.
* Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control.
* POC to production fails not because of LLMs, but because engineering teams underestimate context management.
* A universal multilingual model can outperform single-language models by transferring entity knowledge across languages.
* Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language.
* Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case.
* Feeding both sides of the conversation into STT gives models more context and improves accuracy.
* Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments.
* Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward.
* Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes.
* Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive.
* The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely.
By Davit BaghdasaryanIn the Future of Voice AI series of interviews, I ask three questions to my guests:
This episode’s guests are Klemen Simonic, Co-Founder & CEO at Soniox, and Kwindla Hultman Kramer, Co-Founder & CEO at Daily.
Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems.
Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.
Recap Video
Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.
Takeaways
* Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call.
* Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors.
* Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows.
* Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control.
* POC to production fails not because of LLMs, but because engineering teams underestimate context management.
* A universal multilingual model can outperform single-language models by transferring entity knowledge across languages.
* Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language.
* Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case.
* Feeding both sides of the conversation into STT gives models more context and improves accuracy.
* Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments.
* Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward.
* Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes.
* Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive.
* The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely.