
Sign up to save your podcasts
Or


In the Future of Voice AI series of interviews, I ask three questions to my guests:
This episode’s guest is Jack Piunti, GTM Lead for Communications at ElevenLabs.
Jack Piunti is the GTM lead for Communications at ElevenLabs, where he oversees go-to-market strategy across CPaaS, CCaaS, UCaaS, and customer experience. With a strong background in consultative technology partnerships and startup growth, Jack brings deep expertise in AI-driven communications. Prior to ElevenLabs, he spent six years at Twilio, helping shape enterprise adoption of real-time voice technologies. He is passionate about the future of connected applications and the role of AI in transforming how we communicate.
ElevenLabs is a voice AI company offering ultra-realistic text-to-speech, speech-to-text, voice cloning, multilingual dubbing, and conversational AI tools. Founded in 2022, it enables creators and developers to build voice apps and generate lifelike, emotionally rich speech in 70+ languages. Its latest models support expressive cues and multi-speaker dialogue.
Recap Video
Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.
Takeaways
* Most AI failures in conversation don't come from the language model, but from inaccurate speech-to-text at the start.
* Bad transcription of critical details like names or codes breaks the entire user experience and can’t easily be recovered.
* Accurate speech-to-text is now a make-or-break factor for building reliable AI agents.
* Voice will soon replace typing as the main way humans interact with machines because it's more natural and efficient.
* Enterprises don’t want to stitch together multiple AI vendors, they want end-to-end platforms that simplify the stack and reduce latency.
* Demos often look impressive, but very few companies can scale real-time voice tech reliably in production environments.
* AI voice agents that sound expressive aren't enough — turn-taking and accuracy are still bigger challenges.
* Most companies ignore accessibility in AI, but modeling things like stuttering actually improves agent behavior.
* Streaming speech and voice models will unlock more lifelike, responsive AI agents — and it’s coming fast.
* Audio AI needs deep expertise beyond AI, including sound engineering and context-aware modeling of human speech.
* There’s a growing trend of AI companies going beyond voice to control the full audio experience, including music and sound effects.
* The way voice models are trained is fundamentally different from language models and requires much cleaner training data.
* Many agentic AI builders today are forced to cobble together solutions from different vendors, which creates delay and complexity.
* True real-time voice AI must handle language switching, emotional cues, and speech disfluencies automatically to feel natural.
By Davit BaghdasaryanIn the Future of Voice AI series of interviews, I ask three questions to my guests:
This episode’s guest is Jack Piunti, GTM Lead for Communications at ElevenLabs.
Jack Piunti is the GTM lead for Communications at ElevenLabs, where he oversees go-to-market strategy across CPaaS, CCaaS, UCaaS, and customer experience. With a strong background in consultative technology partnerships and startup growth, Jack brings deep expertise in AI-driven communications. Prior to ElevenLabs, he spent six years at Twilio, helping shape enterprise adoption of real-time voice technologies. He is passionate about the future of connected applications and the role of AI in transforming how we communicate.
ElevenLabs is a voice AI company offering ultra-realistic text-to-speech, speech-to-text, voice cloning, multilingual dubbing, and conversational AI tools. Founded in 2022, it enables creators and developers to build voice apps and generate lifelike, emotionally rich speech in 70+ languages. Its latest models support expressive cues and multi-speaker dialogue.
Recap Video
Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.
Takeaways
* Most AI failures in conversation don't come from the language model, but from inaccurate speech-to-text at the start.
* Bad transcription of critical details like names or codes breaks the entire user experience and can’t easily be recovered.
* Accurate speech-to-text is now a make-or-break factor for building reliable AI agents.
* Voice will soon replace typing as the main way humans interact with machines because it's more natural and efficient.
* Enterprises don’t want to stitch together multiple AI vendors, they want end-to-end platforms that simplify the stack and reduce latency.
* Demos often look impressive, but very few companies can scale real-time voice tech reliably in production environments.
* AI voice agents that sound expressive aren't enough — turn-taking and accuracy are still bigger challenges.
* Most companies ignore accessibility in AI, but modeling things like stuttering actually improves agent behavior.
* Streaming speech and voice models will unlock more lifelike, responsive AI agents — and it’s coming fast.
* Audio AI needs deep expertise beyond AI, including sound engineering and context-aware modeling of human speech.
* There’s a growing trend of AI companies going beyond voice to control the full audio experience, including music and sound effects.
* The way voice models are trained is fundamentally different from language models and requires much cleaner training data.
* Many agentic AI builders today are forced to cobble together solutions from different vendors, which creates delay and complexity.
* True real-time voice AI must handle language switching, emotional cues, and speech disfluencies automatically to feel natural.