ServiceNow Insights

EVA - A Framework for Evaluating Voice Agents by ServiceNow


Listen Later

Voice AI agent evaluation — why it's fundamentally harder than text, how cascade failures derail conversations invisibly, and ServiceNow's open-source framework to establish industry evaluation standards. Featuring real audio examples showing authentication failures, leaked reasoning, and latency problems.

WHAT WE COVER 

TARA BOGAVELLI — Research Engineer, ServiceNow
Leading the open-source voice agent evaluation framework. Explains why existing benchmarks don't measure what matters and what ServiceNow is releasing to establish industry standards.

KATRINA STANKIEWICZ — Staff Machine Learning Engineer, ServiceNow
Cascade model architecture expert. Breaks down STT → LLM → TTS failure modes, named entity transcription challenges, and real audio example analysis.

GABRIELLE GAUTHIER MELANÇON — Staff Applied Research Scientist, ServiceNow
Multi-language evaluation specialist. Reveals why Large Audio Language Models lag behind, the native speaker requirement, and bot-to-bot simulation methodology. 

CHAPTERS
0:00 Introduction — The evaluation gap
1:11 ServiceNow's Open-Source Framework Announcement — Tara Bogavelli
2:43 Meet the Researchers
3:43 Voice-Specific Challenges — Tara Bogavelli
5:03 Cascade Architecture: STT → LLM → TTS — Katrina Stankiewicz
7:57 The Named Entity Problem — Katrina Stankiewicz
10:06 Evaluation Metrics: Accuracy vs Experience — Gabrielle Gauthier Melançon
11:23 Bot-to-Bot Testing at Scale — Gabrielle Gauthier Melançon 
14:30 The LALM Gap: Why Audio AI Judges Struggle — Tara Bogavelli
16:57 Real Audio Example: Flight Rebooking Gone Wrong
21:58 Breaking Down the Failures — Katrina Stankiewicz 28:30 Wrap-Up & Resources

KEY INSIGHTS

The Cascade Failure Problem: STT → LLM → TTS errors propagate invisibly Named Entity Transcription: The #1 enterprise blocker—names, confirmation codes, emails break authentication Accuracy vs Experience: Perfect task completion means nothing if users hang up due to poor experience LALM Gap: Large Audio Language Models lag behind text LLMs—human evaluators remain essential Latency Kills Conversations: Five-second pauses make users think the call dropped, breaking the experience even when tasks complete Open-Source Framework: ServiceNow releasing evaluation tools, metrics, and bot-to-bot simulation methodology for the industry.

LEARN MORE

Website: https://servicenow.github.io/eva/ GitHub:
https://github.com/servicenow/eva Blog Post:
https://huggingface.co/blog/ServiceNow-AI/eva Dataset: https://huggingface.co/datasets/ServiceNow-AI/eva

ABOUT

Hosted by Bobby Brill. ServiceNow Insights podcast explores AI research, real-world applications, and the people building the future of work. #VoiceAI #AIEvaluation #ServiceNow #MachineLearning #OpenSource #ConversationalAI #STT #TTS #LLM #VoiceAgents #AIResearch #Podcast

See omnystudio.com/listener for privacy information.

...more
View all episodesView all episodes
Download on the App Store

ServiceNow InsightsBy ServiceNow Community

  • 4.3
  • 4.3
  • 4.3
  • 4.3
  • 4.3

4.3

16 ratings


More shows like ServiceNow Insights

View all
Radiolab by WNYC Studios

Radiolab

43,979 Listeners

Freakonomics Radio by Freakonomics Radio + Stitcher

Freakonomics Radio

32,105 Listeners

Planet Money by NPR

Planet Money

30,729 Listeners

Marketplace by Marketplace

Marketplace

8,772 Listeners

Mad Money w/ Jim Cramer by CNBC

Mad Money w/ Jim Cramer

4,093 Listeners

Motley Fool Hidden Gems Investing by The Motley Fool

Motley Fool Hidden Gems Investing

3,216 Listeners

WSJ What’s News by The Wall Street Journal

WSJ What’s News

4,352 Listeners

Science Friday by Science Friday and WNYC Studios

Science Friday

6,459 Listeners

Up First from NPR by NPR

Up First from NPR

56,561 Listeners

The Indicator from Planet Money by NPR

The Indicator from Planet Money

9,532 Listeners

ServiceNow Podcasts by ServiceNow Community Podcasts

ServiceNow Podcasts

12 Listeners

Hard Fork by The New York Times

Hard Fork

5,539 Listeners

Huberman Lab by Scicomm Media

Huberman Lab

29,247 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,211 Listeners

Prof G Markets by Vox Media Podcast Network

Prof G Markets

1,483 Listeners