April 17, 2026

Episode 49: Rethinking AI Agent Evaluations

Listen Later

30 minutes

In this episode we explore how companies should evaluate AI agents across multiple dimensions — including correctness, tool selection, multi-turn reasoning, and safety . The conversation covers building reliable evaluation frameworks, balancing automated vs. human-in-the-loop testing, and leveraging observability to debug agent behavior in production.

Links from the Show

AgentCore Evaluation: https://github.com/awslabs/agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations

Strands Evaluation: https://strandsagents.com/docs/user-guide/evals-sdk/quickstart/

AWS Hosts: Nolan Chen & Malini Chatterjee

Email Your Feedback: [email protected]

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

AWS re:Think Podcast

By AWS re:Think

5

99 ratings

April 17, 2026

Episode 49: Rethinking AI Agent Evaluations

Listen Later

30 minutes

In this episode we explore how companies should evaluate AI agents across multiple dimensions — including correctness, tool selection, multi-turn reasoning, and safety . The conversation covers building reliable evaluation frameworks, balancing automated vs. human-in-the-loop testing, and leveraging observability to debug agent behavior in production.

Links from the Show

AgentCore Evaluation: https://github.com/awslabs/agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations

Strands Evaluation: https://strandsagents.com/docs/user-guide/evals-sdk/quickstart/

AWS Hosts: Nolan Chen & Malini Chatterjee

Email Your Feedback: [email protected]

...more

More shows like AWS re:Think Podcast

AWS Podcast by Amazon Web Services

AWS Podcast

203 Listeners

Fiction - Comedy Fiction by The Sunset Explorers

Fiction - Comedy Fiction

6,446 Listeners

Tech Disruptors by Bloomberg

Tech Disruptors

16 Listeners