
Sign up to save your podcasts
Or


In this episode we explore how companies should evaluate AI agents across multiple dimensions — including correctness, tool selection, multi-turn reasoning, and safety . The conversation covers building reliable evaluation frameworks, balancing automated vs. human-in-the-loop testing, and leveraging observability to debug agent behavior in production.
Links from the Show
AgentCore Evaluation: https://github.com/awslabs/agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations
Strands Evaluation: https://strandsagents.com/docs/user-guide/evals-sdk/quickstart/
AWS Hosts: Nolan Chen & Malini Chatterjee
Email Your Feedback: [email protected]
By AWS re:Think5
99 ratings
In this episode we explore how companies should evaluate AI agents across multiple dimensions — including correctness, tool selection, multi-turn reasoning, and safety . The conversation covers building reliable evaluation frameworks, balancing automated vs. human-in-the-loop testing, and leveraging observability to debug agent behavior in production.
Links from the Show
AgentCore Evaluation: https://github.com/awslabs/agentcore-samples/tree/main/01-tutorials/07-AgentCore-evaluations
Strands Evaluation: https://strandsagents.com/docs/user-guide/evals-sdk/quickstart/
AWS Hosts: Nolan Chen & Malini Chatterjee
Email Your Feedback: [email protected]

204 Listeners

6,446 Listeners

15 Listeners