
Sign up to save your podcasts
Or


Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.
In this episode, Tomás Hernando Koffman, Co-founder of Not Diamond, breaks down what it really takes to reach 99%+ accuracy when the underlying model is a moving target. He explains why non-determinism, model churn, and compounding errors make LLM reliability fundamentally different from classical software. And why “good enough” accuracy quickly falls apart in production.
Drawing from an SRE mindset, Tomás walks through a practical framework for building reliable LLM applications: evaluations and golden datasets, semantic metrics, structured context, and workflow design that treats prompts and instructions as part of the system architecture. We also dive into prompt optimization, why manual tuning doesn’t scale, and how teams can systematically improve accuracy even as models keep changing underneath them.
By RootlyShipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.
In this episode, Tomás Hernando Koffman, Co-founder of Not Diamond, breaks down what it really takes to reach 99%+ accuracy when the underlying model is a moving target. He explains why non-determinism, model churn, and compounding errors make LLM reliability fundamentally different from classical software. And why “good enough” accuracy quickly falls apart in production.
Drawing from an SRE mindset, Tomás walks through a practical framework for building reliable LLM applications: evaluations and golden datasets, semantic metrics, structured context, and workflow design that treats prompts and instructions as part of the system architecture. We also dive into prompt optimization, why manual tuning doesn’t scale, and how teams can systematically improve accuracy even as models keep changing underneath them.