AI Adoption Playbook

incident.io's Lawrence Jones on Building AI That Automatically Investigates Technical Outages


Listen Later

When technical systems fail at companies like Netflix or Etsy, every minute of downtime can cost millions. That's why incident.io is building AI systems that can automatically investigate and diagnose technical problems faster than human engineers.

In this episode of The AI Adoption Playbook, Lawrence Jones, Product Engineer at incident.io, tells Ravin how they're creating an automated incident investigator that can analyze logs, traces, and metrics to determine what went wrong during an outage. He shares their methodical approach to AI development, focusing on measurable progress through evaluation metrics and scorecards rather than intuitive "vibe-based" changes.

Lawrence also discusses the evolution of their AI teams and roles, including their newly launched AI Engineer position designed specifically for the unique challenges of AI development, and how they use LLMs themselves to evaluate AI system performance.

Topics discussed:

  • Building an AI incident investigator that can automatically analyze logs, traces, and metrics to determine the root cause of technical outages.
  • Creating comprehensive evaluation frameworks with scorecards and metrics to measure AI performance against historical incident data.
  • Using LLMs as evaluators to determine if AI responses were helpful by analyzing post-incident conversations and user feedback.
  • Developing internal tooling that enables teams to rapidly test and improve AI systems while maintaining quality standards.
  • Evolving from individual "vibe-based" AI development to team-based systematic improvement with clear metrics for success.
  • Structuring AI engineering roles and teams to balance product engineering skills with specialized AI development knowledge.
  • Implementing product-focused AI features like chatbots that can help automate routine tasks during incident response.
  • Leveraging parallel human and AI processes to collect validation data and improve AI system performance over time.
  • Building versus buying AI evaluation tools and the advantages of custom solutions integrated with existing product data.
  • Exploring the future of AI in technical operations and whether AI will enhance or replace human roles in incident management.
  • Listen to more episodes: 

    Apple 

    Spotify 

    YouTube

    ...more
    View all episodesView all episodes
    Download on the App Store

    AI Adoption PlaybookBy Credal