Cloud outages don’t have to be a mystery—or a recurring fire drill. Host Dr. Darren interviews Dr. Helen Gu, professor at North Carolina State University and founder/CEO of InsightFinder, about how AI for cloud operations can detect, predict, and automatically fix outages before users feel the impact.
## Key Takeaways
- AI can move beyond simple alerting to **predictive outage prevention**, spotting early warning signs before they become incidents.
- **Unsupervised machine learning** helps discover hidden patterns in noisy machine data without requiring large sets of labeled examples.
- Real-world cloud environments are complex, with thousands of parameters, dynamic workloads, and interacting microservices that make manual troubleshooting difficult.
- A **closed-loop feedback system** lets teams review AI predictions, correct mistakes, and continuously improve model accuracy.
- InsightFinder’s **composite AI** approach combines predictive AI, causal inference, behavior learning, and small language models for more reliable operations.
- The same data-driven approach can support **cloud monitoring, edge environments, critical infrastructure, and other machine-generated data streams**.
## Chapters
- 00:00 Introduction to AI that prevents cloud outages
- 01:05 Helen Gu’s origin story in NASA-funded Mars research
- 04:10 From video streaming on Mars to machine learning for reliability
- 07:00 Why machine data is harder than it looks
- 09:20 Unsurvised learning vs. supervised learning
- 12:10 From research to Google Cloud anomaly detection
- 14:40 Detection, prediction, and automatic remediation
- 17:10 Why cloud systems are so complex
- 19:45 The future of AI agents, models, and infrastructure monitoring
- 23:10 Hallucinations, false positives, and feedback loops
- 26:00 Composite AI and online learning in production
- 29:10 Adapting AI models to different environments
- 32:05 Fast deployment and time to value