Data Engineering Podcast

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability


Listen Later

Summary 
In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. 
Announcements 
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
  • You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
  • Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
  • Your host is Tobias Macey and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economics

Interview
 
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what the major pain points have been in the observability space? (e.g. limited scale/retention, costs, integration fragmentation)
  • What are the elements of the ecosystem and tech stacks that led to that state of the world?
  • What are you building at Observe that circumvents those pain points?
  • What are the major ecosystem evolutions that make this a feasible architecture? (e.g. columnar storage, distributed compute, protocol consolidation)
  • Can you describe the architecture of the Observe platform?
  • How have the design of the platform evolved/changed direction since you first started working on it?
  • What was your process for determining which core technologies to build on top of?
  • What were the missing pieces that you had to engineer around to get a cohesive and performant platform?
  • The perennial problem with observability systems and data lakes is their tendency to succumb to entropy. What are the guardrails that you are relying on to help customers maintain a well-structured and usable repository of information?
  • Data lakehouses are excellent for flexibility and scaling to massive data volumes, but they're not known for being fast. What are the areas of investment in the ecosystem that is changing that narrative?
  • As organizations overcome the constraints of limited retention periods and anxiety over cost, what new use cases does that unlock for their observability data?
  • How do AI applications/agents change the requirements around observability data? (collection, scale, complexity, applications, etc.)
  • What are the most interesting, innovative, or unexpected ways that you have seen Observe/lakehouse technologies used for observability?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Observe?
  • When is Observe/lakehouse technologies the wrong choice?
  • What do you have planned for the future of Observe?

Contact Info
 
  • LinkedIn

Parting Question
 
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements
 
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links
 
  • Observe Inc.
  • Lakehouse Architecture
  • Splunk
  • Observability
  • RSyslog
  • GlusterFS
  • Dremel
  • Drill
  • BigQuery
  • Snowflake SIGMOD Paper
  • Prometheus
  • Datadog
  • NewRelic
  • AppDynamics
  • DynaTrace
  • Loki
  • Cortex
  • Mimir
  • Tempo
  • Cardinality
  • FluentBit
  • FluentD
  • OpenTelemetry
  • OTLP == OpenTelemetry Line Protocol
  • Kafka
  • VPC Flow Logs
  • Read Amplification
  • Lance
  • Iceberg
  • Hudi
  • PromQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
...more
View all episodesView all episodes
Download on the App Store

Data Engineering PodcastBy Tobias Macey

  • 4.5
  • 4.5
  • 4.5
  • 4.5
  • 4.5

4.5

142 ratings


More shows like Data Engineering Podcast

View all
This Week in Startups by Jason Calacanis

This Week in Startups

1,298 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,103 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

627 Listeners

Risky Business by Risky Business Media

Risky Business

372 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

348 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

990 Listeners

Practical AI by Practical AI LLC

Practical AI

216 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

555 Listeners

The Data Engineering Show by The Firebolt Data Bros

The Data Engineering Show

8 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

99 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

668 Listeners