The Data Engineering Show

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That


Listen Later

In this episode of The Data Engineering Show, Benjamin sits down with Artie CTO and co-founder Robin Tang, to explore the complexities of high-performance data movement. Robin shares his journey from building Maxwell at Zendesk to scaling data systems at Open Door, highlighting the gap between business-oriented SaaS connectors and the rigorous demands of production database replication.

Robin dives deep into Artie’s architecture, explaining how they leverage a split-plane model (Control Plane and Data Plane) to provide a "Bring Your Own Cloud" (BYOC) experience that engineering teams actually trust. You’ll hear about the technical nuances of CDC, from handling Postgres TOAST columns to the "economy of scale" challenges of processing billions of rows for Substack, Artie’s first customer. Whether you're struggling with real-time ingestion costs or curious about the future of platform-agnostic partitioning, this conversation provides a masterclass in modern data movement.


What You'll Learn:

  • Why the data movement market is bifurcating: Managed vendors like Fivetran excel at SaaS integrations (hundreds of connectors), while specialized vendors like Artie focus on production databases at high volume - a fundamentally different job to be done requiring expertise in failure recovery, observability, and advanced use cases.
  • How to design CDC architecture that doesn't break production databases: Use online backfill strategies (DB log framework) instead of long-running transactions that hold write locks; implement table-level parallelism so a single table error doesn't halt the entire pipeline.
  • The split-plane architecture pattern for flexible deployment models: Build control plane and data plane separation from day one, allowing customers to choose between fully managed cloud deployments or bring-your-own-cloud (BYOC) without compromising UX or architecture.
  • Why database-specific expertise matters more than breadth: SQL Server CDC requires reverse engineering undocumented code; Postgres has TOAST columns; MongoDB allows invalid timestamp values - each data source has hidden complexity that justifies deep specialization over connector sprawl.
  • How to build trust with early-stage customers on mission-critical workloads: Walk prospects through architecture and failure modes before implementation; encourage them to stress-test with real data volumes; establish deep engineering partnerships where both teams debug problems together (not sales-driven relationships).
  • The platform-specific optimization trap and how to solve it: Instead of requiring customers to understand nuances of BigQuery time partitioning vs. Snowflake's lack thereof, build platform-agnostic features (like soft partitioning) that work consistently across destinations while handling platform-specific optimizations under the hood.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)


Robin is the CTO and cofounder of Artie, a data movement platform built for high-volume, low-latency production database replication. With over a decade of experience building large-scale data systems, including early work on Maxwell (an open-source CDC framework at Zendesk) and database architecture at venture-backed startups, Robin identified a critical gap: existing tools optimize for SaaS integrations, not production databases at scale. In this episode, Robin shares hard-won lessons from building mission-critical infrastructure, including architectural innovations that prevent data loss and failure modes that only surface under real-world production load. His work at Artie has powered reliable data replication for companies like Substack, making this conversation essential for engineering teams building or evaluating real-time data movement solutions.

Quotes


“Artie helps companies make data streaming accessible." - Robin

"I didn't want to make any sort of compromises and it just turned out to be a really hard problem, so then we started a company around this." - Robin

"The complexity is not just at the destination level, the complexity is also at the source level." - Robin

"Every pipeline that we touch is mission critical for customers, or else they would just use either their existing pipeline or a managed vendor that's out there." - Robin

"We handle the whole thing, whereas other vendors more or less provide a component and expect engineers to either build or attach additional pieces." - Robin

"I think the biggest bottleneck for real time right now is accessibility. When people think about real time, they immediately think it's not worth it because they implicitly have a cost associated with it." - Robin

"We use Kafka transactions, so we do not commit offsets until the destination tells us the data has actually been flushed." - Robin

"There's so much nuance with every single data source that it becomes a whack-a-mole problem." - Robin

"When there's sufficient pain on the other side and they buy into your vision, it's easier to overcome obstacles during technical implementation." - Robin

"We're spending more time developing platform-agnostic solutions so customers don't have to understand platform nuances." - Robin


Resources 

Connect on LinkedIn:
  • Robin Tang - https://www.linkedin.com/in/tang8330/
  • Benjamin Wagner - https://www.linkedin.com/in/wagjamin/


Websites:
  • Artie: https://www.artie.com/
  • Fivetran: https://www.fivetran.com
  • Estuary: https://www.estuary.dev
  • Airbyte: https://airbyte.com
  • Debezium: https://debezium.io

Tools & Platforms:
  • Maxwell – Open source CDC framework for MySQL to read binlog into Kafka
  • Kafka – Distributed event streaming platform for data movement
  • WarpStream – Cost-optimized Kafka alternative using object storage
  • Streamsy – Kubernetes-native Kafka deployment tool
  • Apache Iceberg – Open table format for data lakehouse architecture
  • Delta Live Tables – Databricks' data movement and transformation tool
  • ClickPipes – ClickHouse's native data ingestion platform
  • Snowpipe Streaming – Snowflake's real-time data ingestion service
  • Google Datastream – Google Cloud's CDC and data movement service
  • AWS MSK Tiered Storage – Amazon managed Kafka with tiered storage capabilities

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:
  • Zach Wilson on What Makes a Great Data Engineer
  • Joe Reis and Matt Housley on The Fundamentals of Data Engineering
  • Bill Inmon, The Godfather of Data Warehousing
...more
View all episodesView all episodes
Download on the App Store

The Data Engineering ShowBy The Firebolt Data Bros

  • 3.8
  • 3.8
  • 3.8
  • 3.8
  • 3.8

3.8

8 ratings


More shows like The Data Engineering Show

View all
Planet Money by NPR

Planet Money

30,675 Listeners

Hidden Brain by Hidden Brain, Shankar Vedantam

Hidden Brain

43,648 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

145 Listeners

DataFramed by DataCamp

DataFramed

267 Listeners

Tech Brew Ride Home by Morning Brew

Tech Brew Ride Home

969 Listeners

Practical AI by Practical AI LLC

Practical AI

211 Listeners

The Journal. by The Wall Street Journal & Spotify Studios

The Journal.

6,101 Listeners

My First Million by Hubspot Media

My First Million

2,657 Listeners

The Prof G Pod with Scott Galloway by Vox Media Podcast Network

The Prof G Pod with Scott Galloway

5,658 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

140 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

554 Listeners

The Analytics Engineering Podcast by dbt Labs, Inc.

The Analytics Engineering Podcast

29 Listeners

HBR On Leadership by Harvard Business Review

HBR On Leadership

168 Listeners

Training Data by Sequoia Capital

Training Data

41 Listeners