Data Engineering Podcast

Unfreezing The Data Lake: The Future-Proof File Format


Listen Later

Summary 
In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering. 

Announcements 
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
  • Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
  • Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file format

Interview
 
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the F3 project is and the story behind it?
  • We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?
  • Parquet is the format with perhaps the broadest adoption. What are the challenges that such wide use poses when trying to modify or extend the specification?
  • The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer?
  • Can you describe the key design principles of the F3 format?
  • What are the engineering challenges that you faced while developing your implementation of the F3 proof-of-concept?
  • The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks)
  • What are some examples of features in data lake use cases that could be enabled by F3?
  • What are some of the other ideas/hypotheses that you developed and discarded in the process of your reasearch?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on F3?
  • What do you have planned for the future of F3?

Contact Info
 
  • Personal Website

Parting Question
 
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links
 
  • F3 Paper
  • Formats Evaluation Paper
  • F3 Github
  • SAL Paper
  • RisingWave
  • Tencent Cloud
  • Parquet
  • Arrow
  • Andy Pavlo
  • Wes McKinney
  • CMU Public Seminar
  • VLDB
  • ORC
  • Protocol Buffers
  • Lance
  • PAX == Partition Attributes Across
  • WASM == Web Assembly
  • DataFusion
  • DuckDB
  • DuckLake
  • Velox
  • Vortex File Format

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
 
...more
View all episodesView all episodes
Download on the App Store

Data Engineering PodcastBy Tobias Macey

  • 4.5
  • 4.5
  • 4.5
  • 4.5
  • 4.5

4.5

142 ratings


More shows like Data Engineering Podcast

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

625 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

579 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

303 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

343 Listeners

Practical AI by Practical AI LLC

Practical AI

197 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

207 Listeners

Last Week in AI by Skynet Today

Last Week in AI

311 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

532 Listeners

The Data Engineering Show by The Firebolt Data Bros

The Data Engineering Show

8 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

139 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

98 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

638 Listeners

AI + a16z by a16z

AI + a16z

34 Listeners