
Sign up to save your podcasts
Or


We explore counting words across 5 terabytes of text using distributed systems. From chunking data into 128 MB blocks and performing map and reduce, to Hadoop’s disk I/O and Spark’s in-memory approach, we discuss when memory fits, when it spills, and why I/O is the real bottleneck. We’ll also cover tokenization pitfalls at block boundaries, failure resilience, data skew, and practical timelines on real clusters for building resilient, scalable text analytics pipelines.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC
By Mike BreaultWe explore counting words across 5 terabytes of text using distributed systems. From chunking data into 128 MB blocks and performing map and reduce, to Hadoop’s disk I/O and Spark’s in-memory approach, we discuss when memory fits, when it spills, and why I/O is the real bottleneck. We’ll also cover tokenization pitfalls at block boundaries, failure resilience, data skew, and practical timelines on real clusters for building resilient, scalable text analytics pipelines.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC