Software Daily

LinkedIn Data Platform with Carl Steinbach


Listen Later

LinkedIn is a social network with petabytes of data. 

In order to store that data, LinkedIn distributes and replicates that data across a large cluster of machines running the Hadoop Distributed File System. In order to run calculations across its large data set, LinkedIn needs to split the computation up using MapReduce-style jobs.

LinkedIn has been developing its data infrastructure since the early days of the Hadoop ecosystem. LinkedIn started using Hadoop in 2008, and in the last 11 years, the company has adopted streaming frameworks, distributed databases, and newer execution runtimes like Apache Spark.

With the popularization of machine learning, there are more applications for data engineering than ever before. But the tooling around data engineering means that it is still hard for developers to find data sets, clean their data, and build reliable models. 

Carl Steinbach is an engineer at LinkedIn working on tools for data engineering. In today’s episode, Carl discusses the data platform inside LinkedIn, and the strategies that the company has developed around storing and computing large amounts of data. 

Full disclosure: LinkedIn is a sponsor of Software Engineering Daily.

...more
View all episodesView all episodes
Download on the App Store

Software DailyBy SoftwareDaily.com