Show Notes
- (1:56) Jim went over his education at Trinity College Dublin in the late 90s/early 2000s, where he got early exposure to academic research in distributed systems.
- (4:26) Jim discussed his research focused on dynamic software architecture, particularly the K-Component model that enables individual components to adapt to a changing environment.
- (5:37) Jim explained his research on collaborative reinforcement learning that enables groups of reinforcement learning agents to solve online optimization problems in dynamic systems.
- (9:03) Jim recalled his time as a Senior Consultant for MySQL.
- (9:52) Jim shared the initiatives at the RISE Research Institute of Sweden, in which he has been a researcher since 2007.
- (13:16) Jim dissected his peer-to-peer systems research at RISE, including theoretical results for search algorithm and walk topology.
- (15:30) Jim went over challenges building peer-to-peer live streaming systems at RISE, such as GradientTV and Glive.
- (18:18) Jim provided an overview of research activities at the Division of Software and Computer Systems at the School of Electrical Engineering and Computer Science at KTH Royal Institute of Technology.
- (19:04) Jim has taught courses on Distributed Systems and Deep Learning on Big Data at KTH Royal Institute of Technology.
- (22:20) Jim unpacked his O’Reilly article in 2017 called “Distributed TensorFlow,” which includes the deep learning hierarchy of scale.
- (29:47) Jim discussed the development of HopsFS, a next-generation distribution of the Hadoop Distributed File System (HDFS) that replaces its single-node in-memory metadata service with a distributed metadata service built on a NewSQL database.
- (34:17) Jim rationalized the intention to commercialize HopsFS and built Hopsworks, an user-friendly data science platform for Hops.
- (36:56) Jim explored the relative benefits of public research money and VC-funded money.
- (41:48) Jim unpacked the key ideas in his post “Feature Store: The Missing Data Layer in ML Pipelines.”
- (47:31) Jim dissected the critical design that enables the Hopsworks feature store to refactor a monolithic end-to-end ML pipeline into separate feature engineering and model training pipelines.
- (52:49) Jim explained why data warehouses are insufficient for machine learning pipelines and why a feature store is needed instead.
- (57:59) Jim discussed prioritizing the product roadmap for the Hopswork platform.
- (01:00:25) Jim hinted at what’s on the 2021 roadmap for Hopswork.
- (01:03:22) Jim recalled the challenges of getting early customers for Hopsworks.
- (01:04:30) Jim intuited the differences and similarities between being a professor and being a founder.
- (01:07:00) Jim discussed worrying trends in the European Tech ecosystem and the role that Logical Clocks will play in the long run.
- (01:13:37) Closing segment.
Jim’s Contact Info
- Logical Clocks
- Twitter
- LinkedIn
- Google Scholar
- Medium
- ACM Profile
- GitHub
Mentioned Content
Research Papers
- “The K-Component Architecture Meta-Model for Self-Adaptive Software” (2001)
- “Dynamic Software Evolution and The K-Component Model” (2001)
- “Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing” (2005)
- “Building Autonomic Systems Using Collaborative Reinforcement Learning” (2006)
- “Improving ICE Service Selection in a P2P System using the Gradient Topology” (2007)
- “gradienTv: Market-Based P2P Live Media Streaming on the Gradient Overlay” (2010)
- “GLive: The Gradient Overlay as a Market Maker for Mesh-Based P2P Live Streaming” (2011)
- “HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases” (2016)
- “Scaling HDFS to More Than 1 Million Operations Per Second with HopsFS” (2017)
- “Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata” (2017)
- “Implicit Provenance for Machine Learning Artifacts” (2020)
- “Time Travel and Provenance for Machine Learning Pipelines” (2020)
- “Maggy: Scalable Asynchronous Parallel Hyperparameter Search” (2020)
Articles
- “Distributed TensorFlow” (2017)
- “Reflections on AWS’s S3 Architectural Flaws” (2017)
- “Meet Michelangelo: Uber’s Machine Learning Platform” (2017)
- “Feature Store: The Missing Data Layer in ML Pipelines” (2018)
- “What Is Wrong With European Tech Companies?” (2019)
- “ROI of Feature Stores” (2020)
- “MLOps With A Feature Store” (2020)
- “ML Engineer Guide: Feature Store vs. Data Warehouse” (2020)
- “Unifying Single-Host and Distributed Machine Learning with Maggy” (2020)
- “How We Secure Your Data With Hopsworks” (2020)
- “One Function Is All You Need For ML Experiments” (2020)
- “Hopsworks: World’s Only Cloud-Native Feature Store, now available on AWS and Azure” (2020)
- “Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store” (2020)
- “Hopsworks Feature Store API 2.0, a new paradigm” (2020)
- “Swedish startup Logical Clocks takes a crack at scaling MySQL backend for live recommendations” (2021)
Projects
- Apache Hudi (by Uber)
- Delta Lake (by Databricks)
- Apache Iceberg (by Netflix)
- MLflow (by Databricks)
- Apache Flink (by The Apache Foundation)
People
- Leslie Lamport (The Father of Distributed Computing)
- Jeff Dean (Creator of MapReduce and TensorFlow, Lead of Google AI)
- Richard Sutton (The Father of Reinforcement Learning — who wrote “The Bitter Lesson”)
Programming Books
- C++ Programming Languages books (by Scott Meyers)
- “Effective Java” (by Joshua Bloch)
- “Programming Erlang” (by Joe Armstrong)
- “Concepts, Techniques, and Models of Computer Programming” (by Peter Van Roy and Seif Haridi)
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit datacast.substack.com/subscribe