Mechanical Dreams

By Mechanical Dirk

An automatically generated podcast about machine learning and natural language processing. The two fictional hosts talk about papers that I want to learn more about on my way to work. It's not good, b... more

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Mechanical Dreams:

How many episodes does Mechanical Dreams have?

The podcast currently has 142 episodes available.

Mechanical Dreams episodes:

January 09, 2026 From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence
In this episode:
• Introduction: Is Shannon Information Theory Broken?: Professor Norris and Linda introduce the episode, with Norris expressing skepticism about challenging the foundations of information theory. Linda introduces the paper 'From Entropy to Epiplexity' and the premise that traditional theory fails to account for computational bounds.
• The Paradox of Deterministic Creation: The hosts discuss the first major paradox: how deterministic processes like AlphaZero or synthetic data generation seem to create new knowledge, despite the Data Processing Inequality suggesting otherwise. Linda explains why cryptographic randomness proves that 'computational difficulty' looks like entropy.
• Defining Epiplexity and Time-Bounded Entropy: Linda breaks down the core definitions of the paper, explaining Epiplexity as the structural information a specific model can actually learn, versus Time-Bounded Entropy, which is the residual unpredictability relative to that model's resources.
• Emergence, Induction, and the Chess Experiment: A deep dive into the paper's experiments with Cellular Automata and Chess. The hosts discuss how the order of data (Forward vs. Reverse) impacts what a model learns and how limited compute forces models to learn emergent rules rather than brute-force simulation.
• Practical Implications for LLMs and Conclusion: The discussion moves to real-world application, specifically how Epiplexity explains why pre-training on text transfers better than images. Norris admits the utility of the theory for data selection in Large Language Models.
...more
20min
January 08, 2026 Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
In this episode:
• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.
• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.
• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.
• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.
• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.
• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.
...more
20min
January 07, 2026 NorMuon- Making Muon more efficient and scalable
In this episode:
• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.
• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prompting a discussion on 'condition numbers' versus 'neuron norms.'
• The NorMuon Solution: Best of Both Worlds: The hosts dive into the algorithm itself, detailing how NorMuon applies neuron-wise adaptive learning rates (similar to Adam-mini) *after* Muon's orthogonalization step. They discuss the intuition behind using second-order momentum to normalize the disparate scales of neuron updates.
• Engineering at Scale: FSDP2 and Distributed Newton-Schulz: The discussion shifts to the systems engineering required to make this work on large clusters. Linda explains how the authors implemented NorMuon under the FSDP2 framework, specifically how they distribute the expensive Newton-Schulz orthogonalization across devices to avoid redundant computation.
• Results and Verdict: Efficiency Gains: Norris reviews the empirical results, noting the 21% efficiency gain over Adam on 1.1B parameter models and the impressive memory savings. The episode concludes with a consensus that orthogonalization and adaptive scaling are complementary, not competitive, technologies.
...more
20min
January 06, 2026 Dion- Distributed Orthonormalized Updates
In this episode:
• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.
• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.
• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skeptical about the math, but Linda explains how this integrates cleanly with weight sharding.
• The Magic of Error Feedback: The hosts discuss the 'rank-fraction' parameter and how low-rank updates save compute. Linda explains the crucial role of 'error feedback' in maintaining accuracy, finally winning over Norris's skepticism.
• Lazy Updates and CPU Offloading: A look at the algorithmic flexibility of Dion, including 'Lazy-Dion' and CPU offloading variants. They discuss the experimental results showing Dion matching Muon's performance with significantly lower wall-clock time.
• Future-Proofing Optimization: Professor Norris admits the elegance of the solution. The pair wraps up with thoughts on how Dion might become the standard for training next-generation foundation models.
...more
19min
January 06, 2026 How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
In this episode:

• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.
• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.
• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal direction that is dampened by aggressive optimization schedules.
• The Solution: Curriculum Model Averaging (CMA): Linda details the paper's proposed method: replacing learning rate decay with a constant learning rate combined with weight averaging (EMA) to stabilize the model while keeping it plastic enough to learn from good data.
• Results at Scale: A deep dive into the experimental results on 1.5B parameter models, showing how this new regime outperforms random shuffling by over 1.6% on standard benchmarks.
• Rethinking the Pretraining Recipe: Professor Norris concedes the brilliance of the approach, and the two discuss the broader implications for mid-training and the necessity of co-designing data curricula with optimization hyperparameters.
...more
21min
January 06, 2026 How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5
...more
21min
January 06, 2026 How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4
...more
20min
January 06, 2026 How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3
...more
21min
January 06, 2026 How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2
...more
23min
November 26, 2025 Key and Value Weights Are Probably All You Need
In this episode:
• Is Query Redundant?: Linda introduces a provocative paper suggesting a core part of the Transformer attention mechanism, the Query matrix, might be unnecessary. Professor Norris expresses his trademark skepticism about simplifying such a fundamental component.
• The Usual Suspects: Q, K, and V: Linda provides a quick, intuitive refresher on the roles of Query, Key, and Value matrices in self-attention. Professor Norris helps frame it with an analogy, emphasizing why each component has traditionally been considered essential.
• Disappearing Queries and Basis Transformations: Linda explains the paper's core theoretical claim that the Query matrix can be mathematically absorbed into other components through a change of basis. Professor Norris probes the 'simplifying assumptions,' like the absence of Layer Normalization, required for the proof to hold.
• Putting It to the Test: The discussion moves to the empirical results, where models trained without Query matrices perform surprisingly well. Linda details the crucial hyperparameter adjustments, which Professor Norris identifies as the key to bridging the gap between theory and practice.
• So, Is Query Really All You Don't Need?: The hosts debate the broader implications for parameter efficiency and our understanding of transformer architecture. They conclude by questioning if this simplification is an artifact of smaller models or a fundamental insight that will reshape future designs.
...more
15min

FAQs about Mechanical Dreams:

How many episodes does Mechanical Dreams have?

The podcast currently has 142 episodes available.