AI Post Transformers

By mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, pr... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AI Post Transformers:

How many episodes does AI Post Transformers have?

The podcast currently has 567 episodes available.

AI Post Transformers episodes:

October 26, 2025 LithOS: Operating System for Efficient GPU Machine Learning
This 2025 CMU paper introduces LithOS, a novel operating system designed to improve the efficiency and utilization of Graphics Processing Units (GPUs) for machine learning (ML) workloads in data centers. The authors argue that current GPU management solutions, such as NVIDIA's MPS and MIG, are too coarse-grained, leading to low utilization and high latency in multi-tenant environments. LithOS proposes a transparent, OS-level approach featuring a TPC Scheduler for fine-grained resource control, a Kernel Atomizer that breaks up monolithic kernels to reduce head-of-line blocking, and mechanisms for hardware right-sizing and transparent power management (DVFS). Evaluation results demonstrate that LithOS significantly reduces tail latencies (up to 13× compared to MPS) and improves aggregate throughput in both inference-only and hybrid inference/training scenarios while achieving substantial capacity and energy savings. Overall, the work establishes a foundation for developing true operating systems for GPUs to address the growing efficiency crisis in ML infrastructure. Source: https://www.cs.cmu.edu/~dskarlat/publications/lithos_sosp25.pdf
...more
19min
October 26, 2025 LLMs Learning from Verbal Feedback Without Scalar Rewards
The September 25, 2025 collaboration between Sea AI Lab, SUTD, NUS, NTU and University of Waterloo paper proposes an alternative to traditional Reinforcement Learning (RL) for Large Language Models (LLMs) by introducing the Feedback-Conditional Policy (FCP), which learns directly from rich verbal feedback instead of compressing it into scalar rewards. The authors argue that scalarization leads to information loss, ambiguity, and imbalanced reward scales, hindering effective learning from natural language critiques. FCP reframes learning as a conditional generation problem, approximating the feedback-conditional posterior through maximum likelihood training on offline data and then using an online bootstrapping stage conditioned on positive feedback to refine the policy. This approach, which draws inspiration from text-to-image generation's ability to combine mixed captions (as shown in the accompanying image), allows LLMs to leverage their inherent linguistic priors for better control and performance, matching or surpassing scalar-based RL methods on reasoning tasks. Source: https://arxiv.org/pdf/2509.22638
...more
16min
October 26, 2025 Lp-Reg: Low-Probability Tokens Sustain RL Exploration
The October 3, 2025 paper by Tencent introduces a reinforcement learning technique called Low-probability Regularization (Lp-Reg) designed to overcome the exploration collapse bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. The authors identify that performance plateaus because training systematically eliminates crucial, low-probability tokens, termed reasoning sparks, which are necessary for diverse reasoning paths. Previous methods relying on overall policy entropy fail because they indiscriminately amplify both these valuable sparks and irrelevant noise tokens. Lp-Reg addresses this by constructing a less-noisy proxy distribution that filters out irrelevant tokens and regularizes the policy to preserve the valuable low-probability sparks, leading to stable on-policy training and achieving state-of-the-art accuracy on mathematical reasoning benchmarks. Source: https://arxiv.org/pdf/2510.03222
...more
19min
October 26, 2025 MASA: Meta-Awareness via Self-Alignment Reinforcement Learning
The September 26, 2025 paper introduces a novel reinforcement learning framework called Meta-Awareness via Self-Alignment (MASA), designed to enhance the reasoning capabilities and efficiency of large language models (LLMs) by improving their meta-awareness, or the ability to know "how to think." MASA works by creating parallel rollouts for both solution paths and meta-predictions (like predicted length and difficulty) and rewarding the alignment between these self-generated signals, thus avoiding reliance on external training sources. A more efficient variant, MASA-efficient, leverages these meta-predictions for predictive gating and early cutoff during training, substantially reducing computation time. Experimental results show that MASA significantly improves accuracy and generalization across mathematical, logical, scientific, and coding benchmarks while accelerating the training process by over 1.28 times compared to the GRPO baseline. Source: https://arxiv.org/pdf/2510.03259
...more
13min
October 26, 2025 Open-o3 Video: Spatio-Temporal Grounded Reasoning
The October 25, 2025 Bytedance paper introduces Open-o3 Video, a novel framework developed by researchers from Peking University and ByteDance, aimed at advancing video reasoning by incorporating explicit spatio-temporal evidence. Unlike prior models that only generate textual rationales, Open-o3 Video explicitly highlights key timestamps and bounding boxes to ground its answers in visual observations. To achieve this, the authors curate two new datasets, STGR-CoT-30k and STGR-RL-36k, and utilize a two-stage training strategy involving supervised fine-tuning and Group Sequence Policy Optimization (GSPO) with specialized rewards. This approach, which includes adaptive temporal proximity and temporal gating mechanisms, significantly improves performance on the V-STAR benchmark and other video understanding tasks, making video reasoning more accurate and verifiable. Source: https://arxiv.org/pdf/2510.20579
...more
19min
October 26, 2025 Ring-linear: Efficient Hybrid Architecture for Long-Context Reasoning
This October 23, 2025 technical report from the Ling Team introduces the Ring-linear model series, specifically Ring-mini-linear-2.0 and Ring-flash-linear-2.0, which utilize a hybrid attention architecture combining linear and softmax attention mechanisms to enhance efficiency in long-context reasoning. The paper explains how this architecture, featuring Mixture-of-Experts (MoE) and advanced FP8 training optimization through kernels like LingHe, significantly reduces inference costs and improves training throughput. A major focus is on systematic training-inference alignment to achieve stable reinforcement learning (RL) training, addressing disparities in components like the KV Cache and RMSNorm that often lead to RL collapse in long-context models. Finally, the report presents benchmark results demonstrating that the Ring-linear models maintain state-of-the-art performance across various complex reasoning tasks compared to similar-scale counterparts. Source: https://arxiv.org/pdf/2510.19338
...more
15min
October 26, 2025 STAR: Sub-Entry Sharing TLB for Multi-Instance GPU Efficiency
These April 29, 2024 paper provides an overview of the challenges associated with using NVIDIA's Multi-Instance GPU (MIG) technology, specifically focusing on the address translation mechanism in the A100 GPU. The papers reveal, primarily through reverse-engineering efforts, that the L2 and L3 Translation Lookaside Buffers (TLBs) utilize a compression design where each entry comprises 16 sub-entries to enhance memory capacity management. A major problem arises because the L3 TLB is shared across all isolated MIG instances, causing contention that results in frequent evictions and low utilization of these sub-entries. To mitigate this performance degradation, the sources propose STAR, a novel hardware solution that dynamically enables the sharing of TLB sub-entries among different base addresses to improve overall efficiency. Source: https://arxiv.org/pdf/2404.18361
...more
18min
October 26, 2025 Strata: Efficient Hierarchical Context Caching for LLM Serving
The August 26, 2025 collaboration between Stanford, NVIDIA, Shanghai Jiao Tong University, University of Michigan, University of Colorado Boulder, Carnegie Mellon University introduces Strata, a hierarchical context caching framework designed to improve the performance of serving Large Language Models (LLMs) with long context windows. The core problem Strata addresses is that while caching key-value (KV) states is essential for efficiency, transferring large, fragmented cached contexts from slower memory tiers (like CPU memory) back to the GPU creates severe I/O bottlenecks and performance stalls. It also describes why paged attention creates data fragmentation when offloading even though its goal is to address memory fragmentation. That is paged attention becomes an issue when using offloading due to large contexts. Strata overcomes these issues through two main innovations: GPU-assisted I/O to mitigate data fragmentation and achieve high bandwidth utilization, and cache-aware request scheduling to intelligently form balanced batches and overlap unavoidable I/O stalls with complementary tasks. The evaluation shows that Strata significantly reduces the Time-To-First-Token (TTFT) and increases throughput compared to state-of-the-art serving systems like vLLM + LMCache and TensorRT-LLM on long-context benchmarks. Source: https://arxiv.org/html/2508.18572v1
...more
17min
October 26, 2025 Structural Understanding of LLM Overthinking
The October 10, 2025 paper from the University of Michigan and Google DeepMind concerning the phenomenon of "overthinking" in Large Language Models (LLMs) that utilize chain-of-thought (CoT) reasoning. The authors introduce a systematic analyzer called TRACE to structurally examine an LLM's thought process, decomposing it into sub-thoughts and progression graphs to move beyond superficial, length-based metrics of overthinking. Benchmarking across various tasks reveals that "thinking models" often waste significant computational resources on simple queries without notable accuracy gains, operating five to twenty times slower than non-thinking counterparts. The study identifies two primary overthinking patterns—Explorer (characterized by over-exploration and backtracking) and Late Landing (marked by excessive self-verification)—and proposes a utility-based redefinition of overthinking focused on diminishing marginal returns of subsequent thoughts. Source: https://arxiv.org/pdf/2510.07880
...more
18min
October 26, 2025 Stuck in the Matrix: LLM Spatial Reasoning
The October 23 2025 research paper probes the spatial reasoning capabilities of Large Language Models (LLMs) when processing text-based inputs, specifically focusing on how performance degrades as task complexity increases. Using a suite of five grid-based tasks—including quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding—the authors tested four models: GPT-4o, GPT-4.1, and two variants of Claude 3.7. The key finding is that while models achieve moderate success on smaller grids, their accuracy rapidly deteriorates as grid dimensions scale up, demonstrating a significant gap between linguistic and robust spatial representation in their architectures. Notably, the Anthropic models consistently outperformed the OpenAI variants, though all models exhibited weaknesses, such as frequent miscounting, mathematical errors, and difficulty maintaining board state in complex scenarios. The study concludes by emphasizing the fragility of LLM spatial reasoning at scale and suggesting future work on improving text-based spatial data representation and mathematical capabilities. Source: https://arxiv.org/pdf/2510.20198
...more
14min

FAQs about AI Post Transformers:

How many episodes does AI Post Transformers have?

The podcast currently has 567 episodes available.