Learning GenAI via SOTA Papers

By Yun Wu

This podcast is focusing on sharing the papers on GenAI related topic, especially the SOTA (State of the Art) papers that are the foundations of GenAI work. It shows how these researches paved the way... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.

Learning GenAI via SOTA Papers episodes:

May 09, 2026 EP177: CAPO math stops overconfident AI lies
Paper Link: https://arxiv.org/abs/2604.12632

Summary:
The provided sources introduce Calibration-Aware Policy Optimization (CAPO), a new reinforcement learning framework designed to improve both the accuracy and reliability of Large Language Models (LLMs). The research identifies a critical flaw in existing methods like Group Relative Policy Optimization (GRPO), which frequently causes models to become overconfident in incorrect answers, a phenomenon known as calibration degradation. By implementing an uncertainty-aware advantage estimation based on a consistent logistic surrogate loss, CAPO ensures that model confidence aligns more accurately with factual correctness. The method also incorporates a reference-model-based noise masking mechanism to filter out low-quality training data, such as lucky guesses or near-correct reasoning. Extensive experiments across multiple mathematical reasoning benchmarks demonstrate that CAPO significantly reduces hallucinations and boosts inference-time performance. Ultimately, the sources highlight CAPO as a state-of-the-art advancement in creating trustworthy AI systems that better understand the limits of their own knowledge.
...more
23min
May 08, 2026 EP176: Trigonometry fixes the AI memory bottleneck
Paper Link: https://arxiv.org/abs/2604.04921

Summary:
The provided sources introduce TriAttention, a novel KV cache compression technique designed to enhance the efficiency of Large Language Models during long-context reasoning. By identifying that query and key vectors concentrate around stable centers in the pre-RoPE space, the researchers developed a trigonometric series to predict and retain the most important tokens. This method overcomes the instability of traditional post-RoPE observation windows, which often suffer from memory bottlenecks and information loss. Experimental results demonstrate that TriAttention matches the accuracy of Full Attention while reducing memory usage by 10.7x and increasing throughput by 2.5x. Ultimately, this framework enables the deployment of complex reasoning models on limited hardware, such as a single consumer GPU, without sacrificing performance on mathematical or general tasks.
...more
21min
May 07, 2026 EP175: How AI models teach themselves reasoning
Paper Link:

Summary:
The provided sources introduce SOAR (Self-Optimization via Asymmetric RL), a meta-reinforcement learning framework designed to help large language models overcome reasoning plateaus. While standard training methods often stall on problems the model cannot already solve, this system uses an asymmetric teacher-student setup where the teacher generates synthetic "stepping stone" problems to guide student progress. Critically, the teacher is rewarded based on the student's measurable improvement on difficult real-world tasks rather than internal proxy rewards, which prevents the stability and diversity collapse common in self-play. Research findings indicate that the structural quality and conceptual focus of generated questions are more vital for learning than the precision of the teacher's answers. Ultimately, the text demonstrates that a model's ability to teach and create curricula can be decoupled from its ability to solve the target problems themselves. These insights are shared as part of the podcast "The Genesis of Intelligence," which highlights state-of-the-art foundations in Generative AI.
...more
24min
May 06, 2026 EP174: 1-bit Bonsai brings powerful AI offline
Source Link: https://prismml.com/news/bonsai-8b

Summary:
PrismML has announced 1-bit Bonsai, a family of Large Language Models (LLMs) designed to provide high-level intelligence on consumer-grade edge devices. The flagship 8B model features a "true" 1-bit architecture where the entire network—including embeddings, attention, and MLP layers—operates at 1-bit precision. This results in a footprint of just 1.15 GB, making it roughly 14x smaller than standard 16-bit models in its class while remaining competitive on benchmarks.

Key highlights of the announcement include:
• Intelligence Density: PrismML defines this metric as a model's capability per unit of size (GB). Bonsai 8B achieves a score of 1.06/GB, drastically higher than the 0.10/GB scored by comparable models like Qwen3 8B.
• Local Performance: The models enable high-throughput local inference, reaching 40+ tokens per second on an iPhone 17 Pro and 131 tokens per second on an M4 Pro Mac. This speed allows for more efficient long-horizon agentic tasks.
• Efficiency: Bonsai delivers 4–5x better energy efficiency than full-precision counterparts, even on standard hardware not yet optimized for 1-bit arithmetic.
• Wider Availability: PrismML also released 4B and 1.7B variants, all of which are available under the Apache 2.0 License to support the development of private, responsive, and offline AI-native products.
...more
24min
May 05, 2026 EP173: AI models diagnosing diseases from blank scans
Paper Link: https://arxiv.org/abs/2603.21687

Summary:
The paper "Mirage: The Illusion of Visual Understanding" explores a phenomenon called the mirage effect, where multimodal AI systems generate highly detailed descriptions and reasoning traces for images that were never actually provided. This behavior creates a "false epistemic frame," allowing models to simulate a perceptual process that isn't grounded in real visual input.

Key findings and contributions of the research include:

• High Mirage Scores: Frontier models (such as GPT-5, Gemini 3 Pro, and Claude Opus 4.5) retain 70–80% of their reported accuracy on standard visual benchmarks even when the images are removed. In one extreme case, a text-only "super-guesser" model outperformed both human radiologists and large multimodal models on a chest X-ray benchmark without ever seeing an image.
• Pathology Bias: In medical contexts, these "mirages" are not neutral; they are heavily biased toward pathology, with models frequently fabricating sensitive clinical findings like strokes, tumors, or fractures for non-existent images.
• Distinction from Guessing: When models are explicitly instructed to "guess" without an image, their performance declines, suggesting that the "mirage regime" allows them to exploit hidden textual cues and benchmark structures more effectively than simple deduction.
• B-Clean Framework: The authors introduce B-Clean, a principled evaluation method that identifies and removes compromised, vision-independent questions from benchmarks. Applying this method reduced some benchmarks by over 75%, revealing that original model rankings were often inflated by non-visual inference.

Ultimately, the paper argues that high benchmark performance is not a reliable indicator of genuine visual understanding and calls for more rigorous, vision-grounded evaluation standards to ensure safety in high-stakes deployments.
...more
23min
May 04, 2026 EP172: How HyperAgents rewrite their own code
Paper Link: https://arxiv.org/abs/2603.19461

Summary:
The paper "HyperAgents" introduces a framework for self-improving AI systems that can autonomously enhance both their performance on tasks and the very mechanisms they use to improve.

Core Innovation: Hyperagents
The authors introduce hyperagents, which are self-referential programs that integrate a task agent (to solve problems) and a meta agent (to modify the codebase) into a single, editable unit. This design enables metacognitive self-modification, meaning the agent can rewrite its own self-improvement procedures. This addresses a major limitation in prior systems, like the Darwin Gödel Machine (DGM), which relied on fixed, handcrafted meta-mechanisms that bottlenecked progress.

Implementation and Results
The authors instantiate this framework as DGM-Hyperagents (DGM-H), utilizing an open-ended exploration structure that maintains an archive of progressively improving agents. Key findings include:

• Diverse Domain Performance: DGM-H demonstrated significant improvements across four distinct domains: coding, paper review, robotics reward design, and Olympiad-level math grading.
• Transferable Meta-Level Skills: DGM-H autonomously developed general-purpose tools such as persistent memory and performance tracking. Crucially, self-improvement strategies learned in one domain (e.g., robotics) were found to transfer and accelerate progress in entirely different domains (e.g., math grading).
• Compounding Progress: The system showed that improvements accumulate over time and across different runs, suggesting a path toward unbounded, self-accelerating AI progress.

Safety and Implications
While the research was conducted under strict safety protocols, including sandboxing and human oversight, the paper discusses the broader implications of AI systems that may eventually evolve faster than humans can audit or interpret. Ultimately, Hyperagents offer a glimpse into AI that does not just search for better solutions, but continually improves its own search for how to improve.
...more
23min
May 03, 2026 EP171: Helium makes AI agent workflows 40x faster
Paper Link: https://arxiv.org/abs/2603.16104

Summary:
This paper introduces Helium, a workflow-aware serving framework designed to optimize agentic workflows, which are sequences of interdependent Large Language Model (LLM) calls,. The authors argue that existing serving systems are inefficient because they optimize individual inference tasks in isolation and treat LLM calls as black-box functions, failing to capture the massive redundancy and cross-call dependencies inherent in multi-step agentic workloads,,,.

To address these inefficiencies, Helium applies classic data system principles to LLM serving through several key innovations:
• Query Plan Modeling: It represents agentic workflows as directed acyclic graphs (DAGs) where LLM calls are treated as first-class operators, allowing for global optimizations like plan pruning and common subgraph elimination,,,.
• Proactive Caching: Unlike traditional "passive" caching, Helium identifies static prompt prefixes during compilation to pre-warm KV caches and utilizes a global prompt cache to bypass redundant operator executions entirely,,,.
• Cache-Aware Scheduling: It employs a novel Templated Radix Tree (TRT) to model the global prefix hierarchy and dependencies, paired with a cost-based algorithm that schedules tasks to maximize KV cache reuse across multiple workers,,,.

Evaluation results show that Helium achieves up to a 1.56× speedup over state-of-the-art agent serving systems and up to a 39.5× reduction in latency compared to naive sequential execution, while strictly preserving semantic accuracy,,,.
...more
23min
May 02, 2026 EP170: Qwen3.5 Multimodal Agent
Paper Link: https://qwen.ai/blog?id=qwen3.5

Summary:
The paper titled "Qwen3.5: Towards Native Multimodal Agents" introduces the first model in the Qwen3.5 series, Qwen3.5-397B-A17B, which is a native vision-language model designed for high-performance reasoning, coding, and agentic tasks. Built on an innovative hybrid architecture that fuses linear attention (Gated Delta Networks) with a sparse mixture-of-experts (MoE), the model achieves high inference efficiency by activating only 17 billion of its 397 billion total parameters per forward pass.

Key highlights of the model include:
• State-of-the-Art Performance: It matches the performance of the 1T-parameter Qwen3-Max model while offering significantly improved decoding throughput—ranging from 8.6x to 19.0x faster depending on the context length.
• Massive Context and Multimodality: The model supports a 1M context window and can process up to two hours of video, facilitating tasks like reverse-engineering code from gameplay or turning sketches into frontend code.
• Expanded Multilingualism: Support has grown from 119 to 201 languages and dialects, aiming to foster global AI equity.
• Agentic Capabilities: Through extensive scaling of Reinforcement Learning (RL) tasks and environments, the model shows significant gains in general agent capabilities and tool-use efficiency.

The authors conclude that Qwen3.5 serves as a foundation for universal digital agents, with future work focusing on system integration, persistent memory, and autonomous self-improvement.
...more
22min
May 01, 2026 EP169: Cybersecurity Risks of Autonomous AI Agents
Paper Link: https://arxiv.org/abs/2603.11088

Summary:
This paper presents the first systematic and comprehensive survey of AI agent security, addressing the unique challenges created by hybrid systems that combine large language models (LLMs) with traditional software components. The authors introduce a foundational framework to understand the security landscape, focusing on three primary areas: design dimensions, attack vectors, and defense mechanisms.

Key aspects of the paper's systematization include:

• Design Dimensions: The survey identifies seven key design dimensions—input trust, access sensitivity, workflow, action, memory, tool, and user interface—analyzing how increased flexibility in these areas broadens an agent's attack surface.
• Attack Taxonomy: The authors categorize attacks based on three threat models (external, user-level, and internal adversaries) and identify seven specific security risks (R1–R7), such as indirect prompt injection, private data leakage, and unauthorized actions.
• Defense Landscape: The paper surveys existing defense strategies, categorizing them into runtime protection (e.g., guardrails, monitoring), secure-by-design (e.g., privilege separation), identity and access management, and component hardening.
• Case Studies: To highlight existing security gaps, the authors conduct case studies on real-world coding and web agents, including a detailed analysis of AutoGPT vulnerabilities like command injection and path traversal.

Ultimately, the work serves as a handbook for researchers and developers, pointing out that while progress has been made in mapping the problem space, practical and adaptive defenses remain largely elusive.
...more
25min
April 30, 2026 EP168: Turning AI Agents into Mathematical Functions
Paper Link: https://arxiv.org/abs/2603.04241

Summary:
Agentics 2.0: Logical Transduction Algebra for Agentic Data Workflows presents a Python-native framework designed to move agentic AI from research prototypes to reliable, enterprise-grade deployments. The paper argues that current "agent-centric" models, which rely on conversational personas and black-box planners, lack the reliability, observability, and scalability required for production-level software.

At the core of the framework is logical transduction algebra, which treats Large Language Model (LLM) inference calls as transducible functions. These functions are characterized by several key properties:
• Typed Semantics: Input and output are constrained by semantic types (realized via Pydantic models), ensuring that any ill-formed output triggers a system error rather than a "silent corruption" of text.
• Explainability and Provenance: The framework tracks local evidence, mapping specific output slots back to the input data that generated them to prevent hallucinations and provide clear audit trails.
• Scalability: It leverages a Map-Reduce programming model to execute stateless, asynchronous transductions in parallel, allowing for efficient processing of large datasets.

Implemented as a Python library, Agentics 2.0 overloads standard operators (such as `<<` for transduction and `&` for merging types) to allow developers to seamlessly interleave deterministic code with LLM-based transformations.

The researchers evaluated the framework on two challenging benchmarks:
1. DiscoveryBench: In data-driven discovery tasks, Agentics 2.0 configurations achieved a state-of-the-art final score of 37.27, outperforming existing baselines.
2. Archer: In complex Natural Language to SQL (NL-to-SQL) parsing, the framework's reasoning-validation agents outperformed nearly all leaderboard submissions.

Ultimately, the paper concludes that by grounding LLM interactions in a formal function algebra, developers can build highly composable and controllable workflows that meet rigorous software engineering standards.
...more
23min

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.