Learning GenAI via SOTA Papers

By Yun Wu

This podcast is focusing on sharing the papers on GenAI related topic, especially the SOTA (State of the Art) papers that are the foundations of GenAI work. It shows how these researches paved the way... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.

Learning GenAI via SOTA Papers episodes:

April 19, 2026EP157: [AgentHeLLM] Protecting drivers from hijacked vehicle AI
The paper, "Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy," explores the emerging security challenges of integrating Large Language Model (LLM)-based agents into vehicles. As these agents interact with external services via protocols like Google’s Agent-to-Agent (A2A), they create "attack surfaces" where malicious payloads can propagate, potentially leading to driver distraction or unauthorized vehicle control.
The authors argue that existing security frameworks (such as OWASP and MAESTRO) are insufficient for safety-critical automotive systems because they often confuse what is being protected (assets) with how it is attacked (attack paths). To bridge this gap, the paper introduces AgentHeLLM (Agent Hazard Exploration for LLM Assistants), a framework built on three primary contributions:
Separation of Concerns: It formally distinguishes between assets (the "what") and attack paths (the "how").
Human-Centric Asset Taxonomy: Instead of focusing on technical components like "memory" or "tools," the framework defines assets based on ultimate human values and rights, such as Life and Bodily Health, Mental Well-Being, and Privacy.
Formal Attack Path Model: This graph-based model differentiates between poison paths (the propagation of malicious data) and trigger paths (the recursive actions required to activate that poison).
Finally, the authors demonstrate the framework's practical use through the AgentHeLLM Attack Path Generator, an open-source tool that automates the discovery of complex, multi-stage threats using a bi-level search strategy. This methodology aims to move automotive AI security from reactive patching to proactive threat anticipation.
...more
23min
April 18, 2026EP156: [Uncertainty Quantification] How AI Agents Know They Are Guessing
"Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities" addresses the critical need for a new framework to measure failure likelihood in large language model (LLM) agents. While traditional research treats LLMs as static oracles for single-turn tasks, this paper argues that uncertainty quantification (UQ) must evolve to handle the multi-turn, interactive nature of modern agents operating in open-world environments.
The paper is structured around three core pillars:
Foundations: The authors present a mathematical formulation of agent UQ, modeling an agent’s trajectory as a stochastic process involving actions ($A$), observations ($O$), and environment states ($E$). This framework allows for the estimation of both turn-level and trajectory-level uncertainty, encompassing broad classes of existing UQ setups as special cases.
Technical Challenges: The work identifies four primary hurdles specific to agentic AI:
Practical Implications and Open Problems: The authors highlight how a reliable UQ framework is a prerequisite for deploying agents in high-stakes domains like healthcare, software engineering, and robotics. They also outline remaining research frontiers, including modeling uncertainty in multi-agent systems and self-improving agents.
Ultimately, the paper advocates for a paradigm shift from point-wise estimates to sequential dynamics models to ensure that autonomous agents can reliably assess and act upon their own likelihood of failure.
...more
24min
April 17, 2026EP155: [Agentic Proposing] Small models beat giants with logic bricks
The paper "Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis" introduces a novel framework designed to overcome the limitations of traditional synthetic data generation for training reasoning models. While existing methods often struggle with logical inconsistency or limited problem complexity, Agentic Proposing treats problem synthesis as a goal-driven process of compositional logic engineering.
The framework operates through a specialized agent that dynamically selects and orchestrates modular reasoning skills from an autonomous library. The synthesis process is modeled as a sequential decision process involving three main stages:
Skill Acquisition: Extracting and formalizing atomic reasoning modules from diverse corpora into a structured library.
Agentic Supervised Fine-Tuning (SFT): Training the proposer to mimic expert trajectories that include internal reflection, tool use, and dynamic self-correction.
Agentic Reinforcement Learning: Optimizing the proposer using a novel Multi-Granularity Policy Optimization (MGPO) algorithm, which provides fine-grained rewards for both intermediate steps and final outcomes.
Iterative Workflow: The agent follows a structured "Draft → Check → Refine → Finalize" loop.
Dynamic Self-Correction: Using internal reflection ($\tau_{think}$) and tool execution ($\tau_{exec}$), the agent can proactively prune or update misaligned skills during the generation process to maintain logical integrity.
Difficulty Calibration: The framework uses a curriculum-based distribution and solver-based "probers" to ensure problems are precisely targeted at the model's reasoning frontier.
The researchers developed the Agentic-Proposer-4B, which generates high-precision trajectories across mathematics, coding, and science. Key performance highlights include:
State-of-the-Art Performance: A 30B solver trained on only 11,000 trajectories achieved a 91.6% accuracy on AIME 2025, rivaling frontier-scale proprietary models like GPT-5 and Gemini-3-Pro.
Parameter Efficiency: The framework demonstrates that a small volume of high-quality, high-difficulty synthetic signals can effectively substitute for massive, lower-quality datasets.
Robust Generalization: Models trained on this data showed significant gains in multidisciplinary reasoning, including breakthroughs in graduate-level science benchmarks like GPQA.
Ultimately, the paper concludes that the primary bottleneck for advanced reasoning in LLMs is not parameter scale, but the density and precision of high-quality training signals.
Core Framework and MethodologyKey Technical InnovationsEmpirical Results
...more
16min
April 16, 2026EP154: [FS-Researcher] Giving AI agents a file system
The paper "FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents" introduces a novel dual-agent framework designed to overcome the context window limitations of large language models (LLMs) during complex, long-horizon research tasks.
The core innovation of FS-Researcher is the use of a persistent, file-system-based workspace that serves as an external memory. This allows the agents to store and organize information far exceeding a standard model's context limit. The framework operates in two distinct stages:
Context Builder: Acts as a "digital librarian" that browses the internet, takes structured notes, and archives raw sources into a hierarchical knowledge base.
Report Writer: Composes the final report section by section, using the knowledge base as its sole source of facts.
The Role of the File System
The workspace utilizes control files (such as todos, checklists, and logs) to track progress and coordinate between agent sessions. This structure enables iterative refinement, where agents can revisit and fix errors across multiple sessions, mirroring a human-like research workflow.
Performance and Scaling
Experimental results on benchmarks like DeepResearch Bench and DeepConsult show that FS-Researcher achieves state-of-the-art (SOTA) quality compared to both proprietary and open-source systems.
A major finding of the paper is the validation of test-time scaling: there is a positive correlation between the quality of the final report and the computation (rounds of iterations) allocated to the Context Builder. As more rounds are invested in building the knowledge base, the resulting reports become more evidence-grounded and comprehensive.
...more
22min
April 15, 2026EP153: [SERA] Training AI coding agents on untested code
The paper introduces SERA (Soft-verified Efficient Repository Agents), a new method for training high-performing open-source coding agents at a fraction of the cost of previous approaches. The researchers aim to bridge the gap between closed-source systems and open-weight models by making it practical to specialize agents to private codebases, allowing them to encode repository-specific patterns directly into their weights.
The core innovation is a pipeline called Soft Verified Generation (SVG), which is built on two key observations:
Soft Verification: Rather than using complex and resource-heavy unit tests to verify synthetic data, SVG uses line-level recall to compare patches generated from two separate rollouts. This removes the need for test infrastructure and allows data generation from any repository regardless of its test coverage.
Vague Instructions: The researchers found that using intentionally vague prompts (like asking for a change to a random function) diversifies training data by encouraging tasks like refactoring and documentation, which are often more representative of real-world work than simple bug fixes.
Key Results and Contributions:
Performance: SERA-32B achieves state-of-the-art results for fully open-source models on SWE-bench Verified, matching or exceeding the performance of strong open-weight models like Devstral-Small-2.
Efficiency: The method is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Specializing an agent to a specific codebase (like Django) requires only about 8,000 samples and costs approximately $1,300.
Repository Specialization: The authors demonstrate that a specialized student model can match or exceed the performance of its teacher model (e.g., GLM-4.5-Air) by learning the specific knowledge of a target repository.
Open Resources: The project released the SERA model series, the underlying code, and a dataset of 200,000 synthetic trajectories, the largest of its kind for coding agents.
Overall, the paper argues that SERA democratizes coding agent research by significantly lowering the barrier to entry for individual researchers and small teams.
...more
21min
April 14, 2026EP152: DeepVerifier forces AI to check its work
The technical report, "Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification," proposes a new framework called DeepVerifier to enhance the reliability of Deep Research Agents (DRAs). While DRAs are transforming automated knowledge discovery, they remain prone to errors such as hallucinations and incorrect actions.
The paper introduces several key concepts and contributions:
Inference-Time Scaling of Verification: Instead of improving models through traditional post-training, the authors propose a "self-evolving" paradigm where agents improve by iteratively evaluating their own outputs during test-time inference. This process demonstrates a "scaling effect," where accuracy progressively increases as the agent receives more rounds of structured feedback.
Asymmetry of Verification: The framework leverages the principle that verifying the correctness of an answer is often easier than generating it from scratch. DeepVerifier exploits this by decomposing complex verification tasks into smaller, more manageable sub-questions that target specific vulnerabilities.
DRA Failure Taxonomy: To guide the verification process, the researchers developed a taxonomy that classifies agent failures into five major classes (such as "Finding Sources" and "Reasoning") and thirteen sub-categories. This taxonomy was used to create detailed rubrics for providing structured feedback to the agent.
Performance Gains: Experimental results show that DeepVerifier outperforms standard LLM judges by 12%–48% in meta-evaluation F1 scores. When integrated with capable closed-source models like Claude-3.5-Sonnet, it yielded 8%–11% accuracy improvements on challenging subsets of the GAIA benchmark.
Open-Source Contributions: To support the development of open-source models, the authors released DeepVerifier-4K, a curated dataset of 4,646 high-quality agent steps focused on reflection and critique. They also introduced DeepVerifier-8B, a model fine-tuned on this data that demonstrates significantly improved reflection and self-correction capabilities.
...more
20min
April 13, 2026EP151: [MagicGUI-RMS] AI agents that think before they click
The paper introduces MagicGUI-RMS, a multi-agent reward modeling framework designed to create self-evolving graphical user interface (GUI) agents. It addresses the limitations of existing agents—such as their reliance on manual annotations and static rule-based systems—by providing a scalable method for automated trajectory evaluation and feedback.
The system's core architecture consists of two primary components:
Domain-Specific Reward Model (DS-RM): Evaluates actions based on fine-grained UI interaction rules and proposes corrected actions when errors occur.
General-Purpose Reward Model (GP-RM): Acts as a global arbiter, ensuring actions align with broader task semantics and long-term goals.
To support these models, the authors developed a structured data construction pipeline that automatically generates diverse training samples through techniques like trajectory perturbation and rule-based verification. Additionally, an automated data-reflux mechanism enables continuous self-improvement by feeding high-quality, verified trajectories back into the agent’s training set.
Experimental results demonstrate that MagicGUI-RMS significantly enhances agent performance, achieving substantial gains in task accuracy and robustness. Notably, the system outperformed several strong baselines, including GPT-4o, particularly in complex and out-of-distribution GUI tasks.
...more
25min
April 12, 2026EP150: The Leap to Autonomous Agentic Reasoning
The paper "Agentic Reasoning for Large Language Models" provides a comprehensive roadmap for reframing Large Language Models (LLMs) as autonomous agents capable of planning, acting, and learning through continual interaction with their environments. This transition marks a shift from static sequence prediction to dynamic, goal-oriented decision-making.
The survey organizes agentic reasoning along three complementary layers:
Foundational Agentic Reasoning: Establishes core single-agent capabilities, specifically planning, tool use, and search.
Self-Evolving Agentic Reasoning: Examines how agents refine their internal states and policies through feedback, memory, and iterative adaptation over time.
Collective Multi-Agent Reasoning: Focuses on collaborative scenarios where multiple specialized agents coordinate roles and share knowledge to solve complex tasks.
The authors further distinguish between two primary optimization modes: in-context reasoning, which scales test-time compute through structured orchestration without parameter updates, and post-training reasoning, which uses reinforcement learning and fine-tuning to internalize reasoning strategies into the model's weights.
The paper contextualizes these mechanisms across diverse real-world applications, including science, robotics, healthcare, autonomous research, and mathematical exploration. Finally, it identifies critical future frontiers, such as user personalization, long-horizon credit assignment, world modeling, and the governance of autonomous agentic systems.
...more
24min
April 11, 2026EP149: [IDRBench] Interactive AI beats lone wolf models
The paper "IDRBench: Interactive Deep Research Benchmark" introduces the first systematic framework for evaluating interactive deep research conducted by Large Language Model (LLM) agents,. While existing systems typically operate autonomously, assuming a fully specified user intent, the authors argue that real-world research goals are often underspecified and evolve during the exploration process,.
To address the limitations of existing benchmarks that only evaluate final outputs, IDRBench provides three core contributions:
A Modular Multi-Agent Framework: This pipeline decomposes research into stages—Planning, Research Loop, and Generation—augmented with an explicit interaction mechanism for clarification and alignment,.
Scalable User Simulation: A reference-grounded User Simulator acts as a proxy for human feedback, providing goal-oriented guidance based on reference documents to enable large-scale, reproducible evaluation without human annotators,.
Interaction-Aware Evaluation: A comprehensive suite that jointly measures Interaction Benefits (such as semantic alignment and intent coverage) and Interaction Costs (measured in turns and tokens),,.
Experiments conducted across seven state-of-the-art LLMs—including GPT-5.1, Gemini-2.5-Pro, and DeepSeek-V3.2—demonstrate that interaction consistently improves research quality and robustness,. Notably, the findings reveal that interaction can sometimes outweigh differences in raw model capacity, allowing lower-capacity models with effective interaction to surpass the autonomous performance of stronger models. The benchmark also highlights critical trade-offs between alignment gains and the operational overhead (cognitive and token costs) of frequent interaction,.
...more
22min
April 10, 2026EP148: How AI masters math through self-correction
"Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks" proposes a novel two-stage training framework designed to enhance the mathematical reasoning capabilities of large language models (LLMs) through supervised fine-tuning (SFT) rather than traditional reinforcement learning.
The framework addresses the limitations of existing research that often relies on external model distillation or complex reinforcement learning by focusing on the model's own self-generated data. The two stages include:
Stage 1: Long CoT Data Construction and Fine-tuning: The model uses a multi-turn dialogue strategy to self-generate long chain-of-thought (CoT) data that inherently embeds four critical reasoning habits: verification, backtracking, subgoal decomposition, and backward reasoning. High-quality samples are filtered using predefined rules to fine-tune the model and activate its intrinsic reasoning abilities.
Stage 2: Difficulty-Aware Rejection Sampling: An iterative sampling mechanism is employed to progressively focus on complex, unsolved problems. This dynamic optimization balances the data distribution, ensuring the model receives more training signals for difficult tasks.
Key Results and Impact:
Performance Gains: The approach yielded significant improvements across mathematical benchmarks, including a 149% relative improvement on AIME24 and notable gains on GSM8K and MATH500.
Reasoning Depth: The fine-tuned models generated reasoning chains over 4× longer than baselines, demonstrating a capacity for detailed, olympiad-level proofs.
Efficiency: The method provides a resource-efficient pathway for optimization, matching the accuracy of distillation-based methods while utilizing significantly shorter response lengths and requiring no external teacher models.
...more
25min

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.