Learning GenAI via SOTA Papers

By Yun Wu

This podcast is focusing on sharing the papers on GenAI related topic, especially the SOTA (State of the Art) papers that are the foundations of GenAI work. It shows how these researches paved the way... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.

Learning GenAI via SOTA Papers episodes:

April 29, 2026 EP167: Why AI models ignore visual evidence
Paper Link: https://arxiv.org/abs/2603.00873

Summary:
MC-SEARCH is a new benchmark designed to evaluate and improve multimodal large language models (MLLMs) as they transition from simple retrieval to complex, agentic reasoning. While older datasets focus on short, single-step tasks, this framework provides 3,333 high-quality examples featuring long reasoning chains that average nearly four hops in length. These examples are categorized into five distinct reasoning structures, such as image-initiated or parallel forks, to test how models coordinate text and visual data. The researchers also introduced HAVE, a verification process that ensures every step in a reasoning chain is necessary and grounded in evidence. To move beyond final answer accuracy, the benchmark uses process-level metrics like Hit per Step and Rollout Deviation to identify specific errors like over-retrieval or planning misalignment. Finally, the authors present SEARCH-ALIGN, a fine-tuning method that uses these verified chains to significantly boost the planning and retrieval fidelity of open-source models.
...more
23min
April 28, 2026 EP166: The Auton solution to the integration paradox
Paper Link: https://arxiv.org/abs/2602.23720

Summary:
The paper introduces the Auton Agentic AI Framework, a principled architecture designed to standardize the creation, execution, and governance of autonomous agent systems. It specifically addresses the "Integration Paradox"—the fundamental mismatch between the stochastic, unstructured outputs of Large Language Models (LLMs) and the deterministic, schema-conformant requirements of the backend infrastructure they must control.

The framework is built upon several core architectural pillars:

• Declarative Specification: It separates the Cognitive Blueprint (a language-agnostic, versionable data artifact) from the Runtime Engine. This allows agents defined in the AgenticFormat Standard (YAML/JSON) to be portable across different programming environments, such as moving from a Python prototype to a high-performance Java microservice.
• Deterministic Governance: Instead of relying on post-hoc filtering, the framework uses a Constraint Manifold to project the agent's policy onto a formally defined safe subspace, ensuring safety and compliance by construction.
• Hierarchical Memory: To overcome LLM statelessness, it employs a Reflector-Driven Consolidation Protocol that compresses raw interaction streams into long-term semantic, episodic, and procedural memories, mimicking biological memory systems.
• Formal Execution Model: It formalizes agent behavior as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, enforcing a "think-before-act" discipline that separates internal deliberation from external actions.
• Performance Optimizations: The framework reduces end-to-end latency through Cognitive Map-Reduce (parallelizing independent reasoning steps), speculative execution, and dynamic context pruning.
• Self-Evolution: It defines a three-level framework for continuous improvement, ranging from in-context adaptation to self-taught reasoning (STaR) and on-policy reinforcement learning.

By treating agents as auditable data rather than imperative code, the Auton framework provides a scalable and reliable pathway for deploying autonomous systems in mission-critical enterprise environments.
...more
23min
April 27, 2026 EP165: Translating hidden AI logic into English
Paper Link: https://arxiv.org/abs/2602.15338

Summary:
The paper introduces Obj-Disco, an automated framework designed to decompose opaque large language model (LLM) alignment reward signals into sparse, weighted combinations of human-interpretable natural language objectives. The authors address a critical challenge in AI safety: while LLMs are aligned using complex proxy reward functions, these signals are often "opaque," making it difficult for developers to discern if a model is adopting intended behaviors or unintended shortcuts like sycophancy and verbosity.

Key Components of the Framework
• Iterative Greedy Algorithm: Inspired by matching pursuit, Obj-Disco analyzes the behavioral trajectory of an LLM across multiple training checkpoints. It uses a "proposer" LLM to identify candidate objectives by targeting regions where the current model’s behavioral shifts remain most unexplained.
• Objectives Verification: Discovered objectives must meet two criteria: they must be human-interpretable (scoring similarly to a human evaluator) and follow a predictable trend (such as linear or logarithmic growth) throughout the alignment process.
• Objective Explanations (OEs): To aid human understanding, the system selects a sparse set of exemplar trajectories that highlight global behavioral trends while maintaining semantic diversity across different domains.

Experimental Results and Impact
• High Fidelity: Across various tasks including summarization, dialogue, and coding, the framework consistently captured over 90% of reward behavior.
• Detecting Latent Misalignment: In a safety-focused case study, Obj-Disco successfully identified latent misaligned incentives—such as increased permissiveness regarding illegal acts—that baseline methods failed to surface.
• Causality and Human Validation: Human-subject studies confirmed that the discovered objectives are highly causal to the final model's behavior and that the provided explanations are significantly more useful than random baselines.

By leveraging the rich signal found in training checkpoints, the sources describe Obj-Disco as a vital tool for increasing transparency and safety in LLM deployment.
...more
21min
April 26, 2026EP164: [LACONIC] Teaching AI to stop overthinking
The paper introduces LACONIC (Length-Aware Constrained Policy Optimization), a novel reinforcement learning (RL) framework designed to reduce the verbosity of Large Language Model (LLM) outputs during fine-tuning. While RL-tuning typically enhances reasoning skills, it often leads to excessively long responses that increase inference latency and computational overhead.
Unlike previous methods that rely on fixed heuristic penalties, LACONIC treats length control as a constrained optimization problem. Its core features include:
Primal-Dual Algorithm: It maximizes task rewards (like accuracy) while enforcing a target token budget.
Clipped Cost Function: To prevent the model from collapsing into overly short, degenerate outputs, LACONIC uses a "clipped cost" that only penalizes responses exceeding the specified budget.
Adaptive Multiplier ($\lambda$): A dual variable is automatically adjusted throughout training. It increases the penalty if the model exceeds the budget and decreases it when the model is compliant, making the system robust and tuning-free.
Performance and Efficiency: On mathematical reasoning tasks, LACONIC reduces output length by over 50% while preserving or even improving task accuracy (pass@1).
Resource Savings: Compared to standard RL-tuning (GRPO), LACONIC is 19% faster and consumes 22% less GPU memory because it generates fewer tokens during the training process.
Generalization: The method maintains strong performance on out-of-domain benchmarks, such as general knowledge and logic reasoning, with 44% fewer tokens.
Overall, LACONIC provides a stable and reliable method for developers to enforce specific deployment targets, such as latency or token limits, without sacrificing the model's reasoning capabilities.
Key Innovation: Adaptive Length ControlMajor Results
...more
21min
April 25, 2026EP163: Why AI Models Only Remember Five Percent
The paper "Language Model Memory and Memory Models for Language" explores the capacity of machine learning models to store input information in hidden layer vector embeddings. The research identifies that standard causal language models typically produce "information-poor" embeddings because the objective of next-token prediction does not require the model to retain arbitrary input details. In contrast, autoencoders designed for input regeneration demonstrate nearly perfect memory formation.
To improve memory retention and computational efficiency, the author introduces a parallelizable encoder-decoder memory model architecture. Key contributions and findings include:
Training Paradigms: The paper proposes using combined objective functions—pairing next-token prediction with information-retention tasks like copying—to help models form information-rich memories.
Curriculum Learning: A streamlined training approach is introduced where a high-fidelity encoder is frozen, and decoders are trained first to process memories before learning next-token prediction.
Computational Efficiency: Substituting token sequences with memory embeddings reduces the time-to-first-token, minimizes KV cache sizes, and increases token throughput during inference.
Benchmark Performance: Models trained with these combined objectives show significant improvements in input information-related benchmarks without compromising general language understanding.
The findings also have implications for retrieval-based models, suggesting that current embedding models often lack the necessary information density to identify arbitrary details within text chunks.
...more
23min
April 24, 2026EP162: AI agents beat humans with malicious skills
This paper provides a comprehensive survey of the agent skills paradigm, a modular approach that allows large language models (LLMs) to acquire specialized procedural expertise on demand without retraining. Instead of encoding all knowledge in model weights, this architecture uses composable packages of instructions, code, and resources—often formalized through the SKILL.md specification—to enable dynamic capability extension.
Key areas covered in the survey include:
Architectural Foundations: The paper highlights a progressive disclosure architecture that loads information in three stages (metadata, instructions, and resources) to minimize context window consumption. It also defines the "agentic stack," where skills provide the procedural "what to do" while the Model Context Protocol (MCP) provides the connectivity for "how to connect".
Skill Acquisition: The authors categorize four primary modalities for obtaining skills: human-authoring, reinforcement learning with skill libraries (e.g., SAGE), autonomous exploration (e.g., SEAgent), and compositional synthesis.
Deployment and Benchmarks: The primary domain for these skills is the computer-use agent (CUA) stack, where agents navigate GUIs. The paper notes significant progress on benchmarks like OSWorld, where success rates have recently surpassed human baselines.
Security Risks: Empirical analysis revealed that 26.1% of community-contributed skills contain vulnerabilities, such as prompt injection and data exfiltration.
Proposed Governance: To address these risks, the authors propose a Skill Trust and Lifecycle Governance Framework. This model uses four sequential verification gates—static analysis, semantic classification, behavioral sandboxing, and permission validation—to assign skills to graduated trust tiers.
The paper concludes by identifying seven open challenges, including cross-platform portability and skill selection at scale, providing a research agenda for developing trustworthy, self-improving skill ecosystems.
...more
22min
April 23, 2026EP161: Small AI Judges Beat Massive Coding Giants
The paper "Improving Code Generation via Small Language Model-as-a-judge" investigates a cost-effective strategy to enhance automated code generation by using Small Language Models (SLMs)—defined as models with fewer than 5 billion parameters—to rival the performance of massive Large Language Models (LLMs).
The researchers address the challenge that while massive LLMs are effective for coding, their deployment is often prohibitively expensive for small and medium enterprises, costing upwards of $17,000 to $50,000 in hardware infrastructure. To solve this, they propose a "team-based" approach: one SLM generates multiple candidate solutions, and a second, fine-tuned SLM acts as a judge to select the most likely correct implementation.
Key findings from the study include:
Judge Proficiency: While SLMs fail to judge code correctness in zero-shot settings, fine-tuning them allows them to achieve a "moderate agreement" with ground-truth test results. Remarkably, a fine-tuned Qwen2.5 Coder 3B judge achieved higher accuracy (Kappa score of 0.57) than the commercial GPT-4.1-mini (0.54).
Performance Breakthrough: By generating 10 candidate solutions and using an SLM judge to pick the best one, the code generation performance of small models improved significantly (e.g., a 15.6% boost for Qwen2.5 Coder 3B). In four out of five tested model families, these SLM teams outperformed LLMs 5–25× larger than the generator itself.
Cost-Effectiveness: A two-SLM team (generator and judge) can be run on consumer-grade hardware (e.g., two NVIDIA RTX 3060 GPUs) for approximately $600, compared to the $17,500 required for a single ~30B parameter model.
Reliability: The authors found that a judge's confidence score is a strong indicator of its judgment reliability, allowing for even higher precision if a confidence threshold is applied.
Ultimately, the study demonstrates that fine-tuning SLMs to act as judges is a scalable and budget-friendly strategy for companies to build high-quality, in-house AI coding assistants.
...more
22min
April 22, 2026EP160: [AgentSys] Securing AI agents with hierarchical memory
The paper introduces AGENTSYS, a novel framework designed to protect Large Language Model (LLM) agents from indirect prompt injection (IPI) attacks through explicit hierarchical memory management. Conventional LLM agents are vulnerable because they indiscriminately accumulate all tool outputs and reasoning traces in their context window, allowing malicious instructions to persist across multiple reasoning steps and degrading decision-making through verbose, non-essential content.
Key features of the AGENTSYS architecture include:
Hierarchical Isolation: The system organizes agents into a tree structure where a main agent spawns short-lived worker agents for tool invocations.
Memory Management: Raw external data and subtask reasoning traces are confined to isolated worker contexts and never enter the main agent's memory.
Schema-Validated Communication: The main agent defines a specific "intent" (a JSON-like schema) for each tool call, and worker agents distill raw outputs into compact, validated return values that must pass a syntactic gate.
Mediated Recursion: Any recursive tool calls within subtasks are gated by an LLM-based validator and a sanitize-restart mechanism to handle potentially adversarial content.
Evaluations on benchmarks like AgentDojo and ASB show that AGENTSYS achieves state-of-the-art security, reaching a 0.78% attack success rate (ASR) on AgentDojo while improving benign utility (64.36% compared to 63.54% for undefended baselines). By keeping the main agent's working memory clean and focused, AGENTSYS effectively prevents attack persistence and utility degradation in complex, multi-step workflows.
...more
27min
April 21, 2026EP159: Brute force scale dominates the AI frontier
The paper "Is there 'Secret Sauce' in Large Language Model Development?" (February 2026) investigates whether the rapid progress in Large Language Models (LLMs) is driven by scaling up compute or by proprietary developer techniques. Analyzing data from 809 models released between 2022 and 2025, the researchers decomposed LLM performance into four factors: scaling (compute), shared algorithmic progress, developer-specific "secret sauce," and model-specific optimizations,.
Key findings from the study include:
Scale Dominates the Frontier: At the performance frontier, 80%–90% of performance differences are explained by training compute,. This suggests that "secret sauce" plays only a modest role in pushing the absolute limits of AI capabilities; instead, frontier advances are primarily driven by massive increases in scale,.
The Role of "Secret Sauce": While less critical at the frontier, proprietary techniques are vital for models below that threshold,. Some developers are up to 61 times more compute-efficient than others, allowing them to produce smaller, cheaper models with relatively high performance,,.
Shared Algorithmic Progress: Broad technological gains across the field increased effective compute by a factor of 7.5x between early 2023 and late 2024,.
Intra-Company Variation: Efficiency varies significantly even within a single company’s lineup; one firm can produce two models with over a 40x difference in compute efficiency,.
The authors conclude that sustained leadership in frontier AI requires continued access to massive compute resources,. However, the "secret sauce" of technical progress is effectively democratizing AI by enabling the creation of high-performing, low-cost models for broader use.
...more
19min
April 20, 2026EP158: The hidden blind spots of AI logic
The paper "Large Language Model Reasoning Failures" is a comprehensive survey that systematically categorizes and analyzes the various ways Large Language Models (LLMs) fail at reasoning tasks. To unify fragmented research in the field, the authors introduce a two-axis taxonomy that organizes failures based on the type of reasoning and the nature of the failure.
The taxonomy divides reasoning into embodied (physical world interaction) and non-embodied types, with the latter further split into informal (intuitive judgments) and formal (logical and mathematical) reasoning. On the second axis, failures are classified into three categories:
Fundamental failures: Intrinsic weaknesses in LLM architectures (e.g., the "reversal curse" or limited working memory) that broadly affect performance.
Application-specific limitations: Shortcomings that manifest in particular domains, such as Theory of Mind or 3D spatial planning.
Robustness issues: Inconsistencies where performance drops due to minor variations in prompt phrasing or task structure.
The paper provides detailed definitions for these failures, explores their root causes—such as the limitations of next-token prediction—and discusses mitigation strategies like Chain-of-Thought prompting and data-centric approaches. By providing a structured perspective and a public GitHub repository of related research, the survey aims to guide future work toward developing more reliable and robust reasoning capabilities in AI.
...more
19min

FAQs about Learning GenAI via SOTA Papers:

How many episodes does Learning GenAI via SOTA Papers have?

The podcast currently has 217 episodes available.