AI Post Transformers

Agentic Aggregation for Long-Horizon AI Tasks


Listen Later

This episode explores a Princeton paper on whether multiple long-running, tool-using AI agent trajectories can be combined more effectively by an “aggregator agent” that selectively inspects the full traces, rather than by simple answer voting or compressed summaries. It explains why aggregation gets much harder for long-horizon agentic tasks like web research, navigation, and software repair, where useful evidence is scattered across search queries, tool calls, observations, and partial plans instead of ending in a neat final answer. The discussion situates the work against self-consistency, repeated sampling, ReAct, and Tree of Thoughts, arguing that the real novelty is not parallel rollouts themselves but how to reason over archived trajectories after the runs are complete. Listeners would find it interesting because it gets at a practical bottleneck in scaling AI performance at inference time: where extra compute should be spent, and how to recover the one crucial clue buried inside a pile of messy agent logs.
Sources:
1. Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks — Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen, 2026
http://arxiv.org/abs/2604.11753
2. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, 2023
https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models
3. Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas Griffiths, Yuan Cao, Karthik Narasimhan, 2023
https://scholar.google.com/scholar?q=Tree+of+Thoughts:+Deliberate+Problem+Solving+with+Large+Language+Models
4. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling — Charlie Snell and collaborators, 2024
https://scholar.google.com/scholar?q=Large+Language+Monkeys:+Scaling+Inference+Compute+with+Repeated+Sampling
5. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Anonymous/OpenAI-aligned line of work often associated with inference scaling discussions; exact authorship depends on version, 2024
https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+can+be+More+Effective+than+Scaling+Model+Parameters
6. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023
https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models
7. WebArena: A Realistic Web Environment for Building Autonomous Agents — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al., 2024
https://scholar.google.com/scholar?q=WebArena:+A+Realistic+Web+Environment+for+Building+Autonomous+Agents
8. GAIA: a benchmark for General AI Assistants — Grégoire Mialon and collaborators, 2023
https://scholar.google.com/scholar?q=GAIA:+a+benchmark+for+General+AI+Assistants
9. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — John Yang, Carlos E. Jimenez, Alexander Wettig, Shiyue Deng, et al., 2024
https://scholar.google.com/scholar?q=SWE-bench:+Can+Language+Models+Resolve+Real-World+GitHub+Issues?
10. Toolformer: Language Models Can Teach Themselves to Use Tools — Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Jason Weston, Mike Lewis, 2023
https://scholar.google.com/scholar?q=Toolformer:+Language+Models+Can+Teach+Themselves+to+Use+Tools
11. MRKL Systems: A Modular, Neuro-Symbolic Architecture That Combines Large Language Models, External Knowledge Sources and Discrete Reasoning — A. Karpas, Y. Levine, Y. M. Jang, et al., 2022
https://scholar.google.com/scholar?q=MRKL+Systems:+A+Modular,+Neuro-Symbolic+Architecture+That+Combines+Large+Language+Models,+External+Knowledge+Sources+and+Discrete+Reasoning
12. Gorilla: Large Language Model Connected with Massive APIs — Patil, Zhang, Wang, et al., 2023
https://scholar.google.com/scholar?q=Gorilla:+Large+Language+Model+Connected+with+Massive+APIs
13. Best-of-N Test-Time Scaling — Charlie Snell, et al., 2025
https://scholar.google.com/scholar?q=Best-of-N+Test-Time+Scaling
14. Inference-Time Scaling for Generalist Reward Modeling / Search-based test-time scaling works cited as Brown et al. 2024, Welleck et al. 2024, Muennighoff et al. 2025, Zhao et al. 2025 — Various, 2024-2025
https://scholar.google.com/scholar?q=Inference-Time+Scaling+for+Generalist+Reward+Modeling+/+Search-based+test-time+scaling+works+cited+as+Brown+et+al.+2024,+Welleck+et+al.+2024,+Muennighoff+et+al.+2025,+Zhao+et+al.+2025
15. BrowseComp — Jason Wei, et al., 2025
https://scholar.google.com/scholar?q=BrowseComp
16. HLE — Phan, et al., 2025
https://scholar.google.com/scholar?q=HLE
17. WebDancer or WebWalker-style web navigation/agent benchmarks and newer deep research benchmarks such as DeepResearch Bench — Various, 2024-2026
https://scholar.google.com/scholar?q=WebDancer+or+WebWalker-style+web+navigation/agent+benchmarks+and+newer+deep+research+benchmarks+such+as+DeepResearch+Bench
18. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, et al., 2023
https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning
19. Language Agent Tree Search / Planning with MCTS-style LLM agents — Various, 2023-2025
https://scholar.google.com/scholar?q=Language+Agent+Tree+Search+/+Planning+with+MCTS-style+LLM+agents
20. iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference — approx. 2025 multi-agent debate authors, 2025
https://scholar.google.com/scholar?q=iMAD:+Intelligent+Multi-Agent+Debate+for+Efficient+and+Accurate+LLM+Inference
21. GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion — approx. 2024/2025 multi-agent debate authors, 2024/2025
https://scholar.google.com/scholar?q=GroupDebate:+Enhancing+the+Efficiency+of+Multi-Agent+Debate+Using+Group+Discussion
22. Improving Multi-Agent Debate with Sparse Communication Topology — approx. 2024/2025 multi-agent debate authors, 2024/2025
https://scholar.google.com/scholar?q=Improving+Multi-Agent+Debate+with+Sparse+Communication+Topology
23. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation — approx. 2025 verification/safety authors, 2025
https://scholar.google.com/scholar?q=VeriGuard:+Enhancing+LLM+Agent+Safety+via+Verified+Code+Generation
24. Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems — approx. 2025 agent verification authors, 2025
https://scholar.google.com/scholar?q=Verifiability-First+Agents:+Provable+Observability+and+Lightweight+Audit+Agents+for+Controlling+Autonomous+LLM+Systems
25. AI Post Transformers: DeepResearch Arena: Benchmarking LLMs' Research Abilities — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/deepresearch-arena-benchmarking-llms-research-abilities/
26. AI Post Transformers: Experimental Comparison of Agentic and Enhanced RAG — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-14-experimental-comparison-of-agentic-and-e-37d8bc.mp3
27. AI Post Transformers: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/adaptive-test-time-scaling-with-world-models-for-visual-spatial-reasoning/
28. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/
29. AI Post Transformers: Bloom: an open source tool for automated behavioral evaluations — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/bloom-an-open-source-tool-for-automated-behavioral-evaluations/
Interactive Visualization: Agentic Aggregation for Long-Horizon AI Tasks
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof