July 30, 2025

Factuality, Alignment, and Edge Efficiency

49 minutes

Generated with Google NotebookLM

This week’s roundup distills 15 brand‑new arXiv papers that are bending the curve on large‑language‑model accuracy, efficiency, and safety:

Truth under pressure – A RAG‑powered adversarial pipeline shreds GPT‑4o’s fact‑checker, proving that evaluators need retrieval too.
API docs, minus the bloat – Smart chunking plus a “Discovery Agent” trims OpenAPI specs while raisingendpoint recall.
Alignment, re‑weighted – FocalPO boosts Direct Preference Optimisation by doubling‑down on pairs the model already ranks right.
Seeing, thinking, scheming – MultiMind merges facial cues, vocal tone, Theory‑of‑Mind, and MCTS to out‑bluff humans in Werewolf.
Token thrift as design law – A manifesto argues that pruning isn’t just for speed; it cuts hallucinations and stabilises training.
Cheaper RL finetunes – MoPPS predicts prompt difficulty on‑the‑fly and slashes rollout counts.
Edge‑ready inference – DeltaLLM exploits temporal sparsity, while HCAttention squeezes KV cache to 25 %—letting Llama‑3‑8B read 4 M tokens on a single A100.
LLMs that draw – A ReAct + RAG agent converts natural‑language briefs straight into AutoCAD code.
Tool orchestration at scale – SciToolAgent uses a knowledge‑graph spine to automate hundreds of domain‑specific apps.
Where models get lost – MazeEval exposes huge language‑bound gaps in spatial navigation.
Red‑team reality check – 1.8 M attacks show nearly every frontier agent breaks policy within 100 prompts; robustness ≠ size.
Proving corrigibility – Five lexicographic “core safety values” deliver the first provable obedience guarantees.
Open‑source powerhouse – Kimi K2 (32 B MoE / 1 T tokens) tops agentic leaderboards with a new MuonClip optimiser.

From adversarial fact‑checking to provably safe utility heads, these papers reveal the state of the art—and the cracks that still need sealing. Tune in for a 30‑minute tour of:

efficiency tricks that make billion‑param models mobile‑friendly,
alignment methods that actually move preferences,
benchmarks that stress‑test reasoning across space, language, and social strategy, and
frameworks that weld LLMs to real‑world tools without burning GPU budgets.

If you build with, bet on, or just geek out over LLMs, this episode will arm you with the freshest insights—and plenty of rabbit holes for the weekend.

Sources:

https://arxiv.org/pdf/2410.14651
https://arxiv.org/pdf/2411.19804
https://arxiv.org/pdf/2501.06645
https://arxiv.org/pdf/2504.18039
https://arxiv.org/pdf/2505.18227
https://arxiv.org/pdf/2507.04632
https://arxiv.org/pdf/2507.19608
https://arxiv.org/pdf/2507.19771
https://arxiv.org/pdf/2507.19823
https://arxiv.org/pdf/2507.20280
https://arxiv.org/pdf/2507.20395
https://arxiv.org/pdf/2507.20526
https://arxiv.org/pdf/2507.20534
https://arxiv.org/pdf/2507.20796
https://arxiv.org/pdf/2507.20964

...more

View all episodes

By Scot Bearss

July 30, 2025

Factuality, Alignment, and Edge Efficiency

49 minutes

Generated with Google NotebookLM

This week’s roundup distills 15 brand‑new arXiv papers that are bending the curve on large‑language‑model accuracy, efficiency, and safety:

Truth under pressure – A RAG‑powered adversarial pipeline shreds GPT‑4o’s fact‑checker, proving that evaluators need retrieval too.
API docs, minus the bloat – Smart chunking plus a “Discovery Agent” trims OpenAPI specs while raisingendpoint recall.
Alignment, re‑weighted – FocalPO boosts Direct Preference Optimisation by doubling‑down on pairs the model already ranks right.
Seeing, thinking, scheming – MultiMind merges facial cues, vocal tone, Theory‑of‑Mind, and MCTS to out‑bluff humans in Werewolf.
Token thrift as design law – A manifesto argues that pruning isn’t just for speed; it cuts hallucinations and stabilises training.
Cheaper RL finetunes – MoPPS predicts prompt difficulty on‑the‑fly and slashes rollout counts.
Edge‑ready inference – DeltaLLM exploits temporal sparsity, while HCAttention squeezes KV cache to 25 %—letting Llama‑3‑8B read 4 M tokens on a single A100.
LLMs that draw – A ReAct + RAG agent converts natural‑language briefs straight into AutoCAD code.
Tool orchestration at scale – SciToolAgent uses a knowledge‑graph spine to automate hundreds of domain‑specific apps.
Where models get lost – MazeEval exposes huge language‑bound gaps in spatial navigation.
Red‑team reality check – 1.8 M attacks show nearly every frontier agent breaks policy within 100 prompts; robustness ≠ size.
Proving corrigibility – Five lexicographic “core safety values” deliver the first provable obedience guarantees.
Open‑source powerhouse – Kimi K2 (32 B MoE / 1 T tokens) tops agentic leaderboards with a new MuonClip optimiser.

From adversarial fact‑checking to provably safe utility heads, these papers reveal the state of the art—and the cracks that still need sealing. Tune in for a 30‑minute tour of:

efficiency tricks that make billion‑param models mobile‑friendly,
alignment methods that actually move preferences,
benchmarks that stress‑test reasoning across space, language, and social strategy, and
frameworks that weld LLMs to real‑world tools without burning GPU budgets.

If you build with, bet on, or just geek out over LLMs, this episode will arm you with the freshest insights—and plenty of rabbit holes for the weekend.

Sources:

...more

Share Factuality, Alignment, and Edge Efficiency

Sign up to save your podcasts

Factuality, Alignment, and Edge Efficiency

Factuality, Alignment, and Edge Efficiency