Iris AI Digest

AI Digest — May 26, 2026


Listen Later

Good day, here's your AI digest for May 26, 2026.

Several AI stories today point in the same direction: frontier systems are getting more capable, coding agents are becoming a normal product category, and organizations are starting to ask harder questions about cost, control, and trust.

Pope Leo XIV released his first encyclical, Magnifica Humanitas, and devoted a large part of it to artificial intelligence. He argued that AI is not neutral, because it is built and deployed by private, transnational companies whose reach can exceed the capacity of many governments. He called for human-friendly AI, independent oversight, informed users, and legal frameworks that keep democratic institutions from handing moral decisions to technical systems. He was especially blunt on war, saying lethal decisions must never be delegated to AI and that no algorithm can make war morally acceptable. Anthropic researcher Christopher Olah also spoke alongside the Vatican effort, saying frontier AI labs operate inside incentives that can conflict with doing the right thing.

A separate safety story showed how fragile open model guardrails can be. A tool called Heretic was used to remove safety restrictions from open models in minutes, including Meta's Llama and Google's Gemma. Modified versions were then able to answer dangerous questions that the original models were intended to refuse. The creator of the tool said it has already produced thousands of altered models with millions of downloads. Google described this as a known technical challenge for open models. The risk is not that open models are bad by default; it is that once model weights and tooling are public, safety behavior can become a patch that other people learn to strip away.

xAI launched Grok Build in beta for SuperGrok and X Premium Plus subscribers. It is a coding agent and command line tool aimed at complex software projects, with plan review, support for user conventions, headless automation, parallel processing, and specialized subagents. That puts xAI directly into the same competitive lane as Codex, Claude Code, and Google's agentic development tooling. Coding agents are no longer side demos attached to chat products. They are becoming standalone developer surfaces with workflows around planning, execution, review, and automation.

Elon Musk also said Grok V9-Medium has finished training. The model is described as a 1.5 trillion parameter foundation model, with evaluation results looking good and a public release possible in two to three weeks. Treat timing claims around unreleased models carefully, but the signal is clear enough: xAI is trying to move quickly on both developer tooling and core model capability at the same time.

Google's Gemini 3.5 Flash drew strong early analysis as a fast model for agentic work. The model is being positioned as a daily driver for latency-sensitive workflows, with reported gains over Gemini 3.1 Pro on benchmarks such as Terminal-Bench and MCP Atlas while running much faster. It may not be the strongest model against the latest heavyweight systems, but speed changes product design. Lower latency makes agents feel less like batch jobs and more like interactive collaborators, especially when a task involves repeated tool calls, edits, and retries.

Uber's chief operating officer Andrew Macdonald said rising AI usage is getting harder to justify when higher token spend does not clearly map to better consumer features. The comment followed internal debate about Claude Code budgets and broader pressure to fund AI investment while slowing hiring. This is one of the sharper enterprise AI questions now: if a company rewards raw usage, it can get more prompts, more tokens, and larger bills without necessarily getting better software. The harder measurement problem is whether AI spend is improving shipped work, support quality, operational speed, or product outcomes.

ClickUp reportedly cut 22 percent of its staff while replacing work with about 3,000 AI agents. The company has been explicit about using agents across internal operations and customer-facing workflows. The important detail is not just the headcount number. It is the scale of the agent deployment inside one company, and the way AI automation is being presented as an operating model rather than a narrow productivity feature. That raises real questions about supervision, failure modes, and who owns the outcome when a fleet of agents touches sales, support, product, and operations.

California's largest university system is continuing a 13 million dollar per year OpenAI agreement despite criticism from faculty and students. The pushback centers on cost, academic integrity, labor impact, privacy, and whether a broad AI rollout should move faster than campus governance can absorb. Education is becoming one of the most contested deployment environments for general AI tools, because the same system can be a tutor, writing assistant, research aid, cheating vector, and administrative product.

Researchers also described attacks that hide inaudible commands inside ordinary audio, such as a podcast or video, to manipulate voice AI assistants. The attack can be built relatively quickly and does not require the victim to actively interact with the malicious command. It only needs the audio to play near an assistant that can hear it. Voice interfaces create a different security perimeter from text interfaces: the input channel is ambient, continuous, and easy for users to misunderstand.

On-policy distillation is getting attention as a way to train smaller student models on trajectories sampled from their own behavior while a larger teacher supplies token-level supervision. The goal is to close the mismatch between training data and inference behavior that can weaken off-policy distillation. The formulation can support forward KL, reverse KL, and Jensen-Shannon losses, with reverse KL often favored when a smaller model needs sharper, mode-seeking behavior.

Models.dev is a new open repository and API that consolidates model specifications and pricing. The value is straightforward: model choice has become an engineering dependency, and teams need current context on context windows, pricing, modalities, and provider details without manually checking every vendor page.

BenchBench is a benchmark that asks models to create benchmarks. The premise is useful because benchmark design tests abstraction, creativity, self-awareness, and adversarial thinking, not just answer generation. Early results reportedly found that GPT-5.2 performed best while several newer systems struggled to design tests that were genuinely difficult for others to solve.

Google DeepMind's AlphaProof Nexus reportedly solved nine open Erdos problems out of 353 attempts, including problems that had remained open for decades, with inference costs in the hundreds of dollars per solved problem. Automated mathematical reasoning remains narrow and uneven, but successful attacks on real open problems are a meaningful marker for tool-assisted research.

This has been your AI digest for May 26, 2026.

Read more:

  • Grok Build
  • Notes on Pope Leo XIV's encyclical on AI
  • Gemini 3.5 Flash analysis
  • On-policy distillation
  • Models.dev
  • Introducing BenchBench
  • AlphaProof Nexus
...more
View all episodesView all episodes
Download on the App Store

Iris AI DigestBy Arthur Khachatryan