
Sign up to save your podcasts
Or


In this "Office Hours" episode, Arena co-founder and CEO Dr. Anastasios Angelopoulos discusses how a Berkeley side project became the gold-standard evaluation platform used by every frontier lab. He explains that benchmarks fundamentally cannot capture post-deployment reality, and what happens when a model meets tens of millions of real users doing real work across coding, creative writing, image editing, and increasingly agentic tasks. Angelopoulos argues Arena is structurally ungameable because the question distribution is constantly refreshed by new users, and shares the origin story of NanoBanana - Google's stealth image model whose viral run on Arena marked the first moment the Gemini app inflected globally with consumers. He unpacks the safety-versus-steerability tradeoff in rankings (suggesting LLMs may eventually need movie-style ratings) and why neutrality is not just ethics but the economic foundation of Arena's business. Looking ahead, he predicts long-running agents will become the central unit of work, fundamentally changing what reliability means.
By Anjney MidhaIn this "Office Hours" episode, Arena co-founder and CEO Dr. Anastasios Angelopoulos discusses how a Berkeley side project became the gold-standard evaluation platform used by every frontier lab. He explains that benchmarks fundamentally cannot capture post-deployment reality, and what happens when a model meets tens of millions of real users doing real work across coding, creative writing, image editing, and increasingly agentic tasks. Angelopoulos argues Arena is structurally ungameable because the question distribution is constantly refreshed by new users, and shares the origin story of NanoBanana - Google's stealth image model whose viral run on Arena marked the first moment the Gemini app inflected globally with consumers. He unpacks the safety-versus-steerability tradeoff in rankings (suggesting LLMs may eventually need movie-style ratings) and why neutrality is not just ethics but the economic foundation of Arena's business. Looking ahead, he predicts long-running agents will become the central unit of work, fundamentally changing what reliability means.