March 24, 2026

UX and AI Digest Episode 2

26 minutes

Send us Fan Mail

Evaluating AI Agents, Claude's Computer Access & Prompt-Only Enterprise Software

🔬 EVA — A Framework for Evaluating Voice Agents

I hadn't realised we lacked a proper evaluation framework for voice agents — this one from Hugging Face caught my attention
What I like: it combines two dimensions I've always thought should go together — accuracy (task completion, faithfulness, speech fidelity) and experience (conciseness, conversational flow, turn-taking)
My question: is the "experience" side actually measured with real end users, or just by the designers?
This connects to a three-step evaluation model I keep coming back to: define your ingredients, evaluate internally, then validate with users — and compare the gap
I'll dedicate a full episode to this, but the short version is: if you want to elicit trust or satisfaction, you need to know which product attributes actually produce those outcomes

🤖 Claude + Cowork — AI With Access to Your Computer

Cowork now lets you authorise Claude to access your files and folders so it can act on your behalf even when you're away
I'm genuinely torn — amazed by the technology, but uncomfortable with the direction
My concern isn't the capability itself, it's the pattern: LLMs arrive, and suddenly we open the gates to everything — recording, transcription, computer access — as if these things naturally belong together
My rule of thumb: always assume your data is being used to improve the product — if you have doubts, assume yes
I'd love to see more push for private, self-hosted LLMs — but the honest tension is that commercial ones will keep winning on convenience because they have more data to train on
It's not even apples to apples — and that's what makes this hard

🖥️ Aragon — What If Enterprise Software Was Just a Prompt?

Startup Aragon raised $12M at a $100M valuation to replace enterprise tools like Salesforce, Jira, and Tableau with a single LLM interface
Their thesis: buttons and menus are dead, future business is done by prompt
My honest reaction: I get why this is being explored — we're mapping the edges of a new territory and seeing what sticks
But one modality for everything? I'm not convinced — when I was building my own website, I actually wanted both: LLM for generation, drag-and-drop for fine-tuning — and that product barely exists yet
Users have 10+ years of muscle memory with their tools — strip that away and you're not simplifying, you're adding friction
Nielsen's heuristics exist for a reason: people need control, exit doors, and multiple ways to accomplish a task

Support the show

...more

By Jeremy