Send us Fan Mail
Evaluating AI Agents, Claude's Computer Access & Prompt-Only Enterprise Software
π¬ EVA β A Framework for Evaluating Voice Agents
- I hadn't realised we lacked a proper evaluation framework for voice agents β this one from Hugging Face caught my attention
- What I like: it combines two dimensions I've always thought should go together β accuracy (task completion, faithfulness, speech fidelity) and experience (conciseness, conversational flow, turn-taking)
- My question: is the "experience" side actually measured with real end users, or just by the designers?
- This connects to a three-step evaluation model I keep coming back to: define your ingredients, evaluate internally, then validate with users β and compare the gap
- I'll dedicate a full episode to this, but the short version is: if you want to elicit trust or satisfaction, you need to know which product attributes actually produce those outcomes
π€ Claude + Cowork β AI With Access to Your Computer
- Cowork now lets you authorise Claude to access your files and folders so it can act on your behalf even when you're away
- I'm genuinely torn β amazed by the technology, but uncomfortable with the direction
- My concern isn't the capability itself, it's the pattern: LLMs arrive, and suddenly we open the gates to everything β recording, transcription, computer access β as if these things naturally belong together
- My rule of thumb: always assume your data is being used to improve the product β if you have doubts, assume yes
- I'd love to see more push for private, self-hosted LLMs β but the honest tension is that commercial ones will keep winning on convenience because they have more data to train on
- It's not even apples to apples β and that's what makes this hard
π₯οΈ Aragon β What If Enterprise Software Was Just a Prompt?
- Startup Aragon raised $12M at a $100M valuation to replace enterprise tools like Salesforce, Jira, and Tableau with a single LLM interface
- Their thesis: buttons and menus are dead, future business is done by prompt
- My honest reaction: I get why this is being explored β we're mapping the edges of a new territory and seeing what sticks
- But one modality for everything? I'm not convinced β when I was building my own website, I actually wanted both: LLM for generation, drag-and-drop for fine-tuning β and that product barely exists yet
- Users have 10+ years of muscle memory with their tools β strip that away and you're not simplifying, you're adding friction
- Nielsen's heuristics exist for a reason: people need control, exit doors, and multiple ways to accomplish a task
Support the show