Generative AI Group Podcast

Week of 2025-09-28


Listen Later

Alex: Hello and welcome to The Generative AI Group Digest for the week of 28 Sep 2025!
Maya: We're Alex and Maya.
Alex: First up this week — a long thread about MCPs, background agents and how to give agents a runnable “config.” Ananth asked directly, “If I want to provide a MCP like config which any background agents can take as a prompt and run it, is there a way to do it?” That question touched off a lot of useful back-and-forth.
Maya: Right — folks pointed to a couple of concrete directions. Vrushank pointed to A2A work like lastmile-ai/mcp-agent, SaiVignanMalyala shared MCP Pointer — a Node.js MCP server plus a Chrome extension that turns DOM selections into structured context for agents like Claude Code, Cursor and Windsurf. And Sid and alchemist reminded us of the security angle: strong auth, sandboxing, and monitoring matter.
Alex: For a non-technical listener: MCPs are just a standard way to describe what an agent can do and what context it needs — think of it as a plug-and-play instruction packet for autonomous helpers. The big insight here is that “configs” can be treated like tool responses, but you only get enforcement if you control the agent runtime. If you don’t control the agent, it can ignore your config.
Maya: Why it matters: as agent tooling spreads, being able to hand off a reliable, auditable config to background agents will let businesses automate workflows safely. Non-obvious takeaway — build your own MCP server if you need guarantees, but make that server simple and heavily instrumented: auth, sandboxing, and behavioral monitoring. Practical idea: wrap your config as a structured tool response, host a local MCP server (look at lastmile-ai/mcp-agent), and pair it with a browser extension like MCP Pointer for UI-driven tasks.
Alex: Next major thread — deep research agents. Aankit and others were asking how to build a v1 deep research agent. People shared lots of resources: OpenAI’s Deep Research API and cookbook, Langchain’s open_deep_research repo, langchain-ai/deepagents, Together AI’s open deep research, and NVIDIA’s UDR research.
Maya: The simple playbook from the discussion is great: start narrow. AD recommended building a niche agent from scratch — implement a web_search tool using SearxNG or Tavily, add RAG, then let the agent dynamically choose between RAG or live web search for each sub-query. Shan Shah and ~Ishita flagged the Langchain and blog resources — handy starting points.
Alex: For listeners: deep research agents aren’t magic — they’re orchestrations of search, memory, and iterative planning. Why it matters: these agents let researchers and analysts work faster on long, multi-step investigations. Non-obvious takeaway — don’t over-abstract at v1; a focused vertical agent that knows its domain will outperform a generalist. Practical idea: prototype an agent that uses Tavily or SearxNG for web results, store extracted facts as structured JSON in vectors, and add a small planner that chooses between RAG and live search.
Maya: Relatedly, people shared benchmarks and datasets — deep_research_bench and papers on evaluation — use those to measure progress.
Alex: Another practical cluster this week: realtime news and search APIs. Luv Singh asked which API to use for realtime news search and got a chorus: Tavily for true realtime Google-like data, Perplexity for LLM-formatted answers, and the new Perplexity Search API post is a helpful read on design tradeoffs.
Maya: Quick rule of thumb from the thread: if freshness is critical — use Tavily and build a pipeline to format the raw results. If you want immediate, polished LLM answers out of the box — Perplexity is convenient, but may lag slightly on freshness. Sankalp’s pointer to Perplexity’s architecting post is good for builders who need to understand the design choices.
Alex: Why it matters: agents and RAG systems depend on the freshness and structure of sources. Non-obvious tip — use a two-step pipeline: Tavily for retrieval + a lightweight LLM pass to normalize and format into your app’s schema.
Maya: We should talk about coding agents and benchmarks — big discussion there. Swanand asked about decent benchmarks for coding agents like Amp Code, Claude Code, Junie. People pointed to Terminalbench, ARE from Meta+HuggingFace, and the ARE/personahub work for personas in evals.
Alex: The core debate was: how much of a result is the model and how much is the scaffolding? Nirant argued harnesses are pulling ahead as models plateau locally — meaning the system around the model matters a lot. Practical takeaway: when you compare agents, test end-to-end tasks that include planning, tool use and test invocation, not just single-shot code output.
Maya: For builders: use Terminalbench and ARE-style evaluations, include time/distraction dimensions, and use PersonaHub for realistic user personas in your tests. Non-obvious idea: evaluate the “agent harness” separately — measure planning quality, tool selection accuracy, and test-run success rate.
Alex: On the model and infra front, Qwen and quantization came up a lot. Prayank asked about FP8 vs 6-bit quantized models. The thread clarified: FP8 (e.g., e4m3) is 8-bit floating point, supported natively on H100 and some Blackwell/hopper hardware, offering slightly better performance at higher memory than extreme quantization. MLX community releases often use INT8 or INT6 — they’re different from FP8.
Maya: Practical takeaway: if you have hardware with native FP8 support (H100 et al.), try the FP8 variant for a performance win. If you’re on CPU or GPUs without FP8 support, MLX int8/6-bit quantized versions might be the better pragmatic choice. And for compliance or latency reasons, lots of teams still self-host on non‑Chinese clouds or use US providers (Groq) — Shan Shah and others call this out.
Alex: The Pulse / proactive brief feature from OpenAI and Huxe popped up too. People loved the convenience but worried about precision, noise, and control. Ojasvi showed you can actually ask ChatGPT to generate a prompt that mimics Pulse — neat hack.
Maya: Why this matters: proactive, personalized briefs are a powerful UX, but they must be editable — users need to tell the system what to prioritize. Non-obvious takeaway — you can prototype a “daily brief” by building a scheduled job that runs a personalized prompt through your LLM and surfaces a short digest; let users tune topics and mute feeds to avoid notification fatigue.
Alex: A recurring engineering problem in the thread was RAG consistency and span queries. Sangeetha asked why inline citation counts don’t match number of sources in a Pydantic structured output. G Kuppuram suggested guardrails; AD and others recommended extraction to structured tables for queries that need full-document spans.
Maya: Practical approach: for span queries or “count-like” queries, pre-extract facts into structured records (or a table) and store those records as vector payloads. Use vector filters to find candidates, then reconstruct and validate citations with a final LLM pass. Non-obvious tip — use strict Pydantic schemas with a verification step: if citation counts mismatch, trigger a deterministic recheck or fall back to a filtered retrieval.
Alex: One tiny but useful thread: image editing chains. Multiple people (Sagar, Balaji) observed iterative, multi-turn editing where each output becomes the next input tends to degrade image quality. The blunt lesson: avoid long chains of iterative edits when possible.
Maya: Practical idea: if you must do iterative edits, preserve an unedited high-quality source and reapply edits against it rather than chaining compressed outputs. Or keep checkpoints and re-run edits from the best checkpoint to avoid cumulative artifacts.
Alex: Okay — listener tips. My tip: if you’re giving agents configs for background work, ship a small MCP server locally and use strong auth and monitoring. That gives you both interoperability and control. Maya, how would you apply that?
Maya: I’d use the MCP server to expose a narrow set of tools only — DOM selection via MCP Pointer for UI tasks, a search tool (Tavily) and a structured extraction tool. I’d log every tool call and sample outputs daily for drift. My tip: for fast prototyping of a deep research agent, start with Langchain’s open_deep_research and a single web_search tool using Tavily or SearxNG; focus on one domain and add RAG selectively. Alex, would you try that?
Alex: Absolutely — I’d start with the Langchain repo, wire in Tavily, and build a small evaluation suite using Deep Research bench to measure retrieval quality and answer faithfulness.
Maya: One more quick tip — if you care about accurate citations in RAG, enforce a Pydantic schema and add a validation step that checks citation counts; if mismatched, force the model to re-extract or return “insufficient citations.” Would you add that to a production pipeline?
Alex: Yep — and tie that to alerts so you can fix prompt templates or the retriever quickly.
Maya: That wraps our round-up for the week. Thanks to everyone in the thread — Ananth, Vrushank, Sid, SaiVignanMalyala, Shan Shah, ~Ishita, AD, Nirant K, Sangeetha, Aankit Roy and many others for the pointers and resources.
Alex: See you next week. Keep experimenting, log everything, and be kind to your models — and your users.
Maya: Bye for now from The Generative AI Group Digest.
...more
View all episodesView all episodes
Download on the App Store

Generative AI Group PodcastBy