November 02, 2025

Week of 2025-11-02

9 minutes

Alex: “Hello and welcome to The Generative AI Group Digest for the week of 02 November 2025!”

Maya: “We're Alex and Maya.”

Alex: All right, let’s jump in. This week’s chat was packed — voice agents, Indic OCR, agentic AI for research, new chips, diffusion LLMs, fine-tuning advice, prompt tooling, and more. Ready to unpack?

Maya: Let’s do it. We’ll take you through the biggest, most useful surprises and what to try this week.

Alex: First up — voice and real-time voice agents. The thread had a lot on TTS and realtime stacks: Sarvam supporting edge and cloud, a shout-out to Bulbul as a nice-sounding Indic TTS, and folks troubleshooting agents cutting people off in live conversations.

Maya: Right — Chaitanya shared a LiveKit-based config where the agent sometimes interrupts users or starts speaking early. That’s a super practical pain point for anyone building voice agents. The quick checklist from the chat: try different STT models, adjust endpointing and interruption thresholds, tune VAD (voice activity detection), and test with different TTS engines like Sonic-2/3.

Alex: And notice the product-level lesson: even great voice models can feel “chatty” — Sumanth said one model sounded like a running commentary, not a customer assistant. So beyond accuracy, tune style and brevity in your TTS voice or prompt.

Maya: Practical idea: A/B your stack. Swap STT models (Deepgram flux vs nova, Silero VAD), try preemptive_generation off, and tune min/max endpointing_delay until interruptions stop. If you’re targeting Indic languages, test Bulbul and Sarvam on-device for latency and naturalness.

Alex: Why this matters: real-time UX failures — cutting people off, long-winded responses — kill product adoption faster than raw model quality. The non-obvious takeaway: endpointing and VAD settings can be as important as model choice. Treat the voice stack holistically.

Maya: Next topic — OCR and Indic languages. Ashish asked about extracting English, Bengali, and code-mixed text from form fields. There were a few concrete names: OLmOCR2 from AllenAI (good baseline), Chandra (had Apple Silicon install issues for some), and plenty of suggestions to compare via the Hugging Face playgrounds.

Alex: Also useful: Magvit2 / OmniTokenizer for thinking about image tokens and how images get tokenized for vision models. For dataset work, Argilla and augmentools were recommended for synthetic data and labeling.

Maya: If you’re building this: don’t trust a single demo. Start with a small, representative set of scanned forms, run OLmOCR2 and Chandra, compare field-level accuracy, and measure where the errors happen — Bengali script, mixed lines, printed vs handwritten. If you need a deployed API, look for vendors that wrap these models; otherwise host a trimmed model for faster inference.

Alex: Non-obvious tip: do field-level metrics (is the “name” field correct?) rather than just raw word error rate — that aligns better with how you’ll upsert to Excel or a database. And keep a small human-checked validation set to guard against synthetic-data drift.

Maya: Speaking of synthetic data, that leads us to dataset creation and fine-tuning. Ambika asked about generating Q&A pairs with GPTs to fine-tune a smaller model. Folks suggested Argilla for annotation workflows, and Cheril gave a great checklist: use multiple embedding models to compare distribution of synthetic vs real samples, maintain diversity (MMR helps), and do manual checks on small subsets.

Alex: On the infrastructure side: for small-data fine-tuning use LoRA/adapters rather than full fine-tuning. Tools recommended were tinker (waitlist), Llama Factory, vLLM for serving, torchtune for training, and hosted options like RunPod if you don’t want to manage hardware.

Maya: Non-obvious takeaway: with only a few thousand samples it’s often better to start with prompt engineering and LoRA. And always validate synthetic data by embedding-distance checks and a human spot-check — synthetic data can nudge a model in unexpected ways if it’s too homogeneous.

Alex: The fine-tuning how-to boiled down to practical steps: generate modest synthetic data, run embedding distribution comparisons, add diversity, LoRA the model, and if latency is your constraint, focus on smaller models plus a good serving stack rather than assuming fine-tuning will fix latency.

Maya: Big theme number four — agentic AI and automating research. Paras Chopra’s thread about building AI scientists and exploring cross-domain idea connections drew a lot of attention. Paras said something catchy: “So instead of building a system, build a system that builds that system.” That really stuck with people.

Alex: There’s also the wider industry timeline conversation — Sam Altman’s team publicly setting internal goals like an “automated AI research intern by September 2026” and a “true automated AI researcher by March 2028.” That’s a big signal: labs are moving aggressively to automate parts of research workflows.

Maya: Why it matters: if research tasks with clear, verifiable objectives can be automated sooner, labs — and researchers — need to pick problems that are either hard to verify (human judgement required) or impossible to automate easily. Paras’s non-obvious point: vagueness in natural language can be a strength when you want flexible, generalizable reasoning rather than brittle formalization.

Alex: Practical ideas for teams: design experiments with short, verifiable feedback loops (so an “intern” can evaluate), and think about modular automation — discovery, hypothesis generation, experiment execution, and write-up as separate phases you can automate incrementally. And start capturing metadata so running “reflection” cycles — nightly consolidation — becomes possible.

Maya: Next — diffusion-based language models and compression tricks. Mohamed Yasser pointed to Mercury: “Ultra-Fast Language Models Based on Diffusion,” and folks discussed Gemini diffusion previews. On a related note, Glyph by Z.ai was mentioned: rendering long text into images so VLMs can compress multiple tokens — reported 3–4× token compression in tests.

Alex: Short explanation: diffusion LLMs generate by denoising, a different math than autoregressive; rendering text into images to leverage visual models is a clever engineering hack to squeeze context into fewer tokens. Try Mercury’s API free tier (some folks said there were free token allowances) or experiment with token-compression approaches like Glyph if you’re dealing with very long context windows.

Maya: Non-obvious takeaway: these approaches change tradeoffs — latency, hardware needs, and robustness. If you’re handling huge documents, explore image-as-text representations and test whether downstream QA or retrieval quality holds up.

Alex: Big hardware note: a couple of new chip stories came up — an analogue RRAM chip claim (huge throughput/energy gains) and Extropic’s probabilistic TSU idea. These are promising but early; several people cautioned about hype and lack of reproducible demos.

Maya: The takeaway: watch hardware, but plan for software changes. New compute paradigms usually need co-designed algorithms. If you care about deployment timelines, don’t count on magically faster chips next month — but keep an eye on the research for two to three years out.

Alex: Two quick product/tool items we should highlight: prompt management and plan-mode workflows. People recommended Portkey and Humanloop for storing prompts outside code, or just keep prompts in files under git. Also, Claude’s plan mode (Shift+Tab) helps make implementation steps explicit — and Suyash’s tip: feed the LLM your evolving answers as part of the initial prompt to avoid repetitive question cycles.

Maya: That “ask it to ask questions” trick is neat — it forces the model to generate a structured plan and follow-up prompts instead of running a back-and-forth where context gets lost.

Alex: Very quick note on courts and ethics — there’s a move in India to replace stenographers with AI in many lower courts. People raised the right concerns: AI mistakes matter when lives are affected, but others noted the recordings add traceability and could speed access to justice. This is a reminder to pair automation with human review and audit trails.

Maya: Okay, listener tips time. Quick, actionable tips from each of us — one each, and ask the other how they’d apply it.

Alex: My tip — If you’re building a voice agent, run a three-way A/B: two different STT models, the same TTS voice, and three endpointing delay settings. Collect objective metrics (turn-taking errors per 100 calls) and a short user satisfaction score. How would you apply that, Maya?

Maya: I’d run that on a small pilot with real users, then use the best-voted stack as the control and iterate. I’d also measure how often agents “monologue” and trim prompts accordingly.

Maya: My tip — for small fine-tunes, generate 200–1,000 synthetic samples, but before you train: run embedding-distance tests using three embedding models (as Cheril suggested). If synthetic samples cluster away from real data, improve diversity or prune. Alex, where would you use this?

Alex: I’d use it before committing to a LoRA run — it’s a cheap gate that saves compute and prevents a finetune from drifting. If distributions match, proceed; if not, augment or hand-edit.

Maya: One more tiny tip since we touched on prompts — store system prompts in a repo or Portkey, and version changes with each experiment. Treat prompts like code.

Alex: Agreed — and add a short changelog so you can roll back to the prompt that actually worked.

Maya: All right, that’s our digest. Thanks to everyone who contributed to the chat this week — Paras Chopra, Mohamed Yasser, Ashish (tp53), Ambika, Chaitanya, and dozens more — you all packed this with practical threads.

Alex: See you next week. Keep experimenting, validate with small tests, and don’t forget to measure real user impact.

Maya: Bye for now — stay curious and build responsibly.

...more

View all episodes