Generative AI Group Podcast

Week of 2025-12-07


Listen Later

Alex: Hello and welcome to The Generative AI Group Digest for the week of 07 December 2025!
Maya: We're Alex and Maya.
Alex: Alright, let’s jump in. First topic — using NotebookLM as a learning and summarization hack. Mohamed Yasser shared a neat workflow: "Back up chat history -> add it as source in notebooklm -> generate deck with some custom prompt." That got a lot of reactions.
Maya: That simple pipeline really captures the modern learner's loop. Why is that useful for non-technical listeners?
Alex: Because it turns scattered conversations and notes into a coherent study deck — like turning a messy notebook into a guided course. People in the chat — Luv and DJ — also talked about dropping whitepapers into NotebookLM to build a mental map before diving into details. It’s about getting the overview first.
Maya: Practical idea: back up your chat logs or docs, ingest them into NotebookLM, then ask for a 5-slide executive deck or a 10-minute explainer. Use a custom prompt to enforce the tone and audience level. Tools to try: NotebookLM itself and vLLM if you need faster local inference.
Alex: Non-obvious takeaway: don’t just dump everything — chunk thoughtfully. Several folks asked about NotebookLM’s chunking and retrieval strategy, so expect you might need to pre-chunk long threads or add metadata to improve retrieval.
Maya: Nice. Next up — math, formalization, and LLMs. There was a lively thread: Paras Chopra said "Math is software," people brought up Project Euler, and others recommended Lean for formal proofs.
Alex: Right — big insight: LLMs are great at explanations, but niche or pure-math reasoning still trips them up. Rⁿ called out that "even the best reasoning models do very poorly at niche problems." That’s why projects like Lean (a theorem prover) and formal methods still matter.
Maya: In plain terms, Lean or Coq are tools that let you write math so a computer can check every step — it’s formal, provable math. So if you need iron-clad correctness (proofs, safety-critical math), combine LLMs for intuition and human-verified formal tools for the final step.
Alex: Practical idea: use an LLM to draft a proof sketch, then translate the core lemmas into Lean for verification. For learning, use Project Euler-style problems to build the chain-of-thought with small steps, not big leaps.
Maya: Third big theme: agents, tools and orchestration. There was debate around MCP (model control plane) with DJ noting Anthropic moving away from MCP, and many people discussing different orchestration patterns.
Alex: The key point here is that one abstraction rarely fits all. Rajesh RS and others pointed out MCPs can be handy for some use cases — especially when you want to stitch existing APIs — but they’re not the only approach. Manan mentioned building workflows with a few subagents for retrieval and code context gathering.
Maya: For listeners: an "agent" is a system that can call tools, web pages, or APIs automatically. Non-obvious takeaway: start small — one or two specialized agents or subagents often beats a huge, monolithic orchestrator. Use frameworks like Langflow or Portkey if you want less plumbing, and keep logs for debugging.
Alex: And if you're exploring agent research, swyx’s "Agent Labs" thesis came up — think about shipping agent-focused dev tools instead of more model-centric offerings. Practical idea: prototype a single-task agent, log its failures, and iterate — don't try to build a universal assistant on day one.
Maya: Speaking of logging and iteration — evaluation and RAG pipelines was a huge thread. Marmik asked how people curate evals, and the group had solid, pragmatic answers: start with a V1 benchmark, dogfood, and iterate with production logs.
Alex: Tools named were Arize Phoenix, Promptfoo, Groq Openbench, Promptfoo's YAML approach, and simple CSV pass/fail setups. The big insight: evaluation tooling should reduce friction, not increase it. Many recommended starting simple — an SME-created gold set plus product logs to expand coverage.
Maya: Practical idea: create a 200–300 question V1 benchmark across top user intents, run nightly checks, and add failing prod examples to the benchmark — avoid relying just on synthetic data because it can overfit.
Alex: Next important topic — model behavior and grounding, especially that Gemini OCR issue. Maruti Agarwal reported Gemini sometimes threw a "recitation error" with finish_reason=4 when OCRing legal pages — probably because the model matched copyrighted boilerplate.
Maya: That example shows how provider models can hit policy blocks when text resembles copyrighted corpora. Abhishek suggested IBM Granite for OCR as an alternate. Practical workarounds: preprocess PDFs page-by-page, try alternate OCR like Granite, or ask the model to paraphrase rather than reproduce verbatim. Also log and retry — sometimes switching the model or paraphrasing helps.
Alex: And note: vLLM is used as an execution layer in some pipelines, so you can combine better OCR outputs with LLMs for downstream understanding.
Maya: Moving on — medical AI and long-term memory. Bharat Shetty shared NOHARM, a physician-validated medical benchmark, and highlighted Google's Titans long-term memory work with a "surprise metric."
Alex: The surprise metric is fun to explain: if new input is expected, the model doesn't bother storing it in long-term memory; if it's unexpected or anomalous, the "surprise" is high and the model decides to remember it. That’s a neat signal for continual learning.
Maya: Why this matters: for clinical systems, the cost of a wrong recommendation is high. Use physician-validated benchmarks like NOHARM to measure harm, and consider memory architectures (Titans/MAC) for personalization and safe recall.
Alex: Practical idea: if you're building medical assistants, involve clinicians early, run your model on NOHARM-like cases, and evaluate both beneficial and harmful action rates — you can’t just chase average accuracy.
Maya: Last big theme — open models, compute, and the India research question. There were multiple announcements: Mistral-3, K2 from MBZUAI, DeepSeek mentions, and wide chat about whether Indian startups can do fundamental research.
Alex: The broad takeaway: the open-model ecosystem is accelerating. K2 positions itself as a fully inspectable 70B model with public weights and training code — great for researchers and companies who want transparency. But hosting, fine-tuning, and inference economics still matter — MoE and other architectures introduce operational complexity.
Maya: On India and research, SARVESH and others debated why more startups don’t pursue deep fundamental research. Several practical points came up: you need talent concentration, patient funding, and institutional support. Non-obvious takeaway: startups can still contribute by building rigorous, mathematically-grounded components — you don't need to invent a whole new architecture to do meaningful foundational work.
Alex: Practical idea: if you’re a founder in India, partner with global research labs, hire internationally for niche expertise, and start with open-weight models like K2 or Mistral for reproducible research.
Maya: Okay — time for our listener tips. Alex, you first.
Alex: Tip: If you’re launching an agent or RAG system, build a V1 benchmark of 200–300 real user questions and start logging production failures immediately. Use a simple toolchain — CSV or Promptfoo — then iterate. Maya, how would you apply that to your projects?
Maya: I’d apply it to my knowledge-base search: create a 250-item gold set across the top customer intents, run nightly checks, and add failing queries to the gold set weekly. I’d also use Promptfoo for automating the checks. My tip: if you want to learn a new domain quickly, back up your chats and docs and feed them into NotebookLM to generate a short deck and a Q&A — that’s Mohamed Yasser and Luv’s approach. Alex, how would you use that?
Alex: I’d use it to onboard new hires: dump onboarding docs and past support chats into NotebookLM, generate a 10-slide onboarding deck plus a 30-question quiz, and use that to speed up ramp time.
Maya: One more quick micro-tip — for sensitive OCR or legal docs, try IBM Granite's docling demo or pipeline the PDF through a deterministic OCR before asking a large model to avoid recitation errors. Then paraphrase in the prompt.
Alex: Great. That wraps our picks for the week.
Maya: Thanks to everyone who shared in the chat — Mohamed Yasser, Paras Chopra, Bharat Shetty, Maruti Agarwal, marmik pandya, DJ, Nirant K and many more — you made this episode possible.
Alex: See you next week. Goodbye from Alex.
Maya: Goodbye from Maya. Have a great week!
...more
View all episodesView all episodes
Download on the App Store

Generative AI Group PodcastBy