Alex: Hello and welcome to The Generative AI Group Digest for the week of 05 Oct 2025!
Maya: We're Alex and Maya.
Alex: Big week in the group — folks wrestling with messy PDFs, image generation headaches, new long-context model behavior, Sora's buzz, and some interesting benchmarks and tooling ideas. Ready to dive in?
Maya: Let’s do it.
Alex: First up — parsing really complex PDFs and extracting tables. Sainath kicked this off saying he’s tried a bunch of tools — ChatGPT, Gemini, Claude, Amazon Textract — and kept hitting problems: cells spanning pages, 6‑column layouts with messy boundaries, rows that stretch multiple pages, and mixed content in the same cell.
Maya: That list is a nightmare for any automatic extractor. When a cell can contain citations, descriptive text and table bits, naive OCR and table parsers break down fast. So what helps in practice?
Alex: A few practical patterns the group surfaced. One: separate layout detection from text extraction. Use a layout-aware detector first — LayoutParser, Detectron2-based table detectors, or Google Document AI — to find bounding boxes for cells, headers, and footers. Two: treat multi-page or split cells explicitly: stitch fragments using anchors like column headers, x‑coordinates, and dates. Three: use an LLM for structured reconstruction but feed it cleaned, chunked inputs — Gemini file API + Gemini 2.5 Flash or Claude with structured output can help, but only after the bounding boxes and cell fragments are pre-assembled.
Maya: G Kuppuram suggested a "dynamic boxing with an agentic approach" — basically let a model propose boxes, then have a verifier agent adjust them. That makes sense for messy layouts where a single pass won’t cut it.
Alex: Exactly. Non-obvious takeaways: generate a confidence score per cell and human‑review low‑confidence ones; create a “table fusion” step that merges overlapping fragments; and if your end goal is Excel, normalize the output into a strict JSON schema (row/col/cell_text) so downstream code can validate and repair.
Maya: Practical idea: run OCR on the page images, run a segmentation model for table structure, then use a small LLM prompt to merge fragments into final cells. If you can, synthesize a tiny labeled dataset from a few representative PDFs and fine-tune or calibrate your detector — that pays off fast.
Alex: Good tip. Want to quote Sainath? He basically spelled out the failure modes: "Cells spanning across multiple pages… 6-column layouts with messy boundaries… Rows that stretch 2–3 pages… Mixed refs, citations, and descriptive text in the same cell." It’s a useful checklist to reproduce the problem.
Maya: Topic two: image generation and prompts. Yashwardhan asked whether MidJourney is bad or if prompts are the issue. There’s a theme: some models don’t adhere to prompts well, others need more detailed or structured prompts.
Alex: Right. People recommended trying different backends — nano banana, Seedream, Veo3 — and different prompt styles. Chaitanya and Meera called out JSON prompts for Veo3 and video generation; Chaitanya shared a JSON structure they use for Veo3, and others found detailed textual prompts worked better for image editing.
Maya: A neat trick Shreyas mentioned: use a fast critique model like Gemini Flash to evaluate the first generation, have it output what to correct, and pass that back to the image model (Nano Banana). That iterative loop — generate, critique, refine — often beats one-shot prompting.
Alex: Non-obvious takeaway: structured prompts (JSON/XML) help when you need consistent, repeatable edits or when automating pipelines. For ad‑hoc creative work, long, vivid textual prompts often work better. Also watch for model updates — Nano Banana added better aspect ratio support this week, which matters for production pipelines.
Maya: Practical idea: build a small prompt‑rewriter agent. User enters a short idea; use an LLM (Claude or Gemini) to expand it into a detailed scene + camera + lighting + style JSON; then run the image model. For editing, always include a precise mask or semantic hint for what to change.
Alex: Next: deep research workflows and which LLMs actually help. tp53(ashish) asked which models deliver high‑quality knowledge work. Somya Sinha said Perplexity Labs and Manus worked well for first passes, and many people recommended a two‑step approach: process PDFs into summaries, then feed those summaries into a reasoning model.
Maya: There were some benchmarks too — tp53 shared the new APEX productivity benchmark and the key finding summarized by Brendan Foody: GPT‑5, Grok 4, and Gemini 2.5 Flash were top performers on the domains tested.
Alex: The practical pattern here is important: start with retrieval and synthesis, then do focused reasoning. For deep literature review, don’t expect a single LLM to solve everything. Break the task: extract facts and citations first, curate a golden set, then do cross‑paper synthesis with a stronger model. Use Perplexity or Manus for discovery, then a high‑capability model for synthesis.
Maya: Non-obvious tip: run multiple models and "vibe select" — have a compact scorer or meta‑agent that picks the best draft across models. People in the chat actually combine outputs and pick the best, rather than trusting one run.
Alex: Big model behavior and tools next: Sonnet 4.5, context awareness, and file editing. Several folks noted Sonnet 4.5 can manage very long tasks and is "aware of its own context window" — Pranav Hari quoted the cognition.ai post saying Sonnet 4.5 proactively summarizes as it reaches limits.
Maya: That changes how you build agents. If a model knows it's nearing its limit and can compress context proactively, you can run longer sessions or workflows without losing coherence. Anthropic added memory tools too — short and long term — which people are experimenting with.
Alex: The group also talked about slide generation from papers. tp53(ashish) compared Sonnet 4.5 and Opus 4.1: Sonnet was decent on content, but Opus produced slicker visuals. ChatGPT was noted to output .pptx quickly using python‑pptx. So mix-and-match is valuable: use one model for content, another for visual polish or direct pptx generation via libraries.
Maya: Practical idea: chunk the paper into sections, have an LLM produce slide text per section, then use python‑pptx or html2pptx to assemble slides programmatically. If you need nicer visuals, send the content to a model that can return layout suggestions and assets.
Alex: Sora 2 and the social AI video buzz. Lots of excitement — invites, cameos, remixing, monetization for rightsholders. tp53 pointed to Sam Altman’s Sora update about monetizing rightsholder content.
Maya: The upside is huge for creators and marketers: quick, personalized video content and remix culture. The downside is obvious — deepfakes and consent. tp53 warned that as AI video gets harder to distinguish, deepfake risks amplify.
Alex: Non-obvious takeaway: if you experiment with Sora, treat it like any new media channel — test small, think about provenance and rights, and consider adding watermarks or metadata to generated content. For brands, cameo-style monetization could be interesting, but plan legal and consent controls up front.
Maya: Another cross-cutting theme: structured outputs, tool calls, and agents. Karthik Sashidhar and others noted open‑source models often lag in reliable JSON or function outputs. People recommended Qwen, BAML, Gemma, and DeepSeek for structured output work.
Alex: The engineering pattern is: enforce a schema, validate model outputs, and retry with stricter prompts or a constrained function-call API. Use OpenAI function calling, Anthropic’s tool interfaces, or model‑specific JSON modes. Sushanth Bodapati proposed stateless atomic tool servers behind an agent gateway — useful for building auditable tool ecosystems.
Maya: Practical idea: when reliability matters, use a validator step — JSON schema validation plus a reformat request to the model only when the validator fails. That saves downstream failures.
Alex: Before we close, listener tips — quick and actionable. Maya, you go first.
Maya: Tip — if you’re extracting complex tables from PDFs, do a two-step pipeline: (1) detect table structure and cell boxes with a layout model (LayoutParser, Google Document AI or Detectron2), (2) stitch text fragments and run a schema‑constrained LLM pass to produce row/col JSON. For low‑confidence cells, surface a small human review batch. Alex, how would you apply that?
Alex: I’d prototype with LayoutParser + Tesseract for layout and text, then use Claude Sonnet or Gemini 2.5 Flash to reconstruct tables into JSON. I’d log confidence and route any cell under 80% confidence for manual check.
Alex: My tip — when doing creative image edits with MidJourney or Nano Banana, build a small "prompt expansion + critique" loop: feed a short user prompt to an LLM (Claude or Gemini) to produce a detailed prompt or JSON, generate the image, then ask a fast model (Gemini Flash) to list precise corrections and iterate. Maya, how would you use that?
Maya: I’d use it for product imagery — one pass to produce several variants, then the critique loop to correct color or composition. For editing, I’d include masks and a "must preserve" list in the JSON.
Alex: That’s a wrap. Thanks to everyone who contributed to the chat — Sainath, tp53(ashish), Somya Sinha, Chaitanya, Pratyush, Pranav Hari, and many others for flags and threads we covered.
Maya: See you next week — same group, more AI adventures. Goodbye for the week!
Alex: Bye everyone!