AI: post transformers

By mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.

AI: post transformers episodes:

September 19, 2025 Single-stream Policy Optimization for LLMs
This September 2025 paper introduces Single-stream Policy Optimization (SPO), a new reinforcement learning algorithm for training Large Language Models (LLMs) developed by Tencent researchers. SPO challenges the prevailing group-based optimization methods like Group Relative Policy Optimization (GRPO), which suffer from high computational waste due to "degenerate groups" and synchronization bottlenecks, particularly in complex agentic tasks. The core of SPO involves returning to a single-stream paradigm, using a persistent, KL-adaptive value tracker as a stable baseline, and applying global advantage normalization to ensure efficient and stable learning. Empirical results on challenging math benchmarks, using the Qwen3-8B model, demonstrate that SPO consistently outperforms GRPO in terms of accuracy and achieves a significant 4.35x speedup in simulated high-variance agentic training environments, validating its superior scalability and efficiency.

Source:
https://arxiv.org/pdf/2509.13232
...more
16min
September 18, 2025 Pre-computing & reusing KV caches to accelerate RAG inference
How can pre-computing and reusing Key-Value (KV) caches accelerate inference for Retrieval-Augmented Generation and other long-context LLM tasks?

The provided sources identify the same core problem—high latency in Large Language Model (LLM) inference due to processing long, repetitive contexts—and converge on a unified solution: leveraging pre-computed Key-Value (KV) caches. Each source then contributes a unique perspective on *how* to implement this solution effectively, addressing specific challenges that arise from this approach.

The unified answer proposed by all sources is to avoid redundant computation by pre-computing, storing, and reusing the KV caches of recurring text segments (referred to as chunks, documents, or prompt modules).

Sources:
https://arxiv.org/html/2502.15734v1
https://arxiv.org/html/2412.15605v1
https://arxiv.org/html/2502.16002v1
https://arxiv.org/html/2310.07240v6
https://arxiv.org/pdf/2404.12457
https://openreview.net/pdf?id=x7NbaU8RSU
https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf
https://www.cs.princeton.edu/~ravian/COS597_F24/papers/cacheblend.pdf
...more
56min
September 18, 2025 REFRAG: Rethinking RAG-based Decoding
This September 2025 academic paper, titled "REFRAG: Rethinking RAG based Decoding," appears on the alphaXiv pre-print server. It focuses on Reframing Retrieval-Augmented Generation (RAG) within the context of decoding processes. The research likely explores new approaches or improvements to how RAG mechanisms are integrated and utilized during the generation of text. By "rethinking" this integration, the authors probably aim to optimize the performance or efficiency of language models that leverage external knowledge through RAG. The source, specifically from alphaXiv, indicates it is a scholarly work shared for peer review and discussion within the scientific community.
Source:
https://www.alphaxiv.org/abs/2509.01092v1

...more
9min
September 18, 2025 DeepSeek-R1: Reinforcing LLM Reasoning Through Self-Evolution
This paper published on Nature on September 17 2025, "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning," details the development of DeepSeek-R1-Zero and DeepSeek-R1, two large language models (LLMs) engineered to enhance reasoning capabilities. The authors explain how reinforcement learning (RL) is used to enable emergent advanced reasoning patterns like self-reflection and dynamic strategy adaptation, moving beyond reliance on human-annotated data. The paper discusses a multistage training pipeline for DeepSeek-R1, integrating rejection sampling, RL, and supervised fine-tuning to improve both reasoning and general language tasks while addressing issues like language mixing. Furthermore, the researchers highlight the release of these models and their distilled, smaller versions to the public to contribute to ongoing AI research. Ultimately, the source concludes by acknowledging the ethical considerations and limitations of their pure RL methodology, such as reward hacking and token efficiency.

Source:
https://www.nature.com/articles/s41586-025-09422-z
...more
18min
September 17, 2025 ShadowKV: High-Throughput Long-Context LLM Inference
This April 2025 paper introduces ShadowKV, an innovative inference system for long-context Large Language Models (LLMs) designed to significantly enhance throughput and support larger batch sizes without compromising accuracy. It achieves this by strategically managing the Key-Value (KV) cache: specifically, it compresses the low-rank pre-Rotary Position Embedding (RoPE) key cache on the GPU and offloads the value cache to the CPU. ShadowKV further optimizes performance through an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly, thus minimizing decoding latency. Empirical evaluations demonstrate that ShadowKV can support up to 6x larger batch sizes and boost throughput by up to 3.04x on an A100 GPU across various LLMs and benchmarks, even outperforming theoretical infinite memory scenarios.

Source:
https://arxiv.org/pdf/2410.21465
...more
19min
September 17, 2025 TailorKV: Hybrid KV Cache Compression for LLMs
This May 2025 paper introduces TailorKV, a novel hybrid framework designed to optimize Key-Value (KV) cache management in large language models (LLMs) for long-context inference. It addresses challenges like high GPU memory consumption and inference latency that arise from the linear growth of KV cache size with sequence length. TailorKV categorizes Transformer layers into quantization-friendly and sparsity-friendly based on their attention patterns, applying 1-bit quantization to the former and dynamic retrieval of Top-K tokens from CPU memory for the latter. This tailored approach significantly reduces memory usage and decoding latency while maintaining model accuracy, enabling LLMs to operate efficiently on resource-limited hardware.

Source:

https://arxiv.org/pdf/2505.19586

...more
18min
September 17, 2025 MIRAGE: Optimizing LLM KV Cache with Parameter Remapping
This July 2025 paper discusses advanced memory optimization techniques for Large Language Models (LLMs), particularly focusing on KV cache management in multi-tenant serving environments. The primary subject, MIRAGE, introduces parameter remapping, a novel method that dynamically repurposes GPU memory allocated for model parameters to expand KV cache capacity, outperforming traditional CPU-offloading and KV cache swapping by reducing latency and increasing throughput. Complementary research highlights challenges in on-device LLM deployment and proposes solutions like quantization (AWQ) for model compression and two-level scheduling (FineServe, Nexus) for efficient GPU sharing to mitigate memory fragmentation and improve performance. Overall, the papers underscore the critical need for innovative memory management to address the growing memory demands of LLMs and enhance their inference serving efficiency across diverse hardware configurations.

Source:
https://www.researchgate.net/publication/393724496_MIRAGE_KV_Cache_Optimization_through_Parameter_Remapping_for_Multi-tenant_LLM_Serving
...more
21min
September 17, 2025 WebSailor-V2: Bridging Proprietary Agents with Synthetic Data and RL
This September 2025 paper introduces WebSailor-V2, an open-source deep research agent developed by Alibaba Group's Tongyi Lab. The paper details a post-training pipeline that uses a novel synthetic data construction scheme, SailorFog-QA-V2, and a dual-environment reinforcement learning framework. WebSailor-V2, built on the Qwen3-30B-A3B model, demonstrates state-of-the-art performance among open-source agents and is competitive with leading proprietary systems on various web-agent benchmarks, including BrowseComp and Humanity's Last Exam. The authors emphasize that high-quality data and a stable training environment are more crucial than the specific RL algorithm for developing robust AI agents.

Source:
https://arxiv.org/pdf/2509.13305
...more
20min
September 17, 2025 Dynamic Chunking for Hierarchical Sequence Modeling
This July 2025 paper introduces Hierarchical Networks (H-Nets), a novel architecture designed to move beyond traditional tokenization in large language models by implementing dynamic chunking. This mechanism allows the model to automatically learn content- and context-dependent segmentation strategies directly from raw data, eliminating the need for predefined pre-processing steps like byte-pair encoding (BPE). H-Nets utilize a recursive, multi-stage structure that processes data at varying levels of abstraction, from bytes to more complex semantic units. Experiments demonstrate that H-Nets, particularly multi-stage configurations, outperform tokenized Transformers in perplexity, downstream tasks, and robustness to textual perturbations, especially in languages and modalities with weak or absent tokenization cues, such as Chinese, code, and DNA sequences. The authors highlight that this end-to-end learning of data chunking represents a significant step towards more generalized and efficient foundation models.

Source:
https://arxiv.org/html/2507.07955

...more
26min
September 17, 2025 LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning
This September 2025 paper introduces LoFT, a novel framework designed to improve Long-Tailed Semi-Supervised Learning (LTSSL) by leveraging parameter-efficient fine-tuning of pre-trained foundation models. The core idea is to enhance confidence calibration and generate more reliable pseudo-labels, which are crucial for addressing the imbalance inherent in long-tailed datasets. Furthermore, the paper extends this approach to open-world scenarios with LoFT-OW, specifically incorporating mechanisms to detect and filter out-of-distribution (OOD) samples from unlabeled data. The authors demonstrate that these fine-tuned models achieve superior performance on various benchmarks, even when utilizing significantly less unlabeled data compared to previous methods.

Source:
https://arxiv.org/pdf/2509.09926
...more
18min

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.