AI: post transformers

By mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.

AI: post transformers episodes:

October 15, 2025 HBF: High Bandwidth Flash for AI Inferencing
These sources and patent discuss **SanDisk's development of High Bandwidth Flash (HBF)**, a technology designed to address the significant memory and bandwidth demands of artificial intelligence models, particularly at the edge, such as on smartphones. The first article details a presentation by SanDisk's Alper Ilkbahar, who introduced HBF as a **NAND-based memory solution** that mimics High Bandwidth Memory (HBM) but offers significantly greater capacity at a similar cost, enabling massive AI models like GPT-4 to run on a single GPU or even allowing large mixture-of-experts models to function on a smartphone. The second article highlights a crucial development: **SanDisk is collaborating with SK hynix**, the market leader in HBM, to standardize the HBF specification, which is critical for **creating a multi-supplier ecosystem** and accelerating the commercial adoption of HBF for future AI workloads. Ultimately, both articles focus on HBF's potential to **disrupt the memory industry** by providing high-speed, high-capacity memory necessary for next-generation, memory-bound AI applications.

Sources:
https://patents.justia.com/patent/20250254893
https://blocksandfiles.com/2025/02/25/sandisk-hbf/
https://blocksandfiles.com/2025/08/07/sandisk-and-sk-hynix-working-to-standardize-high-bandwidth-flash/
https://www.tomshardware.com/tech-industry/sandisk-and-sk-hynix-join-forces-to-standardize-high-bandwidth-flash-memory-a-nand-based-alternative-to-hbm-for-ai-gpus-move-could-enable-8-16x-higher-capacity-compared-to-dram

...more
15min
October 15, 2025 Architectural Migration to Multi-head Latent Attention
The sources detail a novel method called **MHA2MLA** (Multi-Head Attention to Multi-Head Latent Attention), which efficiently adapts pre-trained large language models (LLMs) to the memory-saving **Multi-head Latent Attention (MLA)** architecture without requiring full retraining. This framework achieves significant **Key-Value (KV) cache compression** (up to 96.87% reduction in Llama2-7B) through two main components: **partial-Rotary Positional Embedding (RoPE) removal** based on attention score contribution and **low-rank approximation** using Singular Value Decomposition (SVD). Crucially, MHA2MLA requires only a minimal amount of fine-tuning data (0.6% to 1%) and demonstrates strong compatibility with other compression techniques like **KV cache quantization**, maintaining performance across various commonsense reasoning and long-context tasks.

Sources:
https://arxiv.org/pdf/2405.04434
https://arxiv.org/pdf/2502.07864
https://arxiv.org/pdf/2502.14837
...more
33min
October 15, 2025 COPA: Composable On-Package GPU Architecture for Domain Specialization
This April 2021 academic paper from **NVIDIA** discusses the challenge of designing **converged GPUs** that efficiently handle the diverging architectural demands of **High Performance Computing (HPC)**, which uses higher precision arithmetic, and **Deep Learning (DL)**, which increasingly uses low precision math. The authors propose a new architecture called a **Composable On-PAckage GPU (COPA-GPU)**, which uses multi-chip module disaggregation to create domain-specialized products that maximize design reuse. COPA-GPUs enable DL specialization by adding features like significantly **larger on-package caches** and **higher DRAM bandwidth**, which the analysis shows are critical for scaling DL performance where converged designs face memory bottlenecks. This new approach aims to provide superior **cost-performance efficiency** for both application domains, particularly in large-scale DL training scenarios.

Source:
https://arxiv.org/pdf/2104.02188
...more
19min
October 11, 2025 Performance of Confidential Computing for Large Language Models
These sources collectively discuss advancements in **scalable, efficient, and secure machine learning (ML) data systems**, often within the context of large-scale datacenter deployments. Several papers address the performance and security trade-offs of using **Confidential Computing (CC)** and **Trusted Execution Environments (TEEs)** for large language models (LLMs) and database systems, including utilizing technologies like Intel TDX and specialized frameworks for FPGAs. Other documents focus on optimizing the **ML training data pipeline**, detailing systems like **RecD** for deduplication in deep learning recommendation models (DLRMs) to improve efficiency and **cedar**, a framework for automated pipeline optimization that addresses bottlenecks in data preprocessing, caching, and operator reordering. Finally, one source introduces **MinionS**, a collaboration protocol between small on-device LMs and frontier cloud LMs designed to significantly reduce remote inference costs while maintaining high performance for data-intensive reasoning tasks.

Sources:
https://arxiv.org/pdf/2505.16501
https://arxiv.org/pdf/2502.15964
https://hazyresearch.stanford.edu/blog/2025-05-12-security
https://arxiv.org/html/2411.03357v1
https://purl.stanford.edu/dm268wp3942
https://stacks.stanford.edu/file/dm268wp3942/mark_zhao_dissertation-augmented.pdf
https://arxiv.org/pdf/2502.11347

...more
20min
October 11, 2025 Google: Confidential Computing with Accelerated AI Workloads on GCE
The provided sources are a collection of Google Cloud documentation and blog excerpts detailing the features and implementation of **Confidential Computing** services, particularly focusing on **Confidential Virtual Machines (VMs)** and **Confidential Google Kubernetes Engine (GKE) Nodes**, especially for **AI and ML workloads**. The documentation explains that these confidential instances utilize hardware-based memory encryption—known as a **Trusted Execution Environment (TEE)**—to protect data and applications in use from unauthorized access, even from the hypervisor. Specific technologies enabling this include **AMD SEV**, **AMD SEV-SNP**, and **Intel TDX**, with newer developments extending these protections to accelerated computing using **NVIDIA H100 Tensor Core GPUs**. The sources also offer practical guidance on how to create a **Confidential VM instance with GPU**, including managing required **GPU quota** and configuring different provisioning models like **Spot** and **Flex-start**, and detail how to enable **Confidential GKE Nodes** for secured GPU workloads.

Sources:
https://cloud.google.com/confidential-computing/confidential-vm/docs/confidential-vm-overview
https://cloud.google.com/confidential-computing/confidential-vm/docs/create-a-confidential-vm-instance-with-gpu
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-confidential-nodes
https://cloud.google.com/blog/products/identity-security/how-confidential-computing-lays-the-foundation-for-trusted-ai
https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24
...more
18min
October 11, 2025 AWS: Nitro System: Security, Enclaves, and Generative AI
These sources provide an extensive overview of **AWS Nitro Enclaves**, an isolated compute environment designed to protect highly sensitive data within Amazon EC2 instances. The AWS material emphasizes that the underlying **AWS Nitro System** is a foundational security innovation that ensures no Amazon employee can access customer workloads or data, fulfilling the core principle of secure AI infrastructure by isolating data from the cloud operator. A key technical article, written by security researchers, meticulously analyzes the **attack surface of Nitro Enclaves**, offering developers actionable guidance on mitigating risks related to virtual sockets, randomness, memory management, and side-channel attacks. Finally, practical examples showcase how Nitro Enclaves, often integrated with **AWS Key Management Service (AWS KMS)** for encryption and cryptographic attestation, can be used to securely deploy **Large Language Model (LLM) inference** applications that handle sensitive information like PII and PHI.

Sources:
https://aws.amazon.com/blogs/machine-learning/a-secure-approach-to-generative-ai-with-aws/
https://aws.amazon.com/blogs/machine-learning/large-language-model-inference-over-confidential-data-using-aws-nitro-enclaves/
https://aws.amazon.com/ec2/nitro/
https://blog.trailofbits.com/2024/09/24/notes-on-aws-nitro-enclaves-attack-surface/
...more
18min
October 11, 2025 Anthropic: Confidential Inference via Trusted Virtual Machines
These sources, an announcement from Anthropic and a technical whitepaper co-authored with Pattern Labs, provide an **overview of Confidential Inference**, a system designed to ensure **cryptographically guaranteed security** for both proprietary AI model weights and sensitive user data during processing. Confidential Inference leverages **Trusted Execution Environments (TEEs)**, which are hardware-based secure enclaves with features like encrypted memory and cryptographic attestation to confirm that only authorized code is running. The documents thoroughly explain the design principles, components (such as the secure enclave and model provisioning), and the **security requirements for model owners, data owners, and service providers** when utilizing confidential computing for AI inference. Crucially, the sources address the **systemic and introduced security risks** within this complex multi-party ecosystem, including challenges related to integrating **AI accelerators** and maintaining a secure build environment.

Sources:
https://www.anthropic.com/research/confidential-inference-trusted-vms
https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf
...more
21min
October 11, 2025 RAND: Securing AI Model Weights: Preventing Theft and Misuse
The provided texts are excerpts from a **RAND Corporation research report** titled "Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models," which focuses on the critical need to protect the **learnable parameters**—or weights—of advanced artificial intelligence models. The report **identifies numerous attack vectors**, spanning cybercrime to top-tier nation-state operations, and assesses their feasibility across different categories of malicious actors. To address these threats, the research proposes and details **five progressive security levels (SL1 through SL5)**, offering benchmark security systems and measures designed to **thwart increasingly sophisticated adversaries**. The overview emphasizes that protecting these weights is crucial because they represent the **"crown jewels"** of an AI organization's significant investment and capabilities, requiring security far beyond current default practices.

Sources:
https://www.rand.org/news/press/2024/05/30.html
https://www.rand.org/pubs/research_reports/RRA2849-1.html
...more
18min
October 11, 2025 Training-Free GRPO: Policy Optimization via Context Space
The October 9, 2025 paper from **Tencent Youtu Lab** introduces **Training-Free Group Relative Policy Optimization (Training-Free GRPO)**, a novel method designed to enhance the performance of Large Language Model (LLM) agents without requiring expensive parameter updates or fine-tuning. This approach, rooted in reinforcement learning principles, shifts policy optimization from the **parameter space to the context space** by iteratively distilling high-quality **experiential knowledge** into a token prior. Experiments in mathematical reasoning and web searching demonstrate that Training-Free GRPO significantly boosts the performance of large, frozen LLMs like DeepSeek-V3.1-Terminus, achieving superior results compared to traditionally fine-tuned smaller models while requiring **substantially less data and computational cost**. The method replaces the numerical advantage used in vanilla GRPO with a **semantic group advantage** to guide model behavior, confirming the effectiveness and efficiency of context-based alignment, which also preserves **superior cross-domain generalization**.

Source:
https://arxiv.org/pdf/2510.08191
...more
14min
October 11, 2025 Multi-Agent Tool-Integrated Policy Optimization (MATPO)
The October 6, 2025 paper introduces **Multi-Agent Tool-Integrated Policy Optimization (MATPO)**, a novel reinforcement learning framework designed to improve the performance of large language models (LLMs) in complex, knowledge-intensive tasks. MATPO addresses the limitations of single-agent systems, such as context length and noisy tool outputs, by adopting a **multi-agent architecture** that includes a **planner-agent** and specialized **worker-agents**. Crucially, this framework utilizes a **multi-agent-in-one-model** approach, allowing a single LLM instance to take on distinct roles through role-specific prompts, which enhances computational efficiency compared to using multiple separate LLMs. The paper details the **principled credit assignment mechanism** derived from the multi-agent policy gradient and provides experimental evidence demonstrating that MATPO **outperforms single-agent baselines** across several deep search benchmarks. The authors conclude with practical insights and future research directions for multi-agent reinforcement learning.

Source:
https://arxiv.org/pdf/2510.04678
...more
13min

FAQs about AI: post transformers:

How many episodes does AI: post transformers have?

The podcast currently has 340 episodes available.