
Sign up to save your podcasts
Or


In this episode, we explore an innovative approach to overcoming the notorious energy and latency bottlenecks plaguing modern Large Language Models (LLMs).
The core of generative LLMs, powered by Transformer networks, relies on the self-attention mechanism, which frequently accesses and updates the large Key-Value (KV) cache. On traditional Graphical Processing Units (GPUs), loading this KV-cache from High Bandwidth Memory (HBM) to SRAM is a major bottleneck, consuming substantial energy and causing latency.
We delve into a novel Analog In-Memory Computing (IMC) architecture designed specifically to perform the attention computation far more efficiently.
Key Breakthroughs and Results:
The proposed architecture marks a significant step toward ultra-fast, low-power generative Transformers and demonstrates the promise of IMC with volatile, low-power memory for attention-based neural networks.
By GenAI Level UPIn this episode, we explore an innovative approach to overcoming the notorious energy and latency bottlenecks plaguing modern Large Language Models (LLMs).
The core of generative LLMs, powered by Transformer networks, relies on the self-attention mechanism, which frequently accesses and updates the large Key-Value (KV) cache. On traditional Graphical Processing Units (GPUs), loading this KV-cache from High Bandwidth Memory (HBM) to SRAM is a major bottleneck, consuming substantial energy and causing latency.
We delve into a novel Analog In-Memory Computing (IMC) architecture designed specifically to perform the attention computation far more efficiently.
Key Breakthroughs and Results:
The proposed architecture marks a significant step toward ultra-fast, low-power generative Transformers and demonstrates the promise of IMC with volatile, low-power memory for attention-based neural networks.