Share Engineering Persistence: How MLX-Engine v1.8.5 Solves the KV Cache Rewind Problem

Copy link

June 22, 2026

Engineering Persistence: How MLX-Engine v1.8.5 Solves the KV Cache Rewind Problem

43 minutes

Welcome back to Neural Intel. Today, we are going deep into the weeds of mlx-engine v1.8.5, the MIT-licensed inference backend for LM Studio.Neural Signal Check: For the Architect and the Researcher, the real story isn't just "faster tokens." It's how MLX-Engine now manages the unified memory architecture by offloading local attention layers to a specialized disk-writer backend.In this episode, we discuss:

The Rewind Challenge: Why "nifty tricks" in Gemma 4 and Qwen 3.5 make arbitrary rewinding hard and how mlx-engine circumvents this.

Disk Cache Architecture: How the engine uses a single scratch file in /tmp with serialized safetensors blobs to manage cache records.

Boundary Strategy: Why 256 tokens is the "Goldilocks" zone for balancing disk efficiency and recomputation.

Continuous Batching: The implementation for vision model (VLM) requests that allows for serious concurrent agentic workloads.

LRU Store Logic: How the system determines which "stale" conversation tokens to evict and which to keep resident in memory.

Follow us on X: @neuralintelorg

Visit our website: neuralintel.org

Engage with us: What’s your take on using disk-backed caches versus increasing raw unified memory? Give us your take in the comments below!Support the Show:

...more

View all episodes

By Neuralintel.org

June 22, 2026

Engineering Persistence: How MLX-Engine v1.8.5 Solves the KV Cache Rewind Problem

43 minutes

The Rewind Challenge: Why "nifty tricks" in Gemma 4 and Qwen 3.5 make arbitrary rewinding hard and how mlx-engine circumvents this.

Disk Cache Architecture: How the engine uses a single scratch file in /tmp with serialized safetensors blobs to manage cache records.

Boundary Strategy: Why 256 tokens is the "Goldilocks" zone for balancing disk efficiency and recomputation.

Continuous Batching: The implementation for vision model (VLM) requests that allows for serious concurrent agentic workloads.

LRU Store Logic: How the system determines which "stale" conversation tokens to evict and which to keep resident in memory.

Follow us on X: @neuralintelorg

Visit our website: neuralintel.org

Engage with us: What’s your take on using disk-backed caches versus increasing raw unified memory? Give us your take in the comments below!Support the Show:

...more

Sign up to save your podcasts