AI Tinkerers - "One-Shot"

Dynamic LLM Inference: Tomasz Kolinko's Effort Engine


Listen Later

Discover a groundbreaking approach to optimizing Large Language Models with Tomasz Kolinko, a true OG tinkerer and entrepreneur. In this One-Shot interview, Tomasz unveils his 'Effort Engine,' a novel algorithm that dynamically selects which computations are performed during LLM inference, allowing for significant speed improvements while maintaining surprising output quality. Learn how this method goes beyond traditional quantization by dynamically managing computations and even enabling partial model loading to save VRAM.

Tomasz shares his unique benchmarking techniques, including the use of Kullback-Leibler divergence and heat maps, offering a new lens to understand how models behave under reduced 'effort.' This conversation provides practical insights into the underlying mechanics of AI models and offers a fully open-source project for practitioners to experiment with.

💡 Resources:

• Tomasz Kalinko's GitHub - https://kolinko.github.io/effort/about.html

• The Basics - https://kolinko.github.io/effort/equations.html

• AI Tinkerers - https://aitinkerers.org

• One-Shot Podcast - https://one-shot.aitinkerers.org/

Social Media Tags: @AITinkerers @kolinko

👍 Like this video if you found it valuable, and subscribe to AI Tinkerers One-Shot for more conversations with innovators building the future of AI!

00:00 Introduction

00:01:07 Welcome Tomasz Kalinko

00:02:11 Introducing Effort Engine

00:03:10 Dynamic Inference Explained

00:05:56 How the Algorithm Works

00:08:07 Speed vs. Quality Trade-offs

00:11:37 Dynamic Weight Loading & VRAM

00:15:24 Effort Engine Demo

00:26:01 Model Breakdown Observations

00:29:49 Architecture & Benchmarks

00:32:17 Kullback-Leibler Divergence

00:39:22 Heat Map Visualization

00:41:07 Community & Future Work

...more
View all episodesView all episodes
Download on the App Store

AI Tinkerers - "One-Shot"By Joe Heitzeberg