
Sign up to save your podcasts
Or


Discover a groundbreaking approach to optimizing Large Language Models with Tomasz Kolinko, a true OG tinkerer and entrepreneur. In this One-Shot interview, Tomasz unveils his 'Effort Engine,' a novel algorithm that dynamically selects which computations are performed during LLM inference, allowing for significant speed improvements while maintaining surprising output quality. Learn how this method goes beyond traditional quantization by dynamically managing computations and even enabling partial model loading to save VRAM.
Tomasz shares his unique benchmarking techniques, including the use of Kullback-Leibler divergence and heat maps, offering a new lens to understand how models behave under reduced 'effort.' This conversation provides practical insights into the underlying mechanics of AI models and offers a fully open-source project for practitioners to experiment with.
💡 Resources:
• Tomasz Kalinko's GitHub - https://kolinko.github.io/effort/about.html
• The Basics - https://kolinko.github.io/effort/equations.html
• AI Tinkerers - https://aitinkerers.org
• One-Shot Podcast - https://one-shot.aitinkerers.org/
Social Media Tags: @AITinkerers @kolinko
👍 Like this video if you found it valuable, and subscribe to AI Tinkerers One-Shot for more conversations with innovators building the future of AI!
00:00 Introduction
00:01:07 Welcome Tomasz Kalinko
00:02:11 Introducing Effort Engine
00:03:10 Dynamic Inference Explained
00:05:56 How the Algorithm Works
00:08:07 Speed vs. Quality Trade-offs
00:11:37 Dynamic Weight Loading & VRAM
00:15:24 Effort Engine Demo
00:26:01 Model Breakdown Observations
00:29:49 Architecture & Benchmarks
00:32:17 Kullback-Leibler Divergence
00:39:22 Heat Map Visualization
00:41:07 Community & Future Work
By Joe HeitzebergDiscover a groundbreaking approach to optimizing Large Language Models with Tomasz Kolinko, a true OG tinkerer and entrepreneur. In this One-Shot interview, Tomasz unveils his 'Effort Engine,' a novel algorithm that dynamically selects which computations are performed during LLM inference, allowing for significant speed improvements while maintaining surprising output quality. Learn how this method goes beyond traditional quantization by dynamically managing computations and even enabling partial model loading to save VRAM.
Tomasz shares his unique benchmarking techniques, including the use of Kullback-Leibler divergence and heat maps, offering a new lens to understand how models behave under reduced 'effort.' This conversation provides practical insights into the underlying mechanics of AI models and offers a fully open-source project for practitioners to experiment with.
💡 Resources:
• Tomasz Kalinko's GitHub - https://kolinko.github.io/effort/about.html
• The Basics - https://kolinko.github.io/effort/equations.html
• AI Tinkerers - https://aitinkerers.org
• One-Shot Podcast - https://one-shot.aitinkerers.org/
Social Media Tags: @AITinkerers @kolinko
👍 Like this video if you found it valuable, and subscribe to AI Tinkerers One-Shot for more conversations with innovators building the future of AI!
00:00 Introduction
00:01:07 Welcome Tomasz Kalinko
00:02:11 Introducing Effort Engine
00:03:10 Dynamic Inference Explained
00:05:56 How the Algorithm Works
00:08:07 Speed vs. Quality Trade-offs
00:11:37 Dynamic Weight Loading & VRAM
00:15:24 Effort Engine Demo
00:26:01 Model Breakdown Observations
00:29:49 Architecture & Benchmarks
00:32:17 Kullback-Leibler Divergence
00:39:22 Heat Map Visualization
00:41:07 Community & Future Work