
Sign up to save your podcasts
Or


I put out a blog prize to answer a couple of big questions I have about AI. The goal is really to find someone to hire as coresearcher. I have more questions of this variety, but I omitted them from that post’s list, because they don’t make it easy to judge submission quality. So I thought I’d post them here:
5 hyperscalers own 70+% of global AI compute, and much of that is actually reserved for the 3 member set of OpenAI/Ant/GDM. How worried should we be that AI use cases which are not building up to the singularity and the robot factories - aka normal people being more empowered, understanding the world better, being entertained, etc, is not the highest ROI activity for compute in the world. And given how valuable compute will be (whose opportunity cost increases in tandem with the quality of the AI models that run on it), will normal people basically get priced out of the benefits of AI? If we should be worried about this, how concretely should some kind of universal basic income/compute redistribution work? If not worried, what is the frame of this question missing?
Data is arguably the main way that AI models have been getting better over the last few years. But I remain confused about what concretely these improvements have consisted of. To ask some sharper questions:
Clearly Anthropic (and now also OpenAI and GDM) have cracked something about making competent long horizon coding agents. What is it? Is it just stacking up more and more RL coding environments? Or is there something more particular behind this breakthrough?
Are models even getting more sample efficient (aka they learn more from each training sample) or have we just changed/expanded/improved the data input? The reason this question is important is because it tells us how fast deep learning progress will be in domains that actually do require sample efficiency (for example, robotics).
Models are very sample efficient in context, and the information in context can be used much more flexibly. But the attention “fast” weights consume a huge amount of memory in order to accommodate this faster learning. Why is there this memory/sample efficiency tradeoff?
If you look at the size of the KV cache for Llama 3 70B, it’s 320 KB / token. If you just divide the number of bits it takes to store Llama 3 weights by the number of tokens it was pre-trained on, then you get 0.075 bits / token. So there’s a 35 million fold difference in the amount of information per bit you’re storing.
Let’s put frontier lab compute into 3 buckets: pretraining, RL generation, and inference. RL generation and inference look like very similar workloads. The big difference, of course, is that the model learns as a result of RL generation, but it doesn’t (at least currently) from inference. At the same time, the model actually does useful work during inference, but not during RL generation. Many people have pointed out it’s really weird that there’s a distinction between training and inference, and that in the limit it shouldn’t exist. How practically will these two workloads be merged? At a high level, one can imagine hiring an AI instance for a month-long work trial, getting it to do actual useful work for you during that time, and then sending a report card back to the model company. In fact, in a few years, maybe the only way that AI can continue to make progress is through this kind of on-the-job learning, because models will already have saturated anything that can be learned from contrived shorter-horizon RL environments.
Does something Y2Key happen when most of the tokens on the internet (and presumably the ones future models will be trained on) are generated by other AIs? Has the relative value of pre-2023 internet datasets increased in any noticeable way?
I wrote this in my continual learning blog post last June. Is this correct? Why might there not be a winner take all dynamic from continual learning?
“Even if there isn’t a software only singularity (with models rapidly building smarter and smarter successor systems), we might still see something that looks like a broadly deployed intelligence explosion. AIs will be getting broadly deployed through the economy, doing different jobs and learning while doing them in the way humans can. But unlike humans, these models can amalgamate their learnings across all their copies. So one AI is basically learning how to do every single job in the world. An AI that is capable of online learning might functionally become a superintelligence quite rapidly without any further algorithmic progress”
A lot of economic analysis about the impact of AGI focus on human demand - will the economy shrink because our demands can be fulfilled much more cheaply, will it rise because AI will create new varieties of products, or maybe because the relational sector will grow? But all this analyses take as a given that the only demand that matters is the one originating from humans. How do we model the machine-only economy, where the demand originates from the AI’s themselves? And once we add this consideration to our economic analysis of the future, what changes?
By Dwarkesh PatelI put out a blog prize to answer a couple of big questions I have about AI. The goal is really to find someone to hire as coresearcher. I have more questions of this variety, but I omitted them from that post’s list, because they don’t make it easy to judge submission quality. So I thought I’d post them here:
5 hyperscalers own 70+% of global AI compute, and much of that is actually reserved for the 3 member set of OpenAI/Ant/GDM. How worried should we be that AI use cases which are not building up to the singularity and the robot factories - aka normal people being more empowered, understanding the world better, being entertained, etc, is not the highest ROI activity for compute in the world. And given how valuable compute will be (whose opportunity cost increases in tandem with the quality of the AI models that run on it), will normal people basically get priced out of the benefits of AI? If we should be worried about this, how concretely should some kind of universal basic income/compute redistribution work? If not worried, what is the frame of this question missing?
Data is arguably the main way that AI models have been getting better over the last few years. But I remain confused about what concretely these improvements have consisted of. To ask some sharper questions:
Clearly Anthropic (and now also OpenAI and GDM) have cracked something about making competent long horizon coding agents. What is it? Is it just stacking up more and more RL coding environments? Or is there something more particular behind this breakthrough?
Are models even getting more sample efficient (aka they learn more from each training sample) or have we just changed/expanded/improved the data input? The reason this question is important is because it tells us how fast deep learning progress will be in domains that actually do require sample efficiency (for example, robotics).
Models are very sample efficient in context, and the information in context can be used much more flexibly. But the attention “fast” weights consume a huge amount of memory in order to accommodate this faster learning. Why is there this memory/sample efficiency tradeoff?
If you look at the size of the KV cache for Llama 3 70B, it’s 320 KB / token. If you just divide the number of bits it takes to store Llama 3 weights by the number of tokens it was pre-trained on, then you get 0.075 bits / token. So there’s a 35 million fold difference in the amount of information per bit you’re storing.
Let’s put frontier lab compute into 3 buckets: pretraining, RL generation, and inference. RL generation and inference look like very similar workloads. The big difference, of course, is that the model learns as a result of RL generation, but it doesn’t (at least currently) from inference. At the same time, the model actually does useful work during inference, but not during RL generation. Many people have pointed out it’s really weird that there’s a distinction between training and inference, and that in the limit it shouldn’t exist. How practically will these two workloads be merged? At a high level, one can imagine hiring an AI instance for a month-long work trial, getting it to do actual useful work for you during that time, and then sending a report card back to the model company. In fact, in a few years, maybe the only way that AI can continue to make progress is through this kind of on-the-job learning, because models will already have saturated anything that can be learned from contrived shorter-horizon RL environments.
Does something Y2Key happen when most of the tokens on the internet (and presumably the ones future models will be trained on) are generated by other AIs? Has the relative value of pre-2023 internet datasets increased in any noticeable way?
I wrote this in my continual learning blog post last June. Is this correct? Why might there not be a winner take all dynamic from continual learning?
“Even if there isn’t a software only singularity (with models rapidly building smarter and smarter successor systems), we might still see something that looks like a broadly deployed intelligence explosion. AIs will be getting broadly deployed through the economy, doing different jobs and learning while doing them in the way humans can. But unlike humans, these models can amalgamate their learnings across all their copies. So one AI is basically learning how to do every single job in the world. An AI that is capable of online learning might functionally become a superintelligence quite rapidly without any further algorithmic progress”
A lot of economic analysis about the impact of AGI focus on human demand - will the economy shrink because our demands can be fulfilled much more cheaply, will it rise because AI will create new varieties of products, or maybe because the relational sector will grow? But all this analyses take as a given that the only demand that matters is the one originating from humans. How do we model the machine-only economy, where the demand originates from the AI’s themselves? And once we add this consideration to our economic analysis of the future, what changes?