The Nonlinear Library

LW - The surprising parameter efficiency of vision models by beren


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The surprising parameter efficiency of vision models, published by beren on April 8, 2023 on LessWrong.
Crossposted from my personal blog.
Epistemic status: This is a short post meant to highlight something I do not yet understand and therefore a potential issue with my models. I would also be interested to hear if anybody else has a good model of this.
Why do vision (and audio) models work so well despite being so small? State of the art models like stable diffusion and midjourney work exceptionally well, generating near-photorealistic art and images and give users a fair degree of controllability over their generations. I would estimate with a fair degree of confidence that the capabilities of these models probably surpass the mental imagery abilities of almost all humans (they definitely surpass mine and a number of people I have talked to). However, these models are also super small in terms of parameters. The original stable diffusion is only 890M parameters.
In terms of dataset size, image models are at a rough equality with humans. The stable diffusion dataset is 2 billion images. Assuming that you see 10 images per second every second you are awake and that you are awake 18 hours a day, you can observe 230 million images per year and so get the same data input as stable diffusion after 10 years. Of course, the images you see are much more redundant and we made some highly aggressive assumptions but after a human lifetime being in the same OOM as a SOTA image model is not insane. On the other hand, the hundreds of billions to trillions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.
A similar surprising smallness occurs in audio models. OpenAI's Whisper can do almost flawless audio transcription (including multilingual translation!) with just 1.6B parameters.
Let's contrast this to the brain. Previously, I estimated that we should expect the visual cortex to have on the order of 100B parameters, if not more. The auditory cortex should be of roughly the same order of magnitude, but slightly smaller than the visual cortex. That is two orders of magnitude larger than state of the art DL models in these modalities.
This contrasts with state of the art language models which appear to be approximately equal to the brain in parameter count and abilities. Small (1-10B) language models are clearly inferior to the brain at producing valid text and completions as well as standard question answering and factual recall tasks. Human parity in factual knowledge is reached somewhere between GPT-2 and GPT-3. Human language abilities are still not entirely surpassed with GPT-3 (175B parameters) or GPT-4 (presumably significantly larger). This puts large language models within approximately the same order of magnitude as the human linguistic cortex.
What could be the reasons for this discrepancy? Off the top of my head I can think of a number which are below (and ranked by rough intuitive plausibility), and it would be interesting to try to investigate these further. Also, if anybody has ideas or evidence either way please send me a message.
1.) The visual cortex vs image models is not a fair comparison. The brain does lots of stuff image generation models can't do such as parse and render very complex visual scenes, deals with saccades and having two eyes, and, crucially, handle video data and moving stimuli. We haven't fully cracked video yet and it is plausible that to do so existing vision models require an OOM or two more of scale.
2.) There are specific inefficiencies in the brain's processing of images that image models skip which do not apply to language models. One very obvious example of this is convolutions. While CNNs have convolutional filters which are applied to all tiles of the image individually, the brain cannot do this an...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings