
Sign up to save your podcasts
Or


We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) do not trivially bring improvements in H-Test performance.
Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment. Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence [...]
---
Outline:
(01:16) Key Findings on H-Test
(03:14) Acknowledgments and Links
---
First published:
Source:
Linkpost URL:
https://arxiv.org/abs/2402.11349
Narrated by TYPE III AUDIO.
By LessWrongWe argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) do not trivially bring improvements in H-Test performance.
Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment. Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence [...]
---
Outline:
(01:16) Key Findings on H-Test
(03:14) Acknowledgments and Links
---
First published:
Source:
Linkpost URL:
https://arxiv.org/abs/2402.11349
Narrated by TYPE III AUDIO.

113,164 Listeners

130 Listeners

7,255 Listeners

535 Listeners

16,266 Listeners

4 Listeners

14 Listeners

2 Listeners