Subtitle: You can just draw more samples.
I recently got to 50%1 accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go to the next section)2. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts on our substack. So regular readers should be warned!]
The additional approaches and tweaks are:
---
Outline:
(03:05) What is ARC-AGI?
(06:22) My method
(10:52) What are the returns to more sampling?
(12:14) What are the returns to better prompting and code fixing?
(17:03) Qualitative analysis
(20:21) Caveats
(22:00) Predictions
(23:45) What it means about current LLMs
(27:13) What ARC-AGI tells us about AGI
(31:16) Appendix: A bunch of tricks used in my solutions
(38:40) Appendix: results for the train set
(39:13) Appendix: Returns to revision samples
---