
Sign up to save your podcasts
Or
This academic paper introduces PROXYTHINKER, a novel inference-time method designed to enhance the visual reasoning abilities of large vision-language models (LVLMs). Unlike computationally expensive fine-tuning approaches like reinforcement fine-tuning (RFT), PROXYTHINKER allows larger models to inherit reasoning skills from smaller, pre-trained reasoning models. It achieves this by adjusting the large model's output based on the difference between a small RFT expert's output and a small base model's output. The paper demonstrates that this training-free techniquesignificantly improves performance on various visual reasoning benchmarks while being computationally efficient.
This academic paper introduces PROXYTHINKER, a novel inference-time method designed to enhance the visual reasoning abilities of large vision-language models (LVLMs). Unlike computationally expensive fine-tuning approaches like reinforcement fine-tuning (RFT), PROXYTHINKER allows larger models to inherit reasoning skills from smaller, pre-trained reasoning models. It achieves this by adjusting the large model's output based on the difference between a small RFT expert's output and a small base model's output. The paper demonstrates that this training-free techniquesignificantly improves performance on various visual reasoning benchmarks while being computationally efficient.