Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning.
2023: Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, G. Karypis, Alexander J. Smola
https://arxiv.org/pdf/2302.00923v1.pdf