
Sign up to save your podcasts
Or
This research presents a novel training method for Vision Language Models (VLMs) focused on improving their ability to both assign scores to images and provide natural language explanations for those scores. By leveraging an existing image scoring dataset and an instruction-tuned VLM, the approach utilizes self-training without requiring additional external data or models. A key innovation is the creation of a dataset using Direct Preference Optimization (DPO) to enhance the alignment between predicted scores and generated text justifications. Through an iterative process of training the VLM on two self-generated datasets and then merging the resulting models, the system demonstrably improves both the accuracy of image scoring and the consistency of its accompanying explanations.
This research presents a novel training method for Vision Language Models (VLMs) focused on improving their ability to both assign scores to images and provide natural language explanations for those scores. By leveraging an existing image scoring dataset and an instruction-tuned VLM, the approach utilizes self-training without requiring additional external data or models. A key innovation is the creation of a dataset using Direct Preference Optimization (DPO) to enhance the alignment between predicted scores and generated text justifications. Through an iterative process of training the VLM on two self-generated datasets and then merging the resulting models, the system demonstrably improves both the accuracy of image scoring and the consistency of its accompanying explanations.