
Sign up to save your podcasts
Or


The paper introduces Chameleon, a family of early-fusion, token-based mixed-modal foundation models developed by FAIR at Meta. Unlike traditional multimodal models that rely on separate, modality-specific encoders and decoders, Chameleon utilizes a unified transformer architecture that represents both images and text as discrete tokens from the very beginning. This early-fusion approach allows the model to seamlessly understand, reason over, and generate arbitrary sequences of interleaved text and images.
To achieve stable training across approximately 10 trillion tokens of mixed-modal data, the researchers developed key architectural and optimization innovations, such as query-key normalization (QK-Norm) and revised layer norm placements. These adjustments were crucial for preventing the training divergence that typically occurs when training multiple modalities of varying entropy together.
Extensive evaluations demonstrate that Chameleon achieves state-of-the-art performance on vision-language tasks, such as image captioning and visual question answering, while maintaining highly competitive performance on text-only benchmarks compared to models like Mixtral 8x7B and Gemini-Pro. Most notably, large-scale human evaluations reveal that Chameleon substantially outperforms strong baselines like Gemini Pro and GPT-4V in generating open-ended, long-form mixed-modal documents.
By Yun WuThe paper introduces Chameleon, a family of early-fusion, token-based mixed-modal foundation models developed by FAIR at Meta. Unlike traditional multimodal models that rely on separate, modality-specific encoders and decoders, Chameleon utilizes a unified transformer architecture that represents both images and text as discrete tokens from the very beginning. This early-fusion approach allows the model to seamlessly understand, reason over, and generate arbitrary sequences of interleaved text and images.
To achieve stable training across approximately 10 trillion tokens of mixed-modal data, the researchers developed key architectural and optimization innovations, such as query-key normalization (QK-Norm) and revised layer norm placements. These adjustments were crucial for preventing the training divergence that typically occurs when training multiple modalities of varying entropy together.
Extensive evaluations demonstrate that Chameleon achieves state-of-the-art performance on vision-language tasks, such as image captioning and visual question answering, while maintaining highly competitive performance on text-only benchmarks compared to models like Mixtral 8x7B and Gemini-Pro. Most notably, large-scale human evaluations reveal that Chameleon substantially outperforms strong baselines like Gemini Pro and GPT-4V in generating open-ended, long-form mixed-modal documents.