Chameleon: Mixed-Modal Early-Fusion Foundation Models
Will Chameleon be Meta Llama 4?
Meta proposes “Chameleon: Mixed-Modal Early-Fusion Foundation Models” with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. Implementation: 1️⃣ Trained 2 tokenizers, an image Tokenizer that encodes a 512 × 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token. 2️⃣ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting. 3️⃣ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama ⇒ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens); 4️⃣ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data. 5️⃣ Fine-tuned on ~1.8 million samples with ~100k vision samples.
Insights:
🔗 Previous MLLM (Idefics, GPT-4v, Flamingo) used encoders and connectors for multimodality, which limited their ability to generate multimodal documents (image + text outputs).
🦎 Chameleon can understand and generate both text and images using discrete tokens
📚 Chameleon-34B trained for 2.1 epochs over our full training dataset for a total of 9.2T tokens.
🔧 Code Data improved text-only reasoning tasks performance.
⚖️ Challenging to maintain stable training when scaling the Chameleon models above 8B parameters and 1T tokens.
🚀 The last 20% of pre-training with high-quality data significantly boosted performance.
🏆 Chameleon-34B outperforms Llama2-70B and approaches Mixtral 8x7B/Gemini-Pro, GSM8K, MATH, and MMLU.
📊 Chameleon-34B outperforms Flamingo-80B and IDEFICS-80B on MS-COCO and matches on Flickr30k.
🎯 Chameleon-34B achieves 60.4% win rate against Gemini-Pro and a 51.6% against GPT-4V.
⚖️ Balanced modality datasets are important for Fine-tuning and Alignment.
Paper: https://lnkd.in/e_eH3fZS
Note: Chameleon looks to be closer to OpenAI GPT-4o than Uni-MoE (shared yesterday) with its native multi-modal tokens.
Last updated