Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Chameleon be Meta Llama 4?

Meta proposes “Chameleon: Mixed-Modal Early-Fusion Foundation Models” with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. Implementation: 1️⃣ Trained 2 tokenizers, an image Tokenizer that encodes a 512 × 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token. 2️⃣ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting. 3️⃣ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama ⇒ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens); 4️⃣ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data. 5️⃣ Fine-tuned on ~1.8 million samples with ~100k vision samples.

Insights: 🔗 Previous MLLM (Idefics, GPT-4v, Flamingo) used encoders and connectors for multimodality, which limited their ability to generate multimodal documents (image + text outputs). 🦎 Chameleon can understand and generate both text and images using discrete tokens 📚 Chameleon-34B trained for 2.1 epochs over our full training dataset for a total of 9.2T tokens. 🔧 Code Data improved text-only reasoning tasks performance. ⚖️ Challenging to maintain stable training when scaling the Chameleon models above 8B parameters and 1T tokens. 🚀 The last 20% of pre-training with high-quality data significantly boosted performance. 🏆 Chameleon-34B outperforms Llama2-70B and approaches Mixtral 8x7B/Gemini-Pro, GSM8K, MATH, and MMLU. 📊 Chameleon-34B outperforms Flamingo-80B and IDEFICS-80B on MS-COCO and matches on Flickr30k. 🎯 Chameleon-34B achieves 60.4% win rate against Gemini-Pro and a 51.6% against GPT-4V. ⚖️ Balanced modality datasets are important for Fine-tuning and Alignment. Paper: https://lnkd.in/e_eH3fZS Note: Chameleon looks to be closer to OpenAI GPT-4o than Uni-MoE (shared yesterday) with its native multi-modal tokens.

PreviousLlaMA-3 NextClaude (Anthropic)

Last updated 1 year ago