Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Chameleon be Meta Llama 4?

Meta proposes “Chameleon: Mixed-Modal Early-Fusion Foundation Models” with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. Implementation: 1️⃣ Trained 2 tokenizers, an image Tokenizer that encodes a 512 × 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token. 2️⃣ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting. 3️⃣ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama ⇒ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens); 4️⃣ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data. 5️⃣ Fine-tuned on ~1.8 million samples with ~100k vision samples.

Last updated