Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Chameleon be Meta Llama 4?

Meta proposes โ€œChameleon: Mixed-Modal Early-Fusion Foundation Modelsโ€ with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. Implementation: 1๏ธโƒฃ Trained 2 tokenizers, an image Tokenizer that encodes a 512 ร— 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token. 2๏ธโƒฃ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting. 3๏ธโƒฃ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama โ‡’ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens); 4๏ธโƒฃ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data. 5๏ธโƒฃ Fine-tuned on ~1.8 million samples with ~100k vision samples.

Last updated