# Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Chameleon be [Meta](https://www.linkedin.com/company/meta/) Llama 4?&#x20;

Meta proposes “Chameleon: Mixed-Modal Early-Fusion Foundation Models” with a unified approach for fully token-based representations of both image and text. No Encoders or connectors.  \
\
Implementation:\
1️⃣ Trained 2 tokenizers, an image Tokenizer that encodes a 512 × 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token.\
2️⃣ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting.\
3️⃣ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama ⇒ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens);\
4️⃣ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data.\
5️⃣ Fine-tuned on \~1.8 million samples with \~100k vision samples.

![](/files/iy4aiJPYKvM5r1TZU7cn)\
\
Insights:\
🔗 Previous MLLM (Idefics, GPT-4v, Flamingo) used encoders and connectors for multimodality, which limited their ability to generate multimodal documents (image + text outputs).\
🦎 Chameleon can understand and generate both text and images using discrete tokens\
📚 Chameleon-34B trained for 2.1 epochs over our full training dataset for a total of 9.2T tokens.\
🔧 Code Data improved text-only reasoning tasks performance.\
⚖️ Challenging to maintain stable training when scaling the Chameleon models above 8B parameters and 1T tokens.\
🚀 The last 20% of pre-training with high-quality data significantly boosted performance.\
🏆 Chameleon-34B outperforms Llama2-70B and approaches Mixtral 8x7B/Gemini-Pro, GSM8K, MATH, and MMLU.\
📊 Chameleon-34B outperforms Flamingo-80B and IDEFICS-80B on MS-COCO and matches on Flickr30k.\
🎯 Chameleon-34B achieves 60.4% win rate against Gemini-Pro and a 51.6% against GPT-4V.\
⚖️ Balanced modality datasets are important for Fine-tuning and Alignment.\
\
Paper: <https://lnkd.in/e_eH3fZS>\
\
Note: Chameleon looks to be closer to [OpenAI](https://www.linkedin.com/company/openai/) GPT-4o than Uni-MoE (shared yesterday) with its native multi-modal tokens.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://metaverse-imagen.gitbook.io/ai-tools-research/large-language-models-llms/meta/chameleon-mixed-modal-early-fusion-foundation-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
