Snowflake Arctic 128 Experts MoE
Last updated
Last updated
Snowflake's new open-source large language model called Arctic, which uses a novel architecture called a "dense hybrid Transformer" with 128 experts (smaller models). This approach, called a Mixture of Experts (MoE), is claimed to provide several benefits:
Training efficiency: By utilizing many small "expert" models instead of one large model, the training can be made more computationally efficient and less expensive. The article states that Arctic's training cost was under $2 million, much lower than estimates for models like GPT-4 ($60 million).
Model performance: Despite using smaller expert models, the combination of 128 experts allows Arctic to achieve high performance on enterprise tasks like coding, SQL generation, and instruction following - what Snowflake calls "Enterprise intelligence".
Scalability: Having many smaller expert models makes it easier to scale up the overall model size and capabilities by adding more experts, compared to scaling up a single large model.
Specialization: Each expert can potentially specialize in specific tasks or domains, allowing the overall model to handle a diverse set of tasks effectively.
The key innovation claimed is that Snowflake's dense hybrid architecture reduces the communication overhead between the experts during training, which has been a major inefficiency in traditional Mixture of Experts approaches. This enables training very large MoE models like Arctic's 128 experts in a cost-effective manner.