Nemotron-4, 340B (NVidia)

Nemotron-4 340B

NVIDIA just released 340B dense LLM matching the original OpenAI GPT-4 performance for chat applications and synthetic data generation. 🧮 340B Paramters with 4k context window 3️⃣ Base, Reward Model and Instruct Model released 🔢 Trained on 9 trillion tokens with 2 phases 🌎 Trained on English, 50+ languages and 40+ programming languages 🧠 Requires 16x H100 in bf16 and ~8x H100 in int4 🥇 Base Model achieves 81.1 MMLU; 90.53 HellaSwag; 85.44 BHH 🧬 Used SFT, DPO, and RPO for post-training 🔓 Commercially useable, but custom license 🤗 Available on Hugging Face Model: https://lnkd.in/eHAwvtNK Technical Report: https://lnkd.in/ex9J5Dbp

Nemotron-4 340B
Beats Mixtral 8x22B, Claude sonnet, Llama3 70B, Qwen 2 and competes with GPT 4
Release Base, Instruct and Reward model
Trained on 9T tokens
8T pre-training + 1T for continual training for increased quality
Instruct model trained on 98% Synthetic data! LFG!
Trained on English, 50+ natural languages and 40+ coding languages
June 2023 training cut-off
Apache 2.0 like license - explicitly allows synthetic data generation and commercial use
GPT styled - used Group Query Attention (GQA) & RoPE embeddings
Uses 10K HelpSteer2 dataset (human annotated) to train the Reward model - also released!
Run inference on 8x H100
Bonus: They release a 340B reward model, too - it ranks #1 on the RewardBench
Congratulations to Nvidia for open sourcing-; it is such a brilliant model!

What does "Release Base, Instruct and Reward Model" mean?

"Release Base, Instruct and Reward model" refers to a potential approach for developing and training Large Language Models (LLMs). Here's a breakdown of each term:

1. Release Base:

This likely refers to a pre-trained LLM that serves as the starting point. These base models are often trained on massive datasets of text and code, giving them a broad understanding of language.
Examples of "Release Base" could include models like GPT-3 from OpenAI or Jurassic-1 Jumbo from AI21 Labs.

2. Instruct:

This stage involves fine-tuning the pre-trained LLM using specific instructions or examples. This helps the model learn to perform specific tasks like writing different creative text formats, translating languages, or answering your questions in an informative way.
Techniques like prompt engineering and reinforcement learning with human feedback can be used for instruction.

3. Reward Model:

This is a separate model that evaluates the performance of the LLM on specific tasks. It provides "rewards" (positive feedback) when the LLM generates outputs that meet the desired criteria and "penalties" (negative feedback) for outputs that are incorrect or irrelevant.
This feedback loop guides the LLM's learning and helps it improve its performance on the instructed tasks.

Putting it Together:

The "Release Base, Instruct and Reward model" approach suggests a development cycle for LLMs:

Start with a pre-trained LLM (Release Base).
Fine-tune it using specific instructions (Instruct).
Continuously evaluate and improve the LLM using a reward model (Reward).

This approach allows for creating LLMs tailored for specific applications while leveraging the capabilities of pre-trained models.

Here are some additional points to consider:

This is a conceptual framework, and the specific implementation details can vary.
There are ongoing research efforts in areas like self-rewarding LLMs, where the model can learn to improve itself without the need for a separate reward model.
The effectiveness of this approach depends on the quality of the pre-trained LLM, the clarity of the instructions, and the design of the reward model.

I hope this explanation clarifies the meaning of "Release Base, Instruct and Reward model".

PreviousDBRX (DataBricks)NextMoshi by Kyutai

Last updated 1 year ago