4. Training the Model at Scale

Training Models at Scale

To train LLMs, the main challenge is their sheer scale; when you're training on trillions of tokens and have hundreds of billions of parameters, there is a lot of computational cost associated and it is basically impossible to train one of these models without employing creative computational tricks and techniques to speed up the training process.

Here are the three popular training techniques;

a. Mixed Precision Training

Mixed Precision training is essentially when you use both 32bit and 16 bit floating Point numbers during model training such that you use the 16bit floating Point numbers whenever possible and 32bit numbers only when you have to. (note: there is a comprehensive documentation on this topic by Nvidia).

b. 3D Parallelism

This is actually the combination of three different parallelization strategies which are all listed below:

(i) Pipeline Parallelism: This involves distributing the transformer layers across multiple GPU’s and it actually does an additional optimization where it puts adjacent layers on the same GPU to reduce the amount of cross-GPU communication that has to take place.

(ii) Model Parallelism: This basically decomposes the matrix multiplications that make up the model into smaller ‘matrix multiplies’ and then distributes those matrix multiplies across multiple GPU’s.

(iii) Data Parallesim: Data Parallelism distributes training data across multiple GPU’s.

One of the challenges with ‘Parallelization’ is that redundancies start to emerge because ‘Model Parameters’ and ‘Optimizer States’ need to be copied across multiple GPU’s. So you're having some portion of the GPU’s precious memory devoted to storing information that is copied in multiple places. This is where ‘Zero Redundancy Optimizer’ (or ‘ZeRO’) is helpful.

c. ‘Zero Redundancy Optimizer’ (ZeRO)

ZeRO essentially reduces data redundancy regarding the Optimizer State the gradient and parameter partitioning.

The above is just a basic level explanation of these three training techniques. These techniques and many more are implemented by the DeepSpeed Python Library and of course DeepSpeed isn't the only Library out. There are a few other ones such as Colossal AI, Alpha and some more.

Training Stability

Another consideration when training these massive models is ‘Training Stability’ to help ensure that the training process goes smoothly. Here are the main strategies;

i. Checkpointing: which takes a snapshot of model artifacts so training can resume from that point. This is helpful because let's say you're training loss is going down it's great but then you just have this spike in loss after training for a week and it just blows up training and you don't know what happened. Checkpointing allows you to go back to when everything was okay and debug what could have gone wrong and maybe make some adjustments to the learning rate or other Hyperparameters so that you can try to avoid that spike in the loss function that came up later.

ii. Weight Decay: which is essentially a ‘Regularization’ strategy that penalizes large parameter values. There are two ways of doing this one is either by adding a term to the objective function which is like regular regularization regular regularization or changing the parameter update Rule.

iii. Gradient Clipping: which rescales the gradient of the objective function if it exceeds a pre-specified value so this helps avoid the exploding gradient problem which may blow up your training process.

Regular Parameters vs. Hyperparameters:

Regular Parameters: These are the millions or billions of weights within the LLM's architecture that get adjusted during training based on the data. They determine how the model transforms input data into outputs.

Hyperparameters: These are external settings that define the training process itself. They influence how the model learns from the data and impact its overall performance on tasks. Examples of Hyperparameters in LLMs:


Hyperparameters are not specific to Large Language Models. Hyperparameters in Large Language Models (LLMs) are settings that control the learning process of the model. They are crucial for guiding the model towards optimal performance, but unlike regular parameters, they are not directly learned during training. Here's a breakdown and some common choices for these values;

(i) Batch Size: This refers to the number of training examples the model processes in a single iteration. A larger batch size can improve efficiency but might lead to overfitting, while a smaller batch size can be slower but potentially lead to better generalization. Batch size which can be either static or dynamic. Static batch sizes are usually pretty big, for example on the order of 16 million tokens, but it can also be dynamic. For example when training GPT3, they gradually increased the batch size from 32,000 tokens to 3.2 million tokens.

(ii) Learning Rate: This determines the step size the model takes when updating its weights during training. A high learning rate can lead to faster learning but also instability, while a low learning rate can make training slow. This can be static or dynamic. Dynamic learning rates are much more common for these models. A common strategy used is as follows; you have a learning rate that increases linearly until reaching some specified maximum value and then it'll reduce via a cosine Decay until the learning rate is about 10% % of its max value.

(iii) Optimizer: This is the algorithm that determines how the model updates its weights based on the errors it makes during training. Different optimizers can have varying impacts on convergence speed and performance. Adam Based Optimizers developed by Diederik P. are most commonly used for Large Language Models. The Adam optimizer, short for “Adaptive Moment Estimation,” is an iterative optimization algorithm used to minimize the loss function during the training of neural networks. Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum.

(iv) Dropout Rate: This technique randomly drops a certain percentage of neurons during training, preventing the model from overfitting to the training data. Typical values for Dropout are between 0.2 and 0.5 from the original Dropout paper by Jeffrey Hinton et.all.

(v) Number of Training Epochs: This specifies how many times the model iterates through the entire training dataset. More epochs can lead to better performance but also take longer to compute.

Tuning Hyperparameters: Finding the optimal set of hyperparameters is crucial for getting the best performance out of an LLM. This process often involves experimentation and evaluation using techniques like grid search or random search.

Why Hyperparameters Matter:

Impact Performance: The right hyperparameters can significantly improve the LLM's ability to learn complex patterns and perform well on tasks like text generation, translation, or question answering.

Resource Optimization: Hyperparameters can also influence the efficiency of training. Tuning them can help reduce training time and resource consumption.

Overall, hyperparameters are essential tools for guiding the training process of LLMs and unlocking their full potential. By carefully selecting and tuning these settings, developers can create powerful and effective language models.

Last updated