3. Model Architecture
Model Architecture.
In this step we need to define the architecture of the language model. As far as Large Language Models are concerned, Transformers have emerged as the state-of-the-art architecture.
A Transformer is a neural network architecture that strictly uses ‘attention mechanisms’ to map ‘inputs’ to ‘outputs’. What do we mean with an ‘attention mechanism’? We can define it as something that learns dependencies between different elements of a sequence based on ‘position’ and ‘content’.
This is based on the intuition that when you're talking about language, the context matters. Let's look at a couple examples. If we see the sentence; “I hit the baseball with a bat”, the appearance of baseball implies that ‘bat’ is probably a baseball bat and not a ‘nocturnal mammal’. This is the picture that we have in our minds this is an example of the ‘content’ of the ‘context’ of the word ‘bat’. So bat exists in this larger context of this sentence. And the content is the words making up this context.
The ‘content’ of the ‘context’ determines what word is going to come next and the meaning of this word here. But content isn't enough. The positioning of these words is also important. To see that, consider another example; “I hit the bat with a baseball”. In this sentence, there is a bit more ambiguity of what ‘bat means’. It could still mean a ‘baseball bat’ but people don't really hit baseball bats with baseballs. They hit baseballs with baseball bats.
One might reasonably think ‘bat here means the ‘nocturnal mammal’. And so an attention mechanism captures both these aspects of language. More specifically it will use both the content of the sequence and the positions of each element in the sequence to help infer what the next word should be.
At first it might seem that Transformers are a constrained in particular architecture. We actually have an incredible amount of freedom and choices as developers when making a Transformer model. At a high level there are actually three types of Transformers which follows from the two modules that exist in the Transformer architecture. Namely we have the encoder and decoder.
We can have an encoder by itself that can be the architecture.
We can have a decoder by itself that's another architecture.
We can have the encoder and decoder working together as the third architecture.
1. Encoder only Transformer
The Encoder only Transformer translates tokens into a semantically mean meaningful representation. And these are typically good for Text classification tasks, or if you're just trying to generate a embedding for some text.
2. Decoder only Transformer
The Decoder only Transformer is similar to an encoder because it translates text into a semantically meaningful internal representation but decoders are trying to predict the next word. They're trying to predict future tokens. And for this decoders do not allow self-attention with future elements which makes it great for text generation tasks.
The difference between the encoder self-attention mechanism and the decoder self-attention mechanism is that with the the encoder, any part of the sequence can interact with any other part of the sequence. If we were to focus on the ‘weight matrices’ that are generating these internal representations in the encoder, you will notice that none of the weights are zero.
On the other hand for a decoder it uses ‘masked self-attention’. So any weights that would connect a token to a token in the future, is going to be set to zero. It doesn't make sense for the decoder to see into the future if it's trying to predict the future. That would kind of be "circular reasoning" or nonsensical nature of the decoder having access to future information while predicting the future. The decoder being able to predict the future relies on already knowing the future, which defeats the purpose of prediction.
3. Both Encoder and Decoder Transformer
We can combine the encoder and decoder together to create another choice of model architecture. This was actually the original design of the Transformer model. Kind of what's depicted here? What you can do with the encoder-decoder model that you can't do with the others is the 'cross attention' stuff. Instead of just being restricted to ‘self-attention’ with the encoder or mask self-attention with the decoder, the encoder-decoder model allows for ‘cross attention’. The embedding’s from the encoder will generate a sequence and the internal embedding’s of the decoder, which will be another sequence will have this ‘Attention Weight Matrix’. When the encoders representations can communicate with the decoder representations, this enables tasks such as translation which was the original application of this Transformers model.
While we do have three options to choose from when it comes to making a Transformer the most popular by far is the decoder only architecture where you're only using this part of the Transformer to do the language modeling and this is also called ‘causal language modeling’ which basically means given a sequence of text you want to predict future text.
Beyond just this high-level choice of model architecture, there are actually a lot of other design choices and details that one needs to take into consideration.
a. Residual Connections
The use of 'Residual Connections' which are just ‘Connections’ in your model architecture that allow intermediate training values to bypass various hidden layers. What this looks like is you have some input and instead of strictly feeding the input into your hidden layer, which is this stack of things here you allow it to go to both the hidden layer and to bypass the hidden layer. Then you can aggregate the original input and the output of the Hidden layer in some way to generate the input for the next layer.
And of course, there are many different ways one can do this with all the different details that can go into a hidden layer. You can have the Input and the Output of the Hidden Layer be added together and then have an activation applied to the addition.
You can have the input and the output of the Hidden layer be added and then you can do some kind of normalization and then you can add the activation. Or, you can have the original input and the output of the Hidden layer just be added together.
You have a tremendous amount of flexibility and design choice when it comes to these Residual Connections.
In the original Transformers architecture, the way they did it was something similar to this where the input bypasses this multithreaded attention layer and is added and normalized with the output of this multi attention layer. And then the same thing happens for this layer same thing happens for this layer same thing happens for this layer and same thing happens for this layer.
b. Layer Normalization
‘Layer Normalization’, which is rescaling values between layers based on their mean and standard deviation. When it comes to layer normalization there are two considerations that we can make.
One is where you normalize. so there are generally two options here you can normalize before the layer also called ‘pre-layer normalization’, or you can normalize after the layer also called ‘post layer normalization’.
How you Normalize
Another consideration is how you normalize. one of the most common ways is via ‘layer norm’ and this is the equation here this is your input X you subtract the mean of the input and then you divide it by the variance plus some noise term then you multiply it by some gain factor and then you can have some bias term as well.
An alternative to this is the Root Mean Square Norm or RMS Norm, which is very similar. it just doesn't have the mean term in the numerator and then it replaces this denominator with just the RMS.
While you have a few different options on how you do layer normalization, the most common based on that survey of Large Language Models mentioned, pre-layer normalization seems to be most common combined with this vanilla layer Norm approach.
Activation Functions.
These are ‘non-linear functions’ that we can include in the model which in principle allow it to capture complex mappings between inputs and outputs. They introduce ‘non-linearities’ into the model. Here there are several common choices for Large Language Models such as GeLU, ReLU (Rectified Linear Unit) Activation Function, Swish, SwiGLU and GeGLU. There are more but GLU’s seem to be the most common for Large Language Models. (The Gated Linear Unit (GLU), introduced in 2016, is a crucial activation function influencing later activation functions. OpenAI use this in GPT).
How we perform Position Embeddings
Another design choice Is How We perform Position Embeddings. Position embedding’s capture information about token positions.
The way that this was done in the original Transformers paper was using sign and cosine basic functions which added a unique value to each token position to represent its position.
In the original Transformers architecture you had your tokenized input and the positional encodings were just added to the tokenized input for both the encoder input and the decoder input.
More recently, ‘Relative Positional Encodings’ is being used instead of just adding some fixed positional encoding before the input is passed into the model. The idea with relative positional encodings is to bake positional encodings into the attention mechanism.
How large is the model going to be?
And one last consideration is ‘How large is the model going to be?’. The reason this is important is because if a model is too big or trained too long it can ‘overfit’. On the other hand if a model is too small or not trained long enough it can ‘underperform’.
And these are both in the context of the training data. There's this relationship between the number of parameters the number of computations or training time and the size of the training data set. There's a paper by called "An empirical analysis of compute-optimal large language model training" by Jordan Hoffmann et.al where they do an analysis of optimal compute considerations when it comes to Large Language Models.
From that paper, you can see that a 400 million parameter model should undergo approximately 2 ^19 floating Point operations and have a training data consisting of 8 billion tokens. And then a parameter with 1 billion models should have 10 times as many floating Point operations and be trained on 20 billion parameters and so on and so forth.
The takeaway from this is that you should have about 20 tokens per model parameter. it's not going to be very precise but might be a good rule of thumb and then we have for every 10x increase in model parameters there's about a 100x increase in floating Point operations.
Last updated