2. Data Curation

This is the most important and perhaps most tedious part of the training process. The quality of any model is driven by the quality of the data. Thus it is extremely important that you prepare data with extreme care and get the training data perfected especially if you're investing millions of dollars in this model.

Large language models require large training data sets. In order to get a sense of this;

But here we can say that as far as training data go we're talking about a trillion words of text or in other words about a million novels or a billion news articles.

Going through a trillion words of text and ensuring data quality is a tremendous effort and undertaking. But where do we even get all this text?

The most common place is the internet. The internet consist of web pages, Wikipedia, forums, books, scientific articles, code bases, you name it. Post GPT there's a lot more controversy around this and copyright laws. The risk with web scraping yourself is that you might grab data that you're not supposed to grab or you don't have the rights to grab and then using it in a model for potentially commercial use could come back and cause some trouble down the line.

Alternatively there are many public data sets out there one of the most popular is common crawl which is a huge Corpus of text from the internet. And then there are some more refined versions such as Colossal Clean Crawled Corpus also called C4. There’s also Falcon refined web which was used to train Falcon 180b mentioned on the previous slide.

Another popular data set is the Pile, which tries to bring together a wide variety of diverse data sources into the training dataset.

And then we have hugging face which has really emerged as a big player in the generative Ai and Large Language Model space that houses a ton of Open Access Data sources on their platform.

Other sources are private data sources such as ‘Fin Pile’ which was used to train Bloomberg GPT. the key upside of private data sources is you own the rights to it and it's data that no one else has which can give you a strategic advantage if you're trying to build a model for some business application or for some other application where there's some competition and other players are also making their own Large Language Models.

Finally, using an LLM to generate the training data. This is also called 'Synthetic Data'. A notable example of this comes from the Alpaca model by researchers at Stanford. What they did was they trained an LLM Alpaca using structured text generated by GPT3 of data set diversity - briefly with the Pile dataset. One aspect of a good training data set seems to be data set diversity.

A diverse data set translates to a model that can perform well in a wide variety of tasks essentially it translates into a good general purpose model

Listed here are a few different models and the composition of their training data sets.

  • GPT3 is mainly web pages but also some books

  • Gopher is also mainly web pages but they got more books and then they also have some code in there

  • LlaMa is mainly web pages but they also have books code and scientific articles and

  • PaLM is mainly built on conversational data but then you see it's trained on web pages books and code

How you curate your training data set is going to determine the types of tasks the LLM will be good at.

Diversity is an important consideration in data set composition in your training datasets and adding an additional 3% code in your training data set will have this quantifiable outcome in the downstream model.

One other thing that's important is how do we prepare the data?

The quality of a model depends on the quality of the training data. So we need to be aware and sensitive to the ‘text’ that is used to generate a LLM.

There are four key Data Preparation steps;

a. Quality Filtering:

This is removing text which is not helpful to the Large Language Model. This could be just a bunch of random gibberish from some corner of the internet. This could be toxic language or hate speech found on some Forum. This could be things that are objectively false like 2 + 2 is equal to 5 which you'll see in the book 1984. While that text exists out there it is not a true statement.

There are two types of Quality Filtering;

(i) First is classifier based and this this is where you take a small high-quality data set and use it to train a text classification model that allows you to automatically score text as either good or bad, low quality or high quality. So that precludes the need for a human to read a trillion words of text to assess its quality it can kind of be offloaded to this classifier.

(ii) The second type is heuristic based this is using various rules of thumb to filter the text. This could be removing specific words like explicit text this could be if a word repeats more than two times in a sentence you remove it or using various statistical properties of the text to do the filtering.

And of course you can also do a combination of the two. You can use the classifier based method to distill down your data set. And then on top of that you can do some heuristics or vice versa. You can use heuristics to distill down the data set and then apply your classifier there's no one-size fits-all recipe for doing quality filter in rather there's a menu of many different options and approaches that one can take

b. De-duplication

De-duplication means removing several instances of the same or very similar text. The reason this is important is that duplicate texts can create bias in the model and disrupt training. For example if you have a web page that exists on two different domains, one may end up in the training data set and the other ends up in the testing data set. This creates problems when trying to get a fair assessment of model performance during training.

c. Privacy Redaction

Privacy Redaction is essential especially for text scaped from the internet since it might include sensitive or confidential information. It is important to remove this text because if sensitive information makes its way into the training data set, it could be inadvertently learned by the language model and be exposed in unexpected ways.

Tokenization is essentially translating text into numbers. And the reason this is important is because Artificial Neural Networks (ANN’s), do not understand text directly, they only understand numbers. Basically anytime you feed something into an Artificial Neural Network, it must be in a numerical form.

There are many ways to do this mapping. One of the most popular ways is via the ‘bite pair encoding algorithm’, which essentially takes a corpus of text and derives from it an efficient ‘sub-word vocabulary’. It figures out the best choice of sub-words or character sequences to define a vocabulary from which the entire Corpus can be represented. For example, in this string of text ‘efficient sub-word vocabulary’, maybe the word ‘efficient’ gets mapped to an integer and exists in the vocabulary. Maybe ‘sub’ with a ‘dash’ gets mapped to its own integer, ‘word’ and gets mapped to its own integer, vocab gets mapped to its own integer, and ‘ulary’ gets mapped to its own integer. So this string of text ‘efficient sub-word vocabulary’ might be translated into five tokens, each with their own numerical representation, such a 1, 2, 3, 4 and 5.

There are several existing python libraries that implement this algorithm, so you don't have to do it from scratch. For example, there is the ‘SentencePiece Python Library’ there is also the ‘Tokenizer Library’ from hugging face.

Last updated