5. Evaluating the Model

Model Evaluation

Once you have trained your model and you have spent millions of dollars and weeks of your time, you must test your model for performance. When you have built a model, that really is just the starting point because you still have to know what the LLM actually does. How does it respond in the context of the desired use case the desired application of it.

Model evaluation is important.

There are many Benchmark data sets for model evaluations. We will select the Open LLM Leaderboard, which is a public LLM Benchmark that is continually updated with new models on hugging faces models platform.

Hugging Face LLM Leaderboard

Hugging Face LLM Leaderboard evaluates models on four key benchmarks from the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks.

i. ARC (AI2 Reasoning Challenge), (25-shot) - a set of grade-school science questions.

ii. HellaSwag (10-shot) - a test of common sense inference, which is easy for humans (~95%) but challenging for SOTA models.

iii. MMLU (5-shot) - a test to measure a text modelโ€™s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

iv. TruthfulQA (0-shot) - a test to measure a modelโ€™s propensity to reproduce falsehoods commonly found online.

These benchmarks consider zero-shot and few-shot tasks. Zero-shot refers to the ability to complete a task without having received any fine-tuning examples.

While these are only four of many possible Benchmark data sets the evaluation strategies that we can use for these Benchmark data sets can easily port to other benchmarks.

Letโ€™s start with just ARC, HellaSwag, and MMLU which are multiple choice tasks.

ARC and MMLU are essentially great school questions on subjects like math, math history, common knowledge, etc. and it'll be like a question with a multiple choice response A, B, C or D. so an example is โ€œwhich technology was developed most recentlyโ€;

A) Cell phone.

B) Microwave.

C) Refrigerator

D) Airplane.

HellaSwag is a little bit different. these are specifically questions that computers tend to struggle with. So an example of this as follows;

โ€œA woman is outside with a bucket, and a dog. The dog is running around trying to avoid a bath she dot dot dot.

A). Rinses the bucket off with soap and blow dries the dog's head.

B). Uses a hose to keep it from getting soapy.

C). Gets the dog wet then it runs away again.

D). Gets into a bathtub with a dog.

And so this is a very strange question but intuitively humans tend to do very well on these tasks and computers do not.

So while these are multiple choice tasks and we might think it should be pretty straight forward to evaluate model performance on them, there is one hiccup namely these Large Language Models are typically text generation models so they'll take some input text and they'll output more text. Theyโ€™re not classifiers. They don't generate responses like A, B, C or D or Class 1, Class 2, Class 3, Class 4. They just generate text completions and so you have to do a little trick to get these Large Language Models to perform multiple choice tasks. And this is essentially through prompt templates.

For example, if you have the question which technology was developed most recently instead of just passing this question and the choices to the Large Language Model and hopefully it figures out to do a B, C or D. you can use a prompt template like this and additionally pretend the prompt template with a few shot examples.

So the language model will pick up that I should return just a single token that is one of these four tokens here.

So if you pass this into to the model you'll get a distribution of probabilities for each possible token and what you can do then is just evaluate of all the tens of thousands of tokens that are possible.

You just pick the four tokens associated with A, B, C or D and see which one is most likely and you take that to be the predicted answer from the Large Language Model.

While this is an extra step of creating a prompt template, you can still evaluate a Large Language Model on these multiple choice tasks and in a relatively straightforward way.

However, this is a bit trickier when you have open-ended tasks such as for Truthful QA. For Truthful QA or other open-ended tasks where there isn't a specific single right answer but rather a wide range of possible right answers, there are a few different evaluation strategies we can take.

The first is Human Evaluation. So a person scores the completion based on some ground truth, some guidelines or both. While this is the most labor intensive task, this may provide the highest quality assessment of model completions.

Another strategy is we could use NLP metrics. this is trying to quantify the completion quality using metrics such as perplexity blue score, row score Etc. So just using the statistical properties of the completion as a way to quantify its quality. While this is a lot less labor intensive, it's not always clear what the mapping between a completions statistical properties is to the quality of that completion. And, then the third approach which might capture the best of both worlds is to use an auxiliary fine-tuned model to rate the quality of the completions. this was actually used in the Truthful QA paper, where they created an auxiliary model called GPT judge which would take model completions and classify it as either โ€˜Truthfulโ€™ or โ€˜Not Truthfulโ€™. And then that would help reduce the burden of Human Evaluation when evaluating model outputs.

Now that you've created your Large Language Model from scratch, what do you do next?

Often this isn't the end of the story as the name โ€˜Base Modelsโ€™ might suggest. Base Models are typically a starting point not the final solution. They are really just a starting place for you to build something more practical on top of. And there are generally two directions for this.

One is via Prompt Engineering. Prompt Engineering is just feeding things into the language model and harvesting their completions for some particular use case;

Another direction is via โ€˜Model Fine-Tuningโ€™ which is where you take the pre-trained model and you adapt it for a particular use case. Prompt Engineering and Model Fine-Tuning both have their pros and cons.

Last updated