Benchmarking of Large Language Models
How can you evaluate the performance of an LLM? Are NLP Accuracy Metrics such as F1 score, Precision, Recall, BLUE, ROUGE sufficient or do you need to somehow consolidate technical metrics with societal considerations such as fairness, biases etc.
Essentially while measuring an LLM, we need to ask questions like:
- Is this LLM accurate enough to be helpful? 
- Is it harmless? 
- Is it honest by not acting too defensive? 
Such fine-grained evaluation demands a new measurement framework.
Two most plausible ideas in this area of LLM benchmarking are Holistic Evaluation of Language Model (HELM) and Fine-Grained Language Model Evaluation Based On Alignment Skillsets (FLASK).
This article discusses these ideas along with EleutherAI Language Model Evaluation Harness.
Benchmarking of Large Language Models
With the increasing economic and societal impact of LLMs, we must measure their performance and risk-benefit trade-offs. Transparency is the vital first step towards these two goals. LLM community lacks a standardized transparency practice: many LLMs exist, but they are not compared on a unified standard, and even when LLMs are evaluated, the full range of societal considerations (e.g., fairness, commonsense knowledge, capability to generate disinformation) have not been unified with technical considerations such as accuracy measures (e.g. F1 score, robustness, uncertainty estimation,).
LLMs are ultimately NLP models therefore common natural language processing (NLP) accuracy metrics such as BLEU, METEOR, ROUGE, CIDEr, SPICE, Perplexity are good starting points, especially for pre-trained models.
- Perplexity: It is a measurement of how well a probability distribution of a test sample is matching with the corresponding LLM prediction. It is widely used as the accuracy metric for well-defined LLM tasks such as Question-Answering. 
- BLEU (BiLingual Evaluation Understudy): It is a metric for machine translation. BLEU is the metric used on the seminal transformer paper. 
- METEOR: It is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. 
- ROUGE: It stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations. 
- CIDEr/SPICE: It is used for image captioning tasks. 
A full-fledged LLM is task-specific and contextual. A summarization task for a medical use case needs to be evaluated differently from a consumer support use case. Therefore, an LLM evaluation requires a unification between task and domain specific benchmarks.
Hugging Face LLM Leaderboard
Hugging Face LLM Leaderboard evaluates models on four key benchmarks from the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
- AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. 
- HellaSwag (10-shot) - a test of common sense inference, which is easy for humans (~95%) but challenging for SOTA models. 
- MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. 
- TruthfulQA (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online. 
These benchmarks consider zero-shot and few-shot tasks. Zero-shot refers to the ability to complete a task without having received any finetuning examples.
Other LLM benchmarks include; 
- TriviaQA, 
- BoolQ, 
- SIQA, 
- OpenBookQA, 
- GLUE, 
- Big-Bench, and others. 
Human Alignment 
Human alignment is another major consideration for the LLM evaluation. It demands LLM agents to be honest, harmless, and helpful for a human user. LLM evaluation requires awareness about task, domain, difficulty level, and underlying societal context. An agent proficient in question-answering might perform poorly code generation. Also, an agent might have a propensity to produce toxic content despite being highly accurate. This call for multi-metric evaluation. There are two widely popular metics for this types of multi-metric evaluation: FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets) and HELM (Holistic Evaluation of Language Models).
FLASK defines four primary abilities which are divided into 12 fine-grained skills to evaluate the performance of language models comprehensively: Logical Thinking (Logical Correctness, Logical Robustness, Logical Efficiency), Background Knowledge(Factuality, Commonsense Understanding), Problem Handling (Comprehension, Insightfulness, Completeness, Metacognition), and User Alignment (Conciseness, Readability, Harmlessness). Here is a spiral chart that is introduced in the literature (https://arxiv.org/pdf/2307.10928.pdf) which evaluates seven different LLM models including three open source models (LLaMA-2, Vicuna, Alpaca) and four proprietary models (Bard, Claude, GPT-3.5, and GPT-4 as Oracle).
Holistic Evaluation of Language Model (HELM) adopts a top-down approach explicitly specifying the scenarios and metrics to be evaluated and working through the underlying structure. A Scenario is defined by a tuple of (task, domain, language). Tasks include question answering, summarization, information retrieval, toxicity detection. Domains could be news and books, while the language could be English or Spanish. On the other hand, there are seven different metrics for technical and societal considerations: accuracy, robustness, calibration, fairness, bias, toxicity, efficiency. The paper presents the following diagram to show how HELM is improving the granularity for LLM evaluation. First, the scenario is decided based on task, domain, user, time, and language. Then, the metric itself is multi-threaded, including input perturbation such as typo (Robustness), gender dialect (Fairness), and output measures such as Accuracy (ROUGE, F1, Exact Match), Toxicity, Efficiency (denoised).
The multi-metric evaluation is an important design consideration for the LLM evaluation. This is where HELM improves granularity with rigorous evaluation. The following diagram from the seminal paper shows how HELM incorporates more metrics into the LLM evaluation. In the past, Natural Questions scenarios were evaluated only with Accuracy metric, while with HELM it is evaluated with Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity, Efficiency.
LLM is growing rapidly and being considered as a key building block for applications with significant public impact. Therefore, it is important to standardize an evaluation metric which unifies technical considerations with societal ramifications.
Last updated
