LLM Benchmarks Master List
Seq
Category
Benchmark
Performance Description
1
Audio
UniAudio
Audio Generation
2
Audio
MusicGEN
Audio Generation
3
Audio
MusicLM
Audio Generation
4
Code Generation
Codex HumanEval Python Programming Test
Codex HumanEval is a Python coding test. It is a benchmark for evaluating the code generation capabilities of LLMs. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
5
Code Generation
Codex P@1 (0-Shot)
Codex P@1 (0-Shot) is a metric used to evaluate the performance of LLMs in code generation. It measures the percentage of times that the LLM generates the correct code for a given prompt, without any prior training on that specific prompt.
6
Code Generation
HumanEval
Code Generation.
7
Code Generation
SWE-Bench
Code Generation.
8
General Agents
AgentBench
General Agents
9
General Agents
Voyageur
General Agents
10
General Reasoning
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning
General Reasoning
11
General Reasoning
Benchmark for Expert AGI
12
General Reasoning
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
General Reasoning
13
Image
HEIM (Holistic Evaluation of Text-to-Image Models)
Image Computer Vision and Image Generation.
14
Image
MVDream
Image Computer Vision, Image Generation.
15
Image
VisIT-Bench
Image Computer Vision, Instruction-Following.
16
Image
EditVal
Image Computer Vision, Editing
17
Image
ControlNet
Image Computer Vision, Editing
18
Image
Instruct-NeRF2NeRF
Image Computer Vision, Editing
19
Image
Skoltech3D
3D Reconstruction From Images
20
Image
Real Fusion
3D Reconstruction From Images
21
Mathematical Reasoning
GSM8K (Elementary School Math Problems)
GSM8k, a large set of grade-school math problems. The GSM8K dataset is a collection of 8,500 math problems that are designed to be challenging for language models to solve.
22
Mathematical Reasoning
MATH
Mathematical Reasoning
23
Mathematical Reasoning
PlanBench
Mathematical Reasoning
24
Moral Reasoning
MoCa
Moral Reasoning
25
Other
Multiple choice segment of the American Bar Exam
Multiple choice segment of the American Bar Exam
26
Other
GRE Reading and Writing Exam.
Exam given to college students applying to graduate school.
27
Other
Median Applicant on Quantitative Reasoning
28
Other
Constitution (Safe and harmless)
How hard it is to produce offensive or dangerous output on prompts. (Normally this is an internal red-teaming evaluation that scores models on a large representative set of harmful prompts, and using an automated and transparent process).
29
Other
HELM (Holistic Evaluation of Language Model)
30
Other
FLASK (Fine-Grained Language Model
It is a measurement of how well a probability distribution of a test sample is matching with the corresponding LLM prediction. It is widely used as the accuracy metric for well-defined LLM tasks such as Question-Answering.
31
Other
Evaluation Based On Alignment Skillsets)
32
Other
Perplexity
Perplexity is a measurement that reflects how well a model can predict the next word based on the preceding context. The lower the perplexity score, the better the model's ability to predict the next word accurately.
33
Other
EleutherAI Language Model Evaluation Harness
34
Other
BLEU (BiLingual Evaluation Understudy)
It is a metric for machine translation. BLEU is the metric used on the seminal Transformer paper.
35
Other
METEOR
It is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.
36
Other
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations.
37
Other
CIDEr/SPICE
It is used for image captioning tasks.
38
Other
Hugging Face LLM Leaderboard
Hugging Face LLM Leaderboard evaluates models on four key benchmarks from the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks.
39
Other
F1 Score (F-measure one)
The F1 score in LLM stands for F-measure one, and it is a metric used to evaluate the performance of a language model on a classification task. It is calculated by averaging the precision and recall of the model, with each measure being weighted equally.
40
Reasoning
Visual Commonsense Reasoning (VCR)
Visual Reasoning
41
Reasoning
BigToM
Causal Reasoning
42
Reasoning
Tubingen Cause-Effect Pairs
Causal Reasoning
43
Reinforcement Learning from Human Feedback
RLAIF
Reinforcement Learning from Human Feedback
44
Robotics
PaLM-E
Robotics
45
Robotics
RT-2
Robotics
46
Task-Specific Agents
MLAgentBench
Task-Specific Agents
47
Text
GLUE (General Language Understanding Evaluation)
The GLUE benchmark measures performance of general language understanding of Natural Language Processing (NLP) models across a range of tasks.
48
Text
BoolQ
49
Text
SIQA
50
Text
OpenBookQA
51
Text
Big-Bench
52
Text
AI2 Reasoning Challenge (25-shot)
A set of grade-school science questions.
53
Text
HellaSwag (10-shot)
A test of common sense inference, which is easy for humans (~95%) but challenging for SOTA models.
54
Text
MMLU (5-shot)
A test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
55
Text
TriviaQA (5-Shot)
It is a Benchmark Dataset. The TriviaQA (5-Shot) benchmark is a popular way to compare the performance of different LLMs on factual question-answering tasks. It is a challenging benchmark, but it is also a good measure of the overall capabilities of an LLM.
56
Text
QuALITY (5-Shot)
A metric used to compare the performance of LLMs on a variety of tasks, including question answering, summarization, and translation. It is based on the idea that a good LLM should be able to generate high-quality responses to prompts, even if it has only been trained on a small number of examples.
57
Text
RACE-H (5-Shot)
This is a benchmark dataset used to evaluate the performance of LLMs on natural language reasoning tasks. It is a more challenging dataset than the original RACE dataset, as it requires the LLM to reason over multiple sentences and to learn from a smaller number of training examples (5 shots).
58
Text
ARC-Challenge (5-Shot)
This is a benchmark used to evaluate the reasoning capabilities of LLMs. It was introduced in early 2018 in the paper;"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions that require the model to reason over multiple sentences to find the correct answer.
59
Text
MMLU
Measuring Massive Multitask Language Understanding is a benchmark test designed to evaluate the performance of LLMs on a variety of tasks, including question answering, summarization, and code generation. The MMLU test covers a wide range of subjects, including STEM, humanities, social sciences, and more.
60
Text
MMLU (5-Shot CoT)
This is a variant of the MMLU benchmark where the LLM is given only 5 examples of each task before being evaluated on its performance. This is known as the few-shot setting, and it is a more challenging test of the LLM's ability to learn and generalize. (The CoT stands for Chain of Thought).
61
Text
TruthfulQA (0-shot)
Factuality and Truthfulness. A test to measure a model's propensity to reproduce falsehoods commonly found online.
62
Text
HaluEval
Factuality and Truthfulness.
63
Video
UCF101
Video Computer Vision and Video Generation
64
General language/reasoning
Arena Hard
tests general language understanding and reasoning abilities across various tasks.
65
General language/reasoning
AlpacaEval 2.0 LC,
Evaluates language completion capabilities, potentially focused on coherence, factuality, and task completion.
66
Machine translation
MT-Bench (GPT-4-Turbo)
Evaluates machine translation performance, likely focusing on translation quality and accuracy.
67
Programming/coding
MBPP
tests coding and programming abilities by providing programming problems to solve.
68
Instruction following and understanding
IFEval (with Prompt-Strict-Acc and Instruction-Strict Acc metrics)
Evaluates the model's ability to follow instructions precisely, testing understanding and adherence to prompts/instructions.
69
Open-ended text generation quality
TFEval (with Distractor F1 and On-topic F1 metrics).
evaluates open-ended text generation capabilities, with Distractor F1 measuring factual accuracy and On-topic F1 assessing relevance to the given topic/prompt.
Last updated