LLM Benchmarks Master List

Seq

Category

Benchmark

Performance Description

Audio

UniAudio

Audio Generation

Audio

MusicGEN

Audio Generation

Audio

MusicLM

Audio Generation

Code Generation

Codex HumanEval Python Programming Test

Codex HumanEval is a Python coding test. It is a benchmark for evaluating the code generation capabilities of LLMs. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Code Generation

Codex P@1 (0-Shot)

Codex P@1 (0-Shot) is a metric used to evaluate the performance of LLMs in code generation. It measures the percentage of times that the LLM generates the correct code for a given prompt, without any prior training on that specific prompt.

Code Generation

HumanEval

Code Generation.

Code Generation

SWE-Bench

Code Generation.

General Agents

AgentBench

General Agents

Voyageur

General Agents

General Reasoning

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning

General Reasoning

Benchmark for Expert AGI

General Reasoning

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

General Reasoning

Image

HEIM (Holistic Evaluation of Text-to-Image Models)

Image Computer Vision and Image Generation.

Image

MVDream

Image Computer Vision, Image Generation.

Image

VisIT-Bench

Image Computer Vision, Instruction-Following.

Image

EditVal

Image Computer Vision, Editing

Image

ControlNet

Image Computer Vision, Editing

Image

Instruct-NeRF2NeRF

Image Computer Vision, Editing

Image

Skoltech3D

3D Reconstruction From Images

Image

Real Fusion

3D Reconstruction From Images

Mathematical Reasoning

GSM8K (Elementary School Math Problems)

GSM8k, a large set of grade-school math problems. The GSM8K dataset is a collection of 8,500 math problems that are designed to be challenging for language models to solve.

Mathematical Reasoning

MATH

Mathematical Reasoning

PlanBench

Mathematical Reasoning

Moral Reasoning

MoCa

Moral Reasoning

Other

Multiple choice segment of the American Bar Exam

Other

GRE Reading and Writing Exam.

Exam given to college students applying to graduate school.

Other

Median Applicant on Quantitative Reasoning

Other

Constitution (Safe and harmless)

How hard it is to produce offensive or dangerous output on prompts. (Normally this is an internal red-teaming evaluation that scores models on a large representative set of harmful prompts, and using an automated and transparent process).

Other

HELM (Holistic Evaluation of Language Model)

Other

FLASK (Fine-Grained Language Model

It is a measurement of how well a probability distribution of a test sample is matching with the corresponding LLM prediction. It is widely used as the accuracy metric for well-defined LLM tasks such as Question-Answering.

Other

Evaluation Based On Alignment Skillsets)

Other

Perplexity

Perplexity is a measurement that reflects how well a model can predict the next word based on the preceding context. The lower the perplexity score, the better the model's ability to predict the next word accurately.

Other

EleutherAI Language Model Evaluation Harness

Other

BLEU (BiLingual Evaluation Understudy)

It is a metric for machine translation. BLEU is the metric used on the seminal Transformer paper.

Other

METEOR

It is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.

Other

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations.

Other

CIDEr/SPICE

It is used for image captioning tasks.

Other

Hugging Face LLM Leaderboard

Hugging Face LLM Leaderboard evaluates models on four key benchmarks from the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks.

Other

F1 Score (F-measure one)

The F1 score in LLM stands for F-measure one, and it is a metric used to evaluate the performance of a language model on a classification task. It is calculated by averaging the precision and recall of the model, with each measure being weighted equally.

Reasoning

Visual Commonsense Reasoning (VCR)

Visual Reasoning

Reasoning

BigToM

Causal Reasoning

Reasoning

Tubingen Cause-Effect Pairs

Causal Reasoning

Reinforcement Learning from Human Feedback

RLAIF

Reinforcement Learning from Human Feedback

Robotics

PaLM-E

Robotics

RT-2

Robotics

Task-Specific Agents

MLAgentBench

Task-Specific Agents

Text

GLUE (General Language Understanding Evaluation)

The GLUE benchmark measures performance of general language understanding of Natural Language Processing (NLP) models across a range of tasks.

Text

BoolQ

Text

SIQA

Text

OpenBookQA

Text

Big-Bench

Text

AI2 Reasoning Challenge (25-shot)

A set of grade-school science questions.

Text

HellaSwag (10-shot)

A test of common sense inference, which is easy for humans (~95%) but challenging for SOTA models.

Text

MMLU (5-shot)

A test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

Text

TriviaQA (5-Shot)

It is a Benchmark Dataset. The TriviaQA (5-Shot) benchmark is a popular way to compare the performance of different LLMs on factual question-answering tasks. It is a challenging benchmark, but it is also a good measure of the overall capabilities of an LLM.

Text

QuALITY (5-Shot)

A metric used to compare the performance of LLMs on a variety of tasks, including question answering, summarization, and translation. It is based on the idea that a good LLM should be able to generate high-quality responses to prompts, even if it has only been trained on a small number of examples.

Text

RACE-H (5-Shot)

This is a benchmark dataset used to evaluate the performance of LLMs on natural language reasoning tasks. It is a more challenging dataset than the original RACE dataset, as it requires the LLM to reason over multiple sentences and to learn from a smaller number of training examples (5 shots).

Text

ARC-Challenge (5-Shot)

This is a benchmark used to evaluate the reasoning capabilities of LLMs. It was introduced in early 2018 in the paper;"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions that require the model to reason over multiple sentences to find the correct answer.

Text

MMLU

Measuring Massive Multitask Language Understanding is a benchmark test designed to evaluate the performance of LLMs on a variety of tasks, including question answering, summarization, and code generation. The MMLU test covers a wide range of subjects, including STEM, humanities, social sciences, and more.

Text

MMLU (5-Shot CoT)

This is a variant of the MMLU benchmark where the LLM is given only 5 examples of each task before being evaluated on its performance. This is known as the few-shot setting, and it is a more challenging test of the LLM's ability to learn and generalize. (The CoT stands for Chain of Thought).

Text

TruthfulQA (0-shot)

Factuality and Truthfulness. A test to measure a model's propensity to reproduce falsehoods commonly found online.

Text

HaluEval

Factuality and Truthfulness.

Video

UCF101

Video Computer Vision and Video Generation

General language/reasoning

Arena Hard

tests general language understanding and reasoning abilities across various tasks.

General language/reasoning

AlpacaEval 2.0 LC,

Evaluates language completion capabilities, potentially focused on coherence, factuality, and task completion.

Machine translation

MT-Bench (GPT-4-Turbo)

Evaluates machine translation performance, likely focusing on translation quality and accuracy.

Programming/coding

MBPP

tests coding and programming abilities by providing programming problems to solve.

Instruction following and understanding

IFEval (with Prompt-Strict-Acc and Instruction-Strict Acc metrics)

Evaluates the model's ability to follow instructions precisely, testing understanding and adherence to prompts/instructions.

Open-ended text generation quality

TFEval (with Distractor F1 and On-topic F1 metrics).

evaluates open-ended text generation capabilities, with Distractor F1 measuring factual accuracy and On-topic F1 assessing relevance to the given topic/prompt.

PreviousLLM Performance Benchmarks NextLLM Benchmark Categories

Last updated 1 year ago