# LLM Benchmarks Master List

<table data-header-hidden><thead><tr><th width="85"></th><th width="142"></th><th width="153"></th><th></th></tr></thead><tbody><tr><td> <strong>Seq</strong> </td><td> <strong>Category</strong>                                         </td><td> <strong>Benchmark</strong>                                 </td><td><strong>Performance</strong> <strong>Description</strong>                                                                                                                                                                                                                                                                                                                                                  </td></tr><tr><td>1</td><td>Audio                                             </td><td>UniAudio                                   </td><td>Audio Generation                                                                                                                                                                                                                                                                                                                                              </td></tr><tr><td>2</td><td>Audio                                             </td><td>MusicGEN                                   </td><td>Audio Generation                                                                                                                                                                                                                                                                                                                                              </td></tr><tr><td>3</td><td>Audio                                             </td><td>MusicLM                                    </td><td>Audio Generation                                                                                                                                                                                                                                                                                                                                              </td></tr><tr><td>4</td><td>Code Generation                                   </td><td>Codex HumanEval Python Programming Test    </td><td>Codex HumanEval is a Python coding test. It is a benchmark for evaluating the code generation capabilities of LLMs. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.                                                     </td></tr><tr><td>5</td><td>Code Generation                                   </td><td>Codex P@1 (0-Shot)                         </td><td>Codex P@1 (0-Shot) is a metric used to evaluate the performance of LLMs in code generation. It measures the percentage of times that the LLM generates the correct code for a given prompt, without any prior training on that specific prompt.                                                                                                              </td></tr><tr><td>6</td><td> Code Generation                                   </td><td>HumanEval                                  </td><td> Code Generation.                                                                                                                                                                                                                                                                                                                                              </td></tr><tr><td>7</td><td> Code Generation                                   </td><td>SWE-Bench                                  </td><td> Code Generation.                                                                                                                                                                                                                                                                                                                                              </td></tr><tr><td>8</td><td>General Agents                                    </td><td>AgentBench                                 </td><td> General Agents                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>9</td><td>General Agents                                    </td><td>Voyageur                                   </td><td> General Agents                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>10</td><td>General Reasoning                                 </td><td>MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning           </td><td> General Reasoning                                                                                                                                                                                                                                                                                                                                             </td></tr><tr><td>11</td><td>General Reasoning                                 </td><td>Benchmark for Expert AGI                   </td><td>                                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>12</td><td>General Reasoning                                 </td><td>GPQA: A Graduate-Level Google-Proof Q&#x26;A  Benchmark  </td><td> General Reasoning                                                                                                                                                                                                                                                                                                                                             </td></tr><tr><td>13</td><td>Image                                             </td><td>HEIM (Holistic Evaluation of Text-to-Image  Models)</td><td> Image Computer Vision and Image Generation.                                                                                                                                                                                                                                                                                                                  </td></tr><tr><td>14</td><td>Image                                             </td><td>MVDream                                    </td><td> Image Computer Vision, Image Generation.                                                                                                                                                                                                                                                                                                                      </td></tr><tr><td>15</td><td>Image                                             </td><td>VisIT-Bench                                </td><td> Image Computer Vision, Instruction-Following.                                                                                                                                                                                                                                                                                                                 </td></tr><tr><td>16</td><td>Image                                             </td><td>EditVal                                    </td><td> Image Computer Vision, Editing                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>17</td><td>Image                                             </td><td>ControlNet                                 </td><td> Image Computer Vision, Editing                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>18</td><td>Image                                             </td><td>Instruct-NeRF2NeRF                         </td><td> Image Computer Vision, Editing                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>19</td><td>Image                                             </td><td>Skoltech3D                                 </td><td> 3D Reconstruction From Images                                                                                                                                                                                                                                                                                                                                 </td></tr><tr><td>20</td><td>Image                                             </td><td>Real Fusion                                </td><td> 3D Reconstruction From Images                                                                                                                                                                                                                                                                                                                                 </td></tr><tr><td>21</td><td>Mathematical Reasoning                            </td><td>GSM8K (Elementary School Math Problems)    </td><td> GSM8k, a large set of grade-school math problems. The GSM8K dataset is a collection of 8,500 math problems that are designed to be challenging for language models to solve.                                                                                                                                                                               </td></tr><tr><td>22</td><td>Mathematical Reasoning</td><td>MATH                                       </td><td>Mathematical Reasoning                                                                                                                                                                                                                                                                                                                                        </td></tr><tr><td>23</td><td>Mathematical Reasoning                            </td><td>PlanBench                                  </td><td>Mathematical Reasoning                                                                                                                                                                                                                                                                                                                                        </td></tr><tr><td>24</td><td>Moral Reasoning                                   </td><td>MoCa                                       </td><td>Moral Reasoning                                                                                                                                                                                                                                                                                                                                               </td></tr><tr><td>25</td><td>Other                                             </td><td>Multiple choice segment of the American  Bar Exam     </td><td>Multiple choice segment of the American Bar Exam                                                                                                                                                                                                                                                                                                             </td></tr><tr><td>26</td><td>Other                                             </td><td>GRE Reading and Writing Exam.               </td><td>Exam given to college students applying to graduate school.                                                                                                                                                                                                                                                                                                  </td></tr><tr><td>27</td><td>Other                                             </td><td>Median Applicant on Quantitative Reasoning </td><td>                                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>28</td><td>Other                                             </td><td>Constitution (Safe and harmless)           </td><td> How hard it is to produce offensive or dangerous output on prompts. (Normally this is an internal red-teaming evaluation that scores models on a large representative set of harmful prompts, and using an automated and transparent process).                                                                                                             </td></tr><tr><td>29</td><td>Other                                             </td><td>HELM (Holistic Evaluation of Language Model)      </td><td>                                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>30</td><td>Other                                             </td><td>FLASK (Fine-Grained Language Model         </td><td> It is a measurement of how well a probability distribution of a test sample is matching with the corresponding LLM prediction. It is widely used as the accuracy metric for well-defined LLM tasks such as Question-Answering.                                                                                                                              </td></tr><tr><td>31</td><td>Other                                             </td><td>Evaluation Based On Alignment Skillsets)   </td><td>                                                                                                                                                                                                                                                                                                                                                                </td></tr><tr><td>32</td><td>Other                                             </td><td>Perplexity                                 </td><td>Perplexity is a measurement that reflects how well a model can predict the next word based on the preceding context. The lower the perplexity score, the better the model's ability to predict the next word accurately.                                                                                                                                   </td></tr><tr><td>33</td><td>Other                                             </td><td>EleutherAI Language Model Evaluation Harness  </td><td> </td></tr><tr><td>34</td><td>Other                                             </td><td>BLEU (BiLingual Evaluation Understudy)   </td><td>It is a metric for machine translation. BLEU is the metric used on the seminal Transformer paper.                                </td></tr><tr><td>35</td><td>Other                                             </td><td>METEOR                                    </td><td>It is an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. </td></tr><tr><td>36</td><td>Other                                             </td><td>ROUGE (Recall-Oriented Understudy for  Gisting Evaluation)    </td><td> It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations.              </td></tr><tr><td>37</td><td>Other                                             </td><td> CIDEr/SPICE                              </td><td> It is used for image captioning tasks.                                                                                            </td></tr><tr><td>38</td><td>Other                                             </td><td>Hugging Face LLM Leaderboard             </td><td> Hugging Face LLM Leaderboard evaluates models on four key benchmarks from the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks. </td></tr><tr><td>39</td><td>Other                                             </td><td> F1 Score (F-measure one)                 </td><td> The F1 score in LLM stands for F-measure one, and it is a metric used to evaluate the performance of a language model on a classification task. It is calculated by averaging the precision and recall of the model, with each measure being weighted equally. </td></tr><tr><td>40</td><td>Reasoning                              </td><td>Visual Commonsense Reasoning (VCR)       </td><td> Visual Reasoning                                                                                                                  </td></tr><tr><td>41</td><td>Reasoning                              </td><td>BigToM                                   </td><td> Causal Reasoning                                                                                                                  </td></tr><tr><td>42</td><td>Reasoning                              </td><td>Tubingen Cause-Effect Pairs              </td><td> Causal Reasoning                                                                                                                  </td></tr><tr><td>43</td><td>Reinforcement Learning from Human Feedback </td><td>RLAIF                                    </td><td>Reinforcement Learning from Human Feedback                                                                                      </td></tr><tr><td>44</td><td>Robotics                               </td><td> PaLM-E                                   </td><td>Robotics                                                                                                                          </td></tr><tr><td>45</td><td>Robotics                               </td><td> RT-2                                     </td><td>Robotics                                                                                                                          </td></tr><tr><td>46</td><td>Task-Specific Agents                   </td><td>MLAgentBench                             </td><td>Task-Specific Agents                                                                                                             </td></tr><tr><td>47</td><td>Text                                   </td><td>GLUE (General Language Understanding Evaluation)        </td><td>The GLUE benchmark measures performance of general language understanding of Natural Language Processing (NLP) models across a range of tasks. </td></tr><tr><td>48</td><td>Text                                   </td><td> BoolQ                                    </td><td>                                                                                                                                    </td></tr><tr><td>49</td><td>Text                                   </td><td> SIQA                                     </td><td>                                                                                                                                    </td></tr><tr><td>50</td><td>Text                                   </td><td> OpenBookQA                               </td><td>                                                                                                                                    </td></tr><tr><td>51</td><td>Text                                   </td><td> Big-Bench                                </td><td>                                                                                                                                    </td></tr><tr><td>52</td><td>Text                                   </td><td> AI2 Reasoning Challenge (25-shot)        </td><td>A set of grade-school science questions.                                                                                          </td></tr><tr><td>53</td><td>Text                                   </td><td>HellaSwag (10-shot)                      </td><td>A test of common sense inference, which is easy for humans (~95%) but challenging for SOTA models.                               </td></tr><tr><td>54</td><td>Text                                   </td><td> MMLU (5-shot)                            </td><td> A test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. </td></tr><tr><td>55</td><td>Text                                   </td><td> TriviaQA (5-Shot)                        </td><td>It is a Benchmark Dataset. The TriviaQA (5-Shot) benchmark is a popular way to compare the performance of different LLMs on factual question-answering tasks. It is a challenging benchmark, but it is also a good measure of the overall capabilities of an LLM. </td></tr><tr><td>56</td><td>Text                                   </td><td> QuALITY (5-Shot)                         </td><td>A metric used to compare the performance of LLMs on a variety of tasks, including question answering, summarization, and translation. It is based on the idea that a good LLM should be able to generate high-quality responses to prompts, even if it has only been trained on a small number of examples. </td></tr><tr><td>57</td><td>Text                                   </td><td> RACE-H (5-Shot)                          </td><td> This is a benchmark dataset used to evaluate the performance of LLMs on natural language reasoning tasks. It is a more challenging dataset than the original RACE dataset, as it requires the LLM to reason over multiple sentences and to learn from a smaller number of training examples (5 shots). </td></tr><tr><td>58</td><td>Text                                   </td><td> ARC-Challenge (5-Shot)                   </td><td> This is a benchmark used to evaluate the reasoning capabilities of LLMs. It was introduced in early 2018 in the paper;"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions that require the model to reason over multiple sentences to find the correct answer. </td></tr><tr><td>59</td><td>Text                                   </td><td> MMLU                                     </td><td>Measuring Massive Multitask Language Understanding is a benchmark test designed to evaluate the performance of LLMs on a variety of tasks, including question answering, summarization, and code generation. The MMLU test covers a wide range of subjects, including STEM, humanities, social sciences, and more. </td></tr><tr><td>60</td><td>Text                                   </td><td> MMLU (5-Shot CoT)                        </td><td>This is a variant of the MMLU benchmark where the LLM is given only 5 examples of each task before being evaluated on its performance. This is known as the few-shot setting, and it is a more challenging test of the LLM's ability to learn and generalize. (The CoT stands for Chain of Thought). </td></tr><tr><td>61</td><td>Text                                   </td><td> TruthfulQA (0-shot)                      </td><td>Factuality and Truthfulness. A test to measure a model's propensity to reproduce falsehoods commonly found online.              </td></tr><tr><td>62</td><td>Text                                   </td><td> HaluEval                                 </td><td>Factuality and Truthfulness.                                                                                                      </td></tr><tr><td>63</td><td> Video                                  </td><td> UCF101                                   </td><td> Video Computer Vision and Video Generation                                                                                        </td></tr><tr><td>64</td><td> General language/reasoning</td><td>Arena Hard </td><td>tests general language understanding and reasoning abilities across various tasks.</td></tr><tr><td>65</td><td> General language/reasoning</td><td>AlpacaEval 2.0 LC, </td><td>Evaluates language completion capabilities, potentially focused on coherence, factuality, and task completion.</td></tr><tr><td>66</td><td>Machine translation</td><td>MT-Bench (GPT-4-Turbo) </td><td>Evaluates machine translation performance, likely focusing on translation quality and accuracy.</td></tr><tr><td>67</td><td>Programming/coding</td><td>MBPP  </td><td>tests coding and programming abilities by providing programming problems to solve.</td></tr><tr><td>68</td><td>Instruction following and understanding</td><td>IFEval (with Prompt-Strict-Acc and Instruction-Strict Acc metrics) </td><td>Evaluates the model's ability to follow instructions precisely, testing understanding and adherence to prompts/instructions.</td></tr><tr><td>69</td><td>Open-ended text generation quality</td><td>TFEval (with Distractor F1 and On-topic F1 metrics).</td><td>evaluates open-ended text generation capabilities, with Distractor F1 measuring factual accuracy and On-topic F1 assessing relevance to the given topic/prompt.</td></tr></tbody></table>
