ARC-Challenge (5-Shot)

ARC-Challenge (5-Shot) is a benchmark used to evaluate the reasoning capabilities of large language models (LLMs). It was introduced in early 2018 in the paper "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions that require the model to reason over multiple sentences to find the correct answer.

In a 5-Shot evaluation, the model is given 5 examples of question-answer pairs from the ARC dataset before being asked to answer a new question. This is done to assess the model's ability to learn from a small number of examples and generalize to new problems.

The ARC-Challenge (5-Shot) benchmark is one of the most challenging benchmarks for LLMs, and it is often used to compare the performance of different models. For example, the Hugging Face leaderboard for open-source LLMs ranks models based on their average performance on 4 datasets, including ARC-Challenge (5-Shot).

Here is an example of an ARC-Challenge question:

Question: What is the best way to separate a mixture of sand and water?

Answers: A. Filter the mixture through a paper towel. B. Let the mixture settle and then pour off the water. C. Add salt to the mixture and then boil the water away. D. All of the above.

To answer this question correctly, the model needs to understand the different properties of sand and water, as well as the different methods of separating mixtures.

The ARC-Challenge (5-Shot) benchmark is a valuable tool for evaluating the reasoning capabilities of LLMs and comparing the performance of different models. It is also a useful resource for researchers and developers who are working on improving the reasoning performance of LLMs.

Last updated