Are We Done with MMLU? (2024)

Aryo Pradipta Gema¹ Joshua Ong Jun Leang¹ Giwon Hong¹ Alessio Devoto²
Alberto Carlo Maria Mancino^2,3 Rohit Saxena¹ Xuanli He⁴ Yu Zhao¹ Xiaotang Du¹
Mohammad Reza Ghasemi Madani⁵ Claire Barale¹ Robert McHardy⁶ Joshua Harris⁷
Jean Kaddour⁴ Emile van Krieken¹ Pasquale Minervini¹
¹University of Edinburgh ²Sapienza University of Rome
³Polytechnic University of Bari ⁴University College London
⁵University of Trento ⁶AssemblyAI ⁷UK Health Security Agency
{first.last, j.j.l.ong, p.minervini}@ed.ac.uk
alessio.devoto@uniroma1.it alberto.mancino@poliba.it
mr.ghasemimadani@unitn.it joshua.harris@ukhsa.gov.uk
{xuanli.he, jean.kaddour.20, robert.mchardy.20}@ucl.ac.uk

Abstract

Maybe not.We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark.Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs.For example, we find that 57% of the analysed questions in the Virology subset contain errors.To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy.Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported.Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark.Therefore, we open up MMLU-Redux for additional annotationhttps://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

1 Introduction

The advent of transformer-based Large Language Models (LLMs) [1, 2, 3, 4, 5, 6, 7, 8] marked a significant advancement in generative models, enabling interaction with computing devices through natural language.This advancement rendered many earlier benchmarks and leaderboards obsolete [9, 10], leading to the compilation of more challenging and comprehensive tests.Among these benchmarks, Massive Multitask Language Understanding (MMLU) [11] has gained significant popularity: It assesses both the breadth and depth of language understanding capabilities of current LLMs across a diverse range of subjects, including mathematics, history, computer science, logic, law, etc.

However, the reliability of benchmarking results is only as robust as the quality of the dataset used.We find that, despite its popularity, MMLU suffers from numerous errors that can mislead evaluation and model comparison.These errors, which range from simple parsing and scraping mistakes to more complex issues related to context, interpretation, and dataset quality, compromise the reliability of MMLU as a benchmark.For example, we find that 57% of the analysed instances in the Virology subset contain errors, including the suggestion to send the American army to West Africa to prevent outbreaks of Ebola (see Fig.1).

Therefore, in this study, we manually analyse the MMLU dataset using a novel error taxonomy to construct MMLU-Redux: 14 human experts manually assessed and re-annotated 3,000 questions across 30 subsets of MMLU.After our manual re-annotation effort, we study how the errors in MMLU impact LLM evaluation.First, we re-evaluate leading LLMs on MMLU-Redux, and found the performance metrics notably altered, changing their ranking.Furthermore, we both quantitatively and qualitatively analysed the errors to help understand how these errors impact LLM evaluation.

While MMLU-Redux provides a significant stepping stone to correcting up MMLU, reviewing the remaining 12,908 questions in MMLU is still a significant effort, highlighting the need for an automated approach.MMLU-Redux can also be leveraged as a strong benchmark for automatic error detection in NLP datasets, which would help scale up the review of benchmark datasets.Therefore, we study whether LLMs can help with error detection, using prompting techniques (i.e., In-Context Learning[12], Chain of Thoughts (CoT)[13]), Retrieval Augmented Generation (RAG)[14], and fine-tuning.We believe MMLU-Redux underscores the need for closely studying and reassessing the benchmarks used for evaluating NLP models. MMLU-Redux is available at https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

2 What is wrong with MMLU?

The MMLU dataset has become a popular choice for evaluating the performance of NLP systems owing to its extensive coverage of many subjects collected from freely available online sources with the help of graduate and undergraduate students[11].Despite the manual effort, MMLU still contains errors which are difficult to trace due to its under-documented annotation procedure.

We identify numerous errors in the MMLU questions, ranging from simple parsing mistakes – where the ground truth label differs from the correct answer in the original source (e.g., the answer from the source is B, but it is annotated as C in MMLU) – to more complex issues such as missing context.These errors appear randomly, and due to the lack of comprehensive documentation, it is difficult to trace back and identify the root causes of these problems.

This randomness and lack of traceability highlight the need for a standardised categorisation of errors to improve the reliability and accuracy of the MMLU dataset.By systematically identifying and categorising these errors, we can enhance the dataset’s quality and provide more robust benchmarks for evaluating NLP systems.Our approach involved developing a hierarchical taxonomy of errors, which we used to develop MMLU-Redux: A manual annotation of 30 subsets of MMLU, each containing 100 randomly selected samples (Section3.1).

2.1 Error Detection Taxonomy

We develop a hierarchical taxonomy to classify the various errors identified in MMLU into specific error types.Figure2 illustrates our taxonomy for categorising MMLU errors, while Figure4 provides examples of each error category.We categorise errors into two primary groups: samples with errors in the clarity of the questions (Type 1, Question Assessment) and samples with errors in the ground truth answer (Type 2, Ground Truth Verification).

(1a) Bad Question Clarity: The question is poorly presented in terms of various aspects, such as clarity, grammar, and sufficiency of information. For instance, referring to a previous question.

(1b) Bad Options Clarity: The options are unclear, similar, or irrelevant to the question. Most errors in this category stem from incorrect parsing of the options from the original source.For example, a single option might be incorrectly split into two separate options.

(2a) No Correct Answer: None of the options correctly answer the question.This error can, for example, arise when the ground-truth options are omitted to reduce the number of options from five to four.

(2b) Multiple Correct Answers: More than one option can be selected as the answer to the question.For example, the options contain a synonym of the ground truth label.

(2c) Wrong Ground Truth: The correct answer differs from the ground truth provided in MMLU.This type of error occurs when the annotated label differs from the correct label, which may be caused by a mistake during manual annotation.

For creating MMLU-Redux, we followed the proposed taxonomy sequentially to annotate the question.We aim to ensure comprehensive coverage of the different error types for further experiments.

2.2 Heterogeneity of Errors in MMLU Subsets

During our annotation process using the taxonomy, we observe that the types of errors found can vary substantially across subsets. Some subsets predominantly suffer from ambiguous questions, while others are mainly impacted by incorrect ground truth labels. These variations could have important implications for interpreting MMLU results in specific areas and for our ability to address these issues effectively. For instance, we have identified notable irregularities in certain subsets (a complete list of the reviewed subsets and their corresponding errors can be found in AppendixD):

Virology: – Incorrect ground truths labels are particularly prevalent within the Virology subset. Many of the incorrect labels are for relatively simple questions, such as identifying the description of a pandemic, this suggests errors may stem from problems parsing the original datasets (in most cases the Human Virology textbook’s student resources).
College Chemistry: –The questions were sourced from textbooks and standardised college-level exams. We identified erroneous questions resulting from simple parsing errors and unknown causes. For example, questions spanning multiple lines in the original source were often parsed incorrectly, leading to a part of the question being presented as the first MMLU choice (Option A) and the exclusion of Option D. Furthermore, there were questions with ground truth labels that did not match the answers provided in the source, with no apparent cause for the discrepancy.
Professional Law: – A major issue is the lack of specificity regarding jurisdictions. The benchmark does not clearly distinguish between different jurisdictions despite focusing on U.S. law.
Formal Logic: – The dataset contains a significant number of questions with incorrect answers. These are primarily sourced from the ‘Oxford Learning Link’ website. Inaccuracies are not because of invalid scraping: For example, one question states that $(F\wedge L)\wedge\neg C$ is correct, but $F\wedge L\wedge\neg C$ is not, even though these two formulas are clearly equivalent.
Global Facts: – Almost all questions needed consulting external sources to be validated, where a large portion of them are reports from ourworldindata.org (18 cases) and pewresearch.org (15 cases); for several questions, multiple sources were providing conflicting answers — for example, on the perceived corruption of political parties in 2013 Ethiopia, ourworldindata.org seems to confirm the answer in MMLU, while the Global Corruption Barometer from Transparency International was providing conflicting evidence.¹¹1See data at Our World in Data and Global Corruption Barometer.
Machine Learning: – Most questions were sourced from exam papers, assignments or online quizzes. About 30 of the questions require expert knowledge and reasoning. The main issue of this subset is the clarity of the questions and options. e.g., some quiz questions are based on past knowledge, and the descriptions in the questions may be vague or inapplicable today.
Econometrics: – The majority of the questions are correct, but some questions contain unverifiable references. e.g., ‘Consider again the VAR model of equation 16,’ but equation 16 cannot be found within the question.

The above irregularities showcase some of the error patterns present in MMLU. We want to highlight one type of error that is especially challenging to catch, namely unspecified context that is needed to properly answer the question. For instance, the Professional Law dataset introduces this bias introduced by assuming the questions relate to US jurisdiction. This pattern also appears in the Professional Accounting dataset, which assumes US accounting practice. These questions may also become outdated if the law or practice changes. In general, we find several subjects to be US- and Western centric: e.g., the Virology dataset contains a question about “the Latino population”, implicitly referring to the Latino community in the US, and the Human Aging dataset discussing an unspecified survey of older adults.

3 MMLU-Redux: A correct MMLU Subset

In this section, we propose MMLU-Redux, a manually annotated subset of MMLU, to quantify the errors present in the original dataset. MMLU-Redux serves two main purposes: 1) to measure the prevalence and types of errors in MMLU; and 2) to explore the feasibility of automatically fixing MMLU by leveraging the annotated error types.We find that 1) the proportion of errors in MMLU is non-negligible, highlighting the need for a correct subset; and 2) fixing MMLU automatically proves to be a challenging task, despite the availability of annotated error types.

We create MMLU-Redux by manually labelling a subset of MMLU questions with their corresponding error types. To this end, we follow the taxonomy introduced inSection2.1.For more accurate annotations, we confirmed the error detection by finding the samples’ original source wherever it was available.However, at present, the correct answers suggested by the annotators are not used to replace the existing MMLU labels.

In the following, we analyse the error statistics of MMLU-Redux and use MMLU-Redux to re-evaluate the performance of LLMs.Furthermore, in Section4, we explore the possibility of using MMLU-Redux to improve the overall quality of MMLU by automatically fixing the identified errors.

3.1 Analysis of MMLU-Redux

We present the percentage of error types in Fig.3, with detailed numbers available in AppendixA. In our analysis, we find that more than 9% of the examples are incorrect, suggesting a substantial presence of errors in the MMLU.Especially, we find that more than 57% examples in Virology contain errors, where 30% examples have a wrong groundtruth, and 15% are unclear questions.Moreover, we also observe significant error percentages in other disciplines: more than 20% examples in Logical Fallacies and College Chemistry are wrong and more than 10% examples in Professional Law, Business Ethics, Formal Logic, Human Aging, Global Facts, Machine Learning, Miscellaneous and Public Relations are wrong.Such error proportions could lead to inaccurate comparisons and invalid rankings of LLM models.

To improve our understanding of how these errors impact the performance of models on the MMLU, we compare the performance between erroneous instances and correct instances across the seven subjects identified as having the most errors (Virology, Logical Fallacies, College Chemistry, Professional Law, Business Ethics, Formal Logic, and Human Aging) in Fig.5.

Although the general trend indicates a performance decline among erroneous instances, we also observed cases where performance was similar or even higher in erroneous instances (Professional Law and Formal Logic). Considering that erroneous instances should intrinsically be unable to yield correct answers, this may serve as evidence of memorisation, suggesting that these MMLU instances were learned during the models’ pretraining processes. ²²2For more detailed results including all the models and subjects, refer to https://huggingface.co/spaces/edinburgh-dawg/MMLU-Redux-EDA

3.2 Re-Evaluating the State-of-the-Art LLMs

To assess how the corrected dataset impacts the performance of existing state-of-the-art LLMs, we re-evaluate them on the five subjects with the highest number of errors (Business Ethics, Virology, Professional Law, College Chemistry, and Logical Fallacies) in MMLU-Redux.

In Table1, we compared the performance of models when using all instances of MMLU-Redux to the performance when using only correct instances without errors to see if there are any changes in the rankings due to this.The results clearly demonstrate that, at least for subjects with a high number of detected errors, these issues are significant enough to affect the results, leading to changes in model rankings.For example, in the Virology subset, Palmyra X v3 ranked 4th in overall performance when considering all instances and ranked 1st when only correct instances were used.This indicates that errors in MMLU are a critical issue, considering that MMLU is used as an important benchmark for evaluating model performance.

Model	Business Ethics	Virology	Professional Law	College Chemistry	Logical Fallacies
Claude 3 Opus	0.86 (1) $\rightarrow$ 0.95 (2)	0.54 (9) $\rightarrow$ 0.88 (6)	0.69 (4) $\rightarrow$ 0.72 (3)	0.60 (2) $\rightarrow$ 0.72 (1)	0.90 (1) $\rightarrow$ 0.96 (3)
GPT-4o	0.85 (2) $\rightarrow$ 0.96 (1)	0.56 (1) $\rightarrow$ 0.91 (3)	0.70 (3) $\rightarrow$ 0.70 (4)	0.61 (1) $\rightarrow$ 0.71 (2)	0.89 (5) $\rightarrow$ 0.99 (1)
Gemini 1.5 Pro	0.80 (7) $\rightarrow$ 0.88 (9)	0.55 (7) $\rightarrow$ 0.84 (10)	0.67 (7) $\rightarrow$ 0.70 (5)	0.58 (6) $\rightarrow$ 0.69 (6)	0.88 (7) $\rightarrow$ 0.96 (5)
GPT-4 (0613)	0.79 (8) $\rightarrow$ 0.93 (3)	0.56 (2) $\rightarrow$ 0.88 (7)	0.71 (1) $\rightarrow$ 0.74 (1)	0.55 (9) $\rightarrow$ 0.68 (7)	0.89 (6) $\rightarrow$ 0.96 (4)
Llama 3 70b	0.83 (3) $\rightarrow$ 0.93 (4)	0.55 (8) $\rightarrow$ 0.91 (4)	0.55 (10) $\rightarrow$ 0.56 (10)	0.56 (8) $\rightarrow$ 0.67 (8)	0.85 (10) $\rightarrow$ 0.96 (6)
Gemini 1.5 Flash	0.82 (6) $\rightarrow$ 0.91 (8)	0.53 (10) $\rightarrow$ 0.88 (9)	0.58 (9) $\rightarrow$ 0.57 (9)	0.60 (3) $\rightarrow$ 0.71 (5)	0.86 (9) $\rightarrow$ 0.92 (10)
Palmyra X v3	0.83 (4) $\rightarrow$ 0.91 (6)	0.56 (4) $\rightarrow$ 0.93 (1)	0.68 (5) $\rightarrow$ 0.65 (7)	0.59 (4) $\rightarrow$ 0.71 (3)	0.90 (2) $\rightarrow$ 0.95 (7)
PaLM 2	0.83 (5) $\rightarrow$ 0.91 (7)	0.56 (5) $\rightarrow$ 0.93 (2)	0.68 (6) $\rightarrow$ 0.65 (8)	0.59 (5) $\rightarrow$ 0.71 (4)	0.90 (3) $\rightarrow$ 0.95 (8)
GPT-4 Turbo (1106)	0.78 (9) $\rightarrow$ 0.91 (5)	0.56 (3) $\rightarrow$ 0.88 (8)	0.71 (2) $\rightarrow$ 0.73 (2)	0.47 (10) $\rightarrow$ 0.60 (10)	0.86 (8) $\rightarrow$ 0.93 (9)
Mixtral 8x22b	0.74 (0) $\rightarrow$ 0.84 (10)	0.56 (6) $\rightarrow$ 0.91 (5)	0.59 (8) $\rightarrow$ 0.66 (6)	0.57 (7) $\rightarrow$ 0.64 (9)	0.90 (4) $\rightarrow$ 0.99 (2)

4 Can We Fix the MMLU Dataset Automatically?

After presenting evidence of the numerous errors in the MMLU dataset, we explore the following approaches to detect these errors automatically: ³³3The code for these experiments is available at https://github.com/aryopg/mmlu-redux.

Zero-Shot prompting: We provide the model with a straightforward instruction to classify questions into “ok” or “not ok” without introducing any demonstrations. The prompt can be found in AppendixB.
Few-shot prompting: We provide the model with two examples for each error type to guide its classification decisions.
Chain of Thought (CoT) prompting[13]: We encourage the model to generate reasoning steps before producing the final answer, in both zero-shot and few-shot settings. The prompt format can be found in AppendixB.
Retrieval-augmented prompting (RAG): We retrieve 5most relevant paragraphs from Wikipedia and MS-MARCO and append them as context for zero-shot and CoT prompting.
Instruction fine-tuning: Finally, we fine-tune Llama3 (8B-Instruct) model[8] using curated data and evaluate its performance on MMLU-Redux. Detailed information is provided in Appendix E.

In the following, we introduce these error detection strategies and their evaluation results in detail.

4.1 Error Detection Experiments

To evaluate the performance of large language models (LLMs) in detecting errors in the MMLU-Redux dataset, we condcut experiments with 4 state-of-the-art models: OpenAI’s GPT-4 Turbo, GPT-4o, Anthropic’s Claude-3 Opus, and Meta’s LlamA-3-70B[8]. For each model, we test both standard prompting and Chain of Thought (CoT) prompting methods[13]. Details about the prompts are provided in AppendixB.

Model	Method	Recall	F1 Score	F2 Score
GPT-4 Turbo	Zero-Shot	22.81	22.70	23.90
	Zero-Shot CoT	27.97	22.97	23.47
	Few-shot	26.55	27.93	27.97
	Few-shot CoT	46.68	31.68	36.58
GPT-4o	Zero-Shot	28.92	15.99	28.05
	Zero-Shot CoT	19.85	21.41	19.98
	Few-shot	37.40	24.93	31.06
	Few-shot CoT	38.47	29.26	31.36
Claude 3 Opus	Zero-Shot	27.11	27.00	27.65
	Zero-Shot CoT	44.87	34.68	38.19
	Few-shot	38.63	29.45	34.89
	Few-shot CoT	48.85	24.03	40.29
Llama3-70B	Zero-Shot	10.06	8.15	9.46
	Zero-Shot CoT	10.74	8.15	10.17
	Few-shot	17.82	18.58	17.60
	Few-shot CoT	24.87	23.16	23.10

We consider “not ok” as the positive class, and “ok” as the negative class for the calculation of Recall, F1, and F2 scores.Based on Table2, the Few-shot CoT setting consistently outperforms other settings across all models, suggesting that providing a small set of labelled examples along with step-by-step reasoning instructions can improve error detection performance. However, even the best-performing model, Claude-3-opus, only achieves an F2 Score of 40.29, highlighting the difficulty of this task.

Furthermore, we use retrieval-augmented prompting (RAG) to investigate the impact of external knowledge on error detection.We use BM25 to retrieve relevant paragraphs from enwiki-paragraphs and msmarco-v1-passage corpus provided by Pyserini [15].We retrieve the top 5 relevant paragraphs from the knowledge base using the question as the query.We then use these paragraphs as additional context in the prompt to classify the question and answer choices.

Model	Index	Method	Recall	F1 Score	F2 Score
GPT-4 Turbo	Wikipedia	Zero-Shot	57.00	27.47	36.87
	Wikipedia	Zero-Shot CoT	46.63	31.52	36.13
	MS MARCO	Zero-Shot	57.11	26.34	35.02
	MS MARCO	Zero-Shot CoT	37.20	26.07	29.64
GPT-4o	Wikipedia	Zero Shot	35.01	27.25	29.79
	Wikipedia	Zero-Shot CoT	33.67	26.90	28.75
	MS MARCO	Zero-Shot	29.97	28.07	28.31
	MS MARCO	Zero-Shot CoT	28.65	24.14	25.30
Llama3-70B	Wikipedia	Zero Shot	22.78	20.41	21.00
	Wikipedia	Zero-Shot CoT	10.12	12.61	10.90
	MS MARCO	Zero-Shot	28.45	22.39	24.67
	MS MARCO	Zero-Shot CoT	10.18	14.00	11.40
Claude 3 Opus	Wikipedia	Zero-Shot	82.61	28.72	41.92
Claude 3 Opus	MS MARCO	Zero-Shot	83.91	28.09	41.27

Based on Table3, the Claude 3 Opus model with the zero-shot method on the MS MARCO index achieves the highest Recall of 83.91. Claude 3 Opus achieves an F2 Score of 41.92 with the zero-shot method on the Wikipedia index. The GPT-4 models perform relatively worse than Claude, with the GPT-4o model showing lower scores than GPT-4 Turbo. Comparing the retrieval indexes, Wikipedia generally outperforms MS MARCO for the GPT-4o model, while the results are mixed for the GPT-4 Turbo and Llama3-70B models. The RAG approach outperforms the few-shot CoT setting mentioned in the previous analysis, indicating that incorporating retrieved information can enhance error detection performance.

Based on the results presented in Table2 and3, we can conclude that automatic error detection in the MMLU dataset remains a challenging task, despite the availability of annotated error types. While Claude 3 Opus demonstrates the highest performance in terms of Recall, F1, and F2 Scores compared to other models, Claude 3 Opus shows the highest performance with RAG, indicating its potential for identifying errors more effectively. However, the best-performing model and method combination using RAG still achieves relatively low scores, suggesting the overall reliability of the models in detecting errors across the diverse range of subjects in MMLU is still limited. Detailed performance across all subjects can be found inAppendixC.

5 Related Work

Benchmark Issues

While benchmarks often enable methodological progress, they can also be counterproductive when labelling mistakes and annotation artefacts exist.For example, Beyer etal. [16] show how annotation artefacts within the popular ImageNet benchmark [17] led to likely overstated performance gains that did not necessarily transfer to other datasets and tasks.In NLP, similar issues have been found in summarisation [18, 19] and natural language inference[20, 21, 22, 23] benchmarks.

Benchmark issues can arise from biases in the framing of the task[24]; noisy annotations[25]; web-crawls [26, 27, 28]; automated labelling processes such as crowdsourcing [29] — where annotators may be optimizing for their own incentives; human errors[30] — e.g., lack of expertise on a given topic; or programmatic weak supervision [31, 32, 7], and personal[33] or group-level[34] annotator biases.

MMLU Issues and MMLU-Pro.

The broad adoption of the MMLU benchmark for LLM evaluations[2, 1, 5, 6, 8] means identifying issues or improvements is crucial for ensuring its continued applicability.Recent studies have identified issues with labelling errors and ambiguous questions in similar benchmarks, such as MedQA[35].Concurrent work developing the MMLU-Pro[36] benchmark also identifies a number of issues within a filtered and augmented subset of the original MMLU dataset and re-annotates parts of the dataset for inclusion in the new MMLU-Pro evaluation.However, there is currently limited literature categorising and quantifying these types of issues across the original MMLU dataset to help inform our understanding of previous results.

6 Conclusion

We analyse the Massive Multitask Language Understanding (MMLU) benchmark, driven by the necessity for rigorous evaluation of its reliability.Our analysis of 30 different MMLU subjects using a hierarchical taxonomy reveals that a significant portion of MMLU instances contain inaccuracies that could lead to misleading evaluation results.For example, 57% of the instances in the Virology and 26% in the Logical fallacies subsets were found to be inaccurate.To this end, we introduce MMLU-Redux, a thoroughly reviewed subset of the MMLU dataset [11] comprising 3,000 questions spanning the 30 MMLU subjects we analysed.The re-evaluation of LLMs using MMLU-Redux shows a significant variation in performance metrics and shifts in model rankings for several subsets, emphasising the impact that dataset quality can have on the evaluation of LLMs.Furthermore, we analyse whether it is possible to identify errors automatically; although Claude 3 Opus seems to produce the most accurate results on this task (41.9% F2 score when using retrieval-augmented generation), it is still insufficient to produce a high-quality dataset.By making MMLU-Redux publicly available, we invite the community to contribute to the endeavour of building a reliable dataset for properly evaluating the next generation of LLMs.

Limitations and Implications

Although our analysis uses a significant subset of 3,000 questions, the accuracy of this analysis, both quantitatively and qualitatively, can be further improved with additional annotation of the remaining 12,908 questions in MMLU.To properly evaluate the wide variety of subjects covered in MMLU would likely require a larger variety of experts.Therefore, we open up MMLU-Redux for additional annotation of both additional MMLU subjects and add to the already reviewed subsets.However, we acknowledge that the taxonomy we introduced to classify errors might still be prone to annotators’ personal biases.MMLU-Redux is available at https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

References

OpenAI [2023]OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.
Anil etal. [2023a]Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, TimothyP. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, PaulRonald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and etal.Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023a.
Anthropic [2023]Anthropic.Anthropic. model card and evaluations for claude models., 2023.URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
Anil etal. [2023b]Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, GustavoHernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, JanA. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and etal.Palm 2 technical report.CoRR, abs/2305.10403, 2023b.
Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023.
Anthropic [2024]AIAnthropic.The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024.
Kaddour etal. [2023]Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy.Challenges and applications of large language models.CoRR, abs/2307.10169, 2023.doi: 10.48550/ARXIV.2307.10169.URL https://doi.org/10.48550/arXiv.2307.10169.
AI@Meta [2024]AI@Meta.Llama 3 model card.2024.URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
Laskar etal. [2023]Md. TahmidRahman Laskar, M.Saiful Bari, Mizanur Rahman, MdAmranHossen Bhuiyan, Shafiq Joty, and JimmyXiangji Huang.A systematic study and comprehensive evaluation of chatgpt on benchmark datasets.In ACL (Findings), pages 431–469. Association for Computational Linguistics, 2023.
Shen etal. [2023]Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing.Are large language models good evaluators for abstractive summarization?CoRR, abs/2305.13091, 2023.
Hendrycks etal. [2021]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.In ICLR. OpenReview.net, 2021.
Brown etal. [2020]TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.CoRR, abs/2005.14165, 2020.
Wei etal. [2022]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022.
Lewis etal. [2020]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Lin etal. [2021a]Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira.Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations.In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362, 2021a.
Beyer etal. [2020]Lucas Beyer, OlivierJ. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron vanden Oord.Are we done with imagenet?CoRR, abs/2006.07159, 2020.
Russakovsky etal. [2015]Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, etal.Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015.
Tejaswin etal. [2021]Priyam Tejaswin, Dhruv Naik, and Pengfei Liu.How well do you know your summarization datasets?In ACL/IJCNLP (Findings), volume ACL/IJCNLP 2021 of Findings of ACL, pages 3436–3449. Association for Computational Linguistics, 2021.
Bhandari etal. [2020]Manik Bhandari, PranavNarayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig.Re-evaluating evaluation in text summarization.In EMNLP (1), pages 9347–9359. Association for Computational Linguistics, 2020.
Gururangan etal. [2018]Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, SamuelR. Bowman, and NoahA. Smith.Annotation artifacts in natural language inference data.In NAACL-HLT (2), pages 107–112. Association for Computational Linguistics, 2018.
Poliak etal. [2018]Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and BenjaminVan Durme.Hypothesis only baselines in natural language inference.In *SEM@NAACL-HLT, pages 180–191. Association for Computational Linguistics, 2018.
Stacey etal. [2020]Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Sebastian Riedel, and Tim Rocktäschel.Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training.In EMNLP (1), pages 8281–8291. Association for Computational Linguistics, 2020.
Wu etal. [2022]Yuxiang Wu, Matt Gardner, Pontus Stenetorp, and Pradeep Dasigi.Generating data to mitigate spurious correlations in natural language inference datasets.In ACL (1), pages 2660–2676. Association for Computational Linguistics, 2022.
Schwartz etal. [2017]Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and NoahA. Smith.The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task.In CoNLL, pages 15–25. Association for Computational Linguistics, 2017.
Chen etal. [2016]Danqi Chen, Jason Bolton, and ChristopherD. Manning.A thorough examination of the cnn/daily mail reading comprehension task.In ACL (1). The Association for Computer Linguistics, 2016.
Raffel etal. [2023]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
Lee etal. [2021]Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini.Deduplicating training data makes language models better.arXiv preprint arXiv:2107.06499, 2021.
Kaddour [2023]Jean Kaddour.The minipile challenge for data-efficient language models.CoRR, abs/2304.08442, 2023.doi: 10.48550/ARXIV.2304.08442.URL https://doi.org/10.48550/arXiv.2304.08442.
Yuen etal. [2011]Man-Ching Yuen, Irwin King, and Kwong-Sak Leung.A survey of crowdsourcing systems.In SocialCom/PASSAT, pages 766–773. IEEE Computer Society, 2011.
Peterson etal. [2019]JoshuaC. Peterson, RuairidhM. Battleday, ThomasL. Griffiths, and Olga Russakovsky.Human uncertainty makes classification more robust.In ICCV, pages 9616–9625. IEEE, 2019.
Zhang etal. [2022]Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner.A survey on programmatic weak supervision.CoRR, abs/2202.05433, 2022.
Goswami etal. [2021]Mononito Goswami, Benedikt Boecking, and Artur Dubrawski.Weak supervision for affordable modeling of electrocardiogram data.In AMIA. AMIA, 2021.
Geva etal. [2019]Mor Geva, Yoav Goldberg, and Jonathan Berant.Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets.In EMNLP/IJCNLP (1), pages 1161–1166. Association for Computational Linguistics, 2019.
Liu etal. [2022]Haochen Liu, Joseph Thekinen, Sinem Mollaoglu, DaTang, JiYang, Youlong Cheng, Hui Liu, and Jiliang Tang.Toward annotator group bias in crowdsourcing.In ACL (1), pages 1797–1806. Association for Computational Linguistics, 2022.
Saab etal. [2024]Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, JuanmaZambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G.T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, LeHou, Tomer Golany, Luyang Liu, Jean-Baptiste Alayrac, Neil Houlsby, Nenad Tomasev, Jan Freyberg, Charles Lau, Jonas Kemp, Jeremy Lai, Shekoofeh Azizi, Kimberly Kanada, SiWai Man, Kavita Kulkarni, Ruoxi Sun, Siamak Shakeri, Luheng He, Benjamin Caine, Albert Webson, Natasha Latysheva, Melvin Johnson, PhilipAndrew Mansfield, Jian Lu, Ehud Rivlin, Jesper Anderson, Bradley Green, Renee Wong, Jonathan Krause, Jonathon Shlens, Ewa Dominowska, S.M.Ali Eslami, Katherine Chou, Claire Cui, Oriol Vinyals, Koray Kavukcuoglu, James Manyika, Jeff Dean, Demis Hassabis, Yossi Matias, DaleR. Webster, JoelleK. Barral, Greg Corrado, Christopher Semturs, S.Sara Mahdavi, Juraj Gottweis, Alan Karthikesalingam, and VivekNatarajan.Capabilities of gemini models in medicine.CoRR, abs/2404.18416, 2024.doi: 10.48550/ARXIV.2404.18416.URL https://doi.org/10.48550/arXiv.2404.18416.
Wang etal. [2024]Yubo Wang, Xueguang Ma, GeZhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen.Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024.
Clark etal. [2018]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018.
Mihaylov etal. [2018]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal.Can a suit of armor conduct electricity? a new dataset for open book question answering.In EMNLP, 2018.
Amini etal. [2019]Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi.MathQA: Towards interpretable math word problem solving with operation-based formalisms.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.doi: 10.18653/v1/N19-1245.URL https://aclanthology.org/N19-1245.
Jin etal. [2021]DiJin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits.What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021.
Lin etal. [2021b]Stephanie Lin, Jacob Hilton, and Owain Evans.Truthfulqa: Measuring how models mimic human falsehoods, 2021b.
Loshchilov and Hutter [2018]Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.In International Conference on Learning Representations, 2018.
Hu etal. [2021]EdwardJ Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, Weizhu Chen, etal.Lora: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2021.

Checklist

1.
For all authors…
1. (a)
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?[Yes] We claim to develop a dataset for studying the errors present in MMLU, which we present.
2. (b)
  Did you describe the limitations of your work?[Yes] See Section Limitations and Implications
3. (c)
  Did you discuss any potential negative societal impacts of your work?[No] See section Limitations and Implications
4. (d)
  Have you read the ethics review guidelines and ensured that your paper conforms to them?[Yes]
2.
If you are including theoretical results…
1. (a)
  Did you state the full set of assumptions of all theoretical results?[N/A]
2. (b)
  Did you include complete proofs of all theoretical results?[N/A]
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?[Yes] See Section3
2. (b)
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?[Yes] We provide details for all proposed methods. For prompting strategies, we show prompts in AppendixD For fine-tuning, we provide details about the training dataset in AppendixE.
3. (c)
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?[No] Rerunning experiments multiple times to obtain error bars would have exceeded our funding capabilities.
4. (d)
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?[Yes] We provide details about our resources in AppendixE
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  If your work uses existing assets, did you cite the creators?[Yes] Our work is based on the MMLU dataset, which we cited in Section1.
2. (b)
  Did you mention the license of the assets?[Yes] The license of the dataset is available on the dataset URL (CC-BY 4.0).
3. (c)
  Did you include any new assets either in the supplemental material or as a URL?[Yes] MMLU-Redux is provided via a URL in the Abstract.
4. (d)
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating?[Yes] Our dataset is based on MMLU, which has an MIT license. The annotation work for MMLU-Redux is done by the authors of this paper, who all gave consent to use the data.
5. (e)
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?[No] The MMLU data does not use personal data as it is based on publicly available test questions, and so neither does MMLU-Redux.
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  Did you include the full text of instructions given to participants and screenshots, if applicable?[N/A]
2. (b)
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?[N/A]
3. (c)
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?[N/A]

Appendix A MMLU-Redux Error Type Statistics

In this section, we give the exact statistics of the error types found in MMLU-Redux. Table4 contains an overview. We include the total number of questions in each subject, but note that we annotated a subset of 100 questions for each.

Subset	Dataset Size	OK	BQC	BOC	NCA	MCA	WG
Virology	166	43	14	2	4	4	33
Logical fallacies	163	74	14	2	4	3	3
College chemistry	100	75	2	0	2	0	21
Professional law	1,534	82	4	1	0	11	2
Business ethics	100	86	14	0	0	0	0
Formal logic	126	87	1	0	2	9	1
Human aging	223	88	12	0	0	0	0
College medicine	173	96	2	0	1	0	1
High school macroeconomics	390	88	2	9	0	1	0
Global facts	100	88	1	1	4	0	6
Philosophy	311	100	0	0	0	0	0
Machine learning	112	89	3	5	0	1	2
Miscellaneous	783	90	4	1	4	1	0
Astronomy	152	94	2	2	1	0	1
Public relations	110	91	3	0	0	3	3
Professional accounting	282	94	0	1	5	0	0
Conceptual physics	235	95	0	0	0	1	4
College computer science	100	97	0	0	2	1	0
Econometrics	114	97	3	0	0	0	0
High school physics	151	97	0	0	1	0	2
High school statistics	216	98	0	0	0	0	2
Electrical engineering	145	98	1	0	0	0	1
Clinical knowledge	265	99	1	0	0	0	0
Anatomy	135	99	0	0	0	0	1
High school chemistry	203	99	1	0	0	0	0
High school mathematics	270	99	0	0	0	0	1
College mathematics	100	99	1	0	0	0	0
College physics	102	100	0	0	0	0	0
High school US history	204	100	0	0	0	0	0
High school geography	198	100	0	0	0	0	0
Total	7,263	91	3	1	1	1	3

Appendix B Prompting Methods

Below, we provide the prompts used for both standard prompting and Chain of Thought (CoT) prompting methods.

Throughout the evaluation, we used the test split of the MMLU-Redux loaded with the specified configuration. The hyperparameters include a temperature of 0.0, top_p of 1, frequency_penalty and presence_penalty of 0, and max_tokens of 600 for both the standard prompting and Chain of Thought (CoT) prompting methods to ensure consistency and deterministic results. The default random seed was used.

Appendix C Details on automatic error detection

C.1 Detailed Results on Error Detection Experiments for MMLU-Redux

Dataset	Models	Zero-Shot	Zero-Shot CoT	Few-Shot	Few-Shot CoT
College Chemistry	GPT-4-Turbo	52.94	17.14	55.74	48.98
	GPT-4o	42.86	30.77	19.35	30.77
	Claude-3-Opus	40.00	36.36	24.24	32.26
	Llama-3-70B	0.00	0.00	13.00	19.00
College Mathematics	GPT-4-Turbo	0.00	20.00	0.00	0.00
	GPT-4o	0.00	21.43	0.00	21.43
	Claude-3-Opus	9.52	0.00	0.00	32.26
	Llama-3-70B	0.00	0.00	0.00	0.00
Econometrics	GPT-4-Turbo	0.00	8.70	28.57	20.69
	GPT-4o	0.00	29.41	40.00	29.41
	Claude-3-Opus	15.38	0.00	46.15	0.00
	Llama-3-70B	0.00	0.00	57.00	55.00
Formal Logic	GPT-4-Turbo	23.33	22.22	26.09	0.00
	GPT-4o	33.33	0.00	0.00	0.00
	Claude-3-Opus	26.67	55.81	0.00	0.00
	Llama-3-70B	0.00	0.00	0.00	0.00
Global Facts	GPT-4-Turbo	19.35	30.00	20.00	44.44
	GPT-4o	16.67	28.57	21.05	28.57
	Claude-3-Opus	28.57	32.26	38.09	37.50
	Llama-3-70B	0.00	0.00	22.00	32.00
High School Physics	GPT-4-Turbo	9.52	17.39	23.53	16.67
	GPT-4o	19.05	26.09	30.77	26.09
	Claude-3-Opus	8.33	21.05	28.57	33.33
	Llama-3-70B	0.00	0.00	18.00	22.00
Machine Learning	GPT-4-Turbo	20.83	9.52	30.77	35.71
	GPT-4o	40.00	20.00	28.57	20.00
	Claude-3-Opus	33.33	23.08	31.58	46.15
	Llama-3-70B	0.00	0.00	0.00	14.00
Professional Law	GPT-4-Turbo	25.00	17.14	21.43	22.86
	GPT-4o	8.00	23.53	8.00	23.53
	Claude-3-Opus	29.79	32.43	16.00	23.53
	Llama-3-70B	11.00	9.00	8.00	9.00
Public Relations	GPT-4-Turbo	0.00	20.00	31.58	52.17
	GPT-4o	0.00	31.58	25.00	31.58
	Claude-3-Opus	0.00	80.77	38.09	35.29
	Llama-3-70B	17.00	17.00	15.00	14.00
Virology	GPT-4-Turbo	76.00	73.12	81.19	75.27
	GPT-4o	0.00	81.19	76.59	81.19
	Claude-3-Opus	78.43	25.00	71.74	0.00
	Llama-3-70B	54.00	56.00	52.00	67.00

Dataset	Models	Zero-Shot	Zero-Shot CoT	Few-Shot	Few-Shot CoT
College Chemistry	GPT-4-Turbo	45.69	26.55	50.29	48.39
	GPT-4o	48.39	18.69	30.61	22.94
	Claude-3-Opus	40.00	50.85	35.09	27.27
	Llama-3-70B	0.00	0.00	0.00	0.00
College Mathematics	GPT-4-Turbo	0.00	0.00	0.00	0.00
	GPT-4o	0.00	0.00	0.00	0.00
	Claude-3-Opus	6.17	0.00	0.00	0.00
	Llama-3-70B	0.00	0.00	0.00	10.64
Econometrics	GPT-4-Turbo	6.13	15.63	10.49	39.47
	GPT-4o	13.33	0.00	32.26	50.00
	Claude-3-Opus	10.53	40.00	34.88	60.00
	Llama-3-70B	0.00	0.00	9.52	14.15
Formal Logic	GPT-4-Turbo	17.41	22.73	26.32	20.83
	GPT-4o	35.09	8.47	30.30	24.19
	Claude-3-Opus	24.69	32.79	40.54	31.75
	Llama-3-70B	0.00	0.00	0.00	0.00
Global Facts	GPT-4-Turbo	17.05	26.79	25.00	37.04
	GPT-4o	16.67	9.09	25.00	17.86
	Claude-3-Opus	31.25	37.31	41.67	48.39
	Llama-3-70B	0.00	0.00	62.50	75.00
High School Physics	GPT-4-Turbo	6.29	31.25	6.99	30.30
	GPT-4o	13.33	23.81	23.26	38.46
	Claude-3-Opus	5.75	35.71	21.28	38.46
	Llama-3-70B	31.25	26.32	18.52	27.27
Machine Learning	GPT-4-Turbo	15.72	9.26	19.69	40.98
	GPT-4o	37.31	20.00	43.48	33.09
	Claude-3-Opus	27.03	47.62	34.88	39.06
	Llama-3-70B	12.82	12.82	41.67	57.47
Professional Law	GPT-4-Turbo	21.74	16.85	23.08	22.47
	GPT-4o	10.87	6.25	10.87	18.29
	Claude-3-Opus	26.12	32.97	21.74	29.41
	Llama-3-70B	43.65	45.45	25.00	27.78
Public Relations	GPT-4-Turbo	25.97	21.28	36.23	60.00
	GPT-4o	18.87	44.44	27.03	32.61
	Claude-3-Opus	20.55	28.30	35.09	49.18
	Llama-3-70B	0.00	10.64	12.50	12.20
Virology	GPT-4-Turbo	82.97	64.39	81.63	66.29
	GPT-4o	86.67	69.03	87.80	75.37
	Claude-3-Opus	84.39	76.36	83.76	79.42
	Llama-3-70B	6.85	6.49	6.25	6.49

Dataset	Models	Zero-Shot	Zero-Shot CoT	Few-Shot	Few-Shot CoT
College Chemistry	GPT-4-Turbo	41.86	24.00	47.22	48.00
	GPT-4o	52.94	16.00	50.00	20.00
	Claude-3-Opus	40.00	48.00	50.00	24.00
	Llama-3-70B	12.00	8.00	8.00	12.00
College Mathematics	GPT-4-Turbo	0.00	0.00	0.00	0.00
	GPT-4o	0.00	0.00	0.00	0.00
	Claude-3-Opus	5.00	0.00	0.00	0.00
	Llama-3-70B	0.00	0.00	0.00	0.00
Econometrics	GPT-4-Turbo	5.00	33.33	8.57	100.00
	GPT-4o	11.11	0.00	28.57	100.00
	Claude-3-Opus	8.70	66.67	30.00	100.00
	Llama-3-70B	0.00	0.00	0.00	100.00
Formal Logic	GPT-4-Turbo	14.89	23.08	23.33	23.08
	GPT-4o	36.36	7.69	40.00	23.08
	Claude-3-Opus	23.53	30.77	50.00	30.77
	Llama-3-70B	0.00	0.00	0.00	0.00
Global Facts	GPT-4-Turbo	15.79	25.00	23.53	33.33
	GPT-4o	16.67	8.33	28.57	16.67
	Claude-3-Opus	33.33	41.67	44.44	50.00
	Llama-3-70B	0.00	0.00	0.00	25.00
High School Physics	GPT-4-Turbo	5.13	66.67	5.71	66.67
	GPT-4o	11.11	33.33	20.00	66.67
	Claude-3-Opus	4.76	66.67	18.18	66.67
	Llama-3-70B	33.33	33.33	66.67	33.33
Machine Learning	GPT-4-Turbo	13.51	9.09	17.24	45.45
	GPT-4o	35.71	18.18	66.67	36.36
	Claude-3-Opus	24.00	54.55	37.5	45.45
	Llama-3-70B	0.00	9.09	0.00	9.09
Professional Law	GPT-4-Turbo	20.00	16.67	21.43	22.22
	GPT-4o	14.29	5.56	14.29	16.67
	Claude-3-Opus	24.14	33.33	28.57	27.78
	Llama-3-70B	5.56	5.56	5.56	5.56
Public Relations	GPT-4-Turbo	23.53	22.22	33.33	66.67
	GPT-4o	18.18	44.44	28.57	33.33
	Claude-3-Opus	18.75	33.33	33.33	66.67
	Llama-3-70B	11.11	11.11	11.11	11.11
Virology	GPT-4-Turbo	88.37	59.65	85.11	61.40
	GPT-4o	92.86	64.91	97.30	71.93
	Claude-3-Opus	88.89	73.68	94.29	77.19
	Llama-3-70B	38.60	40.35	36.84	52.63

Appendix D Heterogeneity of Errors in MMLU Subsets – Full List

Here, we provide an extensive list of qualitative observations from the manually validated MMLU subsets.

Professional Law: – There are several contextual limitations that pose challenges for accurate question answering. One major issue is the lack of specificity regarding jurisdictions. The benchmark does not clearly distinguish between different jurisdictions, despite focusing on U.S. law. Additionally, the inherently interpretative nature of legal principles means that there is often no definitive ground truth, making it difficult to provide unequivocal answers.
Professional accounting: – These questions mainly come from FAR CPA Exams, and are of high quality. There are minor issues in scraping where numeric answers are not converted properly and significance is lost, e.g., where the given correct answer is $242,000, while the correct computed answer is $241,843. Furthermore, like in professional law, all questions assume U.S. professional accounting practice, even though this is rarely specified.
Human Aging: – The questions are mostly correct, except for some questions containing underspecified information. e.g., “In this chapter’s Senior View, Dr. Shealy advises you to”.
Global Facts: – Almost all questions needed consulting external sources to be answered, such as ourworldindata.org (18 cases) and pewresearch.org (15 cases); in a few cases multiple sources were providing conflicting answers to the same question – for example, on the perceived corruption of political parties in 2013 Ethiopia, ourworldindata.org confirms the answer in MMLU, while other sources such as the Global Corruption Barometer from Transparency International were providing conflicting answers.
Virology: – Incorrect ground truths labels are particularly prevalent within the Virology subset. Many of the incorrect labels are for relatively simple questions, such as identifying the description of a pandemic, this suggests errors may stem from problems parsing the original datasets (in most cases the Human Virology textbook’s student resources). In addition, there are a range of issues relating to question and option clarity where the necessary context required to answer the question is missing, such as which family of diseases is being referred to.
Business Ethics: – This subset includes several unclear questions where respondents must identify multiple correct statements in a listed statement. MMLU provides only the last statement of the list as the question instead of including the entire question.
Philosophy: – Several samples in this subset cannot be explicitly found in an external source.
Public Relation: – This subset includes various errors, ranging from multiple correct answers and wrong ground truth to bad question clarity.
Anatomy: – This is almost completely correct.
College Chemistry: –The questions were sourced from textbooks (e.g., Chechik: Electron Paramagnetic Resonance, Hore: Nuclear Magnetic Resonance 2e) and standardised college-level exams (e.g., GRE Chemistry Practice Test). We identified erroneous questions resulting from simple parsing errors and unknown causes. For example, questions spanning multiple lines in the original source were often parsed incorrectly, leading to a part of the question being presented as the first MMLU choice (Option A) and the exclusion of Option D. Additionally, some questions originally included option E, which was the correct answer, but this option was omitted from the MMLU choices to fit into the 4-choices question style. Furthermore, there were questions with ground truth labels that did not match the answers provided in the source, with no apparent cause for the discrepancy.
College Medicine: – The questions were mostly sourced from textbooks (e.g., Maughan & Gleeson: The Biochemical Basis of Sports Performance 2e) and standardised college-level medical exam (e.g., MCAT). The majority of question-answer pairs are of good quality. However, questions that were sourced from Maughan & Gleeson: The Biochemical Basis of Sports Performance 2e can also be found in the Clinical Knowledge subject.
Clinical Knowledge: – The questions were mostly sourced from textbooks (e.g., Maughan & Gleeson: The Biochemical Basis of Sports Performance 2e, Cox & Roper: Clinical Skills, Endacott, Jevon & Cooper: Clinical Nursing Skills Core and Advanced). The majority of question-answer pairs are of good quality. One specific question was annotated very well by adding the time when the question was asked (i.e., “Which of the following statements is true about informal carers (as of 2020)?”) which correctly indicates the ever-changing nature of these questions. However, questions sourced from Maughan & Gleeson: The Biochemical Basis of Sports Performance 2e can also be found in the College Medicine subject.
Formal Logic: – The dataset contains a significant number of questions with incorrect answers. These are primarily sourced from the ‘Oxford Learning Link’ website. Inaccuracies are not because of invalid scraping: For example, one question states that $(F\wedge L)\wedge\neg C$ is correct, but $F\wedge L\wedge\neg C$ is not, even though these two formulas are clearly equivalent.
Logical fallacies: – Some questions discuss fallacies, e.g., the solid slope fallacy, which come from a set of flashcards obtained from the ‘Quizlet’ website, but otherwise do not return any hits on Google. There is also a large set of unclear questions involving an argument with a logical fallacy, but without any question. For instance, one “question” is “All things that are spoiled are inedible. Timothy is spoiled. So, Timothy is inedible.” The original source of this question also had the instruction “Select the fallacy-type which best describes the reasoning contained in each of the passages below.”, but this was lost when adding it to the MMLU dataset.
Machine Learning: – Most questions were sourced from exam papers, assignments or online quizzes. About 30% of the questions require expert knowledge and reasoning. The main issues of this subset are bad question clarity and bad options clarity. e.g., some quiz questions are based on past knowledge, and the descriptions in the questions may be vague or inapplicable today.
Electrical Engineering: – All the questions are sourced from the ‘Electrical4U’ website. However, 2 answers have been incorrectly extracted from the website.
College Mathematics: – The majority of the questions are from GRE papers, but they have been modified to have 4 options instead of 5. The majority of the questions require expert knowledge and reasoning. However, there is one question that was incorrectly adjusted when changing from 5 options to 4.
College Computer Science: – The majority of the questions are from a Computer Science GRE practice exam and online boards with practice questions.
High School Mathematics: – The majority of the questions are extracted from AP Mathematics and require expert knowledge and reasoning.
High School Statistics: – The majority of the questions are from AP Statistics and can be directly obtained from the crackap.com website. However, two answers were incorrectly extracted.
Miscellaneous: – The questions are direct facts regarding celebrities, movies, and global knowledge, which can be directly extracted from the internet. However, there were some incorrect answers and vague questions that could be difficult to answer accurately. For instance: ‘How many cups of coffee were consumed in the United States in the past week (as of 2013)?’
Econometrics: – The majority of the questions are correct, but some questions contain unverifiable references. (E.g., ’Consider again the VAR model of equation 16,’ but equation 16 cannot be found within the question.)
High School Chemistry: – This is almost completely correct.
High School Physics: – This subset is almost completely correct. The majority of the questions are from AP Statistics and can be directly obtained from the ‘crackap’ website.
College Physics: –Some of the questions (approximately 20%) were duplicated.
High School Geography: – This is almost completely correct.
Elementary Mathematics: – This is almost completely correct. There are only a few questions (mostly basic equations) with wrong ground truth answer.
Conceptual Physics: – This is almost completely correct, with only a few questions having wrong ground truth answers due to issues with data scraping from unreliable sources.
Astronomy: – Most questions and their corresponding answers are correct. A unique source was not identified, but some can be found on ‘Quizlet.’ Two questions are unclear because they are missing essential details. Additionally, two question options are confusing due to incorrect parsing of the power of 10 (e.g., $10^{16}$ is incorrectly shown as $1016$ ). One question lacks a correct answer because it is outdated, highlighting the importance of keeping questions updated with recent discoveries in this field.

Appendix E Error Detection via Fine-tuning

To validate our fine-tuning strategy for error detection, we developed LabelChaos, a dataset designed to mirror the error distribution of the original MMLU. This dataset serves as a benchmark for finetuning models, which are subsequently evaluated on MMLU-Redux.

To create LabelChaos, we selected and merged six manually labelled datasets. We chose datasets annotated by humans ([37, 38, 39, 40, 41]) to avoid the potential inaccuracies associated with automated labelling procedures. After standardising these datasets to align with the format of MMLU, we generate a corrupted version of each by introducing specific types of corruption, as categorised in our taxonomy (Fig. 3). These corruptions were carefully applied to replicate the quality and distribution characteristics of the MMLU dataset. The final dataset comprises approximately 264,000 samples, similar to those in MMLU. Below, we provide an overview of the LabelChaos subsets and the corruption procedures used for each type of error.

•
Wrong Ground Truth: The correct answer label is replaced by randomly selecting from the incorrect answers.
•
Poor Question Clarity: An LLM-based corruption is introduced by prompting GPT-3.5 Turbo to modify the question, increasing its difficulty. Few-shot examples from MMLU illustrating the “poor question clarity” error type are used.
•
No Correct Answers: The correct answer is replaced with ‘all options listed’ or ‘all of the above’.
•
Unclear Options: Since most instances of unclear options in the original MMLU stem from parsing errors, we simulate these by introducing parsing errors into the choices, such as splitting strings at incorrect characters.
•
Multiple Correct Answers: GPT-3.5 Turbo is used to generate a new option semantically identical to the correct one. One of the incorrect options is then replaced with this newly generated option.
•
correct: The original, uncorrupted dataset.

We fine-tune the Llama-3 (8B-Instruct)[8] using LabelChaos datasets. To balance the distribution in MMLU-Redux, where most instances are labelled as "correct", we adjusted the label distribution to: 0.1 (Wrong Ground Truth), 0.1 (Poor Question Clarity), 0.1 (No Correct Answers), 0.1 (Unclear Options), 0.1 (Multiple Correct Answers), and 0.5 (correct). For consistency, we used the previously described CoT Prompt as input, with outputs labelled as either “ok” or “not ok”. The training involves 2048 steps, with a batch size of 64, utilizing the AdamW optimizer[42] with a learning rate of $2\times 10^{-4}$ and no weight decay. Due to computational constraints, we apply LoRA[43], with a rank of 16, to all models. All experiments were conducted on a single Nvidia A100 (40GB) GPU. We present the results of the fine-tuning method in Table 8.

Methods	Recall	F1 Score	F2 Score
GPT-4 Turbo (Few-shot CoT)	46.68	31.60	36.58
GPT-4 Turbo (Few-shot CoT)	38.47	29.26	31.36
Claude 3 Opus (Few-shot CoT)	48.85	24.03	40.29
Llama3-70B (Few-shot CoT)	24.87	23.16	23.10
Llama3-8B (Fine-tuning)	56.58	34.06	44.75