Successful language model evals
Everybody uses evaluation benchmarks (“evals”), but I think they deserve more attention than they are currently getting. Evals are incentives for the research community, and breakthroughs are often closely linked to a huge performance jump on some eval. In fact, I’d argue that a key job of the team lead is to dictate what eval to optimize.
What is the definition of a successful eval? I’d say that if an eval is used in breakthrough papers and trusted within the community, then it’s clearly successful.
Here are some of the successful evals of the past five years:
GLUE/SuperGLUE was used by basically all NLP papers in the pre-LLM era (BERT, T5, etc).
MMLU is used by almost all LLM papers. It’s the favorite eval of DeepMind and Google.
GSM8K spurred LLMs for reasoning, and is used in every paper on chain-of-thought.
MATH is also used in most LLM papers.
HumanEval is the classic eval for LLMs for coding.
Obviously this isn’t a comprehensive list—there are other good evals like HellaSwag, SQuAD, etc.
I made two evals that are somewhat popular. MGSM is used in OpenAI’s simple evals, Claude, and Gemini. BBH was used in Claude, Gemini, and Llama. I think they’re decent but not among the best.
One common thing among the successful evals is that a big paper claims some victory using the eval. GLUE was promoted by BERT. MMLU was promoted by Gopher, Chinchilla, and Flan-PaLM. Chain-of-thought prompting claimed a breakthrough on GSM8K. The prowess of Minerva was shown on MATH. HumanEval was attempted by Codex and others.
Going one level deeper, a good score on the eval must mean something significant and easily understandable. For instance, achieving superhuman performance is very understandable. Solving grade-school level math problems is also something people can easily grasp the significance of.
It’s easier to mess up an eval than to make a good one. Most of the non-successful evals make at least one mistake.
If an eval doesn’t have enough examples, it will be noisy and a bad UI for researchers. For example, someone might run the eval over the course of model training and see that it fluctuates wildly from checkpoint to checkpoint. This makes the eval painful for researchers, and they won’t like using it. It’s good to have at least 1,000 examples for your eval; perhaps more if it’s a multiple choice eval. Even though GPQA is a good eval, the fact that it fluctuates based on the prompt makes it hard to use.
Evals should be high quality. If there are a lot of mistakes in your eval, people won’t trust it. For example, I used Natural Questions (NQ) for a long time. But GPT-4 crossed the threshold where if GPT-4 got a test-example incorrect, it was more likely that the ground truth answer provided by the eval was wrong. So I stopped using NQ.
If your eval is too complicated, it will be hard for people to understand it and it will simply be used less. I think the first version of HELM was a great effort, but it had way too many metrics and subsets. It’s critical to have a single-number metric—I can’t think of any great evals that don’t have a single-number metric.
If your eval takes too much work to run, it won’t gain traction even if everything else is good. BIG-Bench is one of my favorite evals, but it is a great pain to run. There were both log-prob evals and generation evals, which required different infra. There were way too many subsets, and some of them had too many examples, so it took a long time. I believe that’s why BIG-Bench didn’t gain much traction, even though it provided a lot of signal.
If an eval is not on a meaningful task, AI researchers won’t deeply care about it. For example, in BIG-Bench Hard we had tasks like recommending movies or closing parentheses properly. These tasks were challenging and trended well with model size, but doing well on them didn’t allow for making a substantive conclusion about the intelligence of the model. Successful evals often measure things central to intelligence, like language understanding, exam problems, or math.
The grading in your eval should be extremely correct. If someone is debugging why their model got graded incorrectly, and they disagree with the grading, that’s a quick way for them to write-off your eval immediately. It’s worth spending the time to minimize errors due to parsing, or to have the best autograder prompt possible.
For the eval to stand the test of time, performance must not become saturated too quickly. For example, GLUE/SuperGLUE got saturated too quickly that it was hard to show big gains, and people stopped using them. Language models also got good at tasks like summarization and translation faster than we could develop good evals for them, and so we stopped measuring those tasks.
Funny enough, it seems like most of the great evals have atrocious names. GSM8K didn’t need the“8K”, and HumanEval doesn’t actually use humans for evaluation (it’s called HumanEval because the problems were created by humans). MATH was too generic, so people started calling it “Hendrycks-math”, which I suppose is a clever way to get people to name an eval after you.
If you want your eval to be successful, you should help people use it. For instance, when I make an eval, I usually offer to run it for other people on their models. If their model does well, they’ll like the eval and promote it. HELM does a great job of trying to evaluate other people’s models for them, and publicizing the results.
It also helps if you can create incentives for people to use your eval. One of the best incentives for people is what their manager values. So it can pay off to get buy-in on your eval from managers within your lab or company, who will ask their reports to run it. When I created MGSM at Google, I collaborated with Dipanjan Das, who was on a different team than me. I worked with him because he’s a fun guy (not to promote the eval), but I think he liked it and it gained some popularity in his team.
LLMs have made evaluations substantially harder. LLMs are massively multi-task and give long responses. Right now there is no single eval that adequately evaluates LLMs. The current popular evals still use very simple grading (either multiple choice, checking a number, or running unit tests). And even those have problems, like deciding on the prompt or parsing the answer. It would be nice if we centered around a single prompt, like zero-shot chain-of-thought. I know it’s not a perfect solution for many reasons, but I think it’s a reasonable price to pay to get everyone on the same page.
One new thrust has been human pairwise ratings of models, such as LMSYS. The generality of these evals is a double-sided sword. They’re powerful because you can get a single number metric for how good a language model is on a general set of prompts, and noise on the sample-level can be averaged out over a large number of samples. The dangerous side of pairwise evals is that you aren’t exactly sure what you’re measuring—for example, it’s not totally clear how much things like feel and style are weighted compared to correctness.
It also became somewhat trendy to do model-generated evaluations. While I tend to find model-generated evals to be finicky, it’s possible to do them well and they can be useful for quick experiments and seeing large jumps in performance. But creating a great eval that stands the test of time takes a lot of care, and I wouldn’t want to risk anything with synthetic evaluations.
An obvious statement is that the topic of the eval dictates how many people will care about the eval. It’s possible to create a very high-quality domain-specific eval (e.g., legal, medical, etc), and in those cases it’s most important to tailor the eval for what is valued by the experts in that domain. However, it’s important to set the correct expectation with yourself about how popular the eval would become. I once made a histopathology image benchmark, and unsurprisingly it has not gained any traction outside medical image analysis and only got 40 citations. That being said, it’s also possible that a domain-specific eval you create can gain more traction once more people realize its importance. For instance, OpenAI invested heavily in LLMs for writing code, and I believe many more people became interested in LLMs for coding after the success of things like Codex and Github CoPilot.
An increasingly important issue with evals is test set contamination. After the creation of a good eval, examples of the eval tend to get propagated into various places in the internet, like arxiv papers, ChatGPT examples, or reddit. One solution to this is to keep the test set hidden, but this approach introduces a lot of friction. Chris Manning had a good suggestion of an eval having both a public test set and a private test set, and monitoring whether any models deviate substantially on those two test sets. This approach balances low friction of testing on the public test set with high trustworthiness in the private test set.
A final thing I have noticed is that the eval you care about says a lot about your identity. A room full of PhDs will likely be interested in the ability for language models to reason about math, code, and physics. Conversely, I have seen user-facing evals like LMSYS to be considered the gold standard by engineers who came from software or product backgrounds. Though I care about both, my personal bent is towards intelligence, since I believe intelligence is the fundamental driver of how AI will interact with humans.
We as a community should invest in evals a bit more, even though it can be painful and is usually not rewarded as much as modeling work. At the end of the day, good evals (with proper buy-in) are the objective function for AI researchers, and they are a powerful way to make an impact on the field.