Common arguments regarding emergent abilities

This blog post doesn’t represent the positions of my employer (past, present, or future).

I’ll review some common arguments that come up when discussing emergent abilities of large language models. Last year we wrote a position paper that defined emergent abilities as “abilities that are not present in small language models but are present in large language models.” I showed that emergent abilities are widely prevalent, and that they are notable for several reasons:

Emergence is not easily predicted by extrapolating scaling curves from smaller models.
Emergent abilities are not explicitly specified by the trainer of the language model (next word prediction “only”).
Since we haven’t tested all possible tasks that exist, we don’t know the full range of abilities that have emerged.
Further scaling can be expected to elicit more emergent abilities.

Since GPT-4, some have argued that emergence is overstated, or even a “mirage”. I don’t think these arguments convincingly debunk the phenomena of emergence, but they are worth discussion and it’s good to examine scientific phenomena with a skeptical eye. I’ll try to restate them in their strongest form and then explain my thinking on them.

Emergence depends on the evaluation metric

Argument: Emergent abilities often occur for “hard” evaluation metrics, such as exact match or multiple-choice accuracy, which don’t award credit for partially correct answers. For instance, multi-step arithmetic requires getting each step right—even failing one step can result in a wrong answer. If you take the same task, but use a “soft” evaluation metric, such as log-probability of the correct target, you might find that performance improves smoothly over time, without significant jumps in performance.

Evidence for this has been shown in multiple papers—the BIG-Bench paper showed that log-probability of targets improves smoothly across scales (Fig. 9 in “Breakthrough behavior is sensitive to details of task specification”), and it was also shown that using a metric like Token Edit Distance on addition or multiplication would appear to improve smoothly instead of in an emergent fashion as seen when using exact match.

Response: While there is evidence that some tasks that appear emergent under exact match have smoothly improving performance under another metric, I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks. Consider asking ChatGPT what 15 + 23 is—you want the answer to be 38, and nothing else. Maybe 37 is closer to 38 than -2.591, but assigning some partial credit to that answer seems unhelpful for testing ability to do that task, and how to assign it would be arbitrary. Focusing on metrics that best measure the behavior we care about is important because benchmarks are essentially an “optimization function” for researchers.

It’s important to note, however, that finding a “surrogate” metric that improves smoothly is very significant if it gives more information and enables us to predict a more-important emergent metric. I haven’t seen any substantial evidence that exact-match or multiple-choice performance can be predicted using smooth surrogate metrics, though. In our paper, we showed that cross-entropy loss improved even for small model scales where downstream metrics are close to random and did not improve, indicating that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. But this analysis did not enable us to predict emergent performance by using only smaller models.

It’s currently an open question if surrogate metrics could predict emergence on metrics like exact match or multiple-choice. For instance, given accuracy and cross-entropy loss for a bunch of small models, could you predict the cross-entropy loss for a large model, and then map that to emergent exact-match performance? One might expect that if there is a smooth scaling curve on surrogates, emergence on the downstream metric would eventually occur, but this relationship has not been well-studied enough in terms of how well you’d be able to predict when the emergence would happen, and with what accuracy.

Finally, I want to emphasize that showing smoothness on some metrics for some tasks doesn’t mean this occurs for all tasks. Two examples from the paper are below.

Here, cross-entropy loss is slightly smoother for modified arithmetic, but for IPA transliterate there is still a large kink in cross-entropy loss that breaks the trend and is hard to predict:

Here we can pull multiple metrics available in BIG-Bench that award some partial credit, and we see that the performance still sharply increases at the same threshold:

Emergence is an artifact of the scaling curve plot

Argument [1] [2]: Scaling plots for emergence use an log-scaled x-axis, and if you were to use a linear x-axis scale, the shape of the plot would be smooth.

Response: It's still possible to view emergence on a linear x-axis scale. I plotted Figure 2A from our emergence paper below, and you'll still see the same emergent spike from 7B to 13B (albeit in a less readable way).

In addition to evidence that emergence is still viewable on a linear scale, it’s justified to use a log-scale x-axis by default, since models we train are larger in an exponential fashion. For example, the PaLM model sizes are 8B → 62B → 540B (factor of 8x), and LaMDA model sizes go up by 2x. So a log-scale is appropriate for conveying how we scale models in practice (and this has been done in the literature for many years).

Argument: The paper implicitly claims that we should be able to fit linear-ish curves to plots that have log-x and linear-y axes. Why shouldn't we fit exponentials or some other curves?

Response: It makes sense to also plot log-x and log-y scaling curve, with error rate instead of accuracy on a log-y scaling curve (since accuracy is often 0, and log(0) is negative infinity). However, the shape of the curve stays the same even when you do this.

Emergence is an artifact of not enough model datapoints on the x-axis

Argument [1]: There's a sense in which this definition of emergent (behavior of larger models can't be predicted from smaller ones) has to be too strong—if you sampled the the x axis (number of parameters) densely enough, surely the improvement in accuracy will be continuous or smooth? For example, it seems unlikely that a 1,000,000-parameter model would have 50% (random) accuracy and a model with 1,000,001 parameters would have 90% accuracy.

Response: While this is a reasonable point in theory, we don't have such fine-grained model sizes in practice. But assuming that we did and that the improvement in accuracy would be smooth if you zoomed in enough, I still think there’s a notable phenomena—the performance for the model is flat for models below some certain threshold of parameters, and then above some threshold it starts increasing, and extrapolating the flat points wouldn’t enable us to predict the increasing performance.

Note that this definition is true in an uninteresting way for most tasks for small enough N (e.g., models with one or two parameters would have random performance), and so as Tal Linzen suggested it could be good to specify a particular threshold for N, though I don’t think many people are making this quibble. The overall point is that while some behaviors are very predictable, (e.g., GPT-4's loss on some evaluations can be predicted with models of less than 1,000x compute), other behaviors are not predictable even with 2x less compute. The difference between these two types of behavior are night and day.

A final point

While it’s generally good to be generally skeptical, there seems to be an overwhelming amount of evidence of emergent abilities that (for me) makes it a convincing phenomenon and framing. Even if some emergent abilities are a result of noise, many other instances are very solid. Consider the below plots from the U-shaped scaling and GPT-4 papers: performance actually decreases for several model scales, until it suddenly spikes up. This is a great example of emergence, and I doubt that changing the metric or visualization would make this appear smooth or predictable.

Another popular example of emergence which also underscores qualitative changes in the model is chain-of-thought prompting, for which performance is worse than answering directly for small models, but much better than answering directly for large models. Intuitively, this is because small models can’t produce extended chains of reasoning and end up confusing themselves, while larger models can reason in a more-reliable fashion.

Overall, I’m glad that the idea of emergent abilities is being discussed more and that people are questioning it. I’m especially excited about work that would enable us to predict emergent behavior, since emergent phenomena includes risks as well as abilities. I’d love to discuss more with you on twitter or at the next conference!

Thanks Tatsunori Hashimoto, Percy Liang, and Rishi Bommasani for helpful discussions (and any critiques on this blog should go to me, and not them).

Common arguments regarding emergent abilities

Emergence depends on the evaluation metric

Emergence is an artifact of the scaling curve plot

Emergence is an artifact of not enough model datapoints on the x-axis

A final point

Some observations from tracking Twitter

Practicing AI research