Calibrating the Mosaic Evaluation Gauntlet
Tessa is a research scientist at Databricks working on retrieval augmented generation. Previoiusly she was at the New York Times using computer vision for sports journalism. She has a masters degree in Computer Science from Brown University and has worked in data science roles at Meta, Spacex, Tinder and Snapchat.
A good benchmark is one that clearly shows which models are better and which are worse. The Databricks Mosaic Research team is dedicated to finding great measurement tools that allow researchers to evaluate experiments. The Mosaic Evaluation Gauntlet is our set of benchmarks for evaluating the quality of models and is composed of 39 publicly available benchmarks split across 6 core competencies: language understanding, reading comprehension, symbolic problem solving, world knowledge, commonsense, and programming. In order to prioritize the metrics that are most useful for research tasks across model scales, we tested the benchmarks using a series of increasingly advanced models.
Slides: https://docs.google.com/presentation/d/1fIxyJ_vizTQJ2lKcMbmhTtJ8S7O2tQhB/edit?usp=sharing&ouid=108999209131560878693&rtpof=true&sd=true
Theresa Barton [00:00:10]: So this talk will be about calibration of evaluation datasets for large language models at databricks. I guess it's kind of a cliche at this point that large language models are difficult to evaluate. There are 5000 evaluation datasets on hugging face. We have gathered a set of 39 evaluation data sets for our models from this 5000. Even so, when we evaluate large language models, sometimes the difference between two models is trivial and we don't know whether the result is significant. We also don't really know what to do when two scores disagree. So you can see in this radar plot that chat GPT is getting a high score in math, but a low score in coding. And given our current evaluation frameworks, we don't totally know what to make of that type of discrepancy.
Theresa Barton [00:01:09]: Which model to pick? That is what this work is trying to address. We decided to make a controversial assumption that actually the evaluations that show that larger models are better are good evaluations and the ones that do not follow the Chinchilla law of scaling are perhaps poor evaluations. The reason we decided to do this kind of controversial assumption was that we train a lot of models at small scales and then deployed those changes at large scale. And we really care about evaluation datasets that or evaluate or breaking changes that hurt a model's ability to scale. So the experiment we did was to train five models from size one e 21 to five e 23 flops. So you can think like GPT-2 or llama two size compute. And we evaluated our benchmarks against these models with one, three, five, or ten shots. And shots is the number of examples you get per question.
Theresa Barton [00:02:28]: We decided to keep the benchmarks that gave higher scores to larger models and obeyed the Chinchilla Scaling law. We prioritized keeping benchmarks that ranked the models monotonically from smallest to largest. So here's an example of a two shot type of problem from the big bench evaluation data set. So when you do k shot reasoning, this activates in context learning, which is an important capability of large language models. You essentially get several examples, and the example answers with the target question. And this actually causes some problems, but we'll see that later on. So we divided the experiment like the results, the resulting benchmarks, into four classes and threw away the worst class. So this is the best class of benchmarks at this model scale.
Theresa Barton [00:03:25]: This includes many popular benchmarks like hella swag and Winnow. Grande, as well as Lambata, was not much in common between these well performing benchmarks, other than the fact that they were large and gathered from the wild and not synthetically generated generally. So the way to interpret this plot is the different colors indicate how many shots there are for the benchmark. The x axis is the scale and the y axis is the score. And as you can see, in many cases, the zero shot, the red line is doing worse. And then the more, the more shots you add, the better the models generally do. Except for the case of lambada, which is kind of a mystery to me. If anyone has an idea about why that would be the case, why zero shot is better than multi shot, it'd be fascinating for me to hear.
Theresa Barton [00:04:23]: And when I looked through the data, there wasn't much logic to what benchmarks scaled the best. For instance, here's a question from hellaswag about ice fishing. Most people you ask would not know the answer to this question because it's actually derived from a specific YouTube video caption. But this is just the nature of gathering data from the Internet. Perhaps the best questions to ask your model are not ones that you yourself. No. So the next category of benchmarks, which include many mathematical benchmarks, are at this scale, noise level benchmarks. So the gray dotted line indicates random guessing.
Theresa Barton [00:05:08]: And for models between one e 21 to five e 23, these models did not do better than random guessing. And this includes the very popular MMLU benchmark, which is kind of the gold standard for frontier models. When people report this score for models that are small, they are essentially reporting a random noise score, which is interesting. So we decided to keep most of these benchmarks because we expect that they'll be useful for larger models. The more interesting category is poorly behaved benchmarks, which did not monotonically improve with model scale. Um, some of them actually got worse with larger models, which is kind of fascinating, but we looked into it and made may try to make sense of it. So you can see here that big bench math QA, the larger the model, the worse the score. Um, and if you're looking very carefully, you can see that some of these models are doing worse than random, which is kind of a mystery.
Theresa Barton [00:06:14]: So why would a benchmark scale badly. We tried to look into this one possible. Well, one very reasonable assumption is that if there's a lot of, like, answers that are of one of AB or ABC or D, it's kind of like when you're taking the sat and you see a lot of a's in a row as an answer, you start to change your result based upon what you've seen before. Um, so this happens with few shot learning where if a A benchmark has mostly a's as the answer, and the model sees in its sample of like three questions like the a is the correct answer for each one, it will be biased towards that answer itself. That's like part of in context learning. So this is a kind of novel finding that we had. Another reason the benchmark may scale poorly is the questions are really bad. So I don't want to pick on any benchmarks, but the social interaction QA benchmark had a lot of mysterious to me social interaction questions.
Theresa Barton [00:07:24]: For instance, just if someone dried up a paper, lit it on fire and blew it away, how did they feel afterward? And I think that you can't really look at the evaluation data set and decide beforehand whether it's good or bad. However, if it's scaling poorly, then maybe it's a bad evaluation data set. I think the main problem is that this benchmark was generated via programmatic heuristics. It's not gathered from the wild. Another reason that I've recently realized a benchmark might scale badly is contamination. And I think a lot of people view contamination as training on test as cheating. But there's actually another perspective on contamination, which is you may have a leak benchmark in the wild along with answers from an inferior model. So it's going to severely interfere with the correct answer if it's seeing training data with an exact leaked benchmark with bad answers.
Theresa Barton [00:08:32]: So contamination is a worse problem, I think, than people realize because they always assume that, oh, contamination will be my score is better, but actually your score could be worse because of contamination. The last section of it, last category of benchmarks that we categorized were the ones that are well behaved for certain k shot settings. I don't have much analysis as to why certain benchmarks would scale better with different k shot settings. Other than that these models are very small, they're very bad at following directions. It's possible that they need to see several examples ahead of time or nothing. So we decided to only run these benchmarks at the correct k shot settings and just run with it. So I think this was a kind of scientific study of the best evaluation metrics we have these days. I don't know what evaluations will look like in the future for large language models, even like a couple of years in the future, I think things are getting very interactive and I believe that long context evaluations will become increasingly important.
Theresa Barton [00:10:04]: As you can see, here is two different DBRX instruct and the Lama 270 b going head to head and all the evaluations we looked at were multiple choice questions like SAt questions. So hopefully in the future, we're going towards evaluation metrics that are more like SAT writing or that evaluate both long context understanding and long context generation. And those are kind of separate tasks. Long context understanding is like the needle in the haystack eval that we see on Twitter with, like, the green squares, and we're looking into that one as well, but there's evidence that that one is pretty biased.