Everything We've Been Taught About ML is Wrong
Emmanuel Ameisen is a Research Engineer at Anthropic. Previously, he worked as an ML Engineer at Stripe, where he built new models, improved existing ones, and led ML Ops efforts on the Radar team. Prior to that, he led an AI education program where he oversaw over a hundred ML projects. He's also the author of Building ML-Powered Applications.
Machine learning has long been guided by a set of well-established principles around model design and evaluation. But recent progress in language models is challenging many of the rules that have shaped ML practice for years.
 Okay, next up, this is gonna be awesome. We have Emmanuel from Anthropic with a very spicy title for his talk. Uh, it is everything we've been taught about ML is wrong. So let's see, let's see what he has to, to teach us about what's new in the world of ml. Hey Emmanuel. Hello. Um, thanks for the intro. Uh, I.
Yeah, I think everybody can see my slides, so I think I'll just get started. You look good. Take it away. Perfect. Thank you. Um, yeah, so, you know, spicy title just as clickbait. Uh, the take the takes are much more nuanced, but, but I think there's a lot of truth to this. And so, uh, basically the inspiration for this talk is that, you know, I've been working in ML for almost 10 years now, and I found that like a lot of my fundamental intuitions now that we've moved in the world of LMS are wrong.
And so I'll tell you a bit about, uh, why I think that's the case. So, you know, who am I? Uh, why should you maybe listen to me? Uh, so, you know, uh, I'm Emmanuel. I work, uh, currently at Anthropic. Uh, although I just started very recently. Um, and I work as a, as a research engineer so I can improve the LMS here.
Before that, I was at Stripe for a long time, um, where I worked on like building models, improving models, and a lot of ML things. Like automatically evaluating and deploying models and that sort of stuff. Um, and then before that, actually I worked in ML education where I developed a lot of these kind of like, um, teachings about ml, uh, that I would teach to people transitioning into the field.
Uh, you know, I mentored over, uh, a hundred data scientists that would like transition and, and become data scientists in, in industry. Um, and so I, I had a lot of strong opinions about what are good and bad ideas when you're doing an ML project. And over the last couple years I found that all of these strong opinions, uh, are outdated.
So what are we gonna talk about? So here's some of the ML fundamental fundamentals that I used to teach. Um, you know, the first thing is like, start with a simple model. Um, this is. Again, something that kinda like makes sense. It's like if you are gonna do any ML task, you should probably take the simplest possible approach, uh, and try that and, you know, if that works good, move on.
If it doesn't, uh, you know, maybe it's a very hard task, but you should definitely not start with like a super complicated model. So that might be like logistic regression if you're doing kind of like, um, Um, tabular classification, you know, kind of like fraud prediction, like what we did at Stripe. Uh, that's a great baseline.
A bag of words if you're doing an LP or maybe, you know, transfer learning or resonant if you're doing, uh, anything in computer vision. So, What else should you do? You should probably use relevant training data. Uh, that seems like a truism, but it's like if, you know, if, if you want to, um, to take again the right example, like, uh, predict fraud, well, you should take a dataset of like credit card transactions, good and bad, and then train on that.
Um, if you instead took a dataset of like, I don't know, like, uh, patient care information from a hospital, you wouldn't expect to have a good fraud classifier. Um, You know, spam classifiers can't be trained, uh, on kinda like a review data set. Uh, don't use synthetic data. This may be like a more spicy take I had, but I found that, uh, in practice it was very hard to get synthetic data working.
So like, uh, specifically model generated data. Uh, I feel like in the recent history of ML has, um, not often done wells, like people have often proposed like using gans to like make your ML models better. Um, but in practice that. It seems to always be hard and finicky, and your model is very good at like, usually detecting that, you know, this kinda like fake data.
You're passing it and overfitting onto some like synthetic feature. Um, and finally, you know, like, I guess like beware, uh, of automated metrics, uh, it's easy to fool yourself into thinking you have a great model. Um, without looking at the data yourself, you know, the, the most naive example of this is like people reporting, uh, 99% accuracy when they're working on a problem where, uh, you know, there's 0.1%, uh, of positives and the rest of negatives, uh, you would get 99.9% accuracy by just always saying no.
Um, and so looking at the data is always useful. And finally, this is kind of like the, you know, converse of use relevant training data. It's user model only on the task it was trained on. That kinda makes sense. You're training like a fraud classifier. Again, you're not gonna use it to like predict, you know, uh, if a review is positive or not.
Cool. So I'm gonna go through each of those and explain to you why I think they're categorically wrong now. So start with a simple model. Um, I was so convinced by this that I like wrote a blog post about it. Um, and, you know, my recommendation was always like, yeah, just try like the simplest, stupidest thing you can think of.
Uh, logistic regression usually is great for anything that's not NLP of computer or computer vision. Um, and that's just because it's the fastest way to kinda like get an initial result and the fastest way to decide whether it's worth investing more time and what it's worth investing in. Um, but now, uh, you know, if I want to write a regular expression, uh, I don't, I just use like an l LM instead.
Um, if I want to like do something as simple as wrapping every element of a list in parentheses, I use an llm. Um, if I want to like reformat something, I use an lm. And so we've gone from like, use a simple model to do a com complex task, like use an incredibly complex model to do, uh, a simple task. And so, you know, Why, why has, has that happened?
Well, I think the meaning of simplicity in the context of expression has changed. Simplicity used to mean like, you know, uh, a, a simple model in terms of like how hard it is to build, but really the goal is to increase iteration speed. And so you should just take the shortest path to results. And that used to be correlated with taking the simplest model, but it's not anymore.
Um, and why is that? Well, um, Let's take an example. So like what's the best way to estimate whether you can like classify some tasks, some text, which was something we would like commonly do, um, at, at Insight at my education job, well in 2015, you do like a logistic aggression and bag of words. It takes you half a day and you have a pretty good baseline.
Uh, and you can tell, you know, whether it's possible or not, whether the data is good or not, not, et cetera. In 2019, uh, you know, kinda large, large-ish models were starting to be released. And so you'd just like fine tune Bert, uh, that would take you a couple hours and again, would give you like a really good baseline, faster than literacy regression, slightly more complex model today, you know, just using l l m, just like ask it to do the things.
If it does the thing, and then like you're, you're much better off and that takes you two minutes. And, you know, as astute observers will note, um, these models are getting increasingly more complex, right? Like allergic aggression might have a thousand parameters depending on the, um, dimension of your word to vec embeddings, Bert has about 110 million.
You know, LLMs commonly have over a hundred billion. Um, although of course that depends. Um, so start with a simple model. Mm, no, start with a competent model, I think is, uh, really the, the new version of this. And, you know, as with all these hot takes, there's gonna be a bunch of exceptions. Like, if you're doing something that's not an LP related, um, you know, maybe that doesn't apply, or something that you cannot easily ask an LLM to do.
Of course, you might still want to start with kinda a simple baseline. Cool. So use relevant data. Um, you know, if this was live, I would do a poll, uh, like a client, uh, show of hands. But, you know, if I asked you, uh, which data set should you use, if you wanted to, like, for example, generate realistic looking Yelp reviews, um, and then I should point out that there is a Yelp open data set that has a bunch of Yelp reviews in it.
Um, and I should of course also point out that this is a trick question, so I'll let you think about it. You can tell me which is that you want to use, can see if there's like, Uh, The answer to this, uh, is just use all of the internet. Of course. Why would you only use the Ellp dataset? Um, if you use the whole internet, you will do much better than if you just train on the ellp data.
Training on the ELLP data only gets you so far. The rest of the internet will give you so many more, uh, linguistic abilities and reasoning abilities that you'll be able to, um, just perform much better on that task. And this is sort of, this is a chart from like the original GBT three paper, for example, that shows that, you know, um, kind of like if you take a big model and you, uh, prompt it.
Even zero shot actually, uh, it will do on certain tasks better than a fine tuned state-of-the-art model. Um, and you know, certainly it's that that large model is trained on all the data that like sphere of the art model was trained on. Probably it's seen it in this training set. And the only difference between them is the model much larger and has been trained on all the other data.
Um, and so now it's not true that you should just train on relevant data. If you have a model that's large enough, you should just train on all the data because it'll make you better at all the things, which is also like, A ridiculous proposition, I would think, uh, you know, like would be a ridiculous proposition.
Like five years ago, again, if you were training like some spam classifier and you told me like, well, I'm also going to train, you know, on sentiment analysis, I would tell you that that's a terrible idea and that, you know, you probably aren't fit from machine learning. But is this surprising? Well, um, if you think about it, this is kind of like an extension of transfer learning.
If you've been doing a machine learning for a while, um, you know, uh, Kind of transferring. Learning on ImageNet was the first similar example where you could train, you know, a computer vision model on your own data, or you could train one that was trained on a massive dataset, ImageNet, and then fine tune it a little bit on your data and you'd get much better results.
Um, if you think about it, the kind of like l l m zero or one shot in context learning, is this an extension of that where, you know, in 2015. You would like, uh, train on your tasks data in 2019, you would like train on a huge data set and then fine tune on your data. And then now, you know, we can just train on an absolutely staggeringly large data set.
And maybe we don't need to fine tune. I put a little asterisk here because I think, you know, fine tuning LLMs still actually, uh, provides benefits, but, uh, but certainly they give you much, much better performance than you model before, uh, without any fine tuning. So use relevant data? No, just use all the data, siphon it all.
Um, you know, again, caveats, you wanna use all the good data, not use bad data and only use data that you have, uh, actually like, um, the ability to, to to use. Cool. So the next one is I think like use human-made training data. Uh, by default, this is what you do in every machine learning project. You know, if you're doing, again, like product classification, which I was doing recently, you're using like credit card transactions that humans made, um, to like train your model if you're doing like, um, classifying reviews, uh, in terms of like sentiment analysis, you're using real reviews that humans wrote, um, and.
Model, generate data. Just yeah. Has a history of working poorly, uh, gans to generate fraud data or post to me like every couple years. And to my knowledge, they have never worked on any practical application. Um, SMOT was a similar technique that's just I found in very brittle in practice. Um, distillation is another approach which works sometimes, but is also challenging to get to work in practice.
And so model generate data used to be a bad idea if you were on a time crunch. And certainly it could work, but would require a lot of effort to get it to work. Um, and certainly you would use a different model, not the same model, uh, because you know, having a model, uh, generate its own training data would be absurd in the world of amount.
You know, you'd be told like, well, there's no additional information that you're entering into the system. And so this is kind of like a snake eating its own tail. If the model you train is generating some data and then you train the model, you train on the data generated, then you're just gonna mode collapse and it's not gonna be interesting.
Or is it, um, Anthropic, you know, famously released constitutional ai, which is essentially, uh, a version of this where, you know, we ask a trained model to read a constitution and then we ask it to use the constitution to rate whether answers are good or not. And then we train a model based on that, and then we train the original model based on this, this reference and the original model is better.
And, you know, that whole process is just like train model, generates more data, train on that data, and then gets better. Um, which, you know, I think again, would've been. Absolutely wild to think of before, but now it's like a very reasonable thing that gives you a parallel level improvement in terms of like how good the model is.
So, you know, maybe humans don't need to, to make training data for long. Uh, maybe we can use model made training data for a lot of things in the future. Uh, certainly not everything, but the fact that we can use it at all is a, is a huge change. All right. Well, certainly if, you know, we're using models to generate data, uh, we still need to kind of have humans, uh, look at the outputs, uh, in order to, you know, verify that the outputs are actually good.
Uh, we can't really trust the models to judge themselves. Um, that would also be kind of wild. Uh, the thing is like evaluating, um, generative outputs is hard. Uh, in fact, you know, if you're evaluating like summarization of. Translation, having a human is kinda like the only thing that we've been able to do for years because any metrics, like common metrics are called, you know, like hoos and blur, and there's other ones, um, they're often very pessimistic because they punish you aggressively for deviating from like whatever label there is.
So, you know, if you're translating, uh, a sentence in from one language to another, you have this kinda like gold translation that you're trying, your model's trying to, to match. But if like your model, let's say like perfectly paraphrases, that translation, uh, it'll get like, In many of these metrics, the score zero, even though that translation is correct.
And so the gold standard is to, you know, like ask humans to, to, to rate outputs. Um, as I said, you know, having a model, uh, rate its own outputs would be absurd. Like obviously, you know, it would just like, I assume, like just say, oh, like my output is great. Uh, I love my output. Right? It's been, uh, trained to output that output.
So according to it, it's the maximum likelihood output it could produce. Um, It turns out somehow that this not as insane as we thought. Um, this is from the, uh, second paper in the sources below the Achi Vili paper, um, which talks about kinda like ethical behavior on LLMs and is, and is interesting. Um, but also talks about this fascinating pattern where, uh, they have the researchers write a bunch of labels and then they measure, um, you know, crop human crowd workers, um, versus, uh, using, in this case GPT four, or I think like a little like.
Set of, uh, heuristics on top of GT four, and you find that GT four, uh, is better than the human uh, graders. In fact, it agrees more with their golden, uh, ratings than the, the human raiders. So it turns out that if he wants to like, really, uh, evaluate a model thoroughly, uh, you can just use the model. And in some cases, not all, in some cases, it'll do a better job than, than humans, which I think is also something that, that would've gotten you, like laughed out of any ML 1 0 1 course a few years ago.
By the way, if you're not convinced by this and you say, well, you know, I'd still like to use Human Raiders. Um, there's another recent paper that, uh, estimated that about 33 to 46% of crowd workers, uh, I believe it was on Mechanical Turk use, uh, LLMs when completing a task. Um, and so it might be the case that even when you think you're getting human ratings, you're actually getting l l M ratings, which maybe also explains some of the reasons why the differences between, uh, human and LM raid has been shrinking.
Cool. Um, so you see Mr. Judge results? Nah, just use models to judge results. You know, use models to train your models. Use models to value your models. Just models all the way down. Um, cool. Next one that I think is like also interesting is like the, in the history of machine learning, you mostly had, uh, Narrow models, you know, models that are trained for a purpose.
You train a fraud classifier to do fraud classification. You train, you know, a spam classifier to spa, to, to, to classify spam emails, uh, et cetera. They don't do anything else. Um, to my knowledge, no one at Stripe, you know, ever tried to use our fraud classifiers to write poems. That's just not a thing that these models do.
They're not built to do them. They don't even have the, the capacity to try, right? The, the architecture does not offer it. Um, and yet people find new use cases for large language models that go beyond token prediction. You know, every day, uh, it seems like, you know, yes, you can use them to like write emails, but you can also use them to like read emails.
You can use them to do math, you can use them to write code, you can use them to like simulate code execution. Uh, you can use them to help you, um, summarize various documents, uh, et cetera, et cetera. You can use them to generate training data or evaluate model data. It seems like their abilities, um, are just much more general than kind of their intended training objective.
Um, And so ML models now have overhangs, which is not a thing they used to have, meaning that they have a set of capabilities that is not necessarily, uh, immediately apparent. Uh, those capabilities can be unexpected. There are many surprises. Um, I think in like, uh, the abilities of, of, you know, G post, let's say post GT three models, um, that weren't immediately apparent when, when those models were being trained.
Um, it's, you know, when they were evaluated by external parties that. Or internal parties that folks found a bunch of of interesting, um, uh, emergent abilities. They also have unexplored capabilities. I think this is like slightly different. Um, and so it's like one, there's, they can do things that you might not expect, but two, even if you spend a while trying to tease out these capabilities, there's no existing at the moment way to like list out all of the capabilities of such a model.
And so, You know, once you release it, there will be, um, potentially unexplored capabilities that somebody might discover. Uh, some of that might be great and fun and creative, and some of that of course, uh, might cause safety risks. The kind of risks that, you know, companies like Anthropic, uh, are, are, are concerned about.
Um, and, and that's again, not something that you ever had, you know, if you kind of open sourced your, um, I don't know. Yeah, let's. Take fraud classification since that's, that's the example I've been using. You know, it's like, yeah, people can use it to classify fraud, but there's no risk that they'll use it to like, you know, try to ask how to make a bomb or something.
Um, and so the fact that these, these models have these overhangs is completely new and I think a little bit mind bending as well. So, you know, it used to be that we use a tool only for its purpose and now we have this general tool and we can teach it this purpose, um, by prompting. So, All right, so this is kinda like a whirlwind tour, um, of, you know, everything that, that I think has, has just changed quite a bit in the last, um, Three to four years.
Uh, luckily for us, you know, uh, and for our sanity, some things have remained the same. Um, and I'll, I'll end the talk with just a few of those, which is that, you know, um, it is still the case that the most important thing you can do as somebody that uses machine learning, whether you're building the harsh language models or using them in your product, is to just, you know, focus on solving a, a useful problem, not playing with cool tools.
Uh, I think that's the like, number one thing that ML people, uh, end up doing that is maybe not the best use of their time is like, Trying to build a fancy model to do something or getting excited about new architectures a lot of the time. You know, spending more time, uh, thinking about like how you can solve something is, is more useful.
And I think that's even more true in kind of like l LM worlds where, you know, You might get a lot more value from cleaning up the data, um, as has always been the case. Um, you know that, that you pass into your, your prompt as zero shot examples, right? Like, better data here has, has huge, um, effects, or you might, might get better value from like prompting better, um, as opposed to trying something much more complex.
Models aren't magic. Uh, yet, you know, um, and, and I say yet in a, in a cheeky manner, I don't think they actually ever will be, but, you know, some of their behavior can, can feel magic to it. Like magic to us. Um, but they still hallucinate. Uh, as the panelists were just talking about, you know, they still make mistakes.
Uh, they, they still have biases. And so, uh, kind of like, I think any good and responsible, um, ML practitioner, you know, builds around that and kind of like. Almost treats the model adversarially in, in some cases where it's like, okay, well I know that sometimes this model will be wrong, so I need to build my product around that assumption.
You know, if, um, you know, uh, you're doing something, anything that has to do, for example, with like providing advice to a user, you know, you might want to like, um, verify that your model actually is outputting something that's true, uh, before, you know, uh, returning the response. And that's just as true now as it was true before.
Engineering skills are always key. Uh, certainly they're key if you're, you're building these models, you know, these models are kind of like just basically, uh, using supercomputers at this point. Uh, and so like, that's just a, a very hard engineering problem, but also, um, in kind of like deploying these models.
Um, you know, it used to be that, yeah, you were deploying the, the, the like model itself. And now maybe you're deploying like your prompting server. But we, again, in the panel, we're just hearing about how much work goes into like, you know, carefully like, um, crafting prompts and getting results and like, Post-processing them and detecting latency, uh, spikes and kind of addressing them in various ways.
And so engineering skills, I feel like, uh, in my ML career have become more and more important every year. And I think that that remains true. ML lops, you know, still as big of a thing as it as it used to be. Um, people have tried to call it LL Mops. Now I think we can just stay with the ML lops name. Uh, not to revive the flame war, but you know, um, I mean, that's a lot of the challenge of deploying, uh, uh, these models.
Uh, and that certainly hasn't changed with, with LLMs. If anything, like handling breaking version changes in LLMs is I think something that's gonna be, uh, thought about quite a bit. Uh, and, and is a new MOPS challenge monitoring. Monitoring, you always need to do it. Um, many email products have failed, uh, silently, and then, you know, uh, three to six months later somebody realizes that the model was doing something completely wrong.
This is also true with, with m and so. Cool. That's about it. Um, whirlwind tour. Um, I'm glad we got to cover it all in, in, just, just a little over the allotted 20 minutes. Um, yeah, thanks for listening. You can reach out to me. You know, I'm at ML powered on Twitter. Uh, you can reach out to me on LinkedIn, uh, as well if you have questions.
I'm also at, uh, Emmanuel at uh, anthropic com, so you can email me there. Uh, but yeah, that's, that's most of it.
Awesome. That was so amazing, Emmanuel. I thought that was like really insightful. Um, and yeah, just like so many things blew my mind in terms of like how we thought about things in the past and how to think about things differently. Especially like the one about, um, models, judging models. I think that sort of blew my mind a little bit, not what I expected.
Um, Yeah, that's crazy. So, well, there's a little bit of a delay, so I wanna give people an opportunity to ask questions, um, if they have them. Um, a couple that I'm already seeing in the chat, um, one from Marwa, uh, she asked, how would you take a generalized model and let it evaluate itself on a specialized task?
Not sure if you wanna tackle that. Yeah. Yeah, I think that's a good question. So, I mean, for what it's worth, um, we can maybe, I can, I can actually go back to that slide, um, because that's essentially what they do. So in this slide, they take a pretty general model, GT four, kinda like definition of a general model, in my opinion.
And, and the task they're actually running in this is, they are. Asking various models to like, uh, if I remember well, to be deceptive, uh, in various ways or, or harmful in various ways. And then they're asking crowd workers and this model labeler like, Hey, you know, this was the question, this was the answer.
Uh, was this, this deceptive? Was this, you know, related to like, manipulation, betrayal, physical harm, all that stuff. And then the model just outputs the answer. And so that's how they do it in this case. And I think that's just generally how you can do it in, in any case, which is that. The kind of, uh, in context learning ability of these models mean that you can take a general model and prompt it and be like, Hey, uh, I'm going to show you, you know, an example of somebody trying to, let's say like, have a conversation and you need to tell me whether you know that response is blah, deceptive, uh, you know, positive, whatever your task is.
Um, and then, you know, uh, have that as your prompt. There's a lot of tricks to prompt engineering, but of course you could like, Make your model even more specific by adding examples, saying like, you know, if the person said, for example, oh, I, you know, uh, I, I, I swear I'm not lying to you. Like, that might be a little sketchy.
Uh, if the person said, you know, oh, I want to like, um, build a bomb that is, uh, like violence related, whatever, like these examples. Um, and then with that, with that ability of these large models to learn in context, you can take a general model and have it do your specific task, um, very well. That's awesome.
Marwa. I hope that gave you a little bit of clarity. Let's see. Any other questions coming in? There's one question from ri. He asked, I mean, this is like a philosophical question, or at least it starts with it. How can we trust humans for specialized domains or languages that need expertise to even understand the context?
How would even QA work. Hmm. Um, yeah, so I think like, that's funny. It is kinda like a, uh, um, the chicken or the egg question. Uh, so, so like, I think you can trust humans. Uh, well, can you trust humans as a broader question, but I'll answer the like, maybe more localized version, which is, um, for many, many tasks where there's, um, Specific knowledge that's required.
Let, let's say like, I don't know, like a medical task or coding, um, you know, the answer isn't gonna surprise you that like a lot of, uh, labeling services let you, uh, select for people that have that expertise, right? So you might say, well, you know, I, I want to like grade these, like, Uh, yeah, like these, these samples are like medical advice.
Uh, I need doctors to look over them or, you know, or lawyers or engineers or, or whatever. Um, and so in that sense, you can trust humans. Um, although I will say that if you haven't spent much time labeling data yourself, like I'd consider myself, you know, an expert in let's say like engineering or, or ml, um, I've.
Labeled, you know, examples just for my own edification of kinda like LM outputs and it's, it's extremely hard cuz it's like you have to like context switch, you know, and like read some random like requests and then some completion and try to think carefully through bugs. Uh, even if you're good at it, uh, you'll, at least I feel I'm pretty, pretty good at the domain and I'll make mistakes just because, you know, uh, you, especially if you're trying to label many examples and you're giving yourself, let's say like a minute or two minutes per example, um, it's very hard.
Especially if the task is complex. And so I think that's actually where, um, that's my theory for why these models are sort of like outperforming humans in some cases, is that they're slightly worse than the human exp experts at it. Uh, but there's so much more robust, they don't get tired, you know, they can get through, uh, as, as as many model, as many examples as you want.
And so I think that that's, that's likely what is happening currently is that, you know, Just a human raiders, even an expert is gonna have a hard time rating. Many examples. Uh, it takes a lot of mental toil and it's, it's actually quite challenging, whereas a model just cranks away. Yeah. I think that makes a lot of sense.
Well, cool. I think those were the questions I'm seeing so far. Um, definitely make sure to, what's the best way for people to kind of ask you questions or check in on your material in the future? I think you had a, something on the last Yeah. Um, yeah. Twitter, if you like it. ML powered. Uh, LinkedIn is here and, uh, I should put it on the slides, but I am just, my name's Emmanuel Emmanuel at topic com.
If you have any questions as well that you'd like to send my email, so any of those is, is completely fine. Awesome. Thank you so much, Manuel. This is awesome. Have a great rest of the day. Yeah, thanks so much. Of course.