Solving the Last Mile Problem of Foundation Models with Data-Centric AI
Alex Ratner is the Co-founder and CEO of Snorkel AI and an Assistant Professor of Computer Science at the University of Washington.
Prior to Snorkel AI and UW, he completed his Ph.D. in CS advised by Christopher Ré at Stanford, where he started and led the Snorkel open-source project, and where his research focused on applying data management and statistical learning techniques to emerging machine learning workflows such as creating and managing training data and applying this to real-world problems in medicine, knowledge base construction, and more. Previously, he earned his A.B. in Physics from Harvard University.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Today, large language or “foundation” models (FMs) represent one of the most powerful new ways to build AI models; however, they still struggle to achieve production-level accuracy out of the box on complex, high-value, and/or dynamic use cases, often “hallucinating” facts, propagating data biases, and misclassifying domain-specific edge cases. This “last mile” problem is always the hardest part of shipping real AI applications, especially in the enterprise- and while FMs provide powerful foundations, they do not “build the house”.
Intro
For anybody that is wondering and wants to go, just deeper down the rabbit hole with Alex. We had an incredible conversation when he came on the podcast, about two months ago. Time is going a little slow for me these days, so it's probably wrong. Might need to fact check that one. But a few months ago, and we just talked all things.
Like changes MLOps slash foundational models. And I know you got more for us right now, so I am going to hand it over to you and give you the stage. That's awesome. I'm super excited. Yeah, it was a really fun chat. It feels like I think everything feels the effective time skills are very interesting in AI world these days cause everything changes so fast.
And I guess one segue into what I'll chat about today, some things change very rapidly, other things don't. Can you see my screen now? Oh yeah, we'll just check in. Okay, great. So I don't know how we're doing chat, q and a or anything like that here.
I'm just seeing full screen in my presentation. So Demetrios, if you wanna jump in or, if you're doing live q and a, feel free to make this interactive. Um, totally. I will. Exactly. I'll jump in if something comes up and there's a great question in the chat, otherwise, I'll leave it to you and we can keep all the questions for like the last five minutes.
Awesome. Sounds great. So just to do a time check, I think. Up until 1210, right? Yeah. I mean we started like five minutes late, so 1215. Awesome. Oh, generosity today. This is great. Um, Cool. So I'll jump in. And, as a quick intro, I'm Alex, I'm one of the co-founders and CEO at Snorkel ai.
I'm also an affiliate assistant professor at University of Washington at both places. And, and then you know, back when I was working on the snorkel project at Stanford with the team, everything really is around this idea of data centric ai. This idea, that a lot of the development that is most critical and highest leverage for.
For building AI is around manipulating, curating, labeling, slicing, sampling data, more broadly developing data, even more so than picking the right architecture, or tuning the knobs in the right way. I'll talk at a high level and part of this about how that's been accelerated by this rise of foundation models.
And I'll share some other broader thoughts as well. If we get into the weeds, I can also share my thoughts. Some of the work actually out of my co-founder, Chris's lab at Stanford and, and the snorkel research team on ways of automating, prompt engineering in part via week supervision.
So that'll be an in the weeds teaser if we get to that and anyone wants to remind me or I remember, we can talk a little bit about that. But otherwise, I'm gonna go through kind of a high level tour today and thank you all for taking the time to be here and watch. Hopefully ask some, uh, interesting and hard questions, either during or at the end, so to, you know, to make it interesting.
Solving the Last Mile Problem of Foundation Models with Data-Centric AI
Okay. So quick outline. I'll start with, and Demetrios, I have to apologize here cause I'm opening with some shots fired. I'm gonna switch right away to, to saying, foundation model, which I get is a, a kind of aggressive move given the name of the conference. But it's kinda critical to how we do things.
I like it. Yeah. I gotta make things a little interesting. Right? So I'll start with that and obviously I'll actually flip back and forth probably between FMS and LLMs, throughout without realizing. But I do want to start on that note and just, introduce why. You'll hear me in, you know, some instances saying foundation model.
I'll share a little bit of this high level idea that a lot of the future's gonna be around. You know, let's call it GPTU rather than GPT. I'll then dive into how you make these more customized and domain specific and performance foundation models. What is the development, the data center development you do?
I'll use some of our work at Snorkel as a case study, although again, I'm not gonna, be overly focused on that. It'll just be an illustration of, you know, the system building we're doing around these ideas. And then if I get time I can talk about some other. Interesting challenges that are really orthogonal to our work at snorkel in the foundation model space.
Just picked up from some of the customers we work with and some general thoughts may or may not get to that. But that's the high level outline and I'll jump in now. So it's not that maybe there's a little underwhelming as a shots fired slide because it doesn't, well, I don't know, Demetrios, you tell me, but I don't, I don't think this is the most aggressive looking slide, with my hastily Googled image of house and foundations. I do wanna kind of introduce, why do we say foundation model? You know, my, my co-founder Chris and, and a bunch of our Stanford colleagues have anchored on this term with the center there. There's really three reasons.
The first one is that a really exciting one, which is we've, we've moved beyond language, actually any data type. Any data, any data type that, that has some underlying graph structure, admits the exact same types of, you know, self supervision or auto aggressive methods, beyond just, you know, a sequence of tokens, right?
We're seeing that, with image and multimodal data. We're gonna see it with, everything from databases to genomics, or we're already seeing that. So really this is a boom that we're gonna continue seeing far beyond the classic language models. That's reason number one.
The second reason. Is often I hear gen or generative ai as a synonym for large language models, but really all AI application types, not just generative, but also discriminative or predictive, which is where a lot of the, you know, the classical and high value workloads still live.
Our view is at, all of these are built on top of foundation models. Um, so if we get time we can talk more about that, but that's kind of point number two. And the third point where most of my talk will kind of dive. Is this, you know, the idea behind the name, the, the basic and simple, but, but very critical metaphor, which is that foundation models are, are foundations and you still need to, you know, build the specific house or building for your specific setting and needs on top.
And, we'll talk about how you do that coming up. Also, there's a leaf blower, which happens more times during talks than I think. The, uh, you know, occurring by random chance. If you have any problems with my audio, let me know. Otherwise, we should be fine. You're sounding all right. I didn't even notice it until you said it.
Awesome. Okay. I did, but it sounds like we're good. So, diving in, I'm gonna start now by just kind of opening the aperture and, and, uh, I guess firing some more shots. Although, again, this is mostly about, you know, open source community love and predictions of where, um, you know, practical realities are gonna, uh, you know, of.
AI use cases and, and the market are gonna take us. But, you know, again, I'll say this a little bit more declaratively just to make this talk interesting. So I'll start by saying that, um, you know, the entire field owes, you know, the G P T line of work, uh, you know, a massive debt. And, and I'm a, a, a huge fan.
And, I hope that, you know, G P T X continues to, to increment. Because, you know, it's been driving the field forward and it's an, you know, truly incredible innovation. But I do think that most, uh, usage in the enterprise, even in, you know, in individual and consumer settings, is gonna be a lot more around, let's call it G P T U rather than G P T X.
So why is that? Three high level points I'll start with here just to get us all thinking. Number one, we've seen a lot of evidence recently, you know, this is both first principles, but also just, you know, uh, demonstrations that have been, uh, you know, accelerating of late around the lack of defensibility of, of closed API models and, you know, very correlated with that.
This kind of just boom of, of open source foundation model innovation, Ander. Number two, the more durable mode. Again, no real surprises here if you think about it, uh, from first principles is around private data and, and domain specific knowledge, whether that's in a person, in an enterprise, in an organization.
We're seeing that and, and some examples of the power of actually leveraging that, which I think is going to spread. And a third point, which will be the segue into the rest of the talk, and a lot of what we do at Snorkel. Um, this idea that really the last mile, all of the hard work to get these, you know, to go from foundation to house for specific use cases, in specific settings is really where, um, you know, a lot of the, the effort goes, vast majority of the effort goes and, and a vast majority of the, the value.
You know, differentiation's gonna be captured and I'll talk about how that's done and what's exciting there. So just to go through those first couple points, high level. You know, there's been a, uh, a proliferation of, of awesome, of awesome logos and, and, uh, I guess the sunglasses thing has been taking off with these recent projects.
So, for folks who haven't been, you know, living and breathing foundation model Twitter every, you know, five minutes, uh, like a, like a healthy person. Um, uh, one recent, uh, very interesting project that came out actually right around when G PT four dropped. So, it didn't get noticed by everyone was this, uh, alpaca project outta Stanford.
Basically at a high level. What they did is they spent a, I think a couple hundred bucks on chat G P T API calls, and they were able to use that along with a, a, a self instruct method out of some, some, uh, out of a team at UDub to basically train, uh, a 7 billion parameter model. The, the, the open source or the, the Facebook lama model to, and the evaluation's not fully finished, but to, to.
Very similar levels of, of skills and outputs as chat c p t. So basically just by a very cheap set of queries, they were able to basically clone, or come close in many ways to cloning chat, c p t. And then we've seen a bunch of other fall on work, Databricks clone the cloning method with their Dolly model.
Now there's Dolly v. Where they put some of their own crowdsource data into, there's a, an interesting Quala project out of Berkeley where they show that they could get, you know, the same or even better performances, alpaca with just very careful curation of the data sets. The high level point here is that, um, you know, it doesn't really seem like, uh, closed API models are going to be.
That highly defensible if they have any kind of, you know, sufficiently cheap api. And if the open source continues to advance in terms of supporting these models, which we see no signs of slowing down is quite, quite exciting. So where do we actually have more durability? Well, one first principles concept is that it's gonna really arise out of.
Private data distributions and, you know, and, and knowledge. And we saw one exciting example that's come out recently with, uh, Bloomberg G P T. So they took a bunch of private financial data and they were able to train their own model that was able to perform better on that specific domain. So, you know, a, I think there's going to be a, a flattening of the space in terms of just general purpose web data and general conversational and other kind of generic task style foundation.
Number two, we're gonna see this kind of family tree start to really bloom of domain specific models, many of which are gonna be leveraging, you know, private data and specialized knowledge to be, you know, better foundations for those specific areas. Um, that gets into the kind of G P T U idea. Um, and then the third thing, which I'll talk more about, Again, the building of the house on top.
And so in no way mean to pick on the Bloomberg paper, but it's interesting that very few people these days, actually, uh, Alyssa come off a little cynical, but very, very few people, uh, you know, actually open the papers often on these, these, uh, these projects. And so if you actually look, even the blog post that the Bloomberg G B T team released, you'll note what you see in many of these, these, uh, evaluations that are done, you know, thoroughly on, on proper, held out data sets, which is.
It does better in a relative sense, which is a, a very exciting achievement and proof of, of this power of domain specific data. But if you look here at the financial and, and, and, and you know, the financial specific tasks, it's in the sixties, right? For most applications, that's nowhere near, uh, uh, uh, you know, production, deployable, accuracy level.
So again, that gets to this love, this idea. Even with all this development, both closed and open source, even with this specialization and leveraging of, of, you know, domain specific data and expertise, you're still building the foundations and you still have a lot of work to go. What, what always in AI over decades has always been the hardest part, the last mile as any kind of, you know, uh, uh, a data scientist or or ML expert who shipped real stuff to production knows that you have to do on.
So that's what I'll talk about there. I'll pause for a second. Just any questions, comments, push back on this, this high level framing before I get into this, this, uh, you know, last mile building the house part. We're good so far. Awesome. Okay, so let's get into this. And what I'm gonna share with you is, is, you know, our perspective that a lot of the, you know, the building the house really revolves around.
These high level ideas of, of data versus model centric ai. I'll share that both in terms of high level ideas and there's a ton of innovation in the space, in the community. Uh, and, and that'll be kind of with my, you know, academic community hat on. And then I'll also give an example of how we're supporting it, uh, in, in production settings with our platform at snorkel, the company, which we call snorkel Flow.
And here you. Partly because I'm, partly cuz I'm stealing slides and partly because it's just a relevant stand and I'll talk about enterprises. This could be, you know, a commercial enterprise. This could be a government agency, this could be a, a, an open source organization.
Anything where there's a, you know, a collection of non-public, non-generic data and real production use cases. So that's, that's, you know, for the purposes of this talk, think of enterprises of fanin. Any kind of organization or setting like that. So one way that I think it's helpful to think about this is at least collapsing into two dimensions, is think of one dimension where you, you, you're basically asking how bespoke are you?
Or unique is this data for m ml folks in the audience, this is just old transfer learning intuition, right? Um, if you train a giant model, a small model, a giant model on web data, it's going to work better on. Closer to what it trained on, um, pardon me, versus, you know, very bespoke different data inside a bank or a hospital or government, government agency, for example.
Then think about the x axes. What is the accuracy requirement or accuracy, proxy, metric, whatever you choose before you can actually use this thing. We see a lot of, of, of very exciting use cases for, you know, especially generative type use cases where. No one even measures the accuracy or even really knows how to, right?
If I'm trying to generate some marketing copy or some cool images, um, for a kind of copilot style application, I don't even know necessarily how to measure what accuracy means, and I don't need it to be that good because it's just a, a starting point. It's a, it's a, it's a part of a Hume the loop process, so there's high tolerance for failure, um, versus a lot.
The AI applications that are actually shipped in production have to be, you know, 90, 95, 90 9% accuracy, accurate before they can even be, be shipped to production. So a lot of the really exciting moment and demos that we're seeing are kind of in this lower left quadrant where you're testing on data very similar to what you know, G P T X or other foundation models were trained on.
You know, you either have kind of low accuracy requirements cuz there's high degree of failure tolerance, say in a human loop system. You don't even have an established way of measuring what, you know, some accuracy style metric, and a lot of where, you know, the house building is both hardest and more, most valuable is in this upper right quadrant where you have, you know, non-standard data, um, you know, in a bank, a government agency, a hospital system, most most places in the world, most enterprises for sure, and where you actually have high levels of accuracy you need to get to before you can actually ship something.
So I'll give an example of this, with some of our internal data and this is using a variety of foundation models. Uh, some, you know, uh, early results with G P T four. Also smaller models like Burton clip. And here we're just, there's more data here and, and, and we'll be releasing, uh, some of the latest G P T four data from our experiences soon, but just at a very high.
Comparing the out of the box performance, so a zero shot and, and, and the few shot performances are pretty similar of just using g PT four, one of these models out of the box. And this is looking a range of problems from info extraction for a large pharma company to, um, actually an open source case study we released on classifying legal clauses, uh, um, to, to image and chat things as.
And if you actually look, um, at the gap, you have to fine tune, uh, on, on tens or even hundreds of thousands of, of labeled data points to actually even start to approach, uh, production level accuracy. And I'll note here, just to be very precise, that G P T four doesn't have a fine tuning interface. So actually, if you look at these case studies with G P T four, this is fine tuning.
Actually G P T three. So it's actually fine tuning a, a, a less powerful model, uh, but gets a significant leap in quality. Um, and, and if you're curious of more data, you can look at actually this legal data case study. We're, we're updating it pretty soon. Uh, this includes in between the 59 and the 83. All kinds of few shot techniques, advanced prompting techniques.
Nothing really approaches, the good old fashioned fine tuning on labeled data, but this is mainly just to show before we get into methods that you have to do a lot of work still to build the house on top of the foundations. And there's lots of work out in the public domain. I'm, I'm, you know, highlighting one of my favorites here, because I was lazy and I just grabbed one.
There's a lot of work and, and you know, there'll be a wave coming out once uh, folks have enough time to evaluate, uh, G P T four and other models. This was an evaluation of chat G P T, where the conclusion was that, um, chat g p t was again an incredibly impressive generalist, hence the bet that this is, you know, Jack ch you know, models like this are gonna be the foundations for, for all AI development.
That's certainly the bet that we're taking. Um, but it was on average, 25%. Then a specialized model that was specially trained for the given task and it was, it was over 25 N L P tasks. And by the way, that specialist model was often a, a minuscule fraction of the size and therefore cost of chat G P T. So we're seeing these kind of same results out in the open source.
Just a little behind cuz these things take time to these large benchmark to take time to run. Um, so how, how do we build that house? How do we get from the kind of baseline out of the box performance that, again, for real AI use cases, or at least for many real AI use cases in that upper right quadrant we showed are just not anywhere near good enough to ship to production.
How do we go from that, you know, generalist, jack of all trades foundation to a, you know, a specialist or an expert that's deployable in. Already announced that this was gonna be the technical perspective. But at a very high level, what we've been working on out of Stanford, UDub and, and circle the company over the last eight years or so, is exploring this idea of, of what we call data centric development.
This idea that, you know, rather than pursuing things, the, the, the often, the kind of classical way, how, how ML and AI is still often taught. Intro or most intro classes where the data comes from somewhere else. It's janitorial work. It's exogynous to your process as a data scientist. It's not your, your your job.
I also think of this as the Kaggle era of machine learning where you just, you know, your machine learning journey starts when you download your data set from Kaggle, all nicely labeled and curated and collected, and then you start, you know, tweaking your model architectures, data-centric development is the idea in its extreme of kind of flipping that on its head where the model is now fairly standardized and.
And may not even change or may be automatically configured in your process. And most of your data science process, your workflow, your journey is really about iterating on the data, labeling it, sampling it, curating it, augmenting it, et cetera. Um, so it's, it's not always that extreme. But one thing that I'll point out is the, the wave of foundation models has really made it much more extreme than it ever has been.
Think about it from the perspective of a. You know, if you find out that your foundation model based application is messing up on some patient population or some subset of sat, the satellite images you're analyzing, or some, you know, subset of, of, of legal documents more than ever before, you can't go and just tweak the model architecture.
You can't go and, you know, tune, you know, by hand some of the trillion parameters. You effectively have to go to the data, whether that's labeling, prompting, et cetera. It's, it's all these, these kind of data centric interfaces. So in our view, the rise of foundation models has also accelerated and in some ways completed this shift from model centric to data centric development.
So that's a high level thought. I'll leave you with, uh, again, you know, one of one, one, uh, uh, you know, of our favorite examples here is the, the, uh, G P T X family. And I'll note that, that, you know, the advancement. Got at least a large chunk of the world. You know, uh, uh, going crazy over these, these advances was really a delta between G P T three and 3.5 that was all about human supervision, right?
A lot of folks are familiar with the the R L H F term, but, you know, I think it's helpful and if you look at recent work, it's, you know, more precisely you can separate the, the inputs from humans and the mechanism by which the model is updated, which is the RL RL part. And the, the input was just labels.
In this case it was, uh, it was, uh, you know, labels in the form of, of rankings or orderings. And then there was further labeling in terms of the thumbs up, thumbs down. And now there's even further labeling and, and, uh, response generation being paid for yet again. So the delta was not really about the model architecture.
The del the delta is all about the, the, the data and the supervision. So this is, you know, one, one really great example of this, this, uh, this data centric development. So, you know, our idea is that, um, and, and the central concept of both snorkel, the academic work and the data centric, uh, you know, um, uh, concept is that, you know, the, the critical layer between kind of these base foundation models and here I'll depart or I'll be orthogonal to that point that I raised.
This doesn't matter whether you're starting with, um, closed APIs like open. It's always a fun, fun sentence to say, or open source models like I pitched, I think are going to become even more, uh, more, more prevalent. Um, whatever you start with, you have this layer of your stack where you have to do development to fine tune or adapt them for your applications, and that development is done via data primarily.
And this is where a lot of the challenge comes in because, you know, manual onnot. Is extremely difficult. A lot of it's difficult because, um, it's just costly and slow, and especially for most, uh, um, you know, non-trivial, complex settings. And certainly most enterprises, they can't just outsource it. So this takes, you know, huge amounts of in-house efforts, often from, you know, very highly paid and, and, and very busy subject matter experts.
You know, a clinician, a lawyer, a a network technician, and you know, an underwriter, et cetera. And it's also very brittle because every time something changes you, you, you, you don't really have any way of modifying manual labeling. So here's where I'll, I'll segue in, uh, to circle Flow, which is our system for developing foundation models using data centric ai.
And again, I'll just quickly pause on some of this high level stuff that I covered before. Any, any questions, any comments? Otherwise we can leave it to you there. Yeah, there is a, a few questions that came through. One is, When does G P T U need more than 32,000 tokens of context? Oh, uh, yeah, great question.
So I mean, I think that's mostly orthogonal is like, I think the, the extension of the context window length. Um, a lot of that work actually has been, been, been pushed by some great work by, uh, um, tri Dao and others in, um, the co-founder Chris's lab, just advertising if anyone is hiring for academic positions.
I believe trees on the market. He's amazing. His work on flash attention has been, been behind a lot of this, these advances in context window. Um, but I, I, I see a lot of that as, as somewhat orthogonal. I, I think it's, it's extremely exciting. Um, there's a ton of, of possibilities that get opened up when you, when you open up the context window.
You know, you can, I think there's, there's ways to basically unify fine tuning and prompting by putting all of the label data into the context window. You can handle obviously, larger contexts and, and more complex documents and et cetera. But you know, you can still get by with a shorter context window.
It just puts greater emphasis on fine tuning, uh, on, on, on, you know, various kind of chunking and compression schemes, et cetera. So I see those kind of orthogonal to, to, um, to a lot of the stuff I'm talking about today, although very exciting. There is another, there's another one that came through also that I wanted to say and it's uh, from Pradeep asin.
We can use the expensive large language models with last mile prompt engineering until we get enough training data to train or fine tune G P T U and. Maybe that wasn't so much of a question. Now that I read it again, I, at first I thought it was a question, and now I'm thinking that that was probably just a statement, but maybe you have something you wanna talk about on that statement.
I mean, it really just depends on your use, your use cases, right? I think one of the, you know, the key things in characterizing a use case is, Uh, you know, the various different ways you could talk about this, but call it the, the failure tolerance of the, of the use case, right?
How, how accurate do you do you need to be to go to production? And, um, you know, some of these things are scaler, meaning you, you, you get a. You get better results, you get better ROI if you improve the accuracy. In which case, yeah, like starting with, you know, a a a zero shot or prompting based technique that gets you to that kind of 60% level and then gradually kind of tuning and developing it with data centric AI to go higher is a very viable strategy.
In other settings, 60%, or as I mentioned before, like not even knowing how to measure the accuracy for some of these generative use cases is good enough. It's just a co-pilot use case where it's just meant to assist a creative process and there's always gonna be a human editing. Uh, and, and, and, you know, the final result.
And then in other settings, and as you could guess, there's a lot of where, where we operate, um, you know, getting in the sixties just isn't good enough for anything. You can't ship that model. And so you really are blocked until you can get that, that data development done, uh, to, to, to get it to a, a production level accuracy.
So I guess I'd say I agree with the statement in certain use case settings and others, you're blocked until you can do that, that you know, fine tuning or downstream development and, and in others you're fine because it either works really well outta the box cuz it's a very generic or kind of standard data task or you don't care about the accuracy as much cuz of the, the use case setting.
That goes back to the kind of quad chart I shared, or at least that's one way of thinking. Awesome. So I'll push on cuz I'm, I'm almost at time, so I'll just give a little preview. I definitely won't get to the last section, but, um, uh, you know, I'll, I'll, I'll just give a little view of the very high level loop, uh, that, that we support in snorkel flow.
I'll note that, you know, a lot of what we've, you know, over the last. Many years we've, we've kind of anchored the description of snorkel flow on, uh, is, you know, developing training data for training models from scratch. Actually, a lot of our workloads have actually been using it to fine tune, let's call them medi foundation models like Bert, uh, for many years now.
Obviously a lot of the world is moving towards, you know, and, and we're, we're, you know, we've been heavily invested for the last year or two. Moving to, you know, building on top of foundation models. So the basic workflow in, in snorkel flow today is starting with some kind of base foundation model.
It could be closed source, it could be open source. Again throughout some bets at the start of the presentation, but we're, we're completely orthogonal to that. You, you, you bring whatever you want to, uh, to start. And basically the process starts by defining the specific task you want to accomplish.
Let's. You know, a lot of our customers are, you know, still focused on predictive tasks where a lot of the value lies. I want to, you know, classify these contracts with very high accuracy. So you take your, your base foundation model that would say train on web data and you apply it or snorkel flow automatically applies it to your data and your task so that you can kind of, you know, see how it does to start.
And that's when the guided error analysis starts. So the. Let's apply your foundation model to the data. And if you think about it, how else could you actually inspect how your foundation models do it? Right? You can't go and poke around the model weights, at least practically today. You have to apply it to data to be able to even see how it's doing.
So that's that first step. And then, you know, you start this data centric loop, which begins with, um, you know, discovering error modes in your base foundation model via guided error. I'll skip over details there, but that's a lot of the, the, the work we've done both academic commercial side and then your goal is to correct them and to do that as, as rapidly as possible.
And this is where if you've, um, you know, heard me give a talk before or seen in the other snorkel materials, I'll refer you to that. If, if not a lot of the, the, um, uh, the acceleration that we get and a lot of what our academic work has been around is these radically more efficient and, and programmatic.
Of doing this correction, this corrective labeling. And as I teased at the beginning, I'm basically at time, so I I, I won't go in depth, but I'll just note that, um, this idea of programmatic labeling or on the academic side, we've often called it weak supervision, is a, a powerful way to unify all different types of input.
So it could be a heuristic, it could be a knowledge base, it could be using, you know, clusters and embedding space. It could be manual labels, it also could be a. So there's a, some, some nice work out of the, the Stanford lab on, um, and, and also the snorkel research team on how prompts can actually be viewed as programmatic sources of supervision or labeling and automatically combined and modeled.
So you don't need to find kind of one perfect prompt. You just dump them all in along with any other sources of information and it gets all combined by snorkel flow. Um, and I won't have time to go and did the how, but lots out there in the academic space. And then the last step, and I'll end here.
Basically one of two paths you, or three I have here on the slide. You can always, you know, take the data that you labeled and, and just export it. But the two main paths, and I think this is again, broader than just snorkel flow, is either you go back and you update your foundation model with this, you know, uh, corrected and augmented, uh, uh, labeled data.
And again, the standard way for production accuracies that you'd still do is, is fine. Or, and we see this increasingly in our customers. Um, you actually now distill this into a, a smaller model that is specialized for this task. And I'll just quickly note, there's a case study that we have up on this, this open source LEGER data set that's contract classification, a hundred way classification problem, and the net result.
You start with actually an ensemble of foundation models. We did in this, in this case, uh, you do this to development and you actually get not only a 41, uh, accuracy point, boost above the kind of, this was with G PT three, the G P T three route baseline, but we actually now distilled it into a smaller foundation model that was 1400 times as small.
So this is a lot of where we think things are heading. You know, starting with the foundation models as your foundations, but then building the house on top via data center. And here's where I lose the, the, the house building metaphor, but basically also then distilling it into smaller models for, for production accuracy.
So lemme cut there. I know at time. And I don't know if there's time for questions, but at least we covered already a couple. And, uh, thank you all for the time today, dude. Amazing. For the questions, one thing that I would say is for everybody that's looking to continue chatting with Alex, jump into Slack.
Alex, I think you're in Slack. If you're not, I'm going to send you the invite right now and go to the channel, the community conference channel tag Alex in there. Ask him about all of this stuff. I want to give a huge thank you to snorkel while you're on the stream with me, because you all have sponsored this event and I am so grateful for that.
I also wanna mention to anyone out there snorkels having a conference, a virtual conference too. So we'll drop a link to that in the chat. Uh, it's coming up soon, I think. And when or am I speaking out to turn? Alex, was that a secret? Did I just blow the. I'm, I'm not even, I don't even know. So I don't think it's a secret.
I didn't even know we were like giving out socks and stuff and sponsoring. So, um, let's assume it's not. And, at least for this group, otherwise, it's super, super secret. Exclusive announcement. Yeah. We'll be running and thanks for bringing it up. We'll be running another version of our, our, um, uh, future of data center conference.
That's it. Yeah. We've had some exciting speakers before. Last time, you know, a bunch of academic folks. Uh, we had the, uh, the incoming. Um, I think, uh, uh, ciso, the CIA gave, gave a talk, you know, academic, federal industry, um, all open. Largely kind of tilted towards academic type stuff and all about this, this intersection, this year of foundation models and data center development methods.
So yeah, if anything that I said today piqued your interest, uh, please, uh, consider showing it up. And Demetrios, thank you so much for all the time today. It's a pleasure, man. Thank you for joining us, and I'll drop all those links into the chat. Yay.