MLOps Community
+00:00 GMT
Sign in or Join the community to continue

DevTools for Language Models: Unlocking the Future of AI-Driven Applications

Posted Apr 11, 2023 | Views 3.2K
# LLM in Production
# Large Language Models
# DevTools
# AI-Driven Applications
Diego Oppenheimer
Diego Oppenheimer
Diego Oppenheimer
Co-founder @ Guardrails AI

Diego Oppenheimer is a serial entrepreneur, product developer and investor with an extensive background in all things data. Currently, he is a Partner at Factory a venture fund specialized in AI investments as well as a co-founder at Guardrails AI. Previously he was an executive vice president at DataRobot, Founder and CEO at Algorithmia (acquired by DataRobot) and shipped some of Microsoft’s most used data analysis products including Excel, PowerBI and SQL Server.

Diego is active in AI/ML communities as a founding member and strategic advisor for the AI Infrastructure Alliance and MLops.Community and works with leaders to define AI industry standards and best practices. Diego holds a Bachelor's degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

+ Read More

Diego Oppenheimer is a serial entrepreneur, product developer and investor with an extensive background in all things data. Currently, he is a Partner at Factory a venture fund specialized in AI investments as well as a co-founder at Guardrails AI. Previously he was an executive vice president at DataRobot, Founder and CEO at Algorithmia (acquired by DataRobot) and shipped some of Microsoft’s most used data analysis products including Excel, PowerBI and SQL Server.

Diego is active in AI/ML communities as a founding member and strategic advisor for the AI Infrastructure Alliance and MLops.Community and works with leaders to define AI industry standards and best practices. Diego holds a Bachelor's degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

In this talk, we explore the thriving ecosystem of tools and technologies emerging around large language models (LLMs) such as GPT-3. As the LLM landscape enters the "Holy $#@!" phase of exponential growth, a surge of developers is building remarkable product experiences on top of these models, giving rise to a rich collection of DevTools. We delve into the current state of LLM DevTools, their significance, and future prospects.

We also examine the challenges and opportunities involved in building intelligent features using LLMs, discussing the role of experimentation, prompting, knowledge retrieval, and vector databases. Moreover, we consider the next set of challenges faced by teams looking to scale their LLM features, such as data labeling, fine-tuning, monitoring, observability, and testing.

Drawing parallels with previous waves of machine learning DevTools, we predict the trajectory of this rapidly maturing market and the potential impact on the broader AI landscape. Join us in this exciting discussion to learn about the future of AI-driven applications and the tools that will enable their success.

+ Read More

Link to slides


Great. Hey everyone, thanks for being here today. Super excited about chatting at today. So I'm Diego Oppenheimer. I'm a managing partner at Factory. We're a venture fund that's specialized in AI investments. I also happen to help a couple of different AI based startups with product.

And so I've been pretty deep in, in the ecosystem. One of the things that are really interesting is that like this tremendous popularity, this kind of Cambridge explosion of use cases and interpretations of like how we can actually use build ai applications. I think really the kind of quote unquote chat GPT revolution has become fascinating in terms of like the inspiration that it's given to developers, hackers and the people who have been inspired to go build for.

What we can start thinking about as the, software 2.0 or 3.0 if you wanna call it in terms of, how it gets powered by ai. So in today's talk, I'm gonna really do a little bit of a survey of the environment and kind of the tools that are out there. I'm not gonna be I can't go in depth, there's so much going on but I wanna like talk high level about, how people are building these applications and what kind of tooling they use. So let's figure out. Great.

Foundational Models vs LLMs

So let's start and get into it, cuz I think a lot of people still have a little bit of this question in terms of foundational models versus large language models. And so the first thing I wanna do is, these foundational models are trained on extremely large amounts of data.

And they're, if we think about like foundational models, as. Broad general purpose models that are really aimed to capture, a lot of capabilities and knowledge. The kind of like large part comes usually with for, because they're a billion plus parameters. And, they're really designed and architected so that you can be adapted and fine tuned so that, to specific data.

Examples like GPT four and Clip and Dali fall in this. So foundational models really think of it as the kind of like generic or superset. And inside that, there is obviously language and there's image and there's multi-model models. So large language models are really designed for specifically understanding of language.

And you can see their bets and lamas and GPTs in this category, they're really trained on massive data. And they're, you can see the tasks that exist in terms of classification, generalization, summarization, and more. So I'm gonna talk, I'm gonna go back between foundational models and large language models a little bit, but this can give you that framing of, how to think about them.

So let's go a little bit into a history. So in terms of how do we actually get here from a research perspective, right? Back in the day started with the bing bang, very sad, no models, emptiness, nothing. We, the first kind of iteration, or not first, but like big push in terms of machine learning was in self supervision.

We started seeing models that had, 125 to 350 million parameters. Its capabilities were to be able to replicate what it had seen and spit it out. And the kind of like data sets that were actually used to generate, these were really around, like small web, so small web component components or book dumps.

The trend ended up going to bigger models, right? And this is when we started actually getting into the kind of like large models, which are. Getting to that, how do we get to that a hundred billion parameters? These, the more, a lot of proof showed that the bigger the models, the more the capabilities that were coming out of it. So we started seeing these one to 100 billion parameters. Really tax task list text generation was really like the first kind of themes that came out of this. And, the data was really trained on all. So now we're getting to these instruction tuned ma and and massive models that they can actually follow task and instructions.

They are using actually heavily curated data sets. They're getting ginormous, right? The 10 to 200 billion parameters. They can generalize to task. They also can listen to feedback in the sense that you can. Work with these models in terms of instructing them to get to results and providing context.

And we're gonna talk about some of the tools that allow you to do that. And the more important part is from a data perspective, these are heavily curated data sets and labeled data at web scale with human feedback and the cost and the scale at which this is being done is quite impressive. And, really like.

The newest models have been, there's a lot of work going on into actually being able to do that. So as the size and data quality increases, you get more generalized of the more generalization and in context behaviors, but at a much, much higher cost. And so we'll talk a little bit about that, what that looks like as well.

Phase One

Cool. One of the, interesting things about early stage development of platforms is you when the first kind of platforms show up, people start. With a very kind of like basic wrappers around that. And there's actually a parallel to this in general, software and technology, right?

So when the first microprocessors came out, there was essentially wrappers around single board computers when the first operating systems came out. The first applications were really like wrappers around utilities. If you think about like Unix utilities and like Norton as one of the first applications was really.

OS capabilities wrapped around wrappers around that. In the internet we got these like wrappers around like Unix and network utilities. And what we're seeing today in like the generative AI world is really this like wrappers around not just LLMs, but also got around foundational models in general.

So we're in this kind of First phase was just pretty natural. Where these thin wrappers around these foundational models and there's an explosion around it. And the core thing that's actually happened is the capabilities got to that holy shit moment, right where we're looking at, it really feels like magic.

And so my personal experience, I've been working with these foundational models in a bunch of different contexts. I like the. Translation I have to, it is, it feels like you're neo in the matrix, right? Like you're just ac accessing content. You're being able to do things at a speed and a productivity level.

And so that holy shit moment of using ChatGPT for the first time, has really driven to, a ton of discord servers and hackers and market maps are being made and an unbelievable amount of people are building What is today the basic kind of wrappers around these foundational models.

The Ecosystem

If history provides true, this is the start of a whole industry that will be building deeper and deeper applications as the models get better, as our tools get better, and as we understand how to build these applications in a better way.

The ecosystem is completely thriving. And, I cheated here a little bit and I stole this from the folks over at first Mark, cause I wanna give them credit. But like the LA the data landscape and the AI and ML landscape is just growing. And what's actually happened, especially in that kind of m ml ops category, is we're seeing actually now LLM.

Specific tools that are just coming, that are going, everything from the most basic convenience wrappers around open AI APIs to actually complex tools for orchestrating and multi-step reasoning to very simple databases for prompt templates. And so what we are gonna see here is that as developers continue to experiment with LLMs, a thriving ecosystem is gonna be cropping up to support that.

And these tools are designed to enable developers to iterate quickly, build amazing features on top of these LLMs around that. And so we're gonna dive here a little bit into kind of like a couple categories of those tools that are emerging. So you kinda understand what we're seeing so.

Building With LLMs

First of all, I think let's start with what does it take to build an LLM based application and to give a little bit of that workflow. So we just generally understand is today fairly easy? Grab an API around an LLM, plug in your experimentation, prompt tooling, potentially if you need it, and I'll explain when and when you don't need it.

Vector database and data integration and you got that V1 product. And to be clear, Most of the stuff that we're seeing today. All right. And most of this explosion is really around these four first boxes, which is we're getting to that first version of the product and we're still in that kind of like holy shit moment around that first version of the product. As we continue experimenting with these L lms, then this like the thriving ecosystem keeps going up. We're gonna see these steps go through. The first is experimentation and prompting. We go into knowledge retrieval and fine tuning and experimentation phase.

Developers are really tinkering with the prompts. These LLMs have a really interesting api, which is natural language, and it has some specific ways of working with the with that api, API I that require certain tooling, or at least not require, but like we want an abstraction on some tooling to make it a little bit easier.

And this might involve actually chaining. Prompts through that. And the next is knowledge retrieval, which involves providing like relevant context to the model so it can actually improve its accuracy. So these generalized models, if we can actually reduce and provide context into them allow you to improve that accuracy and also it behaves better and run cheaper from that perspective.

And so finally, fine tuning is really where, when we are gonna go in like into the second version. Really what I call like snipe dataset.

So highly curated specific datasets to show examples to these models so that you can actually fine tune on them to improve your model accuracy and actually reduce your inference latency which really what matters when you're actually thinking about production use cases.

So let's jump in a little bit into what these what these kind of places are. As I mentioned, like we're kind. Version one is really where all the action is today. And obviously there's a ton of people working on much, much deeper stuff, so I don't wanna take that away.

But if you go look at, and I was looking at this kinda last night. One of my favorite newsletters no relationship to them, but Bens bytes aggregator of a bunch of links and new tooling, right? Hundreds and hundreds of applications like a week are popping up, essentially building in this kind of like V1 world which are really wrappers around this ml LLms.

For effect purposes, you know that moment where we, that blows our. Getting to the next step is actually, it doesn't blow our mind as much, even though there's a lot of complexity getting into that v2 of that version two of applications. So let's start jumping into some of the actual, like tooling here.


So the first one is really what I'd call this experimentation and prompting. So to get actual desired output from L M, developers often need to experiment with different prompts and chains of prompts, and this can actually be quite complex and time consuming. Fortunately, what we have right now is emerging tool sets like Lang Chain and LAMA Index, or that have helped to jumpstart and manage this experimentation process connecting to, Because these APIs are natural language.

You know that mastering requires that, experimentation between single and chain prompts. And so these tools have really come out to help us connect the data sources, right? Provide context and grab a bunch of data, provide indices to actually be able to run through that, through that data.

Coordinate chains of calls and provide other core abstractions like agents that allow you to build kind of the applications of the future. And so this is really, again, like the core of this tool set is around being able to very quickly and iterate through that experimentation and orchestration process and abstract the way some of those things.

And, some of these tools have gotten extremely popular. And if you just track by like the their GitHub stars. It doesn't really mean how much they're being implemented, it just really shows interest. It's really fascinating to see how this, they've inspired, I think it's fair to say that they've inspired a whole revolution of hackers to go build applications off of these large language models.

So then let's talk about knowledge retrieval and, kind of vector databases. So these I, one of my favorite descriptions around large language model is it's like the, it's the smartest goldfish you've ever met, right? In con what that means is like they don't really have memory right at this point.

And to be able to actually have a memory or understand what you're doing over time, you have to provide a context and kind. Guide it through what you want. And so the best way of actually guiding the context right, is to actually pass it in relevant content so it can actually frame it in a way that it, it understands around it.

And one of this ways of really finding relevant context and actually doing it in. Cheap and efficient way, especially because em embed the way embeddings work is actually using vector databases. And so there's been a, the vector databases are very popular right now.

There's tons of funding going around them, but the real reason around this is that they're actually really efficient. For vector similarity searches, which is your retrieval they're really effective at storing billions of embeddings, which is what you wanna do to provide, if you convert documents in different contexts into those embeddings to be able to feed them into these LLMs.

And they're really efficient, have really efficient indexing capabilities. So when you start thinking about, retrieving similarly similar documents, really vector databases are this core component of giving. And then helping to improve output quality.

So if you're actually building an application and what you're trying to do is to have, as I said, smartest goldfish wanna provide memory to the application and understanding over time, these kind of vector databases tend to be a core component.

Version Two

And so learning how to use them and actually implement them becomes particularly. So then let's go into kind of what these version two, so we talked about the kind of like product the, the building and the kind of thin wrapper layer around the traditional model. But that's actually not that you all can do today. There's actually a lot of things that you can do in terms of making these models more accurate and also faster and cheaper to run.

And that usually. Steps down to this fine tuning step, and to be able to do fine tuning, you really need super high quality data. And so I think it's really interesting. And there's, I, you'll probably see a couple talks on this, but my good friend Alex, who will, I'm sure will be we'll doubly clicking on this, but This is really playing out in terms of the data centric AI movement.

That started, a couple years ago where the core quality of the data and the kind of like manicuring of data sets is. Proven now to be the actual one of most efficient ways of, reducing latency and increasing accuracy through this fine tuning process for these LLM based applications.

And so while they generalize really well the effect of actually trying to get them like really customized, or at least working in a better way, is about getting data sets that are very curated and specific to task. And then providing fine tuning APIs to be able to do that.

And In this process, you can do the, there's a couple different patterns around that. So the fine tuning one is the one that allows you really to improve performance and when accuracy is critical.

Smaller Models

And then what you, another case is what you do is distilling these models to smaller versions of the model that run in a cheaper way, but that don't lose that accuracy. But you just don't need a lot of the other. Properties that these large models potentially have. So you can think about this destination process where I can get a smaller, still large language model that has the same accuracy for those tasks, but is no longer actually at, it, it's much cheaper to run and much faster to run.

So as these use cases come up where latency matters, where accuracy matters, where cost and efficiency and the unit economics of running these. Is an important factor. You have to start thinking about these techniques to be able to boil those down.

So finally we get into kind of like monitoring observability and testing. And so the first thing is let's be clear, like the, we are still working with probabilistic workflows. So while most and it's really interesting because as we look at the general population of software developers and people who are interacting with these models for a lot, it's the first time that they are encountering probabilistic workflows.

Probabilistic Workflows

So people who work in data science and machine learning, like they're very comfortable. They understand, that there's confidence scores and how these models produce results.

But for the general population we're used to like, when I do the exact same, over and over again, like I should get the exact same results. And that's just not necessarily true in a probabilistic world. And on top of that, when we're actually trying to assess performance for large language models we are really have to assess quality via the user interactions, right?

We have to understand. What the user's doing, what they're actually collect, complimenting around that. And so the how do we measure performance is a really interesting question inside the large language model. Good. And how do we understand good what, how the good generated content is.

We talk about this problem with hallucinations and overconfidence in the results, as we start thinking about monitoring observability and testing, we're gonna really have to think. Okay, why, how do we observe the user interactions and what they do after it and what their results?

So being able to provide that feedback in to assess performance. We're gonna really have to double down on building ab testing around the full workflows for product analytics because the actual result and the chain, it's not just about one model and testing the model, but it's actually the entire workflow.

And they be testing that entire workflow to understand the, if they're working or not. As these models. Larger models get larger and more and more there's a giant swell around open source models. So we're seeing a lot of these generalized models coming out. How do we compare on them?

How do we understand just the eye test of being able to like, Run the same prompt against a couple of different models isn't really gonna do the trick. So open source efforts, like how I'm over at Stanford are really providing comparison frameworks on a task by task basis. So there might be room here for actually thinking about local.

What does. Frameworks for like full workflows and testing those against each other. And some of these tools are starting to pop up in things like Honey Hive, where they're really giving you like the ability to test and iterate through some, through those workflows.


And finally, what's the performance impact we need to be able to look at. The performance sheet directly impacts the ux. So if we think about how fast or slow these large language models, the very large ones might take a second and a half to produce a response. If we're doing hundreds and thousands of these inferences every day, not only is the cost really big, but the actual cost of the UX and the experience that the user has is really affected.

So how do we actually measure the full user experience of the application and understand. Hey, when should we be using a cheaper, faster, lower latency model? Because we're affecting the user experience in that way Getting to, finally, like one of the, probably the most interesting pieces right of working with these LMS is, they can really generate plausible but incorrect information, which actually really poses a threat to what I call low affordability use cases, like medical diagnosis and final financial decisions.

So if we think about what high affordability versus low affordability looks like, High affordability to be wrong, right? That's the framing that I use to think about some of these use cases. So if you think about your GitHub PO Gopi, if you think about your suggestions in your email, if you think about generating an image, like these are all high affordability use cases for being wrong.

You might. It's not gonna really affect anybody. Sure. You can ignore the suggestion. This is why we see so many, like this explosion of co-pilots type workflows because they're just enhancing. There's a human there verifying it. Like it's really about getting you two faster.

But once we start going into low affordability use cases, things like medical diagnosis, financial decisions, things that. This technology should be applied to because it has the opportunity of truly revolutionizing those industries the that's how we actually can start thinking about, okay, what do we need to do that?


So we need to start guard railing, these these models. So how do we ensure safety, accuracy, and reliability? Teams like Microsoft, the people that you know, companies like Microsoft. Full benefit s actually doing this. And it's really interesting because it's really gonna be at the core of how we explore the future of these how we explore the future of using these LLMs.

So tools that start emerging that define rules, schema and heuristics for l m outputs are gonna be crucial to build trust in those systems so that we can actually start building and trusting these kind of what I call low affordability to be wrong scenario. So finally, to wrap up here, I'm gonna leave with a couple of predictions around this.

And so the first one is that there's a ton of developer tools and, usually iteration cycles. Are what defines the winning developer experiences? If we think about what we originally started, when we started working with deep learning, there was a launch of, libraries and frameworks like Cafe and TensorFlow.

And as we iterated through it, new ones emerged like. Pie torch that adopted quick pos very quick popularity. So it's really interesting to see. And, I think it's too early right now to think about, okay, what, which one of these frameworks that are become extremely popular today?


The good thing around this is I think the, it, we're gonna see the iteration cycles really quickly. And these new libraries are actually enabling that iteration cycle. Second one is, and again, this is probably not a, not too crazy to think, which is if we started about what traditional ML and something I spent a bunch of time in and, what are called previous generation of MLOps every single step of building an ML application was hard, right?

So I had to build a data warehouse and capture the data. I had to hire and train an ML expert label that data. I had to go build a model and I had to build an inference point, and then I had a version one of the product. Level of difficulty. The way I compare it to is like originally, before AW s and before cloud, like just to build a web application, you have to go standup servers.

The whole, there's a lot of manual stuff to go do. This is out of the box now I can start, right? It's extremely easy to use LLM API. It's literally interacting with it in English or any language, right? Prompt design is actually fairly straightforward. Then you can get to complexity, but it's actually really easy.

And then, obviously there's complexity around dealing with the orchestration and the data integration, we can go very quickly to these very one products. So I think we're gonna continue to see even deeper explosion of these really interesting version one products. As these large models become bigger and have more capabilities, and they're gonna really continue to inspire this revolution around this future of software.

Final Prediction

The final prediction, and I stole this from Alex, so I'll give him credit. Yo know, I think we're gonna see this G P T U and what this means is like, High, the, the, this destination of open source models that are gonna be highly contextualized to your organization.

So the ability to grab these models that are based, trained and fine tuned on your data, on your organization's data that are specific to your business, I think are really gonna explode. And what we're gonna see. Lops tooling that is going to help achieve that. And that the core principle here is really, is that, data is the most durable mode.

The last mile is always where the value is generated in these workflows. And so at the core of what we're gonna be seeing here is, okay, what is the next version of these that are really gonna help you tune and specialize these these large language models.

Specifically to your organization's data, to your personal data and you'll and do that. So that's like the concept of that G B T U. So this has been super fun hopefully was useful. Happy to chat with anybody later. But thanks Demetri, dude. Killer. And since I ended up burning a lot of time at the beginning, we were supposed to do a whole q and a session for you right now.

But if anyone wants to talk with Diego and keep the conversation going, ask him any questions, feel free to jump into the Slack link that I added into the chat. And that's the M ML Ops Community Slack. You can at Diego Oppenheimer and ask him questions. We set up a conference channel that you can hit up.

And that is that's that. I have one question though for you, Diego. And it's not about l m mobs, if we're gonna call it that. It is more about, you've been in the game for a while. You've been doing this I'm not gonna say you're an old hat, but you've been doing it for a while, right?

What's different about this time around? Do you feel like there's actually some legs here, or is it, are we gonna find ourselves in another winter? So I think, no I'm so bullish about what's going on right now. I think, like we are I was actually thinking about this last night and I was like, I can't imagine a more exciting time to be involved in ML than right now.

Like the it feels the, for some people it's almost like dizzying, like just keeping up. But like the level of innovation that is happening at the speed, I don't think we've ever seen it in the history of software. And it is so exciting and an entire generation of developers and hackers and like software artists are being inspired to go build and they're just going to build and people are just building and like the speed at which they're building.

Awesome. I legitimately, I could not be more excited. I think this is we will look back in history as this being a crucial time in, in, in our existence as humanity. Oh dude. Killer. This is, I'm excited. I'm very excited. And you just talking to you, listening to you makes me even more excited.

Some people were commenting on your visuals and I think it's worth noting that I think you made all of these visuals with every kind of yeah. I tried to be almost fully powered for the For the entire co talk, so I generated the outline, I created the speaker notes, I generated the visuals, and I taught all using four or five different foundational models from different sources.

+ Read More
Sign in or Join the community

Create an account

Change email
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Posted Mar 14, 2023 | Views 736
# Large Language Models
# Future of Search