Treating Prompt Engineering More Like Code
Maxime Beauchemin is the founder and CEO of Preset. Original creator of Apache Superset. Max has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Yahoo!, Lyft, Airbnb, Facebook, and Ubisoft.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Promptimize is an innovative tool designed to scientifically evaluate the effectiveness of prompts. Discover the advantages of open-sourcing the tool and its relevance, drawing parallels with test suites in software engineering. Uncover the increasing interest in this domain and the necessity for transparent interactions with language models. Delve into the world of prompt optimization, deterministic evaluation, and the unique challenges in AI prompt engineering.
My name is, uh, Maxime Beauchemin. I'm French Canadian, but I, uh, I go by Max. It is just easier. Um, I am CEO and founder now at a startup called Preset. So we do offer a managed service around Apache Superset. Apache Superset, I mean, if I'm gonna say something that's like non-commercial, uh, about some of the stuff we're building is Apache Superset is a very, very strong open source contender in the business intelligence space.
I call it, data consumption, data exploration, dashboarding. So we're very, like, superset is a very strong open source competitor to, Tableau and Looker and all the business intelligence tools around. So if you haven't checked that Open Source BI or Apache Superset in a while, you should check it out.
It's really, it's becoming really, really solid and really competitive, like, and better than these like, commercial vendor tools in a lot of ways. I appreciate we offer a, a freemium type service on top of it. So it's like, so, so you can get started very quickly and, and try it. So encourage people to take a look at that.
And then, um, the way I prefer my coffee, um, I don't know. I, I mean, I do espresso with o Oak, milk little honey, um, so that's, that's been my jam lately.
Welcome back to the MLOps Community Podcast. I am your host, Demitrios flying solo. Today, I'm talking with Maxime. You all may know him because he came on the podcast about a year ago, and he is also the creator of a little tool that some of you use called Airflow.
That tool is so used, and he decided, go and check it out. It's on the last podcast that we did with him. But he talked to us about why he did not create a whole company around airflow, like astronomer, like a managed airflow, basically. And he went off and did his other project, which is around, as he mentioned in the intro, around Apache Superset.
He created Preset and he's been killing it with that. But today he's here to talk to us about Prompt Optimize, which is a little tool that he created for internal use. And then he said, you know what? This is actually really useful. Maybe I should just open source it to the world. So we went through what exactly Prompt Optimize is, why he decided to open source it, why he open source, basically every tool that he creates and also how it can help.
And I absolutely loved how Prompt Optimiz can help you. It seems like if you are trying to figure out in a very scientific way if your prompts are useful and how useful they are when you change prompts, when you change different dials, prompt, optimize can help you view that and see that and create logs.
He calls these prompt suites and he ma draws the parallel between test suites in regular software engineering
and you have all these test use cases and you have prompt use cases that you can line your code up against or the different prompts and you can evaluate models and you really have your own evaluation and benchmark metrics. When you are looking at whether or not you want to use a new model or if for a certain use case, a certain type of prompt is going to do better than other prompts.
It's a space that is absolutely blowing up right now. I know there's a ton of other Com companies that are popping up in this field because there's pain around it. We don't have. Clear ways of interacting with the models that are, as Max was saying, super scientific and can give us clear feedback. So Max dives into it and I think y'all are gonna enjoy it.
Let's get into it. Before we jump in one, one, ask from my side, just one. That's all I ask for is that you share this with one other friend if you liked it. We are here, like subscribe and leave a review, all that fun stuff. But really, Share this with one other friend because it would be great to get all of the cool stuff that we talked about today with Max out to the world.
All right, let's jump into this conversation with Max, the creator, c e o, and just all around. Awesome guy of preset.
So Maxine, great to have you back. The first time that you were here, you absolutely blew my mind with so much cool stuff. We talked about airflow and preset and why you went off and did preset. I think that was the coolest thing that is forever etched in my brain and the tl, the R for people who didn't see that first episode.
But I highly recommend everyone goes and watches it, is you kind of just said, yeah, but I have more fun doing stuff with preset, and I really think that's a. Better challenge that I want to sink my teeth into. And so I love that it just reminded me to work on things that you enjoy and do things that you like.
It doesn't necessarily mean you have to do things because of whatever the external circumstances are, really like follow what you actually like. So I'm excited to talk to you today about your new little thing that you've been getting into and you've really been liking. And I, I really want to talk about like this whole, it's prompt Optimiz, right?
Am I saying that correctly? Yeah, I, I think so. I mean, really I think we should, uh, kind of probably go back a little bit to the beginning of like how, how did that came to be and what, what's set the stage for, for this. Really the premise is like, Everyone is trying to build, to bring AI into the, the products and the software that they're building today. Yeah. And then how do we do this? Right? We have this new beast, like the, the LLMs and uh, and they're very like, you can ask them anything and they can answer anything.
There's a lot of power to that. Yeah. But how we channel this, this power, how we bring that into our products is, is very, very challenging. Yeah. So true. And you mentioned before we hit record, just the whole API and the way that we interact with the API is a whole new paradigm too, because you can kind of just say whatever you want.
Youre speaking to the API in English in a way, and then good luck getting out, uh, something that is very consistent each time. And so there's a lot of big question marks that I think people are still exploring to this day. A hundred percent. Yeah. If you think about yeah, your normal rest api, I and your, your general APIs that we have a a we've had access to, to build products so far, it's very, very structured, super well documented.
There's a really clear schema as the input, there's a clear schema as the output super deterministic correctly. You ask the same question five times to get the same answer five times. Um, with the these LLMs and the G PT type interfaces, you're free to construct your prompt however you want. Now there's some limitation of as to how long the prompt can be.
Yeah. How much context you can pass, and you really gotta use that smartly. So that's one of the challenges there. But it's like, provide whatever context you want, um, or you feel is relevant in the structure that you feel like is gonna be more intelligible to the, to the ai, and then get the result that you ask for.
And then, you can ask for some structure on the, on the response. It's possible to say like a return, using the following format or, in less than 200 words, so there's a lot you can do in the prompt to shape. The output of what it does, but it's still extremely probabilistic and an open field kind of problem.
You're like, where do I even start with this thing? And how do I measure whether my prompts, like, is this prompt better than the previous prompt I had? And anecdotal it might be, but it might not perform well as some specific sub use cases. So the whole idea, what prompt Optimize is to, is to bring some of the test driven development mentality to be able to say, Hey, if I switch this, if I switch my prompt a little bit, um, if I ask, um, a slightly different way, or if I use G P T five turbo instead of gt or how, how is it gonna do at my use cases or my test set?
Dude, you said something there that you just kind of glossed over it, but it is very important to highlight, which is the way that we have been doing things when we are. Hitting APIs. That's been very well documented. But now when you are basically given infinite possibilities of how you're going to interact with the api, how can you even think about documenting all these different edge cases?
And so it's so hard to think about it. You get companies like OpenAI, they're coming out with, Hey, here's some prompt guidelines. Here's how we think. We've seen the best ways that you can use it. But it is, it's not like you can just say, all right, cool. I have, I noticed this when I, when I'm playing with the tools, sometimes I think, is it because the model isn't there yet?
Or is it because my prompt isn't good enough? It's, and it's probably a combina, an intricate combination of the two. Right? And without a scientific method around your prompt or, or, or, or rigorous method around like trying different version of your prompt, you're not gonna be able to. To, to really know, right?
That's that whole, whole point there is like, okay, how do I, so let's say in, in my specific case and scratching my, my own itch there, we were working on text to sequel, right? So I think that's, uh, nice. Re it's an interesting problem when you think about text to sequel. Let me state, like, try to explain what what I mean by that.
But general idea, you have a, a large, uh, a database with a bunch of, of models and tables, uh, inside this database. And then you wanna ask it a, a text question, human language question as in, uh, what are my, in which countries are my, my sales growing fastest, this year, right? Or you want to ask like all these generic questions and you want for.
AI in this case to generate SQL that would answer this question with a results set. So rel this problem is pretty open-ended problem, right? There's like an infinite amount of question that people could ask, and there's different database schemas. It's pretty open-ended, but it's nicely deterministic in a sense that for a question you're able to say, how did the the AI perform?
Did it write sequel? That answers the question accurately, yes or no? Uh, I think that's the perfect kind of problem. I think for, for a, well, it's not necessarily the perfect, but it is a, a good kind of problem, um, for say product builder and being able to assess to do some prompt engineering where you can assess is it answering the question well or not.
So because it's black and white, yes, it's, it is like, not necessarily perfectly black and white, but it is, it is fairly deterministic. You can say succeeded. Or it mostly succeeded or it completely failed, right? It used, yeah. Did the sequel run, did it have the answer in it as opposed to, write a 200 word essay or a thousand word essay on the n Napoleon's conquests.
In Europe, uh, there, it's, it's a lot harder to evaluate like, did it do well or not right? Like you need a human type interpreter that an expert to, to read this stuff. Where with the sequel, you can a little bit more say like, it did well or, a little bit more black or white. Um, yeah, that's a good use case.
And then the first thing I was like, okay, what kinda this problem, I need to build a bank of test cases that are deterministic, right? The first thing I gotta do is like, find a test set of say, a hundred 500,000 different prompts on a various set of different databases that have different maybe, uh, nomenclature naming conventions.
Uh, level of complexity. Like some might be normalized in different ways. So I need to find this test set so I can actually evaluate whether the AI is doing well or not. If, and then whether my prompt is working, I is doing what it, what it should be doing so I can iterate on it. Oh, I love that.
Yeah, yeah, yeah, yeah. I see exactly what you're saying. So basically you, you were like, okay, if we can figure out all of these different edge cases, then it's going to be a much more robust system when I am passing it into the l l M, right? Uh, or if I can have, say, a test set of say, let's say I have a thousand different prompts with, um, a thousand corresponding queries that I believe are, um, are believe, are right or answer the question, well, now I can go and iterate on my prompt first, right?
I can generate a prompt that's like, okay, given these. Table schemas, right? Like given, uh, this database structure. Um, and given these constraints, can you answer the following question, right? Like, what are my sales, uh, by department, uh, year over year growth? And, um, and then I can, I can iterate my prompt.
I can run this test suite systematically, and I can get a report that says like, oh, this, this version of your prompt is 72% accurate. And I can point it to say a different, um, a, a different model, maybe a different G P T model or the latest, uh, whatever Facebook is coming out with or Databricks of these open source model.
And I can run the same test suite with the same prompt and say, which one performs better? Or, or even, uh, more interestingly too, which, where does this one fail and this one succeed? And where this, the opposite is true too, and try to understand why and, iterate over that, that prompt and the selection or model.
So the first thing that comes to my mind is how does this interact with a vector database? Because it's like vector databases have become the unsung hero of prompting and the l l M movement, right? Right. So, so what's interesting is that a little bit the same way that you think of it, testing libraries.
If you think of like, what does this, this solution that I, call prompt, optimize the, I created, it's really just a place where you can express your, uh, your test cases, call your pro, I call them prompt cases, but input prompt. I love that. An eval function, right? So, so you'd say, okay, for this question, no, for, for this input question that the user may ask, um, here's the eval function.
So you provide this testing library with a, with your own prompt. Engineering toolkit, right? So you might use Lang chain if, if you're familiar with that, or that's a, a, a comprehensive little, uh, super, super useful little Python library to, um, interact with open AI and craft, um, good, good intricate prompts, right?
So you have these templates. You can do chains prompts, you can create agents, or you can a lot, you can do with lane chain mi little library. All it does is, uh, allow you to express your test cases a little bit like pie tests, right? Or, uh, these like very simple library where you, you just write your unit tests and then you're able to run your suites of tests.
So, so my library does is not a prompt engineering library per se, as much as just like a prompt engineering testing toolkit and EV evaluation toolkit. Um, oh, I get it. Yeah. Okay. And that's where, yeah. Now it makes complete sense where you're saying, all right, this one is 75% useful, or this prompt is. Is this good?
But then when you point it at a different model, it holds up much better or much worse, et cetera, et cetera. And so you can effectively know these things in a much clearer way than if you're just kind of evaluating from looking at the answer. And that's really hard to scale if you're saying, okay, now let's see if I want to do these 200, how can I evaluate all of these?
Well, if you set up a, this prompt Optimiz test suite, then you can know that right away that, yeah, so you can do it. It brings, so that mindset of, if you're familiar with hyper parameter tuning, an ml, I'm sure I'm certain you are, and the audience is, really often you want to train a whole like matrix, like multidimensional matrix of model and then get, compare how they're performing against one another, right?
So, so if you, you, you can kind of think of like this mindset of. I'm gonna try different prompts, maybe against different models and with different parameters, right? So you can do that easily with prompt eyes. Cause it's all test cases at, um, defined as code. So if you wanted to say, I wanna run this, this test suite against these five different models with these five different, parameters for temperature, uh, with some variation on, on the prompt itself, it might, might get a little expensive on the API cost.
If you got a thousand tests, then you do multivariate, big matrix of tests. Uh, but enables you to, to do that and to produce these real, that's what VC money's for. What was that? Sorry? That's what VC money's for. That's it. Exactly. Um, but yeah, so, so you can produce these reports for each test, for each prompt swed that you run.
And you'd be able to say like, okay, this is, this combination of parameters is the one that, uh, perform the best. Here's where, maybe another model really succeeded and this one failed, right? Like, give me some examples of the, compare these two diff, these two reports, and tell me, how G P T say when it comes out.
G PT five is performing against four, right? Is it better at what is it worse at? Uh, what is the cost, right? So we'll log some of the, the cost and performance parameter. Which one is faster, which one is more expensive, which one is more accurate? Um, this, this sort of things, right? So it allows to bring this more deterministic scientific method.
This something that's super mushy, right? We talked about how these APIs are, yeah, are like, there's no structure you can do. It's an open playground. You can do whatever the heck you want. But then you're like, okay, if I write important and caps in my prompt, how does that really, does that affect, does that change anything?
What does it change? Yes. Oh, I love it. And that is, I completely see the vision of, hey, let's make this much more scientific so we can at least measure one against the other. Because we're not quite sure right now what each word is affecting when it comes down to it. Is it actually helping my prompt?
How much is it helping the prompt? Or is it hindering it? And so with this, effectively you're saying, yeah, you can see that in a very clear way that you can have a full report of, like, for each test case, how much, how it perform versus the other one. You can get the summary result. You can push your reports into a database and run, SQL against it to get really, um, really intricate.
So it's really, it brings rigor to like testing your, your, your prompts. Producing reports and then, uh, and then allowing you to, to evolve and iterate on that. No, you talked about like, say vector databases, and I, I dodged the question a little bit. I'd love to get into that too. But the first thing is like, should you, should you use a vector database?
Yeah. Probably. If you, if you're, uh, your prompts, um, if your context window is limited for the kind of use cases, do you have, but how would you, let's say you're like, all right, we're, I'm gonna do the work. I'm gonna set up my, my embeddings and I'm gonna hook up like pine cone DB or whatever it is, right?
And then push my documents and do some proximity search on these vectors. Then you do all this work, you're like, wait, does that perform better than my simple prompts that I just had before? Uh, and in what way? Right? Where does it per perform better than the other one? Well prompt them as would allow you to compare like, how you're, your.
Your prompt that uses that vector database might be performing against your prompt. That doesn't. And as you iterate, on these embeddings, in these document formats, um, you can, you can measure how, whether you're doing better or not. Man, I love this dude. And so it's worth saying for everyone that this is fully open sourced.
I, I guess you just have open source in your blood, huh? You can't release something without making it open source. Can you, I mean, let's go over it. What airflow? Fully open source. What Then you have superset. Yeah. Fully open source. Yeah. And, and now it's compromised fully open sourced. Yeah. Let me think about this.
The scope of promi is like, I feel like the, what's more interesting is the idea, or does the scope of the project is much more limited, the same way that you would think like, a testing library. And I'm not saying like that those are unambitious or. Uh, or okay. There, there's a lot of virtue to them, but the problem space is like a, a little toolkit to, to run some tests and produce reports is si much simpler than the the airflow problem space.
And, uh, quite frankly is like, as a founder now is probably the kind of time I can actually, uh, afford this bed is working on a small little problem. But yeah, I mean, the whole thing with open source is if I, what, what, what am I gonna sell this thing or just keep it for myself? It's a lot more fun to build something relevant.
Come on the show, talk about it, and see, uh, see some mint back, see some people using it, building, a small community of people who get value and they, I get a kick out of that. That's why I do it, yeah, I fully understand. I'm a huge proponent in open source. I mean, I think that's how we drive the industry forward, because the masses creating things and hacking on things is always going to be better than just a few people in a room doing it, and then coming out a few years later and saying, look what we have.
And it could hit or miss and sometimes it hits. But I really think that the, the open source idea and having the community, obviously I'm a community guy, so I love when things are open sourced, but how do, how do you feel, uh, if we take the tangent on the whole, like, should we open source the full power of ai?
Right. There's been a lot of reset debate on like, do we, if we think that, I forgot what the stat was, but it's like, uh, 80, like it's more than half of the, the ML practitioners or the, the specialists in the space think that there's a 20% chance or greater that, AI will destroy humanity.
And then clearly, uh, Open source, in an area like that, where people compare it to nuclear, nuclear kind of technology to say like, ah, should we open source nuclear technology or, I don't know. I haven't really formed an idea on that as a pro open source person, but certainly think there's some, uh, seems like there's some, some potential bandage.
There are some lines. I think if you take it like down the negative spiral, then yeah, it's true. I personally am like, but we are, there are open source models out there already. They're not as good. It's pretty obvious that like what's out there on the open source market is not as good as chat g, PT or GPT four.
That's pretty clear right now. Whether or not that's gonna change by the time we release this episode that's yet to be unseen. I think that it's coming up pretty fast and there's a lot of people that are trying to create things that are just as good. I am a huge proponent, proponent of open source, and so I'm gonna say yes and the way that people use it, I think you can still use chat g p t in a very nefarious way.
Also, it's just that for some reason people think that they, the overlords that oversee chat, G p T and open ai, they can fix things when they break or when they see like, oh, there's a loophole that someone discovered and now they're gonna fix it. But potentially a lot of harm can be done before that loophole is discovered.
So I'm, I'm, yeah. And yeah, let's get the danger everywhere on, on open source. I'm not open source. One scenario that we dodge somewhat is, uh, well, it's not obviously D, but like if Open AI was CLO called closed ai, and they, I found this, this amazing technology and instead of. Creating Chad g p t and making it accessible to the world with guardrails they could have done.
Like, Hey, we've got this secret internal technology and we're gonna hire a bunch of people to that are, twice as productive given our proprietary tools. Right. And we're not, we're just going to use our own, uh, technology to better the cause of this company and not the world. And instead, but they took the, the, the other decision of, kind of getting it, getting the, the animal outside the box and forcing everybody else to catch up.
Yeah. Which kinda interesting. Um, well, I mean, yeah, you can look at Google, they kind of did that, right? They've had the technology behind closed doors and they didn't really release it to the rest of the world in a way that OpenAI did until they had to. And. I mean, yeah, the, I guess the, the thing goes back to like giving it to the masses and creating a community around it, whether or not it's open source or not.
They saw how many different applications could be used with this. I think I know, uh, a friend of mine who worked at Google and his main job was figuring out how they could use these large language models for what use cases. They could use them internally on different Google products for, and this was in like 2018.
And so that was his whole job is like talking with the researchers, being like, oh, you've got this. Okay, this does that. Okay, cool. Let me go talk to product managers and figure out if we can inject this feature into the product. But then when Open AI came out with the chat G P T Revolution, all of a sudden we see people hacking on top of it.
And you see there's a million different use cases that people hadn't thought about. Right. Yeah. Can assist pretty much anyone doing anything, is what we're realizing. Yeah. Any, anything that is like, enc, codeable in language like structure, it can be useful in assisting people with, which is like infinite and like use cases, right?
It's just insane as Yeah. Uh, where this can be applied even, even for, for you and I, I don't dunno if you had the same experience over the past, like since December, but for me it's just like I'm throwing different things at it. Like now it's becoming almost first reflex to be like, oh, let's see if, uh, an L G B T or Chad GT can help me with this.
And it's certainly is kind of fail at this and it's like, oh my God, actually like really, really great at assisting me at, at all sorts of things. Um, which is has been like, it's, it's been kind of, it's been kind of crazy to experience that, be like, oh, let's see how good it is at, writing SQL or answering this question or, uh, or debugging code.
Yeah. Ideate on complex concepts, um, and, and it's, yeah. Or teach me Spanish shit. Yeah. Like a, anything from that. Yeah, I do like that generalize generalizability of it. But getting back to like prompt my, I think that's one of the things where it is fascinating because you, you talked for a minute about this test driven design that we traditionally have in software and now you kind of came out with a different approach and you're calling it, it's like prompt driven design I think, or what, what is it called that you I loved this idea.
Yeah. So it's, it's like test driven development as applied to or prompt cases I think. Yeah. Yeah. So did we have test cases? We have prompt cases. Yeah. It's interesting cuz test cases are, are, there's a lot of, of parallel and invite people to, to check out the blog post talking about all this, and I'm trying to draw parallels.
Between, test driven development and unit testing and prompt engineering. Then there's that hyper parameter tuning angle too. So it's, it's some, it's not exactly the same, but a prompt. Uh, I, in a blog post, I list out the similarities and the differences between a prompt case and a test case. And one thing is like a prompt case.
A test case is clearly like black or white, like zero or one. It's a bullion. Like, did you test the, did, did you succeed at the test? And if the test fails you, you break the build. You don't let people merge into master, right? Like that, just like, uh, very much deter determined. What happens if you fail the test and from the drag is like zero to one.
That it could be like, are you 50% succeeded the test or 75%, uh, succeeded the test. And a failure, you, you don't expect perfection also, right? Like you expect. That all you want is for your, your prompts to perform better than the previous prompts. Yeah. Uh, for, for you to release them into production. So there's some parallels, some differences, uh, but you know, you, you need some reporting.
You need some sort of, uh, accounting on like, how are your tests succeeding? Failing. So, so yeah. Interesting, interesting set of parallels. But a lot of differences too. And I'm assuming that you're also using things like the temperature and you're, you're tracking everything, right? Like everything that goes into the prompt is being tracked.
Yeah, pretty much. And you can, you can augment these, the reports and what, uh, right. So, so you derive a class that, uh, is expected to take some input and generate a prompt, uh, and you attach, you attach evaluation function to, to this. I've get a pretty good example, uh, on, on the repo to under example. Um, Build a little toolkit that does, um, that generates prompt to, um, to write simple python functions, right?
So a prompt might be like, generate function that, uh, gives the end number of the Fibonacci suite or, uh, find the com like, write a function that finds the, the lowest common denominator of two numbers. And then, uh, so for each one of these, text prompt, there'll be an eval function that assesses whether, um, it, it might assess like, if I pass these numbers, do I get this number out?
And all the mechanics of like getting this function par into an interpreter is, uh, is off the library in some ways. Like that's your own logic that you bring to the library. Um, but the prompt, aise prompt cases would be maybe a hundred simple, uh, function prompts with little eval function that makes sure that it's running well.
And then when you, uh, When you derive that class, it's pretty easy for, for you to, um, to log more things. Well, by default, we're gonna log all of the, the, open AI type parameters, like temperature and how long it took to answer how many tokens were used in the prompt, right? So we could, we keep track of all of these statistics and put them in a blob, but you can augment that blob with whatever is relevant to your prompt, right?
So you can log your own, uh, type of information you want to keep track of in your reports. Uh, man, here's an interesting thought. Uh, I'll, I'll take a, like, something I've been thinking about is, uh, is the following. You tell me what you think about, about this idea, but. Everything is changing so fast, right?
Like in, in terms of AI development, LLMs, there's gonna be like new, we talked about vector databases as being a key part of the solution. So that's new and emerging, and then the prompts are getting longer. So the limitations are changing. Yeah, the costs are getting cheaper. Uh, now all these open source models emerging that have different constraints, so everything is changing really fast.
Maybe the only thing, what I've been thinking is like, the only thing that is valuable in the world that's changing constantly might be your test set, right? Like if you think about your, like the, the assertions that you're doing to evaluate this is the only anchor in a very kinda, uh, um, like, oh wow. Yeah.
In a storm, right? Let's, uh, you're on this boat, the technology is changing very quickly. And, uh, and it's really hard to evaluate what you should use and how, how you should do things, which model you should use. And, and maybe the most valuable thing in the, in those times where things are changing so fast is the anchor that tests how well you're doing, right?
Your test, your prompt cases, your test cases. Well, I love this because a lot of people, it's not like we were just talking about earlier how yeah, you can do so many different things with these models and that's part of the magic of it is that generalizability. But for the most part, for most use cases, it's not like you, in your company, you are asking for five completely different things, right?
Maybe you have your one use case that you're using and you're heavily optimizing for that one use case. And then you may find another use case and you may find that open AI is not the right API for that. You can actually do it with an open source model, whatever it may be. And so you're figuring that out, but.
In these different use cases. If you can create that test suite that is highly optimized for that specific use case, then you can swap out the models so much easier because you know what the best model is for that use case or again, what you're talking about. Yeah. Let's say like Anthropic says, all right, now you can just throw a whole book at it.
You can throw the Bible at it and not even worry. It doesn't matter. We'll give you back whatever you need and you can analyze, is it worth it? Is it more expensive, is it faster? Is it slower? All that stuff that you really want to know if you're trying to make a battle hardened use case as opposed to just a Twitter demo.
It's super useful and it just shows like, yeah, I really like that idea is how if you build out your test suite, you can know really quickly and you don't have to spend a ton of time. Evaluating which model is the best for this use case, which isn't, yeah, this, this, maybe this, this, uh, test library bec becomes, or your, your test cases, your collection of like prompt cases become your single most important asset in the world where everything else is evolving too fast, right?
That's the only thing that doesn't erode over time. You have to rewrite next week when the next model comes in. One, one thing we haven't talked about, so we talked a lot about, uh, prompt engineering and how the key, I think the key is there's always gonna be a huge portion, the key that's in prompt itself, but there's, there's a lot in terms of like, potentially fine tuning these models, right?
So I would take a base model with a set of weights and then instead of like passing a lot of context in the prompt, which is limited, but you know, arguably growing, the, the prompt limitation are growing. Maybe the, the real solution is to train further these models down the line instead of like trying to.
Like, here's that whole Bible. It's like, uh, how about I just train you once in, in terms of like reading this book, really understanding it, and then that's, compiled in your, in your weights and in the neural net. So it could, it could well be that the solution, uh, weeks or months or years ahead will be all in the direction of like, custom, fine tuning, custom training.
Uh, you pick a base model and then you're still in a world where you need to assess these prompts and how they're doing to. Right. So that library is still useful regardless of, your, your approach in the end. And don't, you don't you feel like though, if you are trying to figure out these new models and where they excel and where they don't, you have these prompt cases, but you also want to add new prompt cases because there are new capabilities.
Certainly I think you're, you're pro. So, I talked about it like a, like a super important asset. If you're solving like, call, call it a subset, you, you take these generalized model, you take a more limited problem, like during Python functions or uh, doing text to sequel and that test library, you should, you can totally argument to be better and more representative of, your user, your, your actual user cases.
And what I realized is like actually AI is really good at generating these, uh, yeah, these from cases. So, um, one thing is like, oh, should they be defined dynamically by, shouldn't AI evaluate the AI as a question? Uh, a different, for me, I, I like to have these test cases like, defined as code and static and being very deterministic cuz we're trying to create some, like, some deterministic.
More deterministic outcome in this super probabilistic world. So if he's like, oh, this AI is gonna evaluate this ai, uh, and then you're still in that realm of like, I don't really know what's happening. Uh, so, so there, I like the idea of like writing those as code, right? And putting them in a repository and evolving this, this code base or this, this library of prompt cases.
Um, what I realized, is you can take say 10 example. You do a few shot example of like, here's 10, um, test cases. Now can you generate a hundred more? And they will do that pretty well too, so that AI can help you come up with this library, and the example I had with, for the Python, um, I think for the pi, for the Python use case, which is, uh, a, a, a prompt, a, a toolkit that allows you to, um, ask for the AI to generate.
Small Python function or just like deterministic, simple python functions. I wrote like four or five, can you write, uh, a function that determines whether it's pri a prime number, uh, if the number is prime or not. So I wrote four or five and I was like, here's four or five. Can you generate like a dozen more?
And it did a really, really good job at, at that, including the evaluations, uh, methods. I generated prompt OPTIMIZ code using Prompt OPTIMIZ code and created these really great prompts that then I put in their repository and, um, was able to use in my test suite. So there is something that, it's kind of changing gears, but I do wanna mention, because we've kind of been dancing around this for the whole last 30 minutes that we've been talking, and it is all around how what you've created, if you can monitor the A speed and B cost, Those are some of the biggest things that people talked about when we did the LLMs in production survey that we sent out to a ton of people.
And it was like basically speed, cost, and trust in output, or how can I know that it's, yeah. Accuracy in quotation marks. Exactly. Hand quotes, because it's a weird accuracy one. And so you effectively are with this test suite and the the prompt suite, you can help measure all of this. And, and you mentioned before that you can kind of bring up a dashboard and see what it is.
Is that with. Let talk about that a little bit. Yeah. So what the library does at, is at its core is they will run your test, your prompt cases for you. I'd say it offers a little bit of a dsl, like how do you define a prompt case? So it's something as simple as like, what's the input and what's the, what are the eval functions for that input, right?
So that's nice. Uh, so that's a way to express, in, in a, uh, very like nice and syntactic sugar way. Like the same way that you have like these test suites. So you can like, very easily like represent your test cases as code, um, and then it will run them for you run the eval function and produce a report.
And that report is in the form of a, it's a little c l i application, so it's a little c l i, you're like a K optimize, run. Find all the tests in, in this folder and run all the tests and put the output report in, uh, some sort of temp file, some sort of YAML file somewhere. So we'll output a YAML file for each prom case with a bunch of statistics in it.
Right? So these statistics are not, not even statistics, but just like logging essentially of like how each, um, case did, uh, you can add some categorization of your prompt cases, right? So you could have like complex function, simple functions, and Python. So you come up with your own categorization of things.
You come up with your own weights. You, you can say this test is like five times more important than the rest of the test. Oh, okay. But this one is less important. There's weight, there's categories, um, don't fuck this one up. Whatever you do, don't. Yeah. This one is like a hundred times more important if this one fails, you could add more semantics eventually too.
You could have a category that says like, if it fails, never right. Yes. Uh, never published a library. Call it like the kill switch. Uh, but yeah, so you can, so it produced this big YAML report and then given these like YAML reports you can do with the cli, some compile, some, some high level statistics, things like mm-hmm.
Oh, what's the percentage of success with the weights? Uh, but, but better, you can push, you can push all these statistics to a database if you want to. So that's pretty easy to just push this, these YAML reports with what is the prompt case? What was the input prompt? What was the output prompt?
Did it succeed or fail? How much does it cost? How much did it cost to run? How long did it take to run and open ai? How many tokens did it use? So all these tokens are in a YAML report. You can put it in the database if you wanna do more complex analytics on it, but, uh, I think the, the C L I, you can write prompt, optimize report.
Pick an a, a YAML file as input. And ask, uh, for, which statistics you want out. And right now we'll just compile and tell you like, this test suite, ran better than, the success rate is higher than this other one. Um, I think there's more work to be done there. So right now, it pretty much just logs the, the into yaml.
Uh, but it could do more things like diffing to outputs, right? Maybe you run mm-hmm. This week for the prompt or with say, GPT four, GPT 3.5. Now you wanna say, give me examples where one succeeded and the other one failed and or vice versa to start understanding like, oh, it does much better at, this kind of tests and here's why.
Yeah. Well, it's, we see it all over social media, right? There's all these benchmarks on this model. Did this, it performed better than G P T on X amount of questions or whatever. But for most people that. Doesn't really mean anything because you want to test it out against your use case. So it's very much like what you're saying is, hey, we can basically give you your own benchmarks and then you can figure out what works for you.
Yeah. It's a methodology for you to create your own benchmark really, right? Mm-hmm. For the things that you're interested in. And, clearly, different people are building different products and are thinking of different ways to leverage ai, right? But this story is always similar where, um, so in our case, we want to help people do their own analysis of data and explore their data more on their own.
Um, so, so our use cases are around, generating queries and making sense of. Of data, but in general, you always have like, okay, my product has some context. In our case that's, that might be the database of the user and the database schema of the user. Um, and the user has a question. And if we take this user question what they want along with the context that's in our app, we can output something that is useful, the context of that app.
And, uh, I think for every screen in every product, there's potentially a place where AI can, can be leveraged. Mm-hmm. And uh, it's like the question is like, how do you evaluate how well it's doing? You definitely don't want to be shipping a product with an untested ai. You don't know whether it's gonna hallucinate things or make shit up or, just have bias to Yeah.
Or, or like to upgrade, like, Hey, let's upgrade to the new newest version of G P T. Maybe it performs better at certain things, but worse at others. You need to handle, if you're gonna ship product, you need a handle on every, every API that you use. And. This one API that we started using now that's in the lm, just has no structure.
So like then maybe that's the way to put some structure and evaluation on top of it. Yeah, and it's funny too, uh, another thing from the product perspective, which people pointed out in this report that we did, it's, and I'm just bringing it up cuz I've literally been, my head has been buried in this data for the last three months.
And so I feel like I've come to some conclusions. They're probably completely biased, but you know, just reading the responses because I don't know anything about doing reports. I did not get that consultant training. I never went to McKinsey for my report writing skills and I left every single answer or the questions were all with open freeform text box because I didn't want to assume anything.
Since it's so new. I didn't wanna push anyone down any path. That made it really hard to try and write a report from because, you have to just go through all this data. There's no like clear, I can't create graphs or anything. There's not like, oh, people, 10% of the people chose this. And so, getting back a little bit of a tangent, uh, there, but getting back to this one thing that people were noting is that, yeah, I can add AI to my product, but at the end of the day, is this AI feature sufficiently going to add more value to the end user so that we can charge more and cover the cost of how much it costs us to implement this AI feature.
It's an uncharted territory in a lot of ways, right? Like, uh, so people are using your product in a certain way today. They're kind of trained to do that. Then you want to add a new, I don't know, set of tools and shortcuts that enable them to, to do much, much more work. Maybe work they wouldn't do otherwise, or Yeah.
Or things that they would, um, do somewhere else. Then you're assuming that this is gonna be useful, and how do you measure whether it's gonna be, I mean, it's, it's a tough one. I think the only way to know is to, add a feature flag in your product, bake it in, and then do some, uh, user interviews after to launch it and then, uh, beta, private beta, uh, and then look at and mix it.
User feedback from humans, like sit down with them and be like, oh, so you've used this feature quite a bit, do you like it? And there's like emerging usage pattern too, that you can't. Predict, like to your point earlier we're talking about we launched chat g p T, or they launched chat G p T, they didn't know what to use it for, and then all these use cases are popping out of Yeah.
In the wild, as people are trying to figure out how to use this thing and they find intricate ways to get value that you and I can never, think about like a dentist using Chad G p D or a neurosurgeon asking questions that we could never, come up with. Um, but, but yeah, so, so there, do you do user research?
You can also log a bunch of data. So for us, like we're gonna be, um, logging like, okay, what's the user prompt? What's SQL generated? I know you add a little thumbs up and thumbs down and try to get some, uh, call, call it like some, some more quantitative metrics on that. And you can sit down and, interview people and be like, Hey, is this.
AI assist feature that we've built. Is that useful to you? More your normal kind of user research type approach to, to it. Yeah. And I think the interesting piece there is, like you, from your perspective, all right, now you see that people actually like it. How much more are they willing to pay for it?
Right? Because then if it's not more than your API calls are costing, it is not, you're not like actually generating an ROI on that. Yeah. So, so there in terms of like the perceived, the value of the feature. Yeah. Then that's gonna be really exactly, really specific to every, product and whatever you wanna gate that you know, or charge people more for it than exactly what they see out of it is, it's kind of tough to evaluate.
But yeah, you probably need to put in some sort of like open beta, get some feedback, try to figure out before you even think about pricing and packaging or roi. Yeah. But the intuition is that I think everyone's intuition is like, there should be ROI here. Yeah. Uh, and, and then, we can't afford to miss that boat as a pretty big feeling in the industry.
Like, I would say, like every PM who's not thinking about this right now is probably gonna potentially missing a boat. And, and, uh, it's, it would be unreasonable to not think about what, how LMS can augment your products today. Yeah, yeah, yeah. Yeah, a hundred percent. And so a friend of mine told me when I was talking to him about this very thing, he was like, yeah, I mean, at the end of the day, you're just going to have to tally up the.
Cost of using LLMs to cogs, so your cost of goods sold, basically. And so if, if it works out, if the math works out, then it works out. But if it doesn't, then you gotta figure out how to make it work. But it, it's cheap though, right? Like, is, if you look at, I mean, maybe if you use like GBT four with the longest prompt possible, or if you run your own infrastructure, it's, it's expensive.
Yeah. But like, the stuff is like pennies a, a, a prompt, right? Um, so unless you're doing like, crazy chains or recursive prompting or very, very large, uh, input and output, I think it's, it's like you, you could never run, and that'll put me on this, but it'd be hard to run Bill on GPD five, uh, three, five Turbo, right?
Like, it's like pennies per call. So unless your product is, is, doing, if there's a Ute user interaction, like user clicks a button, And gets a prompt, at a penny a call, or sub penny a call. It's pretty, that seems like PR commodity, yeah, yeah. It's hard to run up a gigantic bill.
Uh, I guess, but I mean, I wouldn't put it past some people in teams to make that for, uh, yeah, you go with Bill Fat bills of, uh, opening eggs is gonna be making some money off that api, that's for sure. Yep, yep. So there is one kind of elephant in the room that I want to get to, which I've heard other people talk about and seems like you're very bullish on prompts.
But I do know people that talk about how they feel like prompts are an artifact and they hopefully are not the end state on how we interact with LLMs. I cannot think of another way that we would do it, but it feels like what we've been talking about, English or like speaking, or language in general, a.
Written language is a lot more difficult to, it just has a lot of downsides that coding and the deterministic ways that we interact with computers don't have, uh, that's not to say that we're not necessarily going to go back to everything being coding after X amount of time because we realize that this is so hard.
But when it comes to prompts and interacting with large language models through prompts, is that the end state? Have you thought about what potentially is next? I mean, the way I think about this question maybe is like between like specialized model versus generalized model, and then like these LLMs are, they're proving the power of these generalized model and the power of unsupervised turn, uh, unsupervised training.
Using text and language, and that's a new API that we're trying to tame and control with prompts and prompt engineering. If you look at more traditional, like, uh, machine learning type model, you have like these very, very specific use cases, right? You'll, you prepare a data set that says like, is this, uh, I, or I like a classic use case.
You take the examples of like, given the features of a flower, tell me what kind of flower it is. And you, you train this model on the specific use case of like, given these like six input parameters, predict, which kind of flower that is. And in some ways that's where we came from, from these, like, they call it specialized models.
So these specialized models, I think are, are still extremely powerful. Like whether they're neural nets, trees, forest, uh, lin air regression, like whatever they are, they're, they're useful and they are, uh, They have kind of their very clear api, right? Like when you think about those, they have very clear api.
Now we should have this new world of like generalized ai, not ag, but like generalized AI that are just good with language. It's a new beast for us to, to tame. And then clearly like the, you prompted Jang is, is the way to do that though I don't think people realize just how much you can do with prompts, right?
Like, so one thing that I discovered is you can ask for structured output. You can say like, Hey, I'm about to ask you a question to generate sql, but I would like for your output to be adjacent blob with one, one key that this is a SQL query. Another key that is your confidence level from zero to one on whether this SQL answer is the question.
Another key is some text feedback as to hence to improve my prompt to get a better answer from you. Like need for ask for clarification type input. And you can ask this. AI through a prompt to, to provide a structure structure in its answer. And that's been working very well for me in G P T four, right?
With few shots. Examples of like, when I ask you a question like this, I want to answer it like this, formatted like Jason with those keys. Here's three examples. Now here's the question you tell me, provide this sa similarly format, Jason Blob and it, so you can ask for structure in the answer. Um, which is great.
Then how do you structure the question? How do you best provide the context to get, um, the AI to help with your use case? That's, I guess, the other, that's big part of, uh, prompt engineering too. But, the interface for LLM is gonna be language. It's just gotta be it, yeah. Uh, then, then there's a question that, maybe a question going back to you as like, ah, which real life use cases, percentage of real life use cases are gonna be solved by specialized model versus generalized model.
Think, uh, everyone's perspective on that change over the past, 12 months from like, it's all gonna be specialized now it's like, it's all gonna be generalized. Like there's probably a medium there somewhere. Yeah, exactly. That's the funny part, right? Like, it feels like every day that we move forward, we're going into this.
Yeah. It's generalized, but they're smaller models and so, and we are, we went to the generalized, but then we're saying, yeah, but now it's probably gonna be a lot of generalized models. And to your point earlier about the fine tuning. If you can fine tune a large language model that has a certain ability, but then you can add a little boost on top for your specific use case.
And then you have other use cases that you can fine tune for. It's probably a lot of those fine tuned models as opposed to one big one that you are trying to find tune with everything. Right, right. So, so yeah, I talked to some, uh, some folks around this specific topic cuz I was interested in, well in, instead of like doing a lot of prompt engineering, um, I'll get a little bit deeper into the, the text to SQL challenge that we're trying to solve.
But text to sql, you can have a database that has, tens of thousands of, of tables in it, right. And then people might ask a question and generate a question on data. Then the challenge becomes, okay, what are, there's a lot of private context here, which is like all the data schemas inside the, the user database.
And clearly that doesn't fit in a prompt, right? I cannot say like, here's 10,000 tables, uh, the schema is 10,000 tables with their, simple data usage, statistics, descriptions, and comments. Like, here's 10,000 tables. Now can you generate SQL that answer the simple question act, uh, what is my net sales, uh, this year?
So then with Vector database, we try to do document retrieval to best construct the prompt, uh, to best construct the prompt that is likely to have the context for the AI to provide the answer. Then the other approach, like, can we just fine tune a model? Can I just train the model? I can, I create a corpus of 10,000 tables and I'll throw my DBT pipelines, my airflow, dags, my, uh, My data notion pages, and whatever else into this corpus, I'll fine tune a model that will have the full context of all my data catalog and more, right.
Data catalog, data pipelines. Um, then I was like, okay, can I do that? I spoke to, um, someone at my side, they, they work on something called GPT for All, which is like one of the Yeah. Open source, uh, model. Probably a great person to talk to. Um, and general, I can share the contact info, but they were saying that if you try to take a model that's fully trained, so you bring the, the model topology on the weights and then you throw your own corpus at it, you're like, oh, well it should learn the stuff on top that I'm trying to teach it.
But what, uh, this person was talking about is the sequencing of, of information is really important. So if you train a model with a bunch of information, so I'd open AI. They have this sequence of training and the sequel teaching portion of it, the co-teaching and the multiple language teaching is fed in a very specific sequence.
Um, so if you take that model and then you train it all the way, and then you're like, okay, now here's a bunch of like, random Yeah. On PE it can really confuse the model. Um, so then that means you would've to go back to fitting your, your training in the right sequence and then train from scratch.
And then that's, hundreds of thousands of dollars of G P U cost. Um, and otherwise you deal with these issues of like, kinda memory scrambling. I forgot what, what they're called officially, but like, if you try to take a model that's fully trained and like, here's, by the way, here's a bunch of extra information that is relevant to me.
It might forget things. It might kinda just like mess up the, the weights and the neural nets, so, There's, we're looking for some improvements in those areas too. Can you fully train a model and then, and then specialize it, tune it, uh, later in the process without, confusing it out.
Mm-hmm. Yeah. I think that there's so many open questions on that, like, which is better and for what use cases is it better for? And so it makes complete error. Here's the thing too, like for me, if I'm like, okay, well I should build a fine tune, a model for each one of my customer at Parisa. Yeah. You want isolation too.
You don't want for Yeah. Yeah. Want customer to be trained to compliance. There you go. That's now we're getting to, yeah. Compliance and privacy and, but like then that means I need to fine tune a model for each one of my customers and serve it, serve a model, a different model for each one of my customers.
Yep. In an air gaped way. It is rough, right? Like if you wanna do that today, um, I think you can do fine tuning at scale and open ai, but, but fine tuning the way it works might not be the way that you want it to work. And it's not quite there yet. Right? So maybe that year or two from now, then it's, it becomes really easy to fork models and fine tune models in different ways.
Like today it just makes a lot of sense to do prompt engineering. If you're a small and mighty organization moving fast, you probably just wanna spend some cycles doing some prompt engineering, uh, and then hitting the, the open AI api. Uh, despite of the most reasonable approach, if you start getting into training your own fine tuning your own models, um, you're entering potentially a world of, of trouble.
Yes. A world of trouble. A world of like in increasing investment, like right to get to value. I. Do a lot more engineering. Yeah, totally. It is, it's the classic engineer's dilemma, right? Like, we wanna sink our teeth into things because it's like, Ooh, that's fun. Let's add a little more complexity onto it.
And really, you could have done everything from a much easier standpoint. And so, yeah. And then wait, the thing that I will, right, like I said, there's a timing issue of like, Hey, do I want to try to build an iPhone before LCD screens? Touchscreens? Yeah. Good enough. And chips are enough. So there's a, a time, like if you try to do the right thing too early, it's very expensive and costly, where if you wanna invest into fine tuning, maybe the best time to do it is when you know, you, you keep monitoring the, the advances in that domain.
And once the right api, the right cost comes out, that's when you start investing in that. But I think we're entering, we're in the era. It might be a short era where. Prompt engineering and vector databases, is the name of the game right now. If you want to get value out of LLMs, is is really where you should spend your cycles from my own personal perspective, right, fine tuning.
I think we might enter the era fine tuning, in a month or in six months, in 12 months. Who knows? Yeah. It's hard to gauge that, right? I know people are excited about the, what is it, the cue Lauras that are showing promise, but it is true like, hey, build for what is here today, and I love what you said, which is such great wisdom.
It's like look at where you are and take a cold hard look in the mirror and say, if you are this small and scrappy startup and you've gotta move fast, go with what is going to be the easiest and the lowest hanging fruit. You don't need to overcomplicate things just because. You read some research paper and you saw Yeah.
I mean, it might be more fun too, but to me, I don't know. I'm super pragmatic too, so entering the whole, like, oh, I'm gonna figure out how to load up a bunch of weights and get access to a bunch of GPUs and rake up these bills, to try to fine tune my stuff, which I might just throw away, after the first iteration and start over.
Uh, it seems, prohibitive in, in this phase, it's much more like the pragmatic approach of like, I'm trying to fix myself a meal, like what's in the fridge. Like, I'm not gonna go to this store, I'm not gonna go Yeah. To, uh, to specialize grocery shop or are there stuff online? I'm just really trying to cook something, what's what's in my kitchen?
Yeah, yeah, for sure. I mean, if it's possible, right? Because I know there are people that. Are sitting at companies and whether it's because of GDPR issues or because of company policy or regulatory issues, they just can't say, all right, now I'm gonna hit open AI api. And so they gotta figure out workarounds.
But I will say this, you mentioned something before that I thought was hilarious because I've been meaning to say it for a while now, and it's uh, what did you call it? Like random scrambling or something? Oh, these germs. I don't know that, yeah. Yeah. I don't know what the terms are, but I gotta give it to the researchers because every time I hear a new term that comes out of the LLMs or just the machine learning researcher community, I'm like, man, that is such a good term.
First of all, we've got hallucinating, right? Yeah, that is epic. And I created a shirt, I'll have to show you some time. It says I hallucinate more than chat G P t and. I thought you would like that one. Yeah, I thought you would. But then the other term that I absolutely love and I'm trying to figure out if there's a meme or a shirt to create around it, is the whole idea of catastrophic forgetting.
And it's like, oh my God, can you make something sound so epic? I mean, catastrophic forgetting. Yeah. Gimme a break. Yeah, that's right. That's the term when you teach new things and it forget like so much. Should I forget something that's more important? Like you would try new language on top and it would forget Spanish, or like scramble it with some other language. It's forever. Yeah, yeah, exactly. Can't recall it anymore cuz of the sequence of event or something like that. But you know, another thing that I think a, a term I've been using to describe L l M is like amnesiac too. So it's related to get this rough forgetting, but they, these, these models and I thought people I figured out early on when they came out, I'm like, how did these things work?
But. They're immutable models, right? They, they, they have no con they're trained once or you take, anything you take like GT five Turbo, right? Like it is trained once, it is completely immutable, uh, it is trained as of like 2021, the fall of 2021. It knows nothing about what happened after. Yeah. And then that is like, well, but it recalls like I, I'm having a conversation with it and it remembers, what we're talking about cuz tell me more about that.
And it knows, but that's just because at every request you pass the context, the chatbot receives the context of the pre this, the current session, but it has no recall if you go past that prompt window too. So if you have a long enough conversation and you ask G P T, Hey, when did I first talk about when we started this session?
Yeah. And it's outside the pr the, the token window. It's like, I don't know, there's no way for me to know. Um, of course, like now they, they're looking to train. They're gonna train and fine tune their models. With the chat sessions that we're having today outside the api, at least at LP and ai, they, they're doing that.
But these models are am nasc. They, they have no recall, no memories. They have to bolt that on top through co context and prompting, yeah. That's actually, that is something that, uh, when we had matte on here from Databricks, he was talking about how much of a crap shoot it is. When you do decide, Hey, I'm gonna go and train my own model, because you are training for 40 days or 60 days, or however long, you're spending a ton of cash, and then you're kind of like left with whatever comes out.
You can't just go and say, all right, now we've got this basic addition and we're gonna make it like a hundred percent better. You have a, a model that comes out and there's not a lot that you can do if it's a shitty model. Yeah. I think you can load it back in memory, so Right. You can extract the topology of the network and then the weights, and then you can fine tune, I guess is what they call it.
Yeah. You can feed it more corpus, but then the, I think the sequencing matters and there's this like issues with catastrophic forgetting and just scrambling, catastrophic, scrambling, memory. Scrambling. So it will like, yeah, it, it's likely to really weigh the stuff it, I mean like you and I, right?
Like what we last heard about the last input we got. The recency bias. Yeah, recency bias. So we will have that in a training. People are looking to. Kinda fix that cuz otherwise it's true. Like if, uh, to Matthew's point, like you have to go back and retry the whole thing on, on a cluster of GPUs for, uh, for six days or for 40 days, then that's, that's it reminds me the world of like, very slow data pipelines on map reviews, yeah. It's like fire and like you run the job and then it's gonna take 24 hours to run and you just hope it's not gonna fail towards the end. Yeah. If it does, you're like, uh, the cognitive weight of that, right? You, you start many of these jobs cuz they take forever then you forgot what you had done. Like if you're training six, six version of a model at once, then you forgot like why you trained, model one in the first place by the time it's ready.
Um, so yeah, for me, I don't like to iterate on these like supers so that, like anything that takes, a lot of compute and time, it's hard to work on man. It's just like fun. What prompt engineering class, like you can like Exactly. Get direct feedback. Yep. And you can use prompt optimize to get your, your prompt suite.
I love that term. And really go with this. Uh, the t d d or it's, what is it? The, it's not t d d you were calling it the p d i. Oh, test driven development. I guess it's like prompt cases. It could be prompt driven, uh, prompt driven development. Oh, driven. Or like test prompt, uh, development. I don't dunno. Um, but, but to me, the idea to a reason why I, come out to talk about these things, to share my experience working with these things and share some things along, uh, I learned along the way.
I'm not sure if it's, to me, optimize is less about the tool, the python little toolkit in c l I, but more the idea of like, how do we bring. More, a more de deterministic approach to, interacting with these LLMs. And that's, to me, the blog post is really if people are interested in the topic, go read the blog post.
It's much more structured than I can be in free form like this, but it's maybe less entertaining but more structured. Um, and like, this is the, the big I that's the big idea is in the blog post. And then prompt optimize it as just like a reference implementation of how to do it. And if the toolkit works for you, use it.
But if, uh, if you want to write your own, test, sweep and reporting sweep, you can do that as well. Excellent, man. Well, it's been a blast. Absolute blast talking to you. I knew it was gonna be fun and, uh, I'm happy that we did this. I'm super excited for what you created with Prompt Eyes, and it's true like you're doing it so that you can just share your learnings and you recognize that this is probably valuable.
So get it out there, and thank you for doing that. Yeah. It's, it's great to share that it enables these kinds of conversation to do, right? So it's like I was able to write a blog post, create a little toolkit, and then that gives me, uh, an opportunity to connect with people and, and talk about this, this topic, from, from experience, get input from others and compare notes.
So it's been, it's been super exciting too. I gotta find a community of, like, the community of people that are building, if you have any input on that, like people trying to, wiggle and jam AI into the product they're building. I don't know if those communities exist, but if anyone in the, in the show notes, Avin put on that, I'm looking to get closer to other people that are trying to figure out how to bring the power of, AI and LMS inside the product they're building.
Yeah. Like the AI product community. Yeah. And Yep. Yep. I love it. I mean, By the time we release this episode, your talk from the LLMs and production conference that we're about to have next week. So this will date when we're actually having this conversation. That'll be out too. So if anybody wants more of you, Maxine, they can go find that.
We'll link to that too in the show notes. So, uh, this has been awesome, man. I really appreciate you coming on me. Love to connect. Uh, yeah. I don't know what we're gonna talk about next, yeah. Last time we talked about data modeling and, and yeah. How sets across NL and data engineering or analytics in general.
Uh, now we're talking about pump engineering, so who knows what it's gonna be about next time? Yeah, exactly. That wasn't that long yet. That was literally like a year, nine months ago or, yeah, a year probably. So it's funny how things move. Yeah.