Taming AI Product Development Through Test-driven Prompt Engineering
Maxime Beauchemin is the founder and CEO of Preset. Original creator of Apache Superset. Max has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Yahoo!, Lyft, Airbnb, Facebook, and Ubisoft.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
It’s clear that test-driven development plays a pivotal role in prompt engineering, potentially even more so than in traditional software engineering. By embracing TDD, product builders can effectively address the unique challenges presented by AI systems and create reliable, predictable, and high-performing products that harness the power of AI.
You wanna try and upload 'em and see if you can share it that way. If it's breaking out your computer, the, the old fashioned way.
Maxine, we're waiting for you to share your screen.
Let's see if it breaks out your computer.
I'm looking at you, man.
Let's. See what we got today with Mr. Maxine
Tell.
When's this screen share going?
So it bricked it. Bricked it out. Man, I gotta bring you on here cause I just see you suffering backstage and I'm feeling bad for you. Do you want to gimme the link and I'll share it? Yeah, I put it in the private chat. I was gonna go get my guitar. Like I was like this, this, uh, you play? Yeah. Gimme the link.
Play play guitar while I Oh, it's already in here, isn't it? Yeah. I just put, I put it in there. Maybe I was bricked when I tried to upload or to send it to you, but you should be able to share it. Hopefully. Yes. I got it right here. I've got it. I will bring it on stage. Uh, and the fun is over with the guitar or, dude, grab yours.
Let's, let's rock on. Why don't pest driven, prompt engineering is what? Here's a talk. Here's the thing that I, playing. Trying to jam online is really hard. You get like a few milliseconds. Too much. Yeah. Then you start like slowing down, you know, it's, I go a little bit slower to, you know, make up for you. I up for it.
Yeah. And I, I would not, uh, I will, um, kind of protect your audience from mic is har playing. There's a better thing for, uh, for all of us. I'll, awesome dude. Well, we've got this here. You're just gonna have to make a like, uh, a sign when you want me to change to the next slide. Yeah, that'll go pretty fast cuz this is a lightning talk.
So I'm gonna, I'm gonna, yeah. Lightning talk was supposed to start, what, two minutes ago? Where are we? Yes, that means I got, I've got seven minutes left, so I'm gonna, I'm gonna blaze through this. Yeah, don't worry. You can take the whole 10. We're, we're all good. We, we accounted for a little bit of cushion, so that's nice.
Uh, yeah. I, I'm gonna say for people who wanna know more, like there's a blog post, so we'll hit up a slide that talks about the blog post that is, uh, a little bit of a companion to, uh, to this, to this stock. So this, it's probably like an hour's worth of content. I'm gonna try to cram into 10. So it's pretty heavy.
Um, there's a lot to talk about, but I'm gonna brush the surface. Um, and then there's the blog post, and there's the podcast episode that, uh, we, we recorded just, uh, last week if, uh, if I'm right about my, my time blurb, my my time blur. Um, alright, so we're talking about test, uh, test driven prompt engineering today.
Um, maybe let's jump into a slide or two, I think. Uh, the next slide is an intro about me. I think you did my intro already so we can, uh, we can go through that very, very quickly. But I wanted to point out like my, my professional's life mission has been to, uh, work on, on, on open source lately, and then I've been really pushing on, uh, Apache Superset.
I wanna say a few words about that. Um, superset a s a, uh, open source, uh, business intelligence platform very much, right? So it is very competitive to. The space if you think, you know, Tableau, uh, Looker, you know, and all these BI tools. So open source and that, and that area through Superset has gotten really, really good, uh, recently since we've been working.
Um, at it like, you know, I appreciate the community's been working real hard. So, um, encourage people to check to take a look at Apache Superset. If you haven't, uh, done that in a while and then you can try it, you can try the latest, greatest version at preset for free. So, um, Encourage people to take a look at that and to switch to open source.
Like why, you know, paying for these vendor tools. All right, so let's jump into the topic of the day, which is, uh, test driven from engineering. I'm gonna start with just a bit of context. I think things that's pretty established given the, you know, this conference here and what we're talking about. But everyone right now is looking to bring ai, the, the power of AI into the products they are building, right?
And that's a big part of, uh, what this very conference is about. Uh, and if you talk to anyone who's a, a pm, a product builder, a startup person, like everyone is trying to figure out like, how do we harness the power of LLMs and bring that. Into the context of our product or application we're building. Um, product engineering is, is very new, right?
And it's, it's a new field that's pretty fresh. And, uh, LLMs are are just weird freaking animals we gotta deal with, right? So it's a new w We've been used, you know, as product builders to use, you know, proper APIs with schema for input and output and things that are very, very deterministic. And all of a sudden we have this, like, this weird animal that's very istic and uh, that has an infinite a p I surface.
That has no input or output schema, uh, unless I guess in this infinite API space, you can define some sub areas of API surface, you know, and you can ask for schema and you can kind of, uh, you know, dictate schema in some way. Uh, and this stuff is evolving very, very quickly, right? The, the models are evolving quickly.
The, the limitations around the model, the prompt window sizes, um, all these things are evolving very quickly. All right, so next slide. Um, so this talk is about, and it's a 10 minute talk, so it's very quick, but, uh, it's about advocating for a test driven approach to prompt engineering. Like try to make the stuff that's very probabilistic, more deterministic.
Um, I want to introduce prompt Optimize, which is my take on an open source toolkit. To define these test cases or prompt cases to evaluate your prompts, uh, and bring this, this, this idea of like test driven development to prompt engineering. And then I want to talk about a little bit about our Text Two SQL use case at Preset.
So the reason why we, we, I went and built Prompt Optimiz and got involved in this space is because, Um, we wanted to build, um, text SQL type features inside super inside the preset superset. So I'll talk about that. Uh, this is a blog post. It looks like it's not rendering the image, but, uh, here it is. So you can go to preset io slash blog.
There's a bunch of blog, but the blog that we're referring to here is, uh, the one called Mastering AI Powered Product Development. Introducing prompt optimiz for test driven prompt insurance. Quite a mouth full. But, uh, this is what we're talking about today and there's a lot more structure, examples, pointers, um, and, you know, rich text, uh, there.
So encourage you to, to, to maybe pop this blog post open if you're interested. Um, or if you're, you know, curious to learn more after this talk to go and look at this. All right. Next slide. Uh, so our use case, just like to set context, like why are we getting involved into this? So Superset has, is a full, you know, BI platform, but it has a bi, uh, a SQL i d e as part of it.
And, uh, we wanted to bring a few AI assists. Type features. Uh, the one is, you see at the top there, there's like a, a little blurb, like where you can do text sql, so natural language. Then we'll generate sql, uh, we'll show you some results, and then we will also recommend charts. So the recommended chart engine is also, um, a AI use case where we want to integrate, I know our product.
So in, in this, like I'll refer to that example use case that put things in context. All right, next slide.
All right, so very quickly, I think people are familiar with prompt engineering, but you know, I think there's been confusion between prompt, what I call prompt crafting and prompt engineering. Prompt crafting is what you do when you interact every day with, uh, which at G P T or your favorite l l m, and you get smart about, you know, um, crafting some nice prompt, prompt engineering, to me, at least in the context of this doc, is when you wanna bring.
You wanna use an l, an l LM as an api and you want to, uh, bring that inside your, your, your product. Actually, I have to do some prompt engineering and do some proper things there now, just like, you know, craft around and, and ask if, answer your question. Uh, so prompt engineering to me is like adding the, the proper context probably from your product, your application.
Um, specifying an answer format, right? Say like, Hey, L l m I don't want you to write an essay. I want you to return to Jason Blob with, uh, you know, your confidence score and some sequel. Um, and, and maybe some hint as out and improve my prompt, right? So you can define a spec, specify the, the format that you want, and it can be, uh, bringing structure to, to its output, uh, limiting scope, setting guardrails.
And measuring success, which is what we're gonna talk about today. Uh, the, the idea of measuring success is, There's so much you can do when you do these prompts, right? You, you, you have this natural language interface. You can say like, Hey, I'm gonna provide some sample data, or I won't, or five, five rows of sample data.
I will, um, I will, you know, specify which SQL dialect I might want, or I will put like a capitalize, like imported. Don't forget to do this. Um, so there's so much you can do there, and this will influence the result. And that's why it's so since they're, they're so probabilistic and, and, um, potentially like unpredictable, uh, I think it's, it becomes just more important to have that.
Um, here I'm gonna go really quickly, clearly on, on this slide. So that, just pointing out, the blog post talks about parallel between test driven development. So t d D is for test driven development, and then applying some of these concepts and ideas to prompt engineering, uh, and translating the two. Um, there's a lot that's applicable, right?
There's, it's intricately. Um, similar, there's a lot of transferable concept. There's also things that are very, very different. Um, and that's what I talk about here in, in the blog post, right? For, for instance, like a, a test, uh, a normal testing t d d library, it's either you're right or you're wrong. In a case of an A, an l l LM or prompt engineering, it might be, you know, 50% right?
Or 80%, right? Right. So we're not assuming that everything is, uh, bullying. Uh, when, when doing that. So if you're interested in like the parallel and differences between, um, test room development and as it applies to product engineering, there's more in blog posts about that. Now, what does the prompt adoration lifecycle look like?
Right? So, so clearly, and here I'm thinking this, uh, reading a, a book from a, a page from the, the t d D book, which is, first you start like, what is the behavior, what is define your use case and the desired AI behavior, like what's the input? What's the context and what's the output you're, you want from the ai?
Then you start before writing your first prompt. You define your test cases, right? You define, uh, a suite of tests that are like, if I ask this question, given this database schema, I expect the following SQL or SQL that contains, uh, these comment, this, uh, these columns or these tables. And then I we're entering the, the iteration loop here where, um, then you'll run your test, you'll evaluate the result, right?
I is my prompt performing. You'll refine the test perhaps and refine the prompt, and then you loop and you, you stay in that loop until you reach the point of. While this, this ai, I can measure and know for a fact that this AI is helpful and my prompt is successful and it matches the, the success criteria to bring it to my product.
And then, and only then is when you would predict, uh, you would bring this prompt and into production. Right? Similarly, once you bring your, your prompt into production, you probably wanna improve your prompt. Like there's a lot you can do. You wanna do some more engineering, get it to, to be more accurate.
You wanna be able to measure that your new fancier prompt that's using. You know, a different model perhaps, or using a fancier vector database or just is using like this, this slightly fancier, you know, prompt. Um, generation technique, uh, that it's actually working better than your previous prompt, right?
And you want performance metric, like how long does it take, how much does it cost to actually like the, you know, run my test sweep against the say open AI api. Um, and you wanna loop over that. And what I'm talking about today is really about like this iteration loop and, um, building this, this prompt case testing suite.
All right, I'm waiting for the slide to, yeah, so introducing Prompt TOIs. So there's a GitHub, the general idea, so Prompt TOIs is like a test, you know, if you, I'm sure you're all familiar, at least like the engineers in the room, familiar with the test libraries and Python. There's Python test, right? And then, uh, uh, there's some specialized libraries too, and say JavaScript like enzyme for React or React testing library.
So Proto device is, is really a, a bit of a testing library. And toolkit, right? A test, uh, a way to express test, a way to run tests, and a way to evaluate results from tests. Um, and that's what it is about. I think in the next slide, uh, I'll talk a little bit about like some of the, the core key. Features of Prompt Optimiz, it's a way to define prompt cases as code, um, attach evaluation functions to those prompt cases, generate prompts, variations, dynamically, potentially.
So since you express your prompt cases, your, your test as code, you can. Generate some, uh, some prompts dynamically, perhaps in some cases. Um, execute and rank across different engine. So run. So it's a, it's a prompt test runner. And then, uh, report on the pro performance, like produce report of like the detail of each prompt.
What are succeed, succeed or not. What was the input, what was the output and compiling results on it. Um, this is, um, an example that we have if you're interested to kinda see the mechanics of it and how it works. If you pip install, optimize, uh, you will get these examples, right? So you can play with these examples.
So this is an example where we craft a prompt that generates python function, right? So this. This specific example is like you give a prompt, like write a function that tests if a number is a prime number and returns a Boolean, and then we attach some evaluator functions to it. In this case, you know, is, is two prime is poor, prime is seven prime.
Um, and, and then what prompted my skin do is it allows you to express these prompts in a specific way and then to run them and evaluate them and produce reports. Uh, next slide.
So h here. Um, so here we have a fancy little c l i that is prompt my run, you know, and shows like all of the, some of the options, but it's a smart test runner. So we will find your tests or your, your prompt cases, it will run them for you. Uh, there's some, some things to allow you to do human review, right?
So if you wanna do like force a yes or a no, Uh, in some cases you can do that. So there's a bunch of like little, uh, fancy features, uh, around tests, uh, around, around, uh, running things. And then next slide, it will be a very, very quick high level overview of, uh, some of the reporting functionality. So on the, on the right side, you can see that, uh, this is an example of a prompt case run for a, a single test.
And we produce a big a YAML file that's like a big re uh, a big, um, Uh, a, a big kind of report file that has all the details of, of everything that ran in the background, right? So for each test, call it a suite run, right? So maybe you have 600 tests. That evaluate my, my text to Seql. I go and run these 600 tests.
I will have the atomic detail of each test out, perform, and would produce this big yamo file with all the details for that suite. Now using this test, you can also run otherized report commands where you can see what is the percentage of success, right? And um, What is, uh, the performance, you know, the, the per the, the percentage of success per category, uh, and then eventually, right?
Like you have these, these yamo files. But you, you could ask questions to these yamo files, like how much, how long did it take? An average, well, is my, my P 99, P 95, and a P 50 of how long it took to, um, execute the, the prompts and what is the, you know, average number of tokens. So you can do, you can build on.
Uh, these reports and Bill last statistics, some features we don't have to, that are intricate and interesting is, hey, you could have two tests run and you could say, diff the test that succeeded and failed, um, in these different version, right? So you can understand, uh, well this is performing better, but it's actually performing lower in certain cases, right?
So it's all about producing that and having a more systematic approach. To evaluating whether your prompts are working well or whether they're not working well, they're more expensive or they take more time. Next slide. Um, so this is just yet another pointer to the blog posts. Uh, so much more information there.
And then there's the GitHub and you can pip install prompt and play with it. Um, you know, for me, I think, uh, The, the, it's, it's an ambitious project in some ways, but it's pretty contained. Uh, but what what is really interesting here to me is more the approach, the reference implementation. Uh, we used optimize, you know, at preset to test or prompt, but it's, it's still a very, very early project.
So when get involved, it's a good time to influence the, the, the future of the project and get involved and, and shape the direction since it's, it's super early on. In this field and in this specific, uh, project and community. And I believe that was my last slide. Or that might be another one. I think it says, that's all folks says That's that is all folks.
That is awesome. I just, I threw the blog in there. I also, uh, let people know in the chat that I am a huge fan of this. I mean, we talked about it for like. An hour and a half last week and it's just, it's so cool to see the idea of being able to really like put some numbers around your models and how the models are working and, and how your prompts are working with the models and this your specific use case.
So I love it. We're gonna be dropping that podcast episode probably in like a couple weeks and so if anybody does not watch our podcast, this is a great moment for me to. Plug it right here. Scan that QR code and you will see Max and I chat chopping it up in a bit. Uh, dude, max, awesome. Thank you so much.
I'm gonna kick you off now cuz we're like 10 minutes behind and I've been telling everybody I'm gonna be like a clock today, like Ringo. You might as well call me Ringo because I'm keeping the time so tight. But I am not, man. I have lost that title. Uh, that's all I want to say, dude. Thank you though, max.
And I'm gonna be in San Francisco in two weeks, and so hopefully we can meet up if you're around. Yeah, let's, let's try to see if we can, uh, meet for a drink or something. And then, um, if I believe there's a slack too. So I'm gonna go into Slack and take questions and talk to people for a little while too.
So I'll be active on the slack if you wanna pay me directly and I'll be scanning the channel so, Uh, happy to connect with people and even get on a zoom with, uh, whoever's interested. Here we go. I don't wanna distract from the mainstream, like this is live. You should probably be here and be full attention to this mainstream, uh, livestream right here.
That's it. There is, there is some questions coming through in the chat that I just dropped in, into our chat, uh, because. We're mainstreaming on a different platform so that you can go through there and see. And uh, I'll have open, I'll have like the slack open and the, the live double we go. Alright, man.
All right. Thank you. That was a pleasure. See you dude. It was great.