Beyond Guess-and-Check: Towards AI-assisted Prompt Engineering
Alex Cabrera is a Ph.D. candidate at Carnegie Mellon University. He works on human-centered AI, specifically in applying techniques from HCI and visualization to help people better understand and improve their AI systems. He is supported by an NSF Graduate Research Fellowship and has spent time at Apple AI/ML, Microsoft Research, and Google.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
"I'll tip you $100". "Don't be lazy". Have you caught yourself adding these phrases to your prompts? Prompt engineering is central to developing modern AI systems, but it often devolves into an ad-hoc process that requires tribal knowledge and countless iterations of guess-and-check. We believe prompt engineering can be improved significantly and explore how we can use AI itself to guide prompt writing. Learn how intelligent "prompt editors" and synthetic data generation can supercharge your AI development workflow.
Beyond Guess-and-Check: Towards AI-assisted Prompt Engineering
AI in Production
Slides: https://docs.google.com/presentation/d/1vAeeKF0BKQJRyYGr-FVvJWJD7EX1zEez/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true
Adam Becker [00:00:05]: Alex. We're stoked to hear from you. I'm going to come back in ten minutes, and so the floor is yours.
Alex Cabrera [00:00:12]: Awesome. Thanks so much. Yeah, thank you guys for having me. It was a lot of fun doing this last time, so I was excited to get invited back for this. My name is Alex. I'm a PhD can at Carnegie Mellon. I'm defending my thesis next month, so can't say that for too much longer. And I've been building developer tools for the last five or six years.
Alex Cabrera [00:00:35]: What I talked about last time, what I've been spending kind of the last few years on, is a platform called Xeno, which is this kind of end to end evaluation platform that lets you do more of this fine grained evaluation and exploration of benchmarks, that it's a lot of fun. If you're interested in evaluation, I encourage you to take a look. But I'm actually going to be talking about something quite different today and kind of a newer project. We'll see live demo gods. Hopefully they're smiling down on me on doing AI assisted prompt editing and prompt engineering. So something that we saw again and again when we were talking to users who are using Xeno, who are doing deeper evaluation of their models, is that while the idea of having a very deep end to end evaluation pipeline is great, what a lot of the dirty work ends up being day to day is prompt engineering in these oftentimes larger chains. I think the whole LLM community has kind of gathered around this paradigm of these essentially data chains or text chains, where you have prompts linked up to different maybe modules like function calls or vector databases chained together. But in the end of the day, you can't get away from having to write prompts.
Alex Cabrera [00:01:51]: You might have a super sophisticated chain that has broken down the tasks into little prompts and maybe whole agentic graph based logic. In the end of the day, each one of those nodes has to be some sort of prompt, some sort of model, oftentimes a prompt to OpenAI or some local model. And when we talked to people, they said, hey, this is kind of this ad hoc process that we do a lot of guess and check. Usually start with some initial prompt, have a couple example test cases, see if they're failing, or maybe we get a new test case in production added to our test set and use that to update our prompt, the super tight loop. And when you ask really good prompt engineers how they got good at prompt engineering, it's a lot of just practice, a lot of just trying it over and over again. And learning the nuances of exactly how these models perform. You've probably seen blog posts about how to get good at prompting and be like, hey, just tell the model you're going to tip it $1,000. Or just tell it not to be lazy or just saying return JSON doesn't work.
Alex Cabrera [00:02:53]: How about you put it in all caps and put a bunch of exclamation marks? And that got us thinking. We saw this pattern again and again when talking to people. Is this kind of like it? Is this just what prompt engineering is? And it's never going to get easier. You just have to go through this painful process of learning. But we took a look at kind of what existing processes are and sort of some of these creative. I see. So example, when you're writing text. Sure, writing is hard and takes a lot of time to get good at writing, but you have intelligent tools like Grammarly to help you out.
Alex Cabrera [00:03:27]: Same with code programming is hard, takes a lot of domain knowledge and a lot of skill. But we have tools now like copilot that make it a lot easier. The same can't really be said for prompt editing like prompt engineering. It kind of lives sort of in text editors, sometimes in playgrounds, but there's nothing that helps you. There's no kind of AI assistance that says, here are good heuristics. Here are common ways of improving your prompts that guide you in this process. So that's kind of the premise that we went for. We've been working on this only for a few weeks, so it's a little rough around the edges, but I'd love to get your feedback on this broader vision of what would it mean to have kind of AI assistance in the prompt engineering process instead of just having it for other processes like code or text editor.
Alex Cabrera [00:04:12]: So I'll try to show a quick, brief demo. So here we're just going to create a new prompt and start with something simple. This could be like a custom GPT if you wanted to, or in this case, we're just going to treat it as one of the nodes in a longer pipeline. So I have one very simple prompt here that we're going to use that just tries to extract numeric scores from movie reviews from text movie reviews so we can do movie score extraction. So we'll start with this simple prompt next, and what we do first is generate a bunch of potential example inputs. Oftentimes this won't be very useful if you have very specific domain specific data and you can paste in your own, but at least it gives you some heuristic of like okay, at least I know that these are good examples, but maybe they'll be useful in our case. We're going be to like, hey, these are good enough for this use case of potential movie reviews we might want to extract data from. So you click next, you get into the main interface.
Alex Cabrera [00:05:13]: So what you'll see, we automatically ran a model on this input, so in this case we're using just GP 3.5 turbo. But you could change which model you're using and you get automatic outputs. So this looks interesting. And immediately maybe I start thinking, hey, I want to do this in like a data pipeline. I don't need all this text, I just need the number so I can say something like do not return anything but the number. And I can save this prompt. It gets saved in the background. I get new outputs which seem to be working pretty well in this somewhat simple task.
Alex Cabrera [00:05:51]: Say maybe some of these examples weren't working or there was something interesting. You could then go in and update your prompt. What's interesting though, you'll see on the left here we have these things, these suggestions that have popped up, and these are what we call editors that give you, it's kind of like Grammarly, it might say like, hey, there's a grammatical error here. Hey, you might want to reword this to be more clear. These are what we consider similar concepts, but for prompts. So these are common heuristics that given your prompts and your inputs and outputs, we can try to guide you towards better prompts. So in this example, when we hover over, don't use negation. Oftentimes it's just better to tell your model what it should do instead of what it shouldn't do.
Alex Cabrera [00:06:29]: So we can see it actually underlines the area that it detected where you're saying do not return anything but the number. And we can actually go ahead and click apply, and it'll actually rewrite that part of the sentence with this new update, and then we can save that and run it. So we have other ones. Maybe you want to actually output JSon, so we can be like, okay, return JSon instead of a number and use the key output so we can save this. Same thing happens. We get the new outputs, the new JSon, and we get new instructions. So we have another one that's single instruction sentences that you should maybe use instructions that are single sentences instead of compounding them. So we could apply this if we wanted to, and it split it into two different sentences so we can save and run.
Alex Cabrera [00:07:25]: So this is early. Our editors aren't great right now, but hopefully you see, over time we can start learning patterns of what good prompts are. We have other ones shortened. This one's new, and I don't know how well it works, but ideally it can keep the same meaning of your prompt, but shorten it to save you money, lower the token cost for each inference run, et cetera. We also have an ability to generate similar instances. So as you already saw, we kind of saturated our test set super quickly. We're not seeing any cases where it's actually going to fail, so we could add additional ones and say something like, this movie was awful, I never want to see it again. And this is interesting.
Alex Cabrera [00:08:08]: It gave it an output of zero, but maybe we actually want it to return null. So if no numbers present return null, we can save and see if this actually worked. Luckily it did. Maybe when at more of these test case, we can actually also automatically generate data. This hasn't been working great. This part is a little rough around the edges, but you can imagine generating similar instances, similar reviews that don't have numbers or reflect this template. So you can start growing this test set. We do also have ability.
Alex Cabrera [00:08:49]: As someone who cares deeply about evaluation. You can then add labels if you wanted to and start calculating more formal metrics and start doing comparison across models. So you can go kind of deeper down the more robust evaluation and checking from this initial testing phase. Yeah, so that's all I got. This is very early. I actually just broke it in production, but it is out there if you want to test it. I would love any feedback on what works, what doesn't, what are maybe interesting ideas for editors or ways of generating test cases? Do you want other models? Very curious to see kind of what people think about this. Cool.
Alex Cabrera [00:09:33]: And yeah, it's available enduptext app. I'll fix it right after this in the next like 15 minutes. The prompt isn't updating right now.
Adam Becker [00:09:44]: I'm going to try to keep you for five more minutes. So maybe in the next 20 minutes. Perfect. Okay, I think we got at least a couple of questions from the audience. Let me start at the top. Kay says in a purva upvotes. Love the idea. How would you keep it updated? It feels like a lot of these prompt rules keep changing as the underlying models are being updated.
Alex Cabrera [00:10:12]: Good question. So if I'm interpreting this correctly, it's how do you make sure that the prompts are still good when the underlying models are changing? It's a great question if I understand that correctly.
Adam Becker [00:10:26]: Maybe you understood it correctly. I took it as like, how do you even keep the suggestions up to date, given that the way the prompts are processed is different, but that might be similar.
Alex Cabrera [00:10:39]: Interesting. So the way we have it set up now is every time you save a model, it gets saved in the background as like a new version, and then all the suggestions get updated dynamically with that model and the inputs. And then some editors are actually input based. So we have one that checks your prompt, and if you tell it to do something, it actually checks the outputs and sees if it matches that. So you can actually refresh. And if you have new inputs, it'll actually update the suggestions as well. If the underlying models are changing, that's hard. Ideally, maybe we should have a rerun for the current model.
Alex Cabrera [00:11:18]: You can't really rerun it unless you change the prompt. Yeah, actually something we ran into is even if you have temperature at zero, with OpenAI, they use a mixture of experts model, so it's not deterministic. So that's something actually we were interested in. Maybe we track how many times the output changes for the same model and give you some count of how, I don't know, unreliable it is.
Adam Becker [00:11:42]: Yeah. Like the sensitivity of it or.
Alex Cabrera [00:11:44]: Yeah.
Adam Becker [00:11:45]: So a few people are saying Mo is asking, this tool is accessible to test. I think the answer is yes. That might have been before he put up the slide. Mohammed is asking, nice idea. There is some research about using EA to evolve prompts automatically. Ea, I imagine, means evolutionary algorithms. It would be a nice feature. That's actually, that's a little bit closer to what I had in mind you might do.
Adam Becker [00:12:12]: Have you considered doing things like that? Are you thinking about integrating?
Alex Cabrera [00:12:16]: Yeah, we've looked at. I did some literature review. Like, I know there's a DeepMind paper that's kind of interesting that uses some sort of genetic algorithm to mutate the prompts, test them, et cetera. I find those very interesting. I think they mismatch a little bit with the reality of people doing prompt engineering. They almost always rely on having a really big test set that has ground truth labels with a good metric that you can actually test your evolving prompts against. And that's almost never the case for people in production creating these models, unless you're like super sophisticated and have huge test sets. Most people actually have an idea, they implement the model and they don't have any test cases and are actually doing this manual, quick iteration of refining your mental model of what you want your prompt to do in the system.
Alex Cabrera [00:13:03]: So I think eventually the idea is, as you gather more test cases and have more iterations of your prompt, you have more data to do, more sophisticated evolution of your prompts and algorithms to optimize them. So that's the dream. But I think we want to meet people kind of where they really are today, doing prompt engineering, and be like, hey, we can get you off the ground. Started with some more or less simple heuristics on how to improve your models in long term. Maybe once we have more and more data, we can start doing more intelligent optimization.
Adam Becker [00:13:30]: Yeah, it sounds like you're wearing your product hat.
Alex Cabrera [00:13:34]: Yeah. So, Xeno, the system we've been working on for a while, I think it's very useful for doing this very deep evaluation. When you have large label test sets, you can do quantitative analysis and ablations. But this is something that we ran into all the time, was, people just don't have large evaluation sets, or they don't have metrics that actually correlate with the quality they care about. It's more about this creative process of you iterating and seeing what the output is and refining what you want the model to do. So that also influenced a lot of the design decisions.
Adam Becker [00:14:05]: Regarding the metrics, Andrew here is asking, can you have it harmonize all of the scores such that it has a meta score where it knows that, say, like a four out of five is roughly like an eight out of ten.
Alex Cabrera [00:14:20]: If this is for the specific model, like the prompt that we were optimizing, probably you could have it do whatever you wanted. Really. It's as long as you define your spec well enough and ideally you have enough test cases that test the balance. If you meant the metric right now, me and my collaborator, who's also Alex, have a debate right now. Basically, I'm arguing that we should have no metrics or labels by default, and it's just human judgment as you iterate. But there are some analyses that you need to do actually have an evaluation metric. Right now, we just support one evaluation metric, which is character overlap CHRF between the label and the output. Yeah.
Adam Becker [00:15:03]: Okay, I'm going to go through quickly a couple of other ones. One, Vihan is asking Alex, I imagine this will be very helpful to have AI suggestions on creating better prompts. And me and a friend have fine tuned the Chat GPT instance created in the playground to do exactly this. However, from a business standpoint, I see this being a really costly solution for enterprises users, where the primary LLM, if I may, is a rag based implementation and prompts refer to data in the attached databases for outputs. What are your thoughts?
Alex Cabrera [00:15:35]: I don't think I totally got the issue. Is it saying that if you have a ton of inputs it gets really expensive, or if it's. I think if you have a rag pipeline, you usually have the individual prompts that are like, okay, take the top five results, put it in the prompt to actually synthesize an output. So that's a node, that's the LLM that you would optimize. And there you don't have to optimize every possible rag result. You would have a couple of examples that already have the output scaffolded in, and you're changing things like how you should respond, basically the text around the context that you're putting in. Yeah. So I see this as kind of like, I think people have a lot of ways of building their own pipelines and it could be just raw, open AI calls and they're passing strings along.
Alex Cabrera [00:16:24]: Maybe you're using lang chain or something. And I think that's a great place for code tools to do that orchestration. The lower level, you can't get lower than the prompt that is this text function. Text in, text out. And that's the thing that is being optimized in this case.
Adam Becker [00:16:41]: I don't know why a few people have questions about that. I don't think we're going to be able to tackle them all. I'm going to put the link for the chat for you so that you could please, you have it.
Alex Cabrera [00:16:56]: Okay, great.
Adam Becker [00:16:57]: So Amit had a question and [email protected] had a question and Apurva had another question. Awesome to hear from you, Alex. This was a very lively, you've fomented the chat in a way that is wonderful, fantastic. I hope that you get as much out of these questions as we all did, seeing what you're building. And best of luck.
Alex Cabrera [00:17:27]: Awesome. Yeah. Thank you so much. Endoftext app, it's very early. I'd love any and all feedback. Please email me. There's a little help button everywhere that you can send. So really appreciate it.
Alex Cabrera [00:17:38]: Thank you. Close.