Decoding Prompt Version History
Ph.D. student in data management for machine learning.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
LLMs are the rage nowadays, but their accuracy hinges on the art of crafting effective prompts. Developers (and even automated systems) engage in an extensive process of refining prompts, often undergoing dozens or even hundreds of iterations. In this talk, we explore prompt version history as a rich source of correctness criteria for LLM pipelines. We present a taxonomy of prompt edits, developed from analyzing hundreds of prompt versions across many different LLM pipelines, and we discuss many potential downstream applications of this taxonomy. Finally, we demonstrate one such application: automated generation of assertions for LLM responses.
Decoding Prompt Version History
AI in Production
Demetrios [00:00:05]: Next up, we get the pleasure of talking with Shreya Shankar. What's up?
Shreya Shankar [00:00:14]: Can you hear me?
Demetrios [00:00:15]: I can hear you loud and clear. And I got to tell people about the last time you came to one of these conferences, which what happened was I think you froze. So I don't know if you're hearing me.
Shreya Shankar [00:00:29]: I'm hearing, but I'll go.
Demetrios [00:00:31]: All right, cool. Well, I'll tell people about how a few minutes before you were about to come onto the panel, you were like, hey, I actually got to get in a car right now. I don't think I can make this panel. I was like, I don't care. Do it in the car. It's all good. So you like an absolute gangster did this panel while in the back of an Uber.
Shreya Shankar [00:00:55]: It was a great time.
Demetrios [00:00:59]: It was awesome. So you've been up to quite a bit since we last spoke.
Shreya Shankar [00:01:06]: Yeah.
Demetrios [00:01:06]: Are you going to tell us about it?
Shreya Shankar [00:01:08]: I'm going to tell you about a little bit of it. Still exploring a lot. The ecosystem is moving fast that maybe tomorrow I'll be doing something else.
Demetrios [00:01:16]: Oh, my gosh. Tell me about it. That is so wild. So just to give people context, I think the first time that we chatted when you came on the podcast, you were doing stuff in databases. And more recently you've been doing stuff with evaluation and making sure that hallucinations don't happen. And you put out a paper called Spade, which I'll drop in the chat right now in case anyone wants to have a read. I loved it. I loved it so much.
Demetrios [00:01:47]: I made a video out of it, actually.
Shreya Shankar [00:01:49]: I loved your hat.
Demetrios [00:01:50]: And that was what I, that was the highlight. Excellent. Well, I'm going to pull up your screen right now and I'll let you get rocking.
Shreya Shankar [00:02:02]: Great. Thanks, Demetrios, for having me. Excited to be here today for a lightning talk. I love lightning talks. And I'll be talking about our work on prompt version histories. I'm a PhD student at Berkeley. Super interested in ML and AI engineers. How do they use their tools?
Shreya Shankar [00:02:23]: What's with them?
Shreya Shankar [00:02:25]: What's a pain point and how can we build better tools for them? Eventually are quite exciting and everybody has a lot of thoughts on what it's.
Shreya Shankar [00:02:34]: Like to build lm pipelines.
Shreya Shankar [00:02:36]: On one hand, it's super easy because all you need to do is write.
Shreya Shankar [00:02:40]: A string to a model and get.
Shreya Shankar [00:02:42]: Back some magic response. Awesome. But it's also really hard. So much is out of your control as a developer. If you're trying to do this at scale, you're trying to do this many, many different prompts. Our responses going to be correct, reliable, who knows? Use a prompt a little bit, or if the LLM provider changes the model from underneath your feet, you might get completely different responses and you'll have no way of knowing that we take a step back. In the research world, I do both HCI and management research, and I ask, okay, this is a crazy ecosystem.
Shreya Shankar [00:03:18]: Crazy.
Shreya Shankar [00:03:19]: How are AI engineers even writing prompts? Okay, is there anything we could learn from this models? Is prompt engineering similar across tasks? Are there similar ways people edit prompts? What? I don't know. So to get into why I thought this could be interesting, I saw people doing hearing in the wild, and I thought that valuable information is often hidden.
Shreya Shankar [00:03:45]: In these version history.
Shreya Shankar [00:03:47]: So say somebody's trying to write a pipeline to summarize a document so their.
Shreya Shankar [00:03:51]: First prompt look like this.
Shreya Shankar [00:03:53]: Summarize a document, return your answer, and mark down. And then they might add an instruction. After seeing some responses, this instruction to say if the sensitive information, do not include it in the summary. So as a researcher, somebody guidelines. I'm like, that's interesting, because now I know a little bit more about what the developer cares about. I know that this is a common failure mode in the LLM. It's outputting sensitive information, and you can kind of apply that logic over various prompt versions. What's interesting is people might capitalize certain words, people might explain things better.
Shreya Shankar [00:04:31]: So maybe the LLM doesn't know what sensitive information means. But if you say race, ethnicity, gender, then the LLM is better at listening to this instruction and so forth. You can have a professional tone. The bottom line here is that prompt edits can tell us what care about, as well as what mistakes LlM make. And that intersection of like, what do people about? And what are LlMs bad at?
Shreya Shankar [00:04:57]: That's where we can have really great tools.
Shreya Shankar [00:04:59]: We can write prompts for people. We can write assertions for people. We can do all sorts of automation there. So as part of a bigger research project, we reach out to Lang chain. They have a product called Langsmith. We asked them, do you have version history for prompts? Can we mine them? Can we analyze them? So we collected 19 llm pipelines across different domains, engineering, conversational AI, bargaining, et cetera. And we sat there and we annotated every single diff these tasks, and we found that there were a few patterns. So people might just modify a few phrases of the prompt.
Shreya Shankar [00:05:38]: Rarely are people rewriting the whole prompt from scratch. People might have inclusion instructions, is what we call them. Always include the source and maybe rephrase that just a little bit. Sometimes rephrasing does not semantically modify the prompt. Who knows? We kind of did this over a bunch of prompts, and we came up with this taxonomy of how people edit prompts. Broadly speaking, there's like a structural category and a content based category. The structural categories were more focused towards what format should the response be in? Maybe clarifying that a little bit further. And then all the content based prompt deltas or edits involved some semantic change to the prompt.
Shreya Shankar [00:06:25]: So maybe that's including a new instruction to exclude something, including some qualitative criteria that the human cares about, et cetera. And I have a bunch of edits here, I'm going to skip over that. But we came up with all of these categories, and in our research paper that we put the spade paper that Demetrius linked, we applied this to generating assertions. But you could also imagine applying this taxonomy to build tools to rewrite prompts. To be more clear, if you know what kinds of instructions humans care about, or LLMs are really bad at following, then maybe we can steer prompts in those directions. One thing I'm really excited about is thinking about prompt version control, right? Raw text diffs are not very meaningful, but if we have this larger categorization, we can hopefully prevent ourselves from trying ideas we've tried in the past, or ideas or collaborators have tried in the past. But in the rest of this talk, I'll just focus specifically on generating the assertions based on the taxonomy. Okay, so why assertions for LLMs? I really hope I don't have to convince you guys.
Shreya Shankar [00:07:34]: Everybody knows LLMs have reliability issues. When deployed at scale, they hallucinate, don't follow instructions. It might say something undesirable, who knows? And to put this in perspective, or a concrete example of the same document summarizer pipeline, you might have a prompt like this and you're feeding it in some medical document. To summarize, maybe the first response is just fine summarizing it, passing all of the things that the human cares about. The second response might include my name, might include some human sensitive information, and you can imagine all sorts of combinations of errors occurring. And the famous one, I'm sorry, but as a language model trained by OpenAI, that one could obviously be checked for insertion. So the point is, developers need to detect these issues with assertions. And why automatically write them? Well, one, it requires coding experience to write, which not everyone have, and then two, it's quite tedious, right? If somebody is prompt engineering to try to make that first prompt on the left hand side of the screen, the stocking, the summarizer prompt.
Shreya Shankar [00:08:42]: They also don't want to be burdened with coming up with prompts for every single assertion. Thinking about what evaluation means, maybe we can kind of get them a large step of the way there. So how do we synthesize assertions with the taxonomy? First, we formulate the higher level problem, generate assertion functions that cover failures, and then also don't generate hundreds and hundreds of functions, just a few. If you're interested in that kind of optimization, check out the paper. I'm just going to talk about the taxonomy part here, how we use that taxonomy to generate candidate assertion. So we take in a prompt or prompt delta, and then we ask an LLM hey, based on this taxonomy that we've created, come up with as many criteria, assertion criteria as possible, and then tag it with the category that it belongs in. So is it exclusion instruction, qualitative criteria, inclusion extraction, count based instruction, et cetera, and also include a link to some source. We found that when people saw the source highlighted for why the assertion criteria was generated, it was easier to trust that.
Shreya Shankar [00:09:50]: Of course then once these criteria are extracted, then we use code generation to create a Python function to check for that given the prompt and response, and we allow there to be a call to ask an LLM, a yes or no question, or the assertions can be just basic python code. And in the paper we talk about how we mine this for rerouce redundancies in this and ensure correctness with an optimization based solution. So with linechain we kind of deployed a very early version of this in November. These numbers are probably a little bit off because probably more, and it was quite exciting. People more or less liked the idea and somebody even deployed it somewhere, which I thought was really fun. But right now what we're looking at is how to deploy the spade system in an interactive system, interactive prompt engineering system, so that these assertions can be generated for you as you edit your prompt, which is exciting. But with that I'm going to stop because I know I'm probably running over time. And thanks everyone for having me watch.