MLOps Community
+00:00 GMT
Sign in or Join the community to continue

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines // Arnav Singhvi // AI in Production Talk

Posted Mar 07, 2024 | Views 851
# DSPy
# self-refining
# constraints
# assertions
Share
speakers
avatar
Arnav Singhvi
Research Scientist Intern @ Databricks

Recently-graduated undergrad in CS at UC Berkeley. Researcher with Stanford NLP Group, contributing to DSPy framework

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Chaining language model (LM) calls as composable modules is fueling a powerful way of programming. However, ensuring that LMs adhere to important constraints remains a key challenge, one often addressed with heuristic “prompt engineering”. We introduce LM Assertions, a new programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with arbitrary LM Assertions into systems that are more reliable and more accurate. In DSPy, LM Assertions can be integrated at compile time, via automatic prompt optimization, and and/or at inference-time, via automatic self-refinement and backtracking.

+ Read More
TRANSCRIPT

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

AI in Production

Slides: https://docs.google.com/presentation/d/1q94DYIy5vQ1tvlq7NFtdFIZYMIexBAbFFBAhEzBldlA/edit?usp=drive_link

Demetrios [00:00:05]: And who better to give it to us? And Mister Arnav, you've had your hands full recently. I can imagine. DS PI has been blowing up my man.

Arnav Singhvi [00:00:18]: Yes, it definitely has. And I think it's a great opportunity for us to go from this shift of prompt engineering to learning how to program language models and use them in our pipeline. So happy to see how that goes.

Demetrios [00:00:33]: Well, I'm a fan ever since I talked with Omar and I wrapped my head around it, and it's cool to see the love and support that it's been getting all over the place. And so I thought, who better than to bring us on home then? You talking about what you do best?

Arnav Singhvi [00:00:54]: Yep.

Demetrios [00:00:54]: Do you want to share your screen or anything before we get started?

Arnav Singhvi [00:00:57]: Let me get that up. And can everyone see the screen?

Demetrios [00:01:02]: Oh, yeah.

Arnav Singhvi [00:01:04]: Great.

Demetrios [00:01:06]: All right, man, I'll talk to you in 20 minutes.

Arnav Singhvi [00:01:08]: All right, thanks Dmitrios, everyone. Hope you've had a great day of all these amazing talks at the conference, and I'll be happy to take you guys all home with DSPI assertions. So, my name is Arnav, I'm part of the DSPY team as a lead, and I'll be going over DSPY as a quick briefer and then getting into more of the details of this talk, which is DSPY assertions. Let's get started. So, with the advent of language models, as we all know, prompting has become a very interesting part of interacting with language models and in fact, how we prompt with them. And this has led to the emergence of prompting techniques such as things like chain of thought, using step by step logical reasoning to come to a deductive answer, or things like retrieval augmented generation, where we give the model an external source of information and allow it to make its generation based off that, and even more nuanced techniques like react agents, where the language model observes a user's interactions and makes its decisions based off this accumulation of input. Now, where these techniques truly become effective is when we stack these language models and techniques together in multistage prompting pipelines. For example, on the left, we can see Baleen that utilizes a retriever and language model and generates queries at an iterative scale, eventually to produce information that comes to its final prediction.

Arnav Singhvi [00:02:34]: Similarly, pipelines like DinSQL generate a pipeline of hand prompted llms and decompose text and form them into SQL queries. Rar does a similar thing of baleen, except it applies post generation self introspective refinement to the outputs. Now, these diagrams look great, and these techniques are very modular in principle, but when it comes to implementation, it's not always that easy. Often we take our complex tasks, we have to break them down into subtasks. Then we have to identify, okay, what are the best models, what are the best data to tune the prompts for, what are the best prompting techniques? And as we break this down for each subtask, then we have to figure out how can we tie them all together, and if then, how can we tweak all of it to ensure a very cohesive process? Again, conceptually sounds pretty clear to say I want to add a five shot chain of thought with retrieval augmented generation using hard negative example passages. But once we understand what it means on a finer level of implementation, it gets quite tricky. Now, open source frameworks thrive on string templates and essentially make effect of these pipelines by producing these very, very lengthy and often hand wavy string templates to accomplish the task. For instance, we see this almost 607 character input of an SQL lite prompt for taking a question and generating its syntactically correct SQL query.

Arnav Singhvi [00:04:00]: Now, as a user, we see these great outputs coming from the model, but when we dive deep into understanding how this works, we come across repositories or last page appendices of papers and find these crazy long prompts and we don't know what to do with them. And this is widespread amongst open source frameworks. So how do we truly solve this and apply prompting at a scalable level? We introduced DSPY. DSPY's core philosophy lies on the fact of programming, not prompting models. Now we believe in language models as a tool or an input within your entire programmatic prompting pipeline. You can now take in various modules or various layers and connect them together, and not have to deal with the manual tweaking of a language model to get it to do what you want. Instead, we replace it with our three pillars of DSPY, the first being signatures. Signatures remove this idea of a manual tweaking of prompts by directly declaring what your inputs are and what your outputs are.

Arnav Singhvi [00:05:02]: Say you'd like to do a QA task. Well, you simply input. Then my question is the input, and the answer is the output. If you'd like to do a summarization task, you simply state that I will be entering a long document and what I like is a summary. The second pillar of DSPY lies in modules such as DSPY chain of thought, DSPY react. So our ideology is we would like to make prompting techniques a modular approach where any prompting technique that emerges, if you implement it in its own class in DSPY we can now develop a layer for that module and then when we create large prompting pipelines, we can simply stack these together. So now we erase this identity of trying to figure out where we can integrate which prompting techniques, what works best with others, and have this complex chain of understanding what prompting techniques are together and instead work with them in a modular fashion. And we'll take a look at this in upcoming examples.

Arnav Singhvi [00:06:02]: Lastly, the name of the game of language models is performance and how can we enhance performance, that is with the third pillar of DSPY, which is optimizers. Optimizers reduce the need for manual prompt engineering and try and identify how you can get your prompt to best get good performance by simply asking you to input a DSPy program, maybe a few set of examples, and inherently providing a metric you'd like to maximize. The optimizers then internally do the work for you and produce an automatically optimized prompt which you can apply during inference and obtain enhanced performance. Let's get a little concrete. Let's look at an example of a multi hop question answering pipeline that works on hotpot QA dataset which takes in an input question and it will output an answer for complex multi op questions. On the right you can see an example question, how many stories are in the castle? David Gregory and Harrigan we can see here our pipeline in its diagram form includes a language model call, retrieval model call, and it does this over a few hops ultimately to get the final answer. How do we represent this in DSPI? Well, we mirror a very similar approach to Pytorch, aka where we get the PI and DSPY, where we have an initialization module where we define our layers that we are going to include in our pipeline. So we can see here that we have one layer that is generating a query that takes in a set of context and question.

Arnav Singhvi [00:07:27]: We have another layer that retrieves the set of information. And our final layer is the answer generation layer that takes in all this context and then produces the answer for the question in our forward pass. We simply state this logic in its iterative Python workflow fashion. We show that we have a context and over a for loop we build up that context per the retrieved passages. And finally, once we have reached the maximum number of hops, we now generate the final answer. Now under the hood, what does this mean? What is DSPY? Chain of thought on this context question to produce a query? Me. Well, on the left you can see here that it simply takes in your signature, applies a very basic instruction of what the language model should do provides it a corresponding format and then will fill in your inputs based on what you provide. So for example, if you have a set of passages related to Gary Zhukov and the question is which awarded Gary Zukav's first book receive, it will now apply a chain of thought, step by step reasoning and think of how to generate a query relevant to this question, which we can see here on the right as it generates what award did Gary Zukav's first book, the Dancing Woolly Masters, received.

Arnav Singhvi [00:08:43]: Now moving on, let's get deeper into the DSPY compiler, also known as the DSPY optimizers. Now, as I mentioned earlier, all this takes in is your DSPY declared program, a set of a few examples and your metric. And now think about it as this. Your compiler is a teacher that is taking this program and applying its programmatic runs on the examples. Now given your metric, it determines what are the good examples, what are the bad examples, and based on what the good examples are, it retains that as good behavior. That good behavior is then propagated down as it is an improved DSPY program and it can submit it to over variety of student DSPI programs. So now much of this work is being done in the compilation phase, and your inference phase is simply using the automatically optimized prompt. Now let's take a look at what this means.

Arnav Singhvi [00:09:36]: So with our multi hop program, if we were to take in a few examples that have questions and answers and apply bootstrap few shot optimization, and then compile our multi hop program, we would see a prompt, something on the left. So we still have our instructions of context, question and query. We still have that same format, but now we have two examples of what that format looks like with actual content. And this content was built out when the optimizer did its compilation, went over the examples, and found good examples that produced queries that were according to answer exact metric. Now what this means is we can move away from these very complexly derived or manually hand waved templates and adapt these automatically optimized few shot templates and use them during inference without having to worry about how these were generated. We believe that since we've provided a metric and we have a set of examples, we will provide demonstrations to the language model that are easy to learn and reproduce. So what this really means is you can take a more complex or larger model, have it compile and teach good behavior, and that behavior can be propagated down to smaller models to learn from. For instance, if you take GPT four, you have GPT four generate these nice generations that are based on our metric, we can now have GPT 3.5, which may not do the same thing on its own.

Arnav Singhvi [00:11:02]: It will now learn from this behavior and include that in its generations, thereby optimizing performance. But don't just take my word for it, we can actually look at the DSPI paper and we see these gains actually exist and not that high or scalable models. We see them in just GPT 3.5 and Lama 213 b chat on the metrics of answer, exact match and passage retrieval recall. We can see that from basic vanilla predict programs. We can go from a 34.3% answer metric all the way to a 54.7% gain in the multi hop situation. And we can see the similar games in llama two, where it goes from 27.5% to 50%. So we definitely see the gains when we use compilation strategies from DSPY. Now where we can get really concrete in the performance is enabling DSPY programmers to define constraints on how language models should behave.

Arnav Singhvi [00:12:00]: Currently, when you interact with any language model, you'll often find yourself saying don't say this or output it like this. But we need to be more advanced than that. We need to have language models understand how they can self critique their generations, or apply some kind of backtracking to ensure these constraints are met. But how can we do this in a programmatic fashion within DSPY? That's where we introduce DSPY assertions, where we guide the language model behavior through programmatic constructs. DSPY assertions focus on providing instruction hints and incorporating your rules as a programmer to allow the language model to adhere to your guidelines. Additionally, since we integrate techniques like backtracking and self correction, DSPY programmers can use assertions and refine their outputs and evaluate these over tasks and improve performance. Lastly, this can be incredibly helpful for simple debugging, as are assertions in Python that help you understand where your program is going wrong. You you can can similarly use assertions in DSPY and see where your model is performing at a poor level and where it needs certain improvement.

Arnav Singhvi [00:13:10]: So assertions in DSPY, again is a very simplistic construct. Where we simply define between two assertions, we have DSPY assert, which is a strictly enforced constraint. What this means is on a certain number of attempts, if it does not pass the function or the constraint you have asserted, it will fail and let you know that, and it won't go past any execution. So we see here in the DSPY assert statement, we provide some kind of validation function which can be defined in python. We take in the language model outputs and we return a boolean. If this is true, it continues forward. If it's false, that means the statement has been triggered. What it means to be triggered is now we will send an instruction message to the language model and this includes your past outputs and then your instructional feedback for refinement.

Arnav Singhvi [00:13:56]: Now this was the assert case where in assertions if it doesn't pass the validation function after two attempt, or you're backtracking attempts, it's going to fail your execution and halt you and let you know this is where the language model is going wrong. However, if you would like to conduct a full end to end evaluation and you feel like you've tested your language model well enough to understand where it's going wrong and where it's going right, you can use DSPY suggest statements where now it will continue the execution, and if it does not pass the constraint, it will let it be unresolved while logging the failure, but it will let you know that it has applied the best effort refinement. So this can be game changing. For the simple fact that you can use DSpy Dot suggest statements within your program evaluation and get enhanced performance on any of the constraints you apply. Let's take a closer look at what this means. Going back to our multi hop program that we defined, let's introduce two suggest statements. The first one being the query should be concise. So we have a simple pythonic check of the length of the query should be less than 100, and we simply state that as the feedback query should be less than 100 characters.

Arnav Singhvi [00:15:05]: Okay, the next statement the query should be different from previous ones. This is for obvious reasons, we don't want to give the retriever the same query and get the same passages, so let's have them be distinct. We store the queries as a set of lists and we determine that the query should be distinct from this past queries. Now let's say your query fails. The length of the query was not less than 100 characters. This will now trigger a backtracking attempt within DSPY under the hood, modifying its signature internally to dynamically add in your past query and the instructions we provided here that the query should be less than 100 signatures. The module will then re attempt with this new signature, given this information of where it went wrong, and try to improve that in its next generation. So let's take a look at how this works.

Arnav Singhvi [00:15:54]: For instance, let's say we have the context of John Lobb, who's an american criminologist and distinguished university professor in the Department of Criminology and Criminal justice at the University of Maryland, College park. So our question was the place where John Lobb is an american criminologist and distinguished university professor in the Department of Criminology and Criminal justice, and was founded in what year? Now our past query took all of that wording and just put that as the query, which isn't really helpful because if we just look closely at the context, we know that John Lobb was from the University of Maryland. So why not just search that instead of all the wordy verbiage that we had from the question? And that's what we do when we include DSPI assertions. Now let's take advantage of the DSPY compilations and optimizers including assertions. So if we were to do a vanilla compilation, we would definitely see enhanced performance with the optimizers. But this is not while including assertions. Whereas if we include compilation with assertions, we have a teacher that has assertions, but we don't have the student with the assertions. So what this means is your compiled model is asserting good behavior and it's hoping that the student can effectively look from that teacher.

Arnav Singhvi [00:17:06]: This may not always work in all the cases, so we can even give assertions to the student. So this works in a double standard where we have the teacher has assertions during compilation time, and while the student tries to implement that same behavior, if it fails, it gets a few more attempts again to better correct its behavior. Now for the evaluation of DSPI assertions, we tested this on the GPT 3.5 turbo model, and we used the Colbert V two retrieval model, which which is a search index over a Wikipedia abstract stump. We tested in the hot pot QA data set, which is a set of question and answers that are more complex questions and require some kind of multi hop reasoning to come to the conclusion. We tested this on 200 train, 300 dev, and eventually 500 test examples. The metrics we tested these on were defined in two categories, the first being intrinsic and intrinsic simply means how well did the language model's outputs? And here's the assertions we constrained it to, whereas extrinsic one can be more downstream performance, such as checking for answer, exact match, passage retrieval, etcetera. When we tested DSPY assertions, we extended beyond simple QA tasks and introduced new complex tasks such as long form QA, where we generate now long form responses to questions that include citations in a specified format for every one to two sentences. Another one we included was quiz generation, where we take hotpot QA questions and we generate answer choices for those questions in a JSON format.

Arnav Singhvi [00:18:37]: And then we included tweet generation, which is now taking your questions and generating a captivating tweet that effectively answers that question. So looking at what long form QA looks like, it's very similar to the multi hop program. Except now instead of generating an answer with the context in question, we are generating a cited paragraph. And now where we can apply our DSPY assertions lie in the checks for if every one to two sentences indeed has a citation following that text, and if each of those citations is a faithful citation. So this introduced a new nuance of using llms as a judge of more intangible metrics like faithfulness, engagement, plausibility. So as you can see here, in our gains we saw almost a 13% to 14% increase simply with including assertions. In citation faithfulness, we saw gains across citation recall and citation precision. And this extrinsic metric, which was the downstream task performance, is simply whether or not the long form generation has an answer in which we saw gains as well.

Arnav Singhvi [00:19:40]: Let's take a look at what this meant. So an example of long form generation without the assertions was we saw a predicted paragraph where it had some text that I've truncated here, and it didn't particularly follow the format of keeping our citation after a sentence and but before the period of the sentence. Additionally, we saw that some of the text that had mentioned wasn't really faithful to the context. For instance, here it says there was no mention of this player playing for this team or any specific club from the context. Therefore the text is not faithful to that context when we include assertions, we now see all of our citations for every one to two sentences is maintained in the exact format we wanted to, and additionally, all the text is deemed to be faithful to the context that it referenced. Next up is the quiz generation task. Now, it's been very prone to see that language models are not really good at producing answer choices or any form of information in JSON format. But we saw that with assertions we saw almost a 60% gain in how that looks like simply by stating the format of the answer choices should be in a JSON format, which is great.

Arnav Singhvi [00:20:48]: Additionally, another intrinsic metric was the fact that the answer choices should include the correct answer, obviously, because this is a question answer quiz type format. Next up was a plausibility check that beyond just the correct answer choice, were the other answer choices deemed to be plausible distractors or essentially good enough for good test takers to not easily identify the correct answer or the incorrect answers. All of this was checked over a downstream metric check of validity, and we saw almost 40% 45% gains across assertion when including assertions. Now, looking at a closer example what this means, we can see that even without assertions, sometimes the model would generate the outputs in valid JSON format while including the correct answer. However, if we look closely, the model deemed that it didn't have plausible distractors in its generations, and the reason being here is that the university here mentioned is Exxon or the University of Exeter, based in the United Kingdom. Anyone who knew this was from the University of Exeter may easily identify that none of these other abbreviations are actual real abbreviations of that college. So clearly exon should be the correct answer. When we applied our assertions and we said that make the distractors actual plausible distractors, it actually complied and it put real abbreviations of real universities in the United Kingdom, Oxford University, Cambridge University, the London School of Economics, and the University of Exeter.

Arnav Singhvi [00:22:20]: So we can actually see the gains working here in not just basic formatting checks, but in actual intangible checks of things like plausibility. Lastly, we looked at the task of tweet generation, where we can see here again, it mirrors a very similar format to multi hop generation, except instead of generating a paragraph or an answer, it's generating a tweet. And here this was a much more complex task simply because of the sheer number of constraints we wanted it to follow. We had things like user and post constraints. If you go to chat GPT and you ask it to generate a tweet, often or not, it includes hashtags, but hashtags are kind of old. I don't really see people using hashtags and tweets, so let's not use them, and let's use that as a constraint to remove any hashtag phrases. Similarly, we can have platform imposed constraints. Let's say you don't have Twitter blue or the premium version of Twitter, and you only want your tweet to be under 280 characters, but we can simply impose that as well.

Arnav Singhvi [00:23:18]: Of course we want the tweet to have the correct answer, or else it's not really a valid tweet, and we want the tweet to be both engaging and faithful to the context it references. So we saw significant gains in going from 21% of not including hashtags to 71%. So it was really, really good at understanding user and post constraints and not including that going forward. We even saw varying gains across its engagement levels, going from almost one to 90.7% its faithfulness levels going from 63% to 75% and its overall quality increasing by a good 14 point. Now, if we take a look five at an example without assertions, since we have so many constraints, it's natural that some of these tweets will follow some constraints and some won't follow others. For instance, we see here that it's within the length of 280 characters and it does have the correct answer to this question of when the mountain Vesuvius erupted, which was 79 ad. But we can clearly see that it has no hashtags, and we can also infer that it's not really engaging and just simply gives a statement to the question and more so it's deemed to not be faithful to its reference context. So how can we improve this? But we simply include our assertions and we found out, lo and behold, a very, very engaging looking tweet that does not have hashtags anymore and is now deemed to be faithful to its context.

Arnav Singhvi [00:24:38]: You can even see that compared side by side, this tweet is a lot more dynamic and a lot more captivating, while maintaining its validity of having the correct answer right here. So the main takeaways from DSPY assertions is the idea of moving forward and no longer asking your prompts to do this or do that, but instead use them as programmatic constraints while you're building out your DSPY programs. The entire principle and philosophy of DSPY is based on the fact of we don't want to manually tweak our models, rather we want to build out robust, systematic pipelines of prompting them. Constraints play a very big role in this, where you can simply assert your constraints in a programmatic fashion and have them improve your performance. This builds the modular depth because we are using complex pipelines but still maintaining true to our constraints we want to maintain, and lastly, we enhance the performance by using assertions alongside DSPY optimizations, and we see greater performance across these complex tasks. So I hope everyone has enjoyed this talk and learned a lot more about DSPY and DSPY assertions. Please take a look at our repository at DSPY AI and learn more about how you can take your prompts and your manual templates and turn them into exciting DSPI programs. Thank you.

Demetrios [00:25:53]: Oh yeah, my hat goes off to you man. I love what you all are doing and I got so many questions. I'm going to let some questions trickle through here from the gallery, but if I understood it correctly when I was talking to Omar last, the idea is you don't want to be prompt first, right? You don't want to have, because prompts are an area where it's almost like a bottleneck, or it's a single point of failure that you can have. And so to create a more robust system, you want to try and get around the prompts. What I saw you doing, though, like, in those assertions, feels like there were prompts.

Arnav Singhvi [00:26:40]: Right, right. Yeah. So that's actually a very good point brought about in DSPI assertions. And the idea is the, the feedback we're giving to the, to the language model is not really tuned to be something like behave in this way or do this thing. It's simply imposing what your constraint is. If you included hashtags, you're just telling your, you're telling your model to not include hashtags. Now, naturally, this opens up the idea of, well, okay, how well could you prompt it to really follow your constraints? And this will lead into new constructs that we have in DSPI, like signature optimizers, where you're now going to take your instruction or take your signature and then use that as the hyper parameter to optimize on. So let's say, like, you know, the feedback I gave my constraint wasn't giving me great performance.

Arnav Singhvi [00:27:30]: I can now optimize to get a better instruction that serves that cause.

Demetrios [00:27:36]: Okay. I'm still, I think I kind of understand, but I guess this is what.

Arnav Singhvi [00:27:45]: Yeah, the idea is you don't want to tell your, you don't want to tell your model. These are like ten different things I want you to follow. Do this, do that, do that. It's kind of on an, on basis of it. If it failed a validation check, if you found out that programmatically, while running the program, something went wrong, you just want to inform it on where it went wrong and what were the past outputs, and then have it work its magic under the hood.

Demetrios [00:28:10]: Okay. Okay. So there's a question coming through here about how the assertions work under the hood.

Arnav Singhvi [00:28:20]: Yeah. So essentially it's taking whatever your signature was originally in the program and just adding two more components to it. And those components being your past outputs and the feedback message you provide. And the feedback message, like we just discussed, it can be a little bit prompt heavy in one sense, but most of the work is being done in the fact that it can see its past output and it sees where it went wrong. And it will understand on how to fix that. Essentially. Like, we're not trying to do more than that. We're just trying to give it an informed response of what to fix.

Demetrios [00:28:58]: Yeah. So this is a great question that just came through from alone asking about how it feels like this approach adds a lot of LLM calls to the pipeline. Do you consider latency in your research results? And I guess from my side, do you also like consider cost because of all the LLM costs or the LLM costs?

Arnav Singhvi [00:29:25]: So this is definitely a great question and to the point of the cost. The idea is when we compile with larger models and we run inference with the smaller models, your compilation is only going to the extent of finding an x number of good examples per se. Let's say you want two few shot examples to be included in your prompt by the time you found like your best two out of the ten, that's all you have to run for that large model, and the rest of the cost is being used in the small model, which eventually you wanted to test. So we're not imposing like a great cost difference over there in terms of latency. It is definitely prone, like there are a lot more calls, but we offer like multi threaded settings where it won't be as bad as otherwise. But yeah, not too much exploration there, but yeah, yeah.

Demetrios [00:30:16]: And I think one of the things that definitely blew my mind as I was looking into this and talking with Omar was the compilers and how you can go through all of that. Can you break that down a little bit? Just like the, the way that it fans out almost, if I understand it correctly, and I'm like probably doing a really bad job at explaining it. That's why I got you here.

Arnav Singhvi [00:30:44]: Yeah, yeah. So I guess I can quickly share the slides again. So it gives people a little bit more of a cleaner visual. Yes, the idea is. So this is like the basic construct of the compiler where we're taking in a program, and you're essentially running this program on the set of examples. And what I mean is when we have these generations in the few shot, these were some form of training examples that were run during compilation. And then it was deemed that this process of retrieving a context, performing some reasoning and now generating this query, it was deemed that this was the correct or optimized approach of doing it. So going forward, when we use this bootstrap program, it will always use this, it will always use this like examples in the prompt.

Arnav Singhvi [00:31:35]: So that way any other data that's being called on this program is learning how to do this context question reasoning to produce a query in the exact kind of fashion. So that kind of like breaks down what it means to use the compiler and then use the improved DSPY program if that makes sense.

Demetrios [00:31:53]: Okay, but there's different types of compilers, right? Or am I.

Arnav Singhvi [00:31:57]: Yes, yes. So in this presentation I did not include that. But there's a wide variety of compilers where we can do bootstrapping. Of the few shot we can do bootstrap fine tuning, we can do things like signature optimization. So yeah, there is a wide variety of those kind of compilation strategies.

Demetrios [00:32:16]: Okay, fascinating. All right, so there's another question that came through here from Gunsalo. How token heavy is the end to end use of this? You mentioned SQL query creation. Starting off your talk, it would be great to see an example of GSPY creating appropriate SQL.

Arnav Singhvi [00:32:36]: Yes, yes, I think, yeah, I'm not sure if that's in the works yet, but that would definitely be a good avenue to test out. I mean we encourage anyone who would like to do that to try out. VR is open learning for themselves. But yeah, it is a bit token heavy in that sense. And it is like we are making a good amount of calls to the language model, but this is not like a all is kind of case. Like if you are not using metrics like LLM as a judge or you would like to use distilled versions of the DSPY optimizations, you can use like DSPY, bootstrap, fewshot and not do a random search over an X number of candidates. Naturally this is going to lead to a little bit of a dip in the performance, but you know, that's at the cost of making all those calls.

Demetrios [00:33:24]: Yeah. And do the compilers optimize the retry prompt for when the assertion suggestion fails?

Arnav Singhvi [00:33:32]: Yes. So there is a mix of both counterexamples and correct examples. So if the suggestion fails, it has some past information and then it has a fix to that, includes that in the prompt as well, or it can simply as just give it the correct generation. So it's like a mix of that. And that happens in the third scenario I mentioned where both the teacher has assertions and the student has assertions.

Demetrios [00:33:58]: Okay, so going back to that teacher student scenario. Yeah, I don't know how I feel about the naming convention on this. It sounds a little naughty. And it reminds me of that teacher student problem that happened where they got married and then it was illegal, but we can talk about that offline. So anyway, going back to that whole scenario, the teacher and the student sometimes like they're going to give, wouldn't the example, in the example that you were talking about, they, the different models. Hold on, let me see, let me see if I'm reading this question correctly, wouldn't the examples that worked. Oh, so wouldn't the examples that works in these scenarios vary among different models? So in the teacher, like it works for the teacher, but then for the student it doesn't work potentially.

Arnav Singhvi [00:34:58]: Right? Right. So the idea is if you were to test it on the student without any influence from the teacher, it very well would behave as you expect with no improvement. But given that we've given it a prompt where it can see examples of how it's supposed to work, we can in some sense see that enhancement even from the student model, if that kind of makes sense. For example, if GPT four behaves one way, GPT-3 doesn't behave as good as GPT four. But if we show GPT-3 how to behave, it can essentially learn and behave in that fashion. That makes sense.

Demetrios [00:35:37]: So the few shot examples that work for the teacher would not work for the student if different models are used, or you would just teach it how. So you would say like. Yeah, so you would. That's why you call it the teacher. All right, now it all makes fucking sense.

Arnav Singhvi [00:35:55]: That's where it comes from.

Demetrios [00:35:56]: Okay, forgive me for being a little slow here. We only got a few more questions left, man. And then you've been very kind with your time. So have, have you thought about leveraging Nvidia RTX locally to increase speed, cut down costs? I guess that's not quite a question. That's more just like consideration. Yeah, consideration. What if the pass fail criteria are more nuanced or varied? How are the pass fail criteria evaluation metrics generated?

Arnav Singhvi [00:36:29]: Yeah, so I believe this is in relation to assertions. So, yeah, this, it can vary in terms of the complexity of what you are validating for things that are very easy to check, like length or formatting. It's like a simple, like Python function. You could write for things that are intangibles that we use, like llms as a judge. It can be a lot more variable, like citation faithfulness. If an LLM is judging it, we would want a maybe stronger or larger model that is doing that evaluation. It's not subject to being perfect in all scenarios, and there's obvious discrepancies in where that can lie. But I would always say, like, the idea of building out a metric should be a program or like in programming languages or like validation metrics.

Arnav Singhvi [00:37:19]: All that should be programming. And anytime you're interacting with your language model, that should be through the DSPY, like robust design kind of format. So you're not really focused on like worrying too much about how well the model is doing in its metric determination, but you're trying to build out a pipeline that is effectively communicating your goals and your constraints of the task.

Demetrios [00:37:44]: Yeah, I think this was a huge unlock for me when looking at this, too. It's like, I think one thing Omar said was pipelines, not prompts.

Arnav Singhvi [00:37:52]: Yes. Yeah.

Demetrios [00:37:54]: So incredible. Well, dude, you brought us home. Now we are going to say good night. Thank you, arno. This was excellent, man. Everybody can continue the conversation with you, reach out to you on LinkedIn and Twitter. All that fun stuff. Will drop the links in the chat.

Demetrios [00:38:15]: I'll talk to you later, dude.

Arnav Singhvi [00:38:17]: All right. Yep. Thanks, everyone. Have a great day.

+ Read More

Watch More

1:05:40
DSPy: Transforming Language Model Calls into Smart Pipelines
Posted Dec 05, 2023 | Views 776
# DSPy
# LLMs
# ML Pipelines
Graphs and Language // Louis Guitton // AI in Production Lightning Talk
Posted Feb 22, 2024 | Views 627
# KG
# LLMs
# Prompt Engineering
Streamlining Model Deployment // Daniel Lenton // AI in Production Talk
Posted Mar 08, 2024 | Views 410
# benchmarks
# open governance
# open source runtime benchmarks
# dynamic routing