Boosting LLMs: Performance, Scaling, and Structured Outputs
Tom Sabo is an advisory solutions architect at SAS who, since 2005, has been immersed in the field of text analytics and AI as it applies to government challenges. He presents work internationally on diverse topics including deep learning applied to adverse health event assessments, counter human trafficking solutions, and combining text analytics and generative AI for public sector solutions including food safety applications, disease surveillance, and regulations analysis.
Tom developed much of the SAS strategy for bringing innovative work from the public health sector to related health and life sciences sectors. He is also deeply involved in pioneering work combining SAS text analytics solutions with large language models across many government and commercial use cases. He holds a bachelor’s degree in cognitive science and a master’s in computer science, both from the University of Virginia.
Matt is CTO and co-founder at Fuzzy Labs, a consultancy dedicated to using MLOps to help technical teams get the most out of AI and ML. He enjoys AI, bio-inspired computing, and functional programming.
Vaibhav is a software engineer with over 9 years of experience productizing research. At Microsoft, he worked on realtime 3D reconstruction for HoloLens. At Google, he led performance optimization on ARCore and Face ID. Now he's bringing that same experience to help bring better quality and speed to Generative AI technology.
Ben was the machine learning lead for Splice Machine, leading the development of their MLOps platform and Feature Store. He is now a founding software engineer at Galileo (rungalileo.io) focused on building data discovery and data quality tooling for machine learning teams. Ben also works as an adjunct professor at Washington University in St. Louis teaching concepts in cloud computing and big data analytics.
Bending the Rules: How to Use Information Extraction Models to Improve the Performance of Large Language Models Generative AI and Large Language Models (LLMs) are revolutionizing technology and redefining what's possible with AI. Harnessing the power of these transformative technologies requires careful curation of data to perform in both cost-effective and accurate ways. Information extraction models including linguistic rules and other traditional text analytics approaches can be used to curate data and aid in training, fine-tuning, and prompt-tuning, as well as evaluating the results generated by LLMs. By combining linguistic rule-based models with LLMs through this multi-modal approach to AI, we can help to improve the quality and accuracy of LLMs and enable them to perform better on various tasks while cutting costs. We will demonstrate this innovation with a real-world example in public comment analysis.
Scaling Large Language Models in Production Open source models have made running your own LLM accessible many people. It's pretty straightforward to set up a model like Mistral, with a vector database, and build your own RAG application. But making it scale to high traffic demands is another story. LLM inference itself is slow, and GPUs are expensive, so we can't simply throw hardware at the problem. Once you add things like guardrails to your application, latencies compound.
BAML: Beating OpenAI's Structured Outputs We created a new programming language that allows us to help developers using LLMs get higher quality results out of any model. For example, in many scenarios, we can match GPT-4o performance with GPT-4o-mini using BAML. We'll discuss some of the algorithms that BAML uses, how they improve the accuracy of models, and why function calling is good and bad.
Ben Epstein [00:00:06]: We are live. Thank you everybody for joining us today. We had a couple of technical issues, but we're good. We've got Matt, the Bob and Tom all with us today talking about LLMs in production, how to boost performance, how to scale them efficiently. So it's going to be a really good session for anybody starting to build LLMs and put them into production, or if you've started and you're looking for some new tips and tricks to get the maximal performance out of them, both cost and actual behavior. So we have a lot of talks. We're going to jump right into it this time. We're going to start with Matt and then we'll have some questions and run to the other speakers as well.
Ben Epstein [00:00:43]: So Matt, whenever you're ready, feel free to kick it off.
Matt Squire [00:00:48]: Thanks very much, Ben. So yeah, I'm pleased to be here at this mini summit. It's very exciting set of set of topics. I'm going to start off by talking about a case study that we worked on earlier this year where we dealt with a number of challenges with scaling large language models. So I'll talk about what we did and how we resolved them and what we've learned so far because this is very much going to be a story of the journey so far for me. We as an industry are so early in figuring out how to effectively deploy and productionize and scale large language models and more broadly generative AI that a lot of what I'm going to say is that this is how we do things right now and that might change next year and it might change the year after, but at least it's a story about progress with the challenge, let's say I am the CTO and co founder of a company called Fuzzy Labs. We're based in the UK and we specialize in machine learning operations, so we help our customers to productionize AI, including large language models. So yeah, I'll talk through a real customer case study.
Matt Squire [00:02:02]: As I say, I'm going to ask this question of what do we really mean by scaling as well. And then I'll talk about benchmarking performance, how we can get a baseline that allows us to be principled as we do, improve things as we do, optimize what we've got a little bit around the results of the experiments we ran on this project and then what we've learned so far. And then perhaps there'll be a little bit of time for questions and discussion afterwards. So the customer, I won't name, but they're not a Microsoft, despite what this little GiF might suggest, but the reason I've chosen it. Steve Ballmertaindehenite chanting developers, developers. Developers. For our customer, developers were very important. It's enough to say that they are a tech company.
Matt Squire [00:02:50]: They build hardware and they build software platforms for developers to use, when they use that hardware, to build things of their own. So because developers were important, building good relations with those developers was important, and their tech documentation was actually not in a great state. It was difficult to navigate. Their developers often struggled with the intricacies of this customer's products. So we started with business problem, not a technical problem, right. The business problem this customer wanted to solve was how do I make it easier for developers to understand our documentation, understand our products, and adopt our products? Ultimately, and therefore growing their developer community, we were tasked with building a large language model to solve this problem, or rather building a product with a large language model to solve it. They wanted a self hosted stack. The main reason for wanting it to be self hosted was that they did not want to share their data with other parties.
Matt Squire [00:03:53]: Lots of proprietary stuff there. Not all of their documentation is open to the public. They wanted this in house, but also they wanted to learn. They wanted to learn how to build these things themselves. Large language models wasn't a skill they had in house at the time, so they wanted us to take us on, take them on that journey too. So I'll speak briefly about the software architecture that we deployed for them. And I share this mostly to frame a discussion around scaling. So I wouldn't worry too much about all of the specifics here.
Matt Squire [00:04:26]: But at the left hand side, we had the idea that a developer here, illustrated by a very capable dog, wants to ask a question, like for example, how do I run a fast Fourier transform on some data? So what the developer is guessing at is using your software libraries on your hardware platform. How do I do this thing that can only be answered by the customer's documentation? So that query goes over to a orchestration service that's running on Python? Well, of course it is, and we use FastapI to just provide an API there. Nothing too clever there. But what it's going to do is go out to a vector database to get data that provides a context with which we can answer the question. So we're doing retrieval, augmented generation or rack here, which for those familiar with the idea, it will be no surprise for those not familiar, just really briefly, it is a very simple concept. The idea is that I have a question. I want data that might answer that question. So I will find data, which is related to my question.
Matt Squire [00:05:31]: I'll combine the two things, question and data, and I'll ask the large language model, given the data that you see here, and given the user's question, answer that question. And that's all we're doing here. So a couple of other things to point out. We have guardrails. This was a customer for whom professional reputation was very important, and they did not want to release something to the public and have people ask questions that were inappropriate, questions about, for instance, their stock price, they don't want that. Or questions about competing products, but they also don't want people saying, I don't know, how do I build an explosive using your platform? We definitely don't want that. So the guardrails were doing a couple of things. One of the things they did was check whether the question was on topic, and we actually trained a custom model for that.
Matt Squire [00:06:23]: So there's a second model kind of hidden here. We won't talk about scaling that, but just to note, we're not just scaling a large language model in the bit picture, we're also scaling a couple of other things here. It did other things like, you know, strip out proprietary information or prevent questions that might be harmful, or these sorts of things. On the right, we have our model server that is also a Python web service, as is the guardrails, I should say, as well, with an API Python web service, and it's hosting Mistral Seven B instruct, which is what we were using at the time. And it's running on a GPU instance. So we have a GPU which we can use for inference. We didn't do anything else with the model, no fine tuning, nothing crazy like that. We're just deploying an existing model to an environment which happens to be AWS as well.
Matt Squire [00:07:14]: So that's all we need to know in the big picture about scaling. But what does scaling really mean? Why do we care about this? Well, for me it's all about the user experience. We don't scale for the sake of scaling. Scaling is an interesting engineering problem, and so we might do that in some contexts, but certainly here we do it because we care about the users. The users want fast response times. They want to ask a question and get a reply back as quickly as possible, despite the fact that there is an enormous amount of inferencing machinery happening to bring that answer back to them. User doesn't really care about that, they just want the answer. We also need to support hundreds and thousands potentially of concurrent user sessions.
Matt Squire [00:07:58]: So we've got response time. We've also got concurrency, or that scaling horizontally if you like. Also, GPU's are expensive. So from a business perspective, we want to minimize the total number of GPU's that we need to provision to support hundreds and thousands of concurrent users and fast response times. But wait, actually I should have paused a little bit longer here. As the most famous Donald in the world said, that is Donald Cluth. Premature optimization is the root of all evil. So this is Donald just saying, hang on, hang on.
Matt Squire [00:08:35]: Yes, we want to optimize for the user experience, but don't just go and optimize things. We need to know what we're optimizing. We need some kind of baseline. We need to understand how the system performs currently before we start to change things around. And yeah, because you get into this situation where you say, well, I've improved the latency here, but how do you know that that particular component of the system was the biggest bottleneck? And how do you know you've not introduced some new bottleneck somewhere else? All of this should be familiar from an web application engineering perspective. It's not particularly specific to AI machine learning, or indeed to large language models. It's what we already know. But perhaps the way we measure it is different.
Matt Squire [00:09:27]: And the tools we use to resolve the problem, the tools we use to optimize our stack are going to be different, as we'll see. So we took Donald Knuth's advice. We didn't prematurely optimize things. Instead, we started out by benchmarking the performance of what we had. So we had built a system that was in initially, let's say proof of concept. It served to demonstrate to the customer and to us as developers that the large language model could answer the questions we wanted to answer, that it could do it in a reasonably robust and reliable way. We then pulled out a tool called locus. This is a Python based tool that is designed to simulate user traffic scenarios.
Matt Squire [00:10:11]: So the idea is you provide it a number of different scenarios that you want to test. Scenarios can be differentiated by, for example, the number of concurrent users or the number of gaps between requests. Because it's going to simulate a bunch of different requests from different users. That's how this is going to work. So our dev testing scenario is the starting point. That's really just representing an absolutely extreme version of what we as developers might do. In reality, there were five developers on this team, and they were not all constantly hammering the system every minute because they had other things to do, like write code. So dev testing scenario is massively over specified for what it's actually representing, but it's a starting point.
Matt Squire [00:10:55]: Then we say, okay, well, what does a typical day look like? We kind of take a bit of an educated guess here because we haven't actually put it out into the real world to find out, but based on what the customer thinks, we feel like 20 concurrent users is a good starting point. And if we can support that and maintain good latency, we'll define what good latency is. And if we can maintain good reliability, then that's good. Then, okay, what happens with that typical day under strain? What happens if we just go 50% up, we add ten more concurrent users? What happens then? That's followed by what happens when we hit the front page of Reddit. 50 concurrent users. Actually, that's probably an underestimate, but let's go with it. What happens at 50 users? And then finally, how far do we have to push it before it fails? Now, actually, as it happens when we first start running this, that failure n is way, way lower than you think it might be. Could be five users.
Matt Squire [00:11:55]: And we never get to any of these other scenarios. But pushing a system to the breaking point, you learn a lot from doing that. And again, this isn't specific to last language models, right. This is generally true for software engineering. We want to push it to its breaking point, ultimately to find out what breaks, how it breaks, and how we fix it. We also measured three different metrics. One was latency. So by that I mean the total time it takes in seconds to get a response from the system.
Matt Squire [00:12:24]: Now, you might say, well, from the whole system or from that large language model server. Well, it depends what you're interested in. And actually, as it happens, we tested the model server in isolation, we tested the guardrails server in isolation, but we also tested the entire system as a whole. And those are three different sets of experiments that tell you different things about the system. Throughput was the second metric. So how many tokens are we outputting per unit time per second? Seems like a good metric when we're dealing with large language models that can then translate into the kind of what does the user experience in terms of how long it takes overall to generate a response? And the next one is request weight, request, sorry, request rate, not sure what happened there. How many user requests are served successfully per minute or per unit time, whatever we care about. Additionally to these metrics, we want to, we want to be principled, we want to be scientific about how we're doing.
Matt Squire [00:13:30]: So we need to know what test did we run? What software release version were we testing against? What environment were we testing? What is the git commit that corresponds to that, that version that we're testing against? Dataset version, and anything else that might be relevant. We're going to record all of that, because if we're running benchmarking, then we are also going to be changing things. We're going to run a test, find out what happens, form a hypothesis about where we need to intervene in the system, and we're going to test that hypothesis by running another set of tests. We need to record all of this other information so that we can look back effectively. It's your experiment tracker, right? You can look back at what you've done in the past and understand whether you're moving in a progressive direction with your, your optimization. So what were the results? Well, you know, numbers might be different nowadays. This was a few months ago, and we would probably do a few things differently if we started this project again today. But, you know, these are the numbers we got at the time.
Matt Squire [00:14:39]: So, initially, with just our hugging face pipeline for model inference plus fast API, we got very high latency, and it grows very quickly. Essentially, this is unusual, usable for more than one user, and even the one user isn't getting a great experience right now. Fair enough. We're not too surprised by that, but at least it tells us where we are. So the question is, can we speed up inference? Well, that's where we pulled in a tool called VLLM. So, VLLM, the team behind VLLM, started with the observation that inference is mostly bottlenecked by GPU memory. And more specifically, it's because when you're, when you're doing inference in a transformer model, a lot of what you're doing is these, what's called the key value lookups. So, if you look into the details of how a transformer architecture works, you'll understand more of what that is.
Matt Squire [00:15:38]: I won't try to explain that here, but that's the operation that's taking up most of the time when we're doing inference. So the idea behind VLLM, the key elevation, no pun intended, is this thing called page detention. So, think of it as we're doing this with, it's like virtual memory for a large language board. Effectively, what they claim is that they get a 24 times throughput improvement over plain hugging face pipelines. And, okay, plain hugging face pipelines aren't actually the most optimal thing in the world. You have hugging face TGI, the text generation interface that's better. But even so, VLM gives a two and a half percent, two and a half times improvement over HFTGI. So that's pretty cool.
Matt Squire [00:16:25]: And you can read more about how VLM works with the link below. They released a white paper explaining all the details. This is pretty cool. It's extremely cool. The metaphor with virtual memory is a little bit tenuous, because how virtual memory works in traditional computing is rather different from their meaning of the word. But never mind. Other than that, the idea works. And what we found when we replaced these pipelines with VLLM was an improved latency and a slower growth rate in latency as we ramp up the number of users.
Matt Squire [00:17:01]: So we can see that the after values here, we're all looking green here, we're pretty good. We also did that without increasing the compute cost. So the number of GPU's we need remains the same. That's the result as well, because we're saving money ultimately in doing that. Okay, so we can see a path whereby we can improve latency and improve throughput on a per user level. Let's call that vertical scaling. But what we're still doing here is we're doing inference on one individual server, and the next thing we can do is start to scale it out horizontally, which means more servers. So more servers to handle more multiples of concurrent users.
Matt Squire [00:17:55]: But GPU's are expensive, so we want to make most of them. But of course we don't want to scale beyond what we need as well as we're doing this. So the other thing we ended up implementing on this project was a tool called Ray Serve. This is a monster of a framework. It's built by a company called Anyscale. It's used by OpenAI, which itself is a massive endorsement, but it's also used by Uber, LinkedIn, and so on and so on. The idea is that it gives you an abstracted way to deploy your models to many servers, but it also gives you auto scaling, so it can figure out when and how to scale up and down the number of servers that you're using, and it can interface with the auto scaling capabilities that you have in your cloud provider to do that for you. That's very cool.
Matt Squire [00:18:47]: And the final thing it can do is GPU sharing. So it kind of abstracts the idea of a GPU in some sense. So coming back to that guardrails service where I said, hey, there's another model here that I'm not going to talk too much about. Well, the thing is that needs a GPU to run as well. And so does ideally the vector database because at least because we need to compute some vectors so that can be accelerated by GPU as well. So this ability to share GPU's suddenly unlocks something because you can run multiple different things on the same GPU with that allocation. And the final thing which is important here is that it plays nicely with the LLMDH. So we're able to join the two things up so we can get that horizontal scaling and the vertical scaling together.
Matt Squire [00:19:40]: That was a very quick run through of the project. The background, the things we learned, the experiments we ran and what worked for us. Less important to say is what worked for us. This is one data point, one experience, and it's a project that we did at the beginning of the year, and we all know how fast technology moves. There's probably a few things we'd do differently, but the key takeaways here are always benchmark. You must benchmark performance before even starting to optimize something. That's the most important lesson across software engineering, not just for large language models and mlops. Secondly is to think about latency.
Matt Squire [00:20:24]: So we found that VLM works really well there. Thirdly, to think about scaling. And we had great results from Rayserve and the necessity. That's the final point I want to get across, which is if you are self hosting a large language model and there are many good reasons for doing it, there are many good reasons for not doing it, but if that's what you're doing, and you're not just building a toy, but you're building something for production, then you will encounter these challenges and you might solve them in different ways. This is one approach, but fundamentally these are things that you'll want to look out for. And starting with benchmarking is exactly the place to begin. So my final slide, I'll just say you can learn about fuzzy labs here at the URL. And I want to just say a huge thanks to the team from fuzzy labs who made all of what I've talked about happen.
Matt Squire [00:21:20]: And I. Yeah, also I have a newsletter so you can subscribe to life if you like that sort of thing. Thanks very much for listening. Any questions?
Ben Epstein [00:21:31]: That was really cool. That's awesome. I didn't know that Ray and Niscale worked with Vllm. That's a pretty. Is that, is that how recent of a implementation is that?
Matt Squire [00:21:43]: Yeah, it's pretty recent. And we did have, we did have a bit of trouble getting it to work properly. I'll have to say it wasn't a smooth journey. But I'm thinking now, probably first half of this year is when we were doing this. So, yeah, pretty recent.
Ben Epstein [00:21:57]: Is it a smoother journey now or are people going to have the same struggles?
Matt Squire [00:22:01]: I don't know. I don't know. I suspect it will be smoother, but I don't know. I don't have any data to back that up.
Ben Epstein [00:22:07]: Yeah, that's fair. I'm sure things are moving fast in both of those communities, and I imagine, I would hope to imagine that TGI is working pretty hard to catch up on those.
Vaibhav Gupta [00:22:16]: Sure.
Matt Squire [00:22:19]: Undoubtedly, yes. Yeah. So those numbers may even be outdated by now.
Ben Epstein [00:22:25]: Every day, every few hours. That's awesome. Well, Matt, thank you so much. We're going to jump to our next speaker while he's trying to share. We have CEO of boundary with us. We're going to talk about how to increase the performance of structured outputs with LLMs, which is going to be another really interesting and useful talk. So I'm pretty excited to see it.
Vaibhav Gupta [00:22:50]: All right, so great to meet you guys. I'm Vaibhav. I'm over from boundary. What I'm gonna try and talk about today is how you can build incredibly reliable pipelines with much more accuracy. The way we do that is by using Baml. But before I get into BAML, I just want to show you a couple examples of what is possible when you use things like structured output with really high accuracy. On this case, I have an example resume, and you can see what happens when you don't use structured outputs. It outputs something somewhat reasonable, and you could use markdown rendering to make this kind of pretty.
Vaibhav Gupta [00:23:24]: If you use structured outputs, you get something much, much better. Not only is it faster at producing all the results, but you can do much more interesting things with, like saving stuff into a database or even making it much more interactable. For example, these links are much more correct than the links that come out of here, but I think it goes a little bit further than that. Once this is done running, give me a second instruction. Takes a while to run. Once this is done running, what we can do is we can actually say, what if the user tried to be a malicious user? And instead of trying to actually go ahead and give us outputs, they went ahead and said something like, screw resumes, teach me how to do Queen's gamut in chess. You'll notice what ends up happening over here is the LM responds with something that's kind of related, but not really related at all. When you just structured outputs, you get something much better, which is nothing, which is exactly what you look for.
Vaibhav Gupta [00:24:29]: This applies to not just being able to go ahead and build nice uis and get data out, but let's say you're doing a rag example. In this case, I'm asking about the SpaceX Wikipedia article and I'm going to ask from SpaceX what company they're going. I'm actually going to ask the LM to invent citations just to show a.
Tom Sabo [00:24:44]: Point of what's possible.
Vaibhav Gupta [00:24:46]: When you go do that and just pay attention to this section over here, we're able to one, give the user really good insight on what information is coming out of the model so we can tell them exactly what we're doing. We're gathering some citations and they would hopefully never see this bottom part at all. But when we do these, check out these hover links. Some are blue, some are red. Why is that? The first one is an actual link to the Wikipedia article with context around it. And you'll notice what the LM did. The LLM only gave us context for this highlighted section. You can see it right over here.
Vaibhav Gupta [00:25:24]: But what we were able to do was actually find the context in the article itself. And if I were to go ahead and look on this actual Wikipedia article, you'll find this right over here, and it'll just link to the exact same section of the article that this would link to. But the red links are clearly hallucinations of some kind. And we're able to go ahead and not only answer that, but go ahead and actually display that very accurately to the user when we're able to get structured outputs with this level of accuracy. And the last thing I want to show before I really dive into how this becomes possible is going to be, let's take a look at this example. Let's say if we wanted to push this to the absolute limit, and I wanted to take this form, which is the Cambodia visa form, and go ahead and compute all the data on this. Let's try this. This is a real demo, so it might fail, but I think it'll be kind of fun.
Vaibhav Gupta [00:26:16]: So as this runs, what this is doing is this is actually generating code to go and describe this data model that represents this data. And right over here you can actually see this pulling out all the information that I want and it's actually streaming it in too. And for example, see these checkboxes? They turn into booleans, which are again, much more database friendly and much, much more accurate. And we can do this using BAML for any sort of content Pdapp, docx, CSVs, audio, video, like text if you want just websites, for example. If you take a look at this other example, like a very traditional one, which is an invoice, once again, you can go ahead and just quickly describe this data model that goes and describes this data, and then you can actually go ahead and run that code to go ahead and produce highly accurate, highly useful systems. So I'll pause there and just really quickly recap exactly how any of this is possible. The core technology behind doing all this is what we call Baml. BaMl's innovation is this idea of we all want to be able to write code instead of writing plain English and telling models do not hallucinate or these other phrases that don't really work and just make it a lot more interactable.
Vaibhav Gupta [00:27:40]: It makes, having code allows us to do a couple of things. One, it makes the systems much easier to read and reason about as they change over time and become more complex. And two, it also simultaneously makes the problem a lot simpler for the model to think about as well. And I'll give you a couple examples, but let's go straight into versus code to really show how this works. So when I go over here, I'm just going to show you a very, very brief example of what it looks like to go ahead and get really high quality structured data output. What I would love to do is go ahead and define some function, and that takes in either a string or some image, and it'll dump some resume data model out. Now, instead of writing JSON schema or something else, I want to write classes. So we write classes like this and what we can do, just like in website development, the first thing that I did when I first learned how to make websites, whether it be HTML, next JS, or anything else, was I opened the browser to see what the website looks like.
Vaibhav Gupta [00:28:42]: And then the most important thing immediately after that was figuring out how hot reload works. Because without hot reload, I really, really can't iterate in any reasonable way. So we wanted to bring that same process over to LLMs. And when I want to get really high quality structured outputs, you need a hot reload loop. In this case, what we're allowing you to do is actually see the entire prompt that gets fed into the model and actually get everything coming out of it. Not only are images able to be rendered, but audio files, video files, anything else. And as you make your systems more complicated and you have multiple test cases, you can see exactly what each test case would look like in this model. If you wanted to do something slightly fancier, like say, if statements are for loops, we can actually say something like.
Matt Squire [00:29:28]: This.
Vaibhav Gupta [00:29:32]: Greater than ten those. And you'll notice that we have this resume coming in. But on the other hand, if I make this very small, it won't have that section. Now why does this matter? It matters because in order to get really really high quality of data coming out of the model, you need to be able to see exactly what you're sending to the model. And I don't mean, and by that I mean you would never really ship a website with a CSS change without seeing what the CSS change or the div change did on the actual website. And similarly you should be able to see the entire request without any abstractions composed. So in this case it actually means a raw web request that we release in here I released my API key. Please don't steal it.
Vaibhav Gupta [00:30:18]: But it will deactivate it after this talk. And you'll see over here it's like this is the full text that gets sent to the model. But secondarily you need another hot reload loop which actually allows you to go ahead and run these tests directly in versus code, rather than having to jump between uis to go and do any testing, you want to run them in versus code or under CLI. And that's really the only two options have a really fast loop. So let's talk about what baml does to make your life a lot easier, and how we're able to achieve better structured outputs even using super tiny models. In this case we did something really trivial. We took the data that came out of the model and we processed it into this actual data model that represents your resume data type. Now, if I wanted experience to instead be something a little bit more sophisticated, let's take a look and see what that does.
Vaibhav Gupta [00:31:08]: One, you notice that my prompt automatically changed to mimic my data model. But two, when I go run this, you'll find once again I'm now able to go get this data coming out of it. But to do something more complicated than just this, let's try something else. We'll tell the LLM to not use any quotation marks when it outputs data out of the model, and keep a note at how many tokens are being used. We're using 108 tokens today, so let's see if we can get that lower. So you're noticing we're still streaming all the data out, even though this is technically totally unparsedable in garbage, we came down from 108 tokens to 83 tokens. That's a saving of 108, just straight up, almost 25% cheaper and 25% faster, because we're just using way less tokens. What BAml does is when you write code using BAMl, we're able to generate a bunch of algorithms behind the scene that run on the output of the model to turn the data that the model gives you into something much more practical coming out of it.
Vaibhav Gupta [00:32:18]: If we did this, and we use no LLMs for doing any of this, all of this happens completely locally on your own machine in milliseconds. What this basically unlocks is the ability for super, super tiny models to go ahead and provide really high accuracy, excuse me, provide really high accuracy on the same quality of models. Because what we noticed when we ran a bunch of data sets, and I'll show you that in a second as well, is the fact that GPT 4.0, while it's really good, really matches a lot of performance on lower tier models, even like GPT 35 and Claude Haiku. But the problem with those smaller models is that they have a really hard time with giving you exactly the same format of output that you need for formats that are very strict, like JSON, which require quotes and don't even allow comments or leading comments or trailing commas. And what that effectively leads to is an extremely failing, extremely faulty system that fails not because the data in the model is incorrect, but because the, but because the way that you're able to handle the data is failing. And that's the innovation that Bama provides. So that whenever you go and ask the models to give you a resume data model, more often than not, it will almost always give you a resume data model. What's interesting about our technique is one, it requires no modification to the model at all, so you're able to use this with any existing model today.
Vaibhav Gupta [00:33:53]: And two, it will work, it's just much faster because it requires, it's much faster because it doesn't slow down the hot loop of token generation like constraint generation. Let me show you a couple, let me show you an example before I dive back into the code in a second. So let's take a look at exactly what this means really quickly on the actual benchmarks. When you ran this on the function calling benchmarks, we ran into something really, really interesting. Once this runs, the main difference is here's what we found when we ran this on all the models and compared it against function calling for those models built in and even OpenAI structured outputs. The first most surprising thing we found is even when we ran Berkeley function calling dataset on just GPT 3.5 turbo, we found that BAML is significantly more accurate on the oldest model that is out there today, and even on par, or in many cases better than that newer models that came out without BAML. So models not using BAML just end up performing way worse than models that use BAML. And this is across the board on all test cases.
Vaibhav Gupta [00:35:09]: If you do it on a cost basis, you'll find once again, BAML is just a lot cheaper. And BAML ends up being a lot faster than almost every other technique except the technique on the very left, which has significantly worse accuracy, but much faster than any of the function calling techniques that exist. The reason this ends up being true is because what BAML is able to do if you go back here is we actually reduce the number of tokens that go into the model. So if you go ahead and actually look at the tokens that we're actually sending out to the model, compared to JSON schema or any other format, the format that BAML uses to represent the prompts is a lot smaller. And then lastly, the format that BAML uses to parse the data of the model is also way less. As we saw earlier, we just don't need a lot of superfluous tokens like quotation marks, commas content, and we're able to still get the same data out. Now, a lot of you may have more complex problems like, let's say financial data or something else that may require some techniques like chain of thought, where you have the model to reasoning before it produces some data out of it. Doing something like that with demo is also much easier.
Vaibhav Gupta [00:36:22]: So I'm going to show you a prompt that I use for chain of thought very often. And it almost surprises me how good, how good the models are at figuring out what I need to do. So let's say this before answering. Outline some key details. Example and this prompt is effectively total nonsense. It really shouldn't work. But I'll show you what happens when I go run this. You're noticing that the model outline basically outlined everything ahead of time, was able to do some chain of thought, and then produce the actual JSON.
Vaibhav Gupta [00:37:13]: And once again, as we were watching, kind of pulled out all the data for me automatically. Let's run a slightly more complex test case and just show what's able to happen. And this test case that I'm running is right over here. It's actual image of my resume plumbed out into here, and we'll watch the model actually go ahead and produce some results. And right over here you can see the same thing once again. The model was able to go ahead and output a lot of key details, and then after that structure into the actual data model that I wanted. And all this happened without any quotation mark. So we probably should use tokens by another 20% when producing this results, because we don't use any LLMs to convert this string into this output that you want in your code base or in your type system, we're able to go ahead and do this much more frequently and almost on every token that comes out, even while streaming.
Vaibhav Gupta [00:38:09]: Now, while all this is true, I think the most important part to figure out is if you did want to use something like BAML, which allows you to go and get this level of accuracy with almost any model of your choice, how could you do that? The way that you do this today is you can use BAML with any other language of your choice. While BAML is its own programming language that you can use in versus code, using our versus code extension, it plugs into typescript, Ruby, python, Java, go rust. So you can get the benefits of our algorithms in any place of your choosing. But I want to give you a taste of what it feels like when you go ahead and write that in, let's say, python code. So all you have to do to use this extract resume function that we have is from demo client import b, and then you can do b extract resume. You'll watch it autocomplete as I go down here, and resume is going to be autocompleting almost everything. This is going to be a list of experiences. Let's take a look at what happens when I go change resume list of experiences to be a list of strengths.
Vaibhav Gupta [00:39:16]: You'll notice that this oops, why does that change? That's awkward. In theory, it should change to be the exact same type that we have over here. There we go. Oh, well, I guess that was slightly less. There's probably a bug somewhere that we need to go fix, but we keep our type systems completely in sync with your codebase. So as you go ahead and type, you never have to think about having types between BAML and python, not lineup. I'll pause for I guess that's the main gist of everything. What we really are able to do is go ahead and drive extremely highly accurate systems from LLMs and give you a really good hot loop in your pipeline so that whenever you're trying to go ahead and write llamaps, you're not struggling to see what is the actual request that is being sent to the model, what is the work that is actually happening? And more importantly, if you wanted to have really high accurate systems, what you have to do is describe data models that are able to go ahead and describe the goals of what you're achieving.
Vaibhav Gupta [00:40:31]: And then I go ahead and describe test cases and just run them really quickly. I'll go ahead and I think that's the general gist of BAML. I hope you guys have fun and hopefully you're able to go ahead and build some of the demos that I showed earlier on our example repo.
Ben Epstein [00:40:50]: That was unbelievably cool. That was really sweet. I am stoked to try this. I will be using this in the next week. This is really cool. Combined with Matt's talk, we now have scalable fast LLMs with cheaper and higher quality structured outputs. That's pretty sweet. How long have you been working on this?
Vaibhav Gupta [00:41:13]: We've been working on BAML for probably over a year now. So we wrote a whole compiler in rust for it. That's why it's a lot faster in a lot of ways too. So over a year now, and this.
Ben Epstein [00:41:24]: Is the open source project on boundary mail. I shared the URL. I'm assuming that was the right.
Vaibhav Gupta [00:41:28]: Yes, yes. So if you go to boundary ML, GitHub, that's our open source project, BAML, all this that you saw is completely free to use. Anyone can use it at any point. Um, I think we have, um, we have a growing community, so join our discord if you ever need the. What I toss out is if some prompt is not ever working for you. If you post it on my discord, I will give you a working version of the prompt, uh, within usually 15 minutes.
Ben Epstein [00:41:52]: That's sick. That's very cool. I love it. Um, presumably the parsers are just incredibly sophisticated regexes that you've been working on.
Vaibhav Gupta [00:41:59]: To grab a little bit more than regex. It's a lot of dynamic programming, actually. So my background's in algorithm design. Um, I did a lot of algorithms for google and DSHA for the last ten years. Uh, and I just made parts of their system go from like 24 hours to less than five minutes. I worked on core Python for a while and just making, I just love making items that make systems faster.
Ben Epstein [00:42:21]: That's awesome. I am super stoked to try this out. That's really fun. Awesome. Thanks so much. I shared your link.
Vaibhav Gupta [00:42:30]: Thanks for having.
Ben Epstein [00:42:31]: Yeah, of course. Or jump to the next one, which I think will also tie in to everything we've been using. I think at the end of this session you can use all three of these tools together to really one up your systems. Thanks. We're going to jump to Tom now. How you doing?
Tom Sabo [00:42:46]: I'm doing okay.
Ben Epstein [00:42:48]: All right.
Tom Sabo [00:42:49]: You're technical things this morning. So happy to have landed.
Ben Epstein [00:42:54]: Yeah, yeah. A few technical issues, but we're here, we're doing it and we're going to let you jump right in because we're going to push time a little bit. So feel free to introduce yourself and jump in whenever you're ready.
Tom Sabo [00:43:05]: Fantastic. I'll try to wrap up a little early in case there are any additional questions, too. So, hey, folks, I'm Tom Sabo and thanks for the introduction. Really appreciate it. I've been working in text analytics and natural language processing before anybody knew about text analytics and natural language processing, so probably about 19 years, and I've been at SAS for almost 14 years now. And so exciting to share with you some of the capabilities we have explored around essentially combining traditional techniques and natural language parsing. We just heard a little bit about how essentially you can use regex to essentially, through a language to essentially improve their production and improve the performance of large language models. We're doing something pretty similar where essentially we have been exploring techniques in using information extraction models, which is essentially text analytics to improve the performance of large language models.
Tom Sabo [00:44:00]: So the idea is you can narrow down essentially the set of data that you end up using to send to a large language model and subsequently get much more focused answers from it. So we are super excited to apply this and show you essentially how you can do this on your end too, using some of the open source techniques available to you. I'll just mention once, like most of everything that I've done here, we've done on top of SAS. We've made calls out to large language models, we've used OpenAI, we've used Lama hugging base through an LLM type agnostic approach, essentially just to give you a little bit of a primer on natural language processing. For those of you who haven't been doing this for a long, long time, NLP is just largely based around the whole proposition of getting computers to understand essentially strings as something other than bits and bytes. So it involves like tokenization, involves parsing. All these techniques end up going into the embeddings, which we feed into large language models today. Largely what my work has been in has been in using traditional techniques in entity and fact extraction.
Tom Sabo [00:45:17]: And a lot of these techniques are available like do entity extraction through a bunch of different models. Entity extraction is essentially identifying the people, places, and things of interest to an organization. So, for instance, you could use entities to identify, in case of, like, public comments, a public comment that has a recommendation or what that recommendation was, or who supplied that recommendation, let's say. And custom entity extraction is fantastic. So lets you focus on anything that your organization might find particularly interesting in a set of text data. I'll be talking about public comment analysis in my examples, because I work primarily in government, and government organizations are very interested in the public feedback related to any proposed regulation they put out there. So this can be applicable to consumer complaint analysis, customer call center requests, and everything that's related to commercial orgs, too. So like I said, we use information extraction before.
Tom Sabo [00:46:15]: We have large language models to vectorize and create embeddings, but also we use them primarily on essentially creating taxonomies and categories of data. Each of these, again, can be sort of filtered to a large language model to get summaries of these are different key aspects of the data that I care about. And then applying linguistic concepts again can be used to create a subset of data from a larger set of data to help your large language model focus in on a correct answer. LLMs are smart, but they're not as smart as people necessarily give them credit for. So when you can do things like, let's say, engineer a prompt on one side and simultaneously give it the right data against that prompt on the other side, it can give you much more focused answers with less hallucinations. So I'll talk specifically about public comment analysis, because some public comment, essentially regulations, can get tens of thousands of these responses. And many government organizations largely are resorting to going through these manually and attempting to identify all of the different subsets of the public comments in a way to address all of the different aspects of commentary brought up to the public. So if you're looking at a regulation dealing with your health data privacy, wow, there is a lot of different stakeholders involved in that who will write in with many different opinions.
Tom Sabo [00:47:40]: And by law, government organizations have to respond to all of the salient points brought up by the public or be subject to lawsuit. So that's super challenging for organizations during this manually. So we've worked with orgs who are doing this manually and have basically introduced a combination of text analytics and large language models to one. In some cases, they may be interested in one of, let's say, 200 different aspects of a regulation. And so what we do is use text analytics to essentially filter. Okay, here's a comment, it belongs in this particular bucket. But even going deeper than that, we can tokenize according to statement. So let's say 1000 letters that the public write in with might have like, let's say 30,000 unique statements.
Tom Sabo [00:48:27]: And those are the things which we want to bucket as, hey, this statement deals with this different aspect of the regulation. Once we have those buckets. Large language models work so well in terms of summarizing, this is what the public had to say. And both of those steps were formerly very manual processes that organizations have to go through. So we'd seen incredible time to value improvement, basically going from 4500 hours, the closer to 600 hours in terms of one particular regulation that we were looking at related to health and human services. And the idea is, and I'll show you how some of these visualizations look. So I really like to show off ideas related to how do you visualize the results of large language models? You can reduce hallucinations, but largely I want to show you a little bit around essentially using text analytics to help you validate the results coming out of large language models. And in this case, text analytics could be term extraction.
Tom Sabo [00:49:26]: It could be focused term extraction based on terminology of interest to your organization. So when the large language model makes an assertion, you can go in and explore some of those. I'm also going to talk a little bit about how this goes somewhat beyond rag, because what I'm doing essentially is, and what I've seen, rag work works very well, essentially grabbing several different documents and essentially constructing an answer from the documents it gathers, doing something a bit different, where the essentially text analytics process on the front end might look over tens of thousands of statements, put, let's say 2000 of those statements together into essentially one large document, and then ascertain a response from that very large document. Whereas let's say those 2000 statements are particularly pertinent to answering a question. So it's almost like doing a mini rag where you're grabbing things from 2000 different places, because these could come from any number of attachment letters rather than just a few requisite search results. And then the idea is really because you're giving better data to the LLM, it's going to help minimize the hallucinations. But beyond that, you can then use visual interfaces to validate the veracity of any responses and sellers. And then also we've seen some processes where we address bias and toxicity in models.
Tom Sabo [00:50:51]: This is a way where you can use this process for those. So just going through a high level, before I go into the use case, example, let's say I'm pulling in data from like 1000 regulation letters. This could also be like customer interactions too. We're identifying your terminology, dealing with certain aspects of, let's say, your product or solution. I would use my information extraction model. In this case, I'm using a SAS one to identify with my comment letters. Maybe I've gotten like, let's say 20,000 or 50,000 unique statements out of those. Well, which ones of those have an actual recommendation for me in there? Essentially, we want you to do this different because those can be really powerful in terms of actual value to the end user or the government agencies required to adapt their regulation to the public.
Tom Sabo [00:51:46]: Then I can use a generative AR model, like in batch or on the fly, to essentially create a summary of the public's needs around like particular areas. It might be just the negative recommendations related to your proposed regulation or positive ones, or it might deal with a certain aspect of the regulation. And simultaneously I can take the text analytics results into a visualization where I can drill down, explore the results, explore the large language model says, and trace it back to all of the requisite statements from many different locations that pertain to the large language model's response. Let me go ahead and show you this in action. One particular example that we did with a proposed EPA regulation. So we looked at an EPA regulation that essentially considers transition of primacy on carbon capture and storage of these, what are called class six carbon storage wells. EPA is currently overseeing them. The regulation proposes transferring control over to the state of Louisiana.
Tom Sabo [00:52:50]: Class six wells are used to essentially capture byproduct of oil and gas industry and storm underground rather than releasing air. So there are environmental benefits. However, the folks writing in had a lot to say in terms of concerns related to carbon capture in general, and also transferring control of that over to the state of Louisiana. So to essentially assess what's contained in, there were over 10,000 unique statements in here. We wanted to assess general negative themes, general positive themes and themes surrounding environmental justice. We're going to look specifically at the negative themes here. So I can show off to you essentially how we applied the text analytics with large language models for summarization and then also how we can subsequently employ traceability. So upfront.
Tom Sabo [00:53:44]: And this is a visualization that we have in SAS, I applied essentially a set of rules, identify recommendations. So the benefit of using information extraction models is they run really, really fast. You can score millions of documents in seconds or minutes. So basically I can flag statements as containing a recommendation using essentially a query language. This looks a little bit like Regex, but it gives you a lot more power around it. That said, you could use regular expressions for this kind of thing too, like looking just for statements that, say, contain account numbers or whatnot, and that's what you feed to your LLM. This essentially set of text looks for where this rule filters out statements that have terminology like please include or in order to determine or. It would be helpful if that kind of thing.
Tom Sabo [00:54:39]: Once we do that, essentially, I stand up a visualization that has a lot of access to the different underlying statements, and I can explore through some visualizations like show me everything related to negativity around this particular regulation that the public had to say. On the next page, you'll see the large language model summary in a couple of places. To generate this, I use a pretty generic prompt and then fed in only essentially those negative recommendations having to do with that regulation. So it was essentially 5% of my data. The large language model comes back with pretty good summaries of what the public had to say. But then also with powertext analytics, we can drill down into both the wrote it and some of the key themes. So, for instance, the large language model mentions that a main concern from the public is the risks of carbon dioxide. Well, I have the ability then to drill down into essentially terminology related to that that's identified from looking at terms like safety, carbon waste, carbon dioxide, and unproven carbon capture.
Tom Sabo [00:55:56]: And I can also identify who all is writing in about that. Earth justice, the environmental defense Fund, the Sierra Club. Who am I most likely to get sued by? Maybe I should focus on those responses. And then down the bottom, we can identify all the individual statements. Same thing applies for statements like adequate unadequate plans. Consider environmental justice, for instance. Sierra Club and NAACP write in about.
Matt Squire [00:56:28]: That.
Tom Sabo [00:56:31]: Concern that essentially Louisiana is not going to actually oversee these wells, and so on and so forth. I'm going to stop here because there's probably a few more minutes for questions. Just going to highlight, essentially, that we've applied this kind of technique to data all over the place in terms of adverse event assessment, drugs, vaccines, vehicles. We've looked at public policy. We've looked at police narratives and crime patterns. Just an FYI, there's a great data set out there published by the city of Dallas. Essentially 40,000 crime events that you can download and use. Just go to opendallastata dot gov dot.
Tom Sabo [00:57:09]: It quote s a fantastic data set if you're doing any work with police organizations and then consumer complaint analysis research analysis. There are so many opportunities to essentially blend these kind of capabilities in with many large language model oriented projects to filter down the data that you're sending to LLMs, especially if you're using one of those smaller LLMs that's offline and you have like limited power on your end. This can get you results so much faster. I'll stop there. See, there's just a couple minutes left if folks have additional questions.
Ben Epstein [00:57:48]: Thanks so much, Tom. That was great timing as well. Really interesting presentation before. Maybe I'll ask a quick question here. I'm assuming that QR code is for more information. Is there anywhere else people can go to get more information and to learn about some more of these techniques?
Tom Sabo [00:58:05]: Yeah, absolutely. The QR code actually will link you to my profile, so reach out and shoot me a message. I would love to chat with you more about it. And I can point you to some of the resources, like the organization we work with, Southern States Energy Board. They've already started publishing some of this online. There's a couple of online presentations. There's a webinar out there, so. Yep, happy to share more resources.
Vaibhav Gupta [00:58:24]: Awesome.
Ben Epstein [00:58:25]: Super cool. I'll quickly drop if anybody here has questions for each other. I mean, these were all really cool talks. They all are kind of very interrelated. So I'm curious if anybody here has questions for other folks or other speakers.
Tom Sabo [00:58:40]: Yeah, for the second reset or the one about Baml. I'm just very interested in connecting with you at some point in terms of seeing essentially how you use regular expressions, in terms of helping the large language model focus on generating less tokens or using less tokens too. Feel like these kind of strategies like you talk about, I talked about, can go hand in hand.
Vaibhav Gupta [00:59:00]: Yeah, I mean, we're pretty agnostic to actually using like models or algorithms under the hood. It can really be anything. It's what we really just go for is providing a really good developer experience so that on average, whenever you use a prompt, it just works that you have to do any real prompt engineering.
Ben Epstein [00:59:18]: How does that work when, I mean, you kind of added a section around, like, if you need the model to do more reasoning, but when you say it just works, you're really specifically talking about just works in terms of getting it into the expected JSON format, rather than just works in terms of getting the model to reason about the problem correctly.
Vaibhav Gupta [00:59:35]: Right. It turns out that they go hand in hand, and we have a little bit more advanced syntax to help you get the reasoning right as well. So the way I always frame the problem, it's kind of like security. So, for example, whenever you deploy an application, you need some level of security that's secure. Like, you don't want your passwords to be written in some text file that everyone else can read. But by the time you've built into an enterprise, you need a lot more levels of security that you add with adding more software. So the benefits of having like a language that we support is by default, we give you a system that works well enough, but then we let you write more code using our syntax, and a slightly more ergonomic and algorithmically useful way for us to go ahead and add the level of refinement you need. For example, one of our customers uses us to read an 18 page plus bank statements.
Vaibhav Gupta [01:00:23]: And their parsing is not off by a single penny. Out of 18 pages using all, not a single cent is off. How would you do that in any other way? And that's kind of like what we want to enable for almost anyone in the world.
Ben Epstein [01:00:37]: Yeah, that's very sick. I wish, I would love to see a combination of the three of these things being used together. Get off of OpenAI, deploy your own model, leverage this level of JSON conformity and reasoning, and then reduce the amount you actually even have to send. It's a great synergy. I love it. I'm very into it. Okay, I think we're going to call it. We're just over time.
Ben Epstein [01:01:02]: Thank you all three for coming on. These were really some of my favorite talks we've had in a long time. Thanks, everybody, for coming on and listening.
Vaibhav Gupta [01:01:11]: Thanks, Ben. It was really fun and great to hear from you guys. Tom and Matt, it was really fun.
Tom Sabo [01:01:16]: Yeah, thanks for having us.