Controlled and Compliant AI Applications
Daniel Whitenack (aka Data Dan) is a Ph.D.-trained data scientist building a tool called Prediction Guard, which wraps state-of-the-art models under the hood to provide a delightful developer experience (without having to worry about which model to call). He has more than ten years of experience developing and deploying machine learning models at scale. Daniel co-hosts the Practical AI podcast, has spoken at conferences around the world (Applied Machine Learning Days, O’Reilly AI, QCon AI, GopherCon, KubeCon, and more), and occasionally teaches data science at Purdue University.
You can’t build robust systems with inconsistent, unstructured text output from LLMs. Moreover, LLM integrations scare corporate lawyers, finance departments, and security professionals due to hallucinations, cost, lack of compliance (e.g., HIPAA), leaked IP/PII, and “injection” vulnerabilities. This talk will cover some practical methodologies for getting consistent, structured output from compliant AI systems. These systems, driven by open access models and various kinds of LLM wrappers, can help you delight customers AND navigate the increasing restrictions on "GPT" models.
Daniel, thank you so much for joining us. Can you watch any, what's up the prompt engineering game injection? I did, I was, I was on the edge of my seat. That was awesome. Yeah. Yeah. It's pretty cool. I'm gonna have to look at it more later, but we are very excited for your talk. Thank you for your patience.
And I think without further ado, are your, do you already have your slides? Uh, yeah, I have, uh, I can share my screen. Let me go. Okay. Here. That will be the way to go.
Okay. Here we go. You see those? Yes. Here's your screen that we're seeing. Yes. It should be the title slide. Oh. It's just, it's like us. Oh, okay. Let me. Share. Hold on, I'll share the actual window. Okay. Let me try that. Better try that. Oops. Yes, that is better. All right. Awesome. Cool. Thanks so much. Take it away.
Awesome. Well, uh, thank you for, for having me. This is super exciting. Today I'm gonna talk about controlled and compliant AI applications. Um, Let's see.
Okay, there we go. So I am Daniel Whitener. Um, I've, uh, spent time as a data scientist in industry for quite some time. I've built up some data, uh, data science and data teams at a couple different. Startups and an international N G O called S I L International Done Consulting at a bunch of different places.
And now I'm the founder of a company called Prediction Guard. I also co-host a podcast called Practical ai. If you, you're into podcasting things, we've done a crossover with the ML Ops community podcast, which uh, is one of my favorites. So check that one out. Um, so in addition to all of those things that I do, one of the main things that I want to do is to integrate the latest large language models into my applications.
Uh, everyone, that's why everyone's here. Um, I've listened to a few of the talks. Everyone's trying some amazing things and accomplishing some amazing things. So I want to join in. I want to integrate the latest large language models. Uh, but there's a few problems that I've run into personally. I, I've tried a, a bunch of different things with these models and one of the things that I've found is it's great if you put something into chat G P t, and you get this sort of text vomit output, right?
And it's amazing and it's a ma and it's a magical experience, and you immediately think, oh, this could solve so many of my problems. But the fact is that that output is also unstructured text and it's fairly inconsistent, and I can't really build robust systems with inconsistent unstructured text blob output.
So, um, if I want to do something, for example, like sentiment analysis and I don't want to use a traditional sentiment analysis model, I could construct a prompt like this and I put it into a large language model and I get out some sort of vomit that looks. Like the bottom right. So, um, I do get a sentiment tag, um, in this case, a wrong one, I think.
Um, but then I also get all of these other things that continue on, and you might say, oh, Daniel, that's not a problem. You can put like a stop token in, or you can, you know, strip out some of that extra. But I don't want to do that. I mean, part of the reason why I want to use large language models is cuz I want this magical output that's just right.
So this is annoying that I get this text blob output. Um, also generally large language models scare all of these sorts of people, like lawyers and finance departments, security professionals, um, due to a lot of different things. I know already people have talked about in this conference, hallucinations costs.
Um, most of these language model APIs charged by the tokens. So you're, you know, racking up costs, especially if you're doing use cases like data extraction and other things. Um, there's a lack of compliance in these systems. There's a fear of leaking IP and personal information into, uh, commercial APIs where there's maybe suspect terms and conditions around how that will be used.
Um, and I love this, uh, this setup to my talk just now, this game about injection vulnerability. So that's cool. That's another thing that's on people's minds. So generally you have a situation like this, you've. You're in a company over time, you've created all of this great infrastructure that's all unit tested.
It's compliant if you need to be compliant with like HIPAA or SOC two, type two, et cetera. It's structured, it's secure. And you're kind of interacting with a customer. Maybe there's sensitive data there and all is good. Then you bring in something like OpenAI and to be clear, I have nothing against OpenAI.
Amazing, amazing A P I and system and models and all of that, but you're gonna start making these people like legal and security and finance really unhappy. In a best case scenario, your customer is. Still happy because they're getting kind of magical experiences that are large language model driven. But um, everybody else might be having some issues.
And in the worst case scenario, you start hallucinating out some incorrect information to your customers, and all of a sudden your customers are now not happy with you as well. So what I wanna discuss today is the current realities of dealing with inconsistent, unstructured results. How are people dealing with that generally right now?
And then how are people maintaining compliance? What are the various things that people are trying to do? And then I wanna introduce, um, my own opinion for a path forward. Through all of this complication. Um, it wouldn't be a, a talk at LLMs in production if I didn't provide some, some opinions and hot takes.
So, uh, you'll get some of that towards the end. All right, so let's, let's start by talking about this sort of unstructured text vomit output and how people are dealing with that. And I would say that the major ways or the trends and how people are dealing with this unstructured output. Is via some type of L l M wrappers around large language models.
This could be like your own code that you've implemented around large language models that involves some RegX or some keyword spotting or filtering or whatever. Um, it could be a, a framework and I'll, I'll show a couple of those here in a second. Um, could be like new query languages or schemas even. Um, and custom dec coding.
So just to give an example of how some of this is done, I'd like to highlight, um, great work by, by Reya, um, on guardrails. Uh, I think she is talked at this conference before. Um, amazing work here around using these sort of like xml like, uh, Specs to wrap large language model calls and guard those calls to check for the specifically structured output.
Um, so you're not getting random text blobs, but you're getting nice structured J S O N out. Um, similar projects around this are, uh, guidance from Microsoft. Um, and, uh, and a few others. But this idea that you would kind of wrap a large language model call and like re-ask for things that you don't get and validate the output, that's one approach that people are taking here.
One thing I'd like to highlight about these is they're amazing. They work great. Um, I think both guidance and guardrails, um, work great for, um, uh, open ai. There's, in my experience, These different frameworks don't work in the same way across all different models. Um, especially if you're using open access models.
Um, there's some extra custom integration or, um, wrapping that you need to do to make open access models. Let's say like Falcon work with some of these, uh, systems, but. This is one approach that people are taking. Another one I I'd like to highlight, um, from Matt Rickard is, um, uh, re l l m or regex, l l m.
Um, this is, I think, a really cool approach because as opposed to wrapping a large language model and then kind of validating output. And then re-asking that sort of workflow. This actually modifies the way that tokens are generated as you stream tokens out of the large language model. So this re rejects, or there's another version, um, by him called parser, l l m, that does a similar thing with context free grammars, where you actually look at your RegX pattern, you look at your context free grammar.
You figure out what are the next valid tokens that could occur and then you mask your output for your next token generation based on that RegX or that context free grammar. So this is really cool. This is another approach, so. Um, I, I think in the words of even, um, Matt, he, when he tweeted this out saying, um, you know, we have a problem and we introduced regex, now we have another problem.
So, th this comes with its own problem of having to deal with regex or, um, context free grammar type stuff. Okay, so let's talk about hallucinations. Um, people are trying a lot of different things with, with hallucinations, but I, I think some of the ones I'd like to highlight are, um, formulating multiple prompts, um, doing some type of consistency checks or, um, actually using large language models to evaluate large language models.
So when I say self consistency checks or consistency checks, um, this might be something like this where instead of just providing input, like a prompt to a model and getting output, or instead of even, you know, making that prompt fancy with something like chain of thought prompting, um, you actually.
Input multiple, your prompt multiple times to a model with a temperature that's higher, um, than, than, uh, you know, it's, it's a higher temperature. So you get some variability and then you take a majority vote of that output to get some self selfs consistency on the output. So that's, that's one methodology to kind of reign in some of this variation in the output.
That does have maybe some other issues, like how do you, how do you compare various outputs that aren't just like single values or numbers or classes or something like that, but they're just text output. Maybe you could compare them with embeddings or other things. Um, Another technique that, uh, that I was gonna highlight was language models, evaluating language model output.
Um, so this, for example, is, is an example from Inspired Critique, which is a package and platform from the cool, cool people over at C M U Graham Newbie and, and company. And, uh, you can use a model that's been trained, for example, in this case. Uh, So the BART model that's been fine tuned to actually estimate the factuality of a reference text and a source text, um, and estimate the factual consistency between the two.
So you could do this sort of check inspired critique also, uh, implements toxicity checks and other things. Um, but those are examples of language models evaluating the output. There's some good docs in also in the llama index docs if you look to that, um, from Jerry and and company who have really thought, been thinking more and more about, um, evaluating the output of a retrieval chain or something like that, and comparing it to the context to determine the quality of output.
Um, in terms of compliance and those sorts of things, um, there's a lot of things people are doing. One thing is ignoring the risk, so just charging ahead anyway. And, um, that's a setup for some disaster. Um, some are putting a full stop on L L L M usage or not allowing certain data to be passed to LLMs. Um, this of course is restrictive and it might even, that might even be another type of liability on your company because you're not able to get, uh, You're not able to get the advantage of using LLMs and um, the markets kind of leaving you behind.
Um, and then there's hosting of private models. I know that's been discussed here. Of course, with that you have to learn a little bit more about using GPUs and how to do hosting, um, in your own infrastructure and scale that up, which has its own challenges. So here's my sort of opinion about, um, what, what we should do, um, going forward based on these, these observations.
So I've been trying to build large language model applications, um, for use cases both for my clients and for my own use cases. And this becomes like exceedingly complicated because. Using all of what I just talked about. Now, I have multiple open source projects, multiple large language models that are evaluating large language models, all sorts of different queries and specs that I have to know I from Reg X to XML to other things like.
Um, special query languages like, uh, language model, query language. Um, I also need to figure out like model hosting and private model hosting. And I have to do that for each model that I want to integrate because all of those things work differently for different models. So this starts to become more than like a single human can, can manage, I think.
So my opinion of how to move forward within this climate is to create a standardized a p i for both open and closed models. Um, open model, open models are then hosted in a compliant manner, uh, and integrated into that standardized a p i. And then some of these things like checks for consistency, factuality toxicity, um, and then the structuring of output.
And output types are kind of brought together in a way where, um, in this standardized a p i it's familiar to, to devs, they don't have to learn a new specification. They don't have to learn a new query language. They just know, have to know how to make an a p I call. Or have to know which type they, they want on the output.
So I'm gonna share with you what that actually looks like in practice. Of course. Um, my opinionated take on this is what's represented in, um, the system that we've built in prediction Guard. So I'm gonna show you how that works out and hopefully help you understand why I think this is, um, really useful and can produce a lot of acceleration as you try to build actual.
Um, actual large language model applications. Um, all right. I don't know. Let's see. Okay. Uh, okay. So I'm not sure why that's not loading. Um, let me escape here. I don't know if you're seeing the, um, that prompt now. But, um, here's the prompt that I'm gonna be using throughout the, uh, throughout the rest of these examples.
It's a Instagram post that I've, um, that I've pulled from my wife's company, um, uh, about candles. And we're gonna be using this and making some queries, uh, about this post using different large language models. Um, so let me share my screen again, and then we'll just continue on. So in order to do this, um, I'm, I'm using L Chain, uh, to create that prompt template.
But everything else I'm gonna show is with the Prediction Guard client, which you can install, just pip install Prediction Guard. Okay. All right. We will just. We'll just show it this way and hopefully, hopefully you'll be able to see some of that. Um, so here, some of you might be familiar with the Open AI api.
Um, the, the Open AI API looks similar to this. So like you have OpenAI completion create and you select a model and then give it a prompt. So here it works, um, almost the same way with the prediction guard, actually in this case the same way. Um, so if we can ask what kind of post this is, and then we get this text output down here.
Um,
okay. No dice. Um, I'll share these, um, I'll share these slides afterwards as well so people can see them a little bit bigger. Um, so one, the first thing that we want to do, um, I mentioned is create a standardized API for both open and closed models and clo the open models can be hosted then in a compliant way, even with, uh, With compliance like HIPAA compliance.
So here we have our Camel 5 billion model, which is hosted in this compliant way, and we can use the exact same API that's OpenAI like to query this model. What kind of post is this? And we get back some kind of, again, text vomit out of this model, although it. It looks okay. Maybe. Um, now we want to go to the next phase of, okay, we can use both open and closed models and the open models being hosted in a compliant way with prediction guard.
Um, now we want to start enforcing some of that structure on our output. So, um, here I'm now saying, well, I just don't want any sort of output here. I actually want a certain type of output. I want categorical output, and I want one of these two categories. I want to know if this post is a product announcement or a job posting, and now my output is going to only be one of those two categories.
So here you can see my output has been modified, and now I just get one of those two categories on the output. This is a product announcement. Um, I can do a similar thing with other types, like an integer type. So I could ask, you know, how many new products are mentioned and enforce an integer type on this output, and now I get my output as two.
Um, this is, there's. Two new products mentioned in this post. Um, so, uh, right now we implement, you know, integer, float categorical, J S O N and custom J s O n formatted output, um, all via this open AI like api. I. Um, now just because we can structure output doesn't mean that it is, um, factual or, or consistent.
And so a next thing that we can do is just pull a couple levers, um, flip a couple switches and say, Hey, I want this output to be integer. But I also want consistency and I want factuality. And just by setting these two elements of my output, um, I can now ensure that when I send this off to prediction guard, it's gonna ping my model multiple times and check for consistency.
It's also gonna use a different large language model. To check the factuality and the factual consistency between the output and the input. And here I still get success. So Camel was able to tell me how many new project products are mentioned with factuality and consistency checked. Um, I could try to fool this and say how many giraffes are mentioned, and if I have those same checks, then I get an error that says inconsistent results.
And so I can avoid that sort of hallucination. Um, now the, uh, last one I'll show here is a, uh, another one that's a J S O N format output. So ra, rather than just doing, um, uh, consistency and factuality checks and type setting, uh, type and enforcement, um, we can actually start to get some structured output.
So I can say, list out the product names and prices and make that in J S O N type. Um, and here we get the text output, but then, um, this is automatically parsed and available now as a dictionary in Python or a J S O N blob in the rest, a p i, um, that's returned. So this is how we're envisioning, um, this element of.
Control and compliance with large language models. Um, so we would have this sort of easy to parse output configuration for developers that has types that they're familiar with. Integer, float, um, Boolean, categorical, j s o n, custom, j s o n. We would also have kind of flip of a switch, um, uh, checks for factuality.
We also have the check for toxicity, um, and consistency. And, uh, and that's kind of our vision of this. Um, as we move to the future, of course we, we have a lot on the, on the roadmap. Um, we'd like to make this sort of wrapping and this consistent API with access to all of this functionality of available for people's fine tunes of LLMs as well.
Um, and we've got more kind of presets of structure and type, um, on the way along with, of course, more model integrations. Um, So thanks to I, I'd like to give a sort of special thanks to, let's see if this loads special. Thanks to, um, a lot of people that have inspired, uh, this work, um, and, uh, and even helped out, um, working on great frameworks that have really led the way on this front so you can find some of them here.
Um, All right. Um, so thank you again. This is all the places that you can find me on the interwebs, um, and the social medias. Uh, here's a link to the podcast, um, and a link to Prediction Guard if you wanna check that out. And of course, I'll be hanging out in the event link and on, uh, ML ops, uh, slack if you wanna reach out directly.
Awesome. Thank you so much Daniel, and let's definitely drop some of those links in the chat. Um, maybe we'll send you over there after, if you wanna kind of share that with folks as well. Sure. Um, yeah, thanks so much for joining us. I mean, let's give it a minute too to see if people have, um, some questions.
I think that like, prompt engineering game, sort of like. Absorbed a lot of people, but hopefully they have some good questions. Um, and so one question I have though is like, as you're sort of like doing this work, like are there things that sort of surprised you, uh, that you sort of weren't expecting? Yeah.
Uh, great, great question. So I think the thing that surprised me was, um, I knew that a lot of smart people were working on kind of individual pieces of this solution, and so you can kind of. Assemble that into something fairly reasonable for a single model. But then as soon as you figure out, oh, that model doesn't work for the use case, that.
I need to do in my next project, then you have to sort of repeat everything again because, um, despite them being like really good frameworks and there's a lot of functionality out there, they all sort of work slightly differently for different models. And so you end up kind of in this loop where you have still have to do a lot on the, this like language model wrapping, um, custom plumbing around these things.
And to get this kind of consistent a p i and consistent output, um, from project to project. Got it. Yeah, that's great. And from Sterling, wait, he says Great talk. Awesome. Thanks Sterling. Yeah. Yeah. Good vibes. Uh, cool. So we'll also this will is being recorded, so we'll send that out to folks and then, um, we'll also share the slides.
I think maybe Pauline got those from you. Or we'll get them from you at Yes, and I'm just pasting them into, uh, chat now. Awesome. Thank you so much. Well, awesome, Danielle, we'll let you go on with your day. We really appreciate you joining us. Yeah, thanks so much. See you all. All. So,