Building the Next Generation of Reliable AI // Shreya Rajpal // AI in Production Keynote
Shreya Rajpal is the creator and maintainer of Guardrails AI, an open-source platform developed to ensure increased safety, reliability, and robustness of large language models in real-world applications. Her expertise spans a decade in the field of machine learning and AI. Most recently, she was the founding engineer at Predibase, where she led the ML infrastructure team. In earlier roles, she was part of the cross-functional ML team within Apple's Special Projects Group and developed computer vision models for autonomous driving perception systems at Drive.ai.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In this talk, Shreya will share a candid look back at a year dedicated to developing reliable AI tools in the open-source community. The talk will explore which tools and techniques have proven effective and which ones have not, providing valuable insights from real-world experiences. Additionally, Shreya will offer predictions on the future of AI tooling, identifying emerging trends and potential breakthroughs. This presentation is designed for anyone interested in the practical aspects of AI development and the evolving landscape of open-source technology, offering both reflections on past lessons and forward-looking perspectives.
AI in Production 2024
Building the Next Generation of Reliable AI
Slides: https://docs.google.com/presentation/d/1N_vOt7CwBVxW3FulQpuiTZjZ8VDxk7_4GBkM8f_vlsI/edit?usp=drive_link
Demetrios [00:00:00]: Now it is talk number two. I'm ready. I am so ready for Shreya where you at? Shreya there you are.
Shreya Rajpal [00:00:08]: Hey. Hi Demetrius. Long time.
Demetrios [00:00:11]: I'm so excited for this. I heard a little birdie told me that you have something special to tell us about in this session but I'm not going to be a spoiler. I am going to let you get after it and go ahead and talk. For those that do not know Shreya she's a friend of the community. She's doing awesome work when it comes to guardrails. Check out that open source project if you have not seen it I'll drop a link in the chat. And Shreya the floor is yours. Do you want to share your screen or.
Shreya Rajpal [00:00:42]: Yeah, yeah let's do it.
Demetrios [00:00:46]: All right Shreya the floor is yours.
Shreya Rajpal [00:00:49]: Yeah. Thank you Dmitrius. That's a really hard act to follow but I'll try to do my best. All right. Hey everyone thank you for your time here. My name is Shreya. I am the CEO of guardrails AI and my talk is about building the next generation of reliable AI. So a lot of this really comes from being out in the field building reliable technology for AI systems for the past year or so.
Shreya Rajpal [00:01:14]: So the purpose of this talk kind of is to talk about what are the set of problems that people who are really serious about building production ready AI systems. What are the set of problems that they really care about as far as reliability goes. What are the methods and techniques that they kind of look for? What are the challenges with some of them and how can we build towards a vision of the future to account for a lot of the challenges that we kind of have seen over the last year. Awesome. So before we get started, a little bit about me. I'm the CEO and co founder of Godrails AI but in the past I've spent a decade working in machine learning. So I started out doing a lot of research in classical AI deep learning infrastructure. Worked in ML for self driving cars for a number of years, was the tech lead for MlN for us.
Shreya Rajpal [00:02:01]: I've kind of worked across the full stack of research applied AI ML infrastructure and now really working on how do we make generative AI work like any other piece of infrastructure. Awesome. So we've all kind of seen this over the past year. There's been this amazing fascination with generative AI technology and a lot of excitement about what it can really do for us. So can I treat mental illness? How will generative AI, will it completely transform sales. Will sales just fundamentally look really different from what it has in the past, even software engineering? How do coding workflows that we participate in day to day change? I think so. There's a lot of, like, there's this kind of Pandora's box that has been opened, and for the last year, all of us have been really playing around with this technology and understanding where will this Pandora's box impact what I do day to day? But at the same time, the reality of adoption for generative AI looks very different from what it has for previous generations. So, for example, Chat GBT here had the fastest path to 100 million users compared to even TikTok or Instagram, Facebook, et cetera.
Shreya Rajpal [00:03:14]: But if you look at one month retention for Chad, GPT, or some of the other AI first companies, compared to one month retention for other more traditional software companies, that graph looks a little bit different. So AI first companies tend to have a little lower retention overall. And a lot of this is really kind of from the question of where are the really valuable use cases in AI? Right? So what's really fascinating to me is there's about, like 200 billion in value that AI will need to create, and will need to create fast to justify all of the GPU spend. So Nvidia is really pumping out those gpus at this unprecedented rate. And here's some analyst estimates about where GPU spending, how much GPU utilization is expected to grow up in the next couple of years. How do we create enough value out of the AI systems that we're building today to kind of justify that spending? Right. So the end consumers of a lot of these technologies will be places like very traditional enterprises, like your Walmart or your GM, et cetera. So how do they get value out of generative AI outside of just the really fantastic chat, GPT and GitHub copilot? So another way to kind of phrase the same question is, today, what is the problem that is holding enterprises back from being able to use generative AI effectively? And the answer to this is what? Over the last year, we have been kind of working on, and we've been kind of seeing.
Shreya Rajpal [00:04:44]: And so this talk is basically a compilation of all of those amazing trends and insights. So some enterprises have adopted generative AI early, and here's some examples of the type of issues that they run into. So this was pretty well publicized. A Chevy dealership built an AI first chat bot, and then Chris here convinced that chatbot to sell them an $80,000 truck for like a dollar. No. Takes these backseat. So it's legally binding. Another example here.
Shreya Rajpal [00:05:19]: So a lot of these are just from this week, where new lawyers are fined every day for citing cases that AI made up and that don't actually exist in the real world. The key issue is that enterprises are fundamentally very risk averse and using generative AI, open themselves to a lot of risk that is uncontrolled. So the natural question, if we want to create that huge amount of value that gen AI offers, how do we really control a lot of that risk? And it is to really dig deep into reliable AI and reliable AI tooling. So that ends up being a very necessary piece of infrastructure in order to kind of make enterprises that are very conservative fundamentally take on generative AI and create a host of really awesome products. What is reliable AI? I think there's a lot of talk around this, and it's related to a lot of these similar ideas around responsible AI, AI governance, et cetera, including guardrails, like the philosophical idea of AI models needing guardrails. But the core idea for reliable AI essentially ends up being this, that you have this really powerful system, and this same system can write SQL queries for you. It can write a poem about the 49 ers losing the super bowl. It can write you an article about the modern data stack.
Shreya Rajpal [00:06:49]: It can do a host of other things, right? But when an enterprise adopts a gen AI technology in their own use case, they're using it for one single objective. I want to optimize my customer support team, or I want to optimize my data parsing or structured data extraction. So there's a single objective that I'm solving, and I just want my AI system to be constrained and answer this one problem really well, which is why can't I log into account, or why are my transactions not working, or how do I change my password, et cetera, right. Not write a poem about 49 ers, not write a SQL query, et cetera. So what are the methodologies that today enterprises and organizations use to make AI systems reliable? And this is the really interesting part where I basically go through like a set of technologies that we see across the board rag fine tuning better models, et cetera, and then talk about what works for those technologies and what makes enterprises and organizations nervous about those technologies and how we can kind of do better. Before I go into that, I'm going to basically going to be rating these technologies on a set of evaluation criteria so that there's a holistic picture of how we can kind of build something that's bigger than the sum of its parts by combining a bunch of these technologies together. So, efficacy of any methodology, how effectively is my risk reduced? The cost of implementation of runtime, of running it in production latency? Is my application meaningfully slower by adopting this technology compared to if I wasn't customizability, how I have a set of very diverse use cases. How well does this work for my specific use case versus the generic use case? Controllability.
Shreya Rajpal [00:08:45]: How much control does this give me over the model output? Am I just kind of happy with the output, or do I get guarantees? Or is it just like light suggestions? And then finally, ease of use. Great. So with that set up, let's dig into the first technology, which most people here would be familiar with. Rag, or retrieval augmented generation. So the idea for retrieval augmented generation is pretty straightforward. Instead of just asking Chat GPT a question straight away, why don't we first provide it some data from our internal data source that we know to be true and we know to be reliable, and we'll pass it the question that we want to ask with all of this extra data and tell an AI system, hey, I have all of this extra context. Given this context, can you tell me how I solve problem a or how I do this task, et cetera? So it's a way of kind of grounding the response of the AI system in something that you can trust. So that's the high level idea.
Shreya Rajpal [00:09:46]: Efficacy. This might be controversial, but it's not great. As far as reliable AI is concerned, it does give you answers. It's almost the standard. I actually don't talk to many people that aren't building technologies that are not built on Rag. But as far as reliability is concerned, it's not very effective in giving you a very reliable AI outcome. Cost is pretty okay. There's some cost of vector DBS associated, but overall, and your prompt lens increases, but overall, not too bad.
Shreya Rajpal [00:10:20]: Latency is great. Customizability is great. You can fix it and configure it however you want. Controllability, again, not great. You're at the end of the day kind of like just asking your LLM to respect the context and the context that you provided it, which might not necessarily be respected. Ease of use is awesome. And a big part of this is why we see Rag as the ubiquitous solution everywhere where there's meaningful, generative AI technology built out. If I were to summarize, the key challenge for Rag as far as reliability is concerned, and unlocking this huge amount of value for generative AI use cases, Rag is effectively prompting on steroids you just really overpower your prompt and tell it to fix a lot of the problems for you.
Shreya Rajpal [00:11:09]: But prompting doesn't offer any guarantees. Awesome LLM self evaluation if we want to build reliable tooling and we have a problem with know maybe getting weird responses or incorrect responses or responses that are just toxic or have hate speech in them, why don't we just ask the LLM to evaluate itself and tell me if it's good or bad? So that very, very simple idea is kind of like the idea behind LLM self evaluation. And there's a host of derivatives of this method around assertions and self checking, reflection, et cetera. Efficacy leaves something to be desired. Cost, latency and controllability are all pretty poor for this method. There's a lot of research publications that talk about how it's very hard to trust the outcome of these methods because they have known biases, et cetera. Ease of use is really great, right? If you can write an english language prompt, you can ask the LLM to evaluate itself. The key challenge as far as LLM self evaluation is concerned is I'm not going to attempt to try reading this, but how do we know who will guard the guards? How do we trust the LLM's own evaluation of itself? Isn't that just stacking two things that we can't trust on top of each other? Is that a sufficient lift for enterprises to be able to adopt this technology? Rlxf so RLHF Rlaif or just model fine tuning so really fine tuning models so that they work for your AI applications or your use cases.
Shreya Rajpal [00:12:52]: Efficacy is fantastic. Cost controllability and ease of use, not so much. If you're getting pre tuned, pre fine tuned models that you're using off the shelf, then that's great. But typically in order to do this yourself, you would need a data set. You would need to have a lot of spend on AI model training, which is my background was all in training really large models, really challenging, time consuming, cost intensive tasks to do, customizability, not great, right? For every criteria that you kind of need to enforce, you need to kind of build this kind of up from scratch. And the key challenge here is typically as an enterprise, as an organization, I have a whole custom suite of list of criteria that I want my LLM to follow. So I will have my own communication guarantees for my own use case. I will have some requirements for the industry I work in.
Shreya Rajpal [00:13:43]: Like if I'm a consumer facing tech company versus if I'm a financial institution or a bank, I will have a very different set of concerns. So how do you really customize fine tuning for all of these sets of concerns and use cases and stakeholders, et cetera? Okay, and then the final I'm obviously biased. The name of my company is guardrails, but guardrails is also this technique for adding checks and verifications on the inputs and outputs of the models to very explicitly check for the set of risks that you kind of care about. So it kind of works as a layer that surrounds your LLM in technically how it functions. Again, like I said, I'm biased, but I've tried to be really objective here and talk through how guardrails kind of situates and compares to a lot of these technologies and what the challenges are today with this technology. Efficacy is great, cost is great, controllability is great because you can really mix and match and configure a lot of things. Latency and customizability leave something to be desired. Fundamentally, you're adding this computation after you get the LM back, and that computation will never end up being free, right? It can be optimized for sure, but it's never free.
Shreya Rajpal [00:14:51]: Ease of use is probably the biggest challenge here and what we've seen over the past year. So, for example, how easy if you were to encode every set of risks into one input output guardrail. That is a lot of work that needs to be done for every single risk that you need to do, right. And that kind of ends up being the critical challenge here, which is I, as a practitioner, as somebody building geni technology, am I supposed to build a guardrail for every type of risk we want to guard against? And that is over the past year, what the crux of the issue that we've seen in the past and what we try to address now. All right, so if I were to summarize the key challenges in AI reliability methodologies, the first is instructions aren't guaranteed. So as somebody who really worked on structured data generation from OpenAI at a time when that wasn't a thing, return only, Json, do not include a preamble to your answer. I've seen variations of this where I have a collaborator of mine who literally has in their prompt, somebody will die if your output doesn't look this way. And even then there's no guarantee, right? It's just a suggestion or an instruction.
Shreya Rajpal [00:16:04]: At the end of the day, customizability is a necessity. I think this is something that a lot of people miss, and that's a very important nuance of AI reliability, which is like what you care about really changes depending on who you are. So maybe something as simple as profanity, right? Some of us can agree, most of us can agree that profanity is bad, it's offensive. But I genuinely personally have had more than ten conversations with organizations, enterprises, et cetera, that were specifically looking for models that can generate profanity, because that was the kind of use case. Authenticity was what they were going after. A polite tone can be professional, but if you're building a more social application, it can appear distant. Financial advice. If you are just building something that doesn't have any regulatory constraints, as smaller companies typically don't, then financial advice is helpful, right? It's a nice little companion that's telling you how to split up your savings, et cetera.
Shreya Rajpal [00:17:04]: But if you are a financial institution, so if you are Morgan, if you are Robin Hood, if you're a big company that wants to put out a geni product into production, financial advice is really risky, and there's regulation against being able to give financial advice, and finally, the ability to measure. So every time I put any system into production, I really need to understand how does the system work for the set of risks that I care about. Right? Financial advice. How many times am I close to giving financial advice? How many times are my medications working? All of that ends up being really critical in order to get enterprises, get organizations to a point where they can push these AI force products out into the market. All right, so looking ahead from all of these amazing learnings that we've had, how do we address these challenges in an open and collaborative manner so that we're able to have the fastest possible acceleration to solving the problem of AI reliability and a lot of these key challenges. So that is the problem that we're working on, and that is what we're really excited to announce, which is the guardrails Hub, which is live, I think, as of like 20 minutes ago today. So that's been really exciting. And a lot of the key promise that the guardrails hub offers it is that it is an open source platform with very high quality implementations of guardrails for a lot of common use cases.
Shreya Rajpal [00:18:31]: So, for example, we talked about, am I supposed to build a guardrail for every single use case, every single risk that I want to capture? Right? So the guardrails hub is this open source platform that just comes with a lot of those technologies kind of pre implemented for you. There are plug and play guardrails that can be configured and customized. So a lot of them you can kind of use off the shelf and just get started with directly. But a lot of them can be customized and can tell you, hey, for this type of risk, what is the best solution out there? What is the state of the art solution kind of look like? So a lot of the benefits of open source technology that we've seen in the past with software and in machine learning, with hugging face, how do we kind of bring some of that same benefits to open source AI reliability? And then finally there's templates and recipes for ease of use that tell you for your use case for your application. How do you really kind of solve the set of AI reliability challenges? So it's fully open source and it comes with batteries included. So 50 plus validators implemented across a bunch of different use cases, a bunch of different risks that you want to capture input guarding as well as output guarding so you can track for PIi profanity, jailbreaking attempts. Is this conversation veering off course from what I intended this conversation to be, output guardrails for a host of risks around profanity, etiquette, tone, data leakage, et cetera. We talked about latency as a key concern that a lot of organizations have.
Shreya Rajpal [00:20:00]: So guardrails comes with streaming built in. So any guardrails that you mix and match and you apply to your system comes with streaming. So you can stream validate any output that you get which really helps with streaming as well as latency orchestration of all of the guardrails you configure. And each guardrail is configurable, customizable, and fully open source so that you can really adopt it for your own use case. I have a few quick examples of how you would practically install and set this up. So we've really worked on making it as easy to use as a lot of other gen AI technology, as simple as calling your model endpoint, for example. So installing a guardrail is a simple like guardrails hub install and then the URI of the guardrail. So here we have our anti hallucination guardrails.
Shreya Rajpal [00:20:50]: And then setting it up is create a guard that uses that specific guardrail that you want to add to your system. For example, if you didn't care about hallucination, but you did care about only talking about a certain set of topics, let's say only talking about the weather and music, and never ever venturing into talking about politics, you can install the on topic guardrails with guardrails hub install and this Uri and set up and use it by creating a guard that uses the restrict to topic guardrail specifies only talk about food and cars, never talk about politics, and then validate if that works for your use case. You can mix and match multiple guardrails so Pii and profanity can separately be installed. And then you can create a guard that uses many guardrails and this will all be streamed. So many guardrails as part of a guard will be streamed and then the guard will kind of orchestrate it so they run in pretty low latency. Awesome. We're really excited about this. We hope that this is like a leap forward for reliable AI tooling.
Shreya Rajpal [00:21:54]: By really being this collaborative platform for building a lot of AI technology, you can kind of stay in touch with a lot of the open source projects as well as our website by link bio Godrailsai. And then the godrails hub can be found at hub godrailsai.com. And like I said, live as of I think 30 minutes ago now. And yeah, really thank you for your time.
Demetrios [00:22:18]: So cool. Shreya, I'm so excited for this. This is awesome. I talk about this all the time, but I think it's hilarious when you brought up that example of somebody buying a car for one dollars and it was legally binding, no takesie backseas. So that kind of stuff you don't want happening, right? That is not what you want to have your AI use case be. And there's another few things on here. I mean, the chat has been blowing up, so that is awesome. We're going to ask like two or three questions and then keep it moving because we've got jam packed afternoon or morning or wherever you are.
Demetrios [00:23:00]: But first question, while people are throwing their questions in, the chat we had been discussing, I think before the actual start, what do we got to do? Like, guardrails helps models not hallucinate and it helps them stay on track. Is there any hallucinations as a service type thing that you've heard about? Because I need to get me some of that. That's what I'm looking.
Shreya Rajpal [00:23:33]: Mean. You say that as a joke, but that genuinely is the really fascinating thing. Right. For Morgan, building a chatbot for customer support, hallucinations are necessarily bad. Like, they do not want it. But if you're using chat GBD to brainstorm or write amazing new music about AI, like hallucinations, that's never been written before, hallucinations are a good thing. They're almost like creativity. It's a proxy for creativity, right? So that's a really interesting thing about guardrails, and it kind of goes back to customizability for me, which is that not every person interprets correctness the same way.
Shreya Rajpal [00:24:11]: Like, hallucinations can be good for someone and bad for someone. And as a user, you should kind of have the ability to kind of mix and match and really apply what you want and you need for your use case.
Demetrios [00:24:22]: Yeah, I did like that idea of like, yeah, I'm one of those use cases where I would love a model to curse, not care if it is saying bad words to me. It's all right.
Shreya Rajpal [00:24:35]: Maybe offline I can share some jailbreak prompts with you. That'll get you there.
Demetrios [00:24:40]: That's what I'm talking about. That's what I'm looking for. All right, cool. So another question that came through that I thought was fascinating, and let me see if I'm understanding it correctly. It's like, for different use cases, you can apply different guardrails. And so, for example, I think someone was asking a question about the healthcare industry, and you can kind of, like, niche it down by use case in the healthcare industry, what you're doing with it. And then you almost have these little building blocks that you can say, I want two or three of these guardrails on top of my model. So there's one that's very broad that says, don't talk about anything that's not healthcare related, and then more that are specific.
Demetrios [00:25:23]: Is that how we understand it?
Shreya Rajpal [00:25:25]: Yeah, absolutely. We've seen this, for example, we work with financial institutions, and we've seen this with financial institutions where they care about, again, they're enterprises. So hallucinations are always bad for them. So they always, like, hallucinations are a very broad, sweeping thing. Right. But they care about very specific things where my enterprise offers these three products. So it's not even ask me a question about finance. It's ask me a question about these three products and these three products only.
Shreya Rajpal [00:25:54]: And my response will be about these three products only. I won't talk about, if you come and ask me my opinion on the 2024 general election, I won't even give, not even a neutral response. I don't want to give any response because anything reflects poorly on me. So you can really whittle it down from the broadest correctness like hallucination type things to really specific. This is my use case. This is who I am as an organization, and this is what I care about. And configure and mix and match and play with it.
Demetrios [00:26:19]: Yeah, that's cool. That is very cool. So last question for you, and then we're going to keep rocking. Are there different options for bias guardrails? Sometimes you want certain biases, and other times maybe not so much. So it seems like it might mitigate some of the debate around bias making its way into llms. And I guess just me to tag on to that. Do you have a temperature control on how much guardrail you can put on? Or is it you slap it on and it's on or off?
Shreya Rajpal [00:26:47]: Yeah. So every guardrail is configurable for a lot of the types of guardrails around bias, hallucinations, toxicity. Staying on topic, there's a threshold, because all of these guardrails under the hood are machine learning models. That's how you kind of attack. That's the only way to attack this problem. You can't have a rules engine for a lot of these things. So for these model based guardrails, you can set and configure your sensitivity to them. You can pretty much configure, for example, like toxicity, I think, or profanity.
Shreya Rajpal [00:27:17]: I'm forgetting which one. But you can define what are the types of profanity you care about. Is it talking about nationalities or is it talking about race, et cetera. So you can really configure what is the type of toxicity that you're sensitive to, and then filter that out. So it's all configurable. And I guess it goes back to our underlying learning, which is that customizability, again, is really key.
Demetrios [00:27:45]: Yeah. Going back to that control. And there was somebody that wrote, I think it was Nina in the chat, and I thought it was hilarious how they were saying, there's so many broken hearts on this slide. It's making me depressed. Something along those lines. So, hopefully not anymore. Hopefully. Just go check out guardrails hub and no more broken hearts the day after Valentine's Day.
Demetrios [00:28:12]: So, we've got Holden coming up next. I want to ask these two questions because they're so awesome. So, Holden, thank you for your patience. I know you're in the background, and I've got you coming. The questions that I want to ask are coming through in the chat, and they're, if we use llms for guardrails, the UX slows down. Have you done any performance analysis on Regex? Rules versus lookups versus LLM based?
Shreya Rajpal [00:28:43]: Yeah, I think LLM based is by far the slowest. So some guardrails use llms under the hood. And then that's also why we try to build, like, model specific guardrails. Right. We have hallucination guardrails that we're releasing that. There's one hallucination, guardrails that uses lms under the hood and it is performant, but it is so slow. But the models that work really well are fast and are competitive with these models are really task specific, very precise, smaller models. And that's kind of like the agenda forward.
Shreya Rajpal [00:29:16]: Right, which is that you get the sophistication of LM based guardrails but without the kind of speed or slowdown from that. Maybe we have some benchmarks that we plan to share that aren't public yet about rule based, almost next to no latency, especially with streaming, model based lookup, model based guardrails also very low latency, especially with streaming. And the orchestration that we have. LM based guardrails add like a little bit of lift, which is also why we kind of mix and match and only use that as a fallback.
Demetrios [00:29:46]: Excellent. All right, we're going to keep it moving. And Shreya, thank you so much. Also, huge announcement from you today. Right? You raised boatload of that.
Shreya Rajpal [00:30:00]: I haven't read it yet, but I think the TechCrunch article popped out just as I was getting onto the green room for this talk. So, yeah, go check out the article. Yeah. Thank you so much for your time, Demetrius. Really appreciate it.
Demetrios [00:30:11]: Yeah, likewise. This is great. Thanks so much. And anybody that wants to continue talking to Shreya, go ahead, hit her up on LinkedIn and on Twitter where she is very active.