MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Rags to Riches With Kubernetes

Posted Jul 26, 2024 | Views 194
# Generative AI
# RAG
# LLMs
# Google
Share
speaker
avatar
Anu Reddy
Software Engineer @ Google

Anu is a senior software engineer working to make Google Kubernetes Engine the best destination to run AI/ML workloads!

+ Read More
SUMMARY

The rapid advancement of AI and generative AI is transforming industries and empowering new applications. However, the widespread adoption of these technologies necessitates robust protection of the sensitive data that fuels them. Retrieval Augmented Generation (RAG) is a popular technique to enhance Large Language Models LLMs by grounding their responses in factual data, thereby improving the quality, accuracy, and reliability of AI-generated outputs This lightning talk will explore practical techniques to safeguard sensitive data and ensure trustworthy AI-driven applications. We will demonstrate how to filter out sensitive or toxic responses from LLMs using Sensitive Data Protection (SDP) and NLP. We will also showcase how to leverage Google-standard authentication with Identity-Aware Proxy to control access so users can seamlessly connect to your LLM front end and Jupyter notebooks.

+ Read More
TRANSCRIPT

Dan Baker [00:00:09]: Perfect. Take it away, Anu.

Anu Reddy [00:00:11]: Great, thanks, Dan. Hey everyone, I'm virtual. It's been nice to meet you virtually. I'm Anu. I'm a software engineer at Google. Welcome to my session about how we built a sample rag application on top of Kubernetes, and I'm just going to talk a little bit about how you can productionize such an application tailored to your own business needs in a more secure way. So first, some background for people that haven't heard of Rag, which is probably a small majority on why we even need a technique like Rag. So LLMs are limited by the data on which they're trained, and this effectively means that if you query them about a private data set or some domain specific data set, they're not very good at responding and they're going to hallucinate or just give up, which is not ideal.

Anu Reddy [00:01:03]: So entervag a cheaper alternative to fine tuning LLMs. It's a technique to optimize the output of the LLM by injecting relevant information into the prompt context. And how is this done? So, in this diagram we see the user prompts, the LLM. There's an external knowledge base with documents from your private or domain specific data that is consulted for context related to the user prompt. The most relevant documents are then included in the user prompt, and now we can query the LLM with this augmented prompt for a more meaningful response, hopefully. So I'm going to briefly go over some architectural principles for Vag that make it work well with the Kubernetes infra, just to highlight why we chose to build this on GKE. So first you want to optimize for experimentation and being able to iterate quickly as a developer with loosely coupled components, so containerized services that are portable and immutable, and being able to expose and integrate services through well defined contracts enables this. Ideally, you want to leverage open source frameworks like lancing, Ray Jupiter open models and open infrastructure like k eight s to make the application more extensible and flexible.

Anu Reddy [00:02:24]: And you'd want to have mixed retrieval systems or multiple data sources, ideally. Finally, you always want to design for security and safety at each step, so you want to ensure all your endpoints are secure. You want to ensure your responses are moderated, minimize hallucinations, and also secure your data itself. So this slide sort of highlights some of the key value adds of building this app on top of GKE, and it touches on some of the principles we just discussed. So there's good native support in GKE for OSS frameworks like Jax Jupyter array. You can do distributed training through queue and other mechanisms. There's seamless scalability via autopilot and optimized service startup times. You can get flexible computer infra with native GPU and GPU support and various storage solution integrations like cloud SQL and gcs.

Anu Reddy [00:03:20]: On top of that, there's integrations with GCP products like identity aware proxy for securing your endpoints, and integration with APIs like SDP or NLP for text moderation. And we'll talk about some of these in the next few slides. So I just wanted to give an overview of what RAC stack on GKE looks like. At the center we have GKE which handles serving the inference requests via front end application. We have also an embeddings generation service that runs on GKE which generates embeddings for your private datasets to store in a vector database. And we're using a Jupyter notebook and ray to actually do this vector embedding generation. We're also using cloud SQL here as our vector database and cloud storage as the data source. So this front end user interface is really where everything comes together.

Anu Reddy [00:04:16]: And to support production use cases, we need to consider how we secure access to all exposed services like the front end, and how to secure the data itself, whether that's from the user, the source data set, or from the LLM. And so this is where you can leverage capabilities of real time data protection to identify and filter sensitive or otherwise harmful information. So one example is sensitive data protection, which is a Google cloud offering. You can discover, deep inspect, and de identify sensitive data of over 150 different types, such as PII, credit card numbers, et cetera. In the context of drag. Specifically, SDP is critical in your data preparation step to ensure you're not ingesting sensitive data that could be exposed to other users. And it can also act as the last line of defense to ensure that the sensitive data isn't exposed. But adding safety to your VAG application is more than just filtering out text.

Anu Reddy [00:05:19]: In many cases, you also need to understand and classify the sentiment behind the data or behind your prompts and make a decision as to whether we accept or reject it. And so here you can take advantage of a different API, the Google Cloud natural language API, to do some of this filtering. So I'm going to have a quick demo to see what this looks like. We're going to see how to use these SDP and NLP filters and how they react to responses from the LLM with different prompts and I hope this video is big enough to play. Actually, how do I play this? Is that working? Yeah, I think so. Okay, great. So this is just our front end for our rag application. It's using a very public data set about movies and tv shows, not a private one.

Anu Reddy [00:06:11]: Here we're asking it who worked with Robert De Niro on a movie? And I'm using the SDP protection template for names to filter out the name of the person, so it filters out their name. A better example is I ask it when was squid game made? And it says it was made in 2021. If I go and use the date filter template to filter out the date, it redacts the date so I no longer see it. So that's SDP. And then now if we ask it a more drastic question, like what movie will show blowing up a building without any filters enabled, it should give me a response. It says, appears you're asking about a movie that depicts the act of blowing up a building. Says it's die hard, directed by XYZ. So now I'm going to use the cloud NLP filter to moderate the text.

Anu Reddy [00:07:03]: So I'm going to give it a very restrictive value of 40 and ask it the same question again. And it should do some sentiment analysis behind both the prompt and the response to say it's deemed inappropriate for display based on blowing up a building. And then if I allow it to be a higher threshold less restriction and ask it the same question, it should still give me the die hard response. Back in a second. Yeah, so this is just a quick example. Oh, I lost the. Sorry. Yeah, just a quick example of what I was touching on with the real time data protection.

Anu Reddy [00:07:47]: And just to recap, the SDP filters allow you to discover, deep inspect, and de identify any sensitive data, so you can redact things like credit card info, Pii people's names, date of birth, etcetera. For Vag applications, you can apply these filters either at the prompt input or model response stages, which is what I showed, or you can actually even apply it at the data preparation stage and build confidence of not ingesting any PIi into your vector Db when you generate the embeddings. And the cloud natural language API allows you to classify and score potentially harmful content. So it sort of provides a safety attribute like toxic for disrespectful or rude content. And for VAG applications, you can make decisions to entirely block or show prompts based on or show responses based on a confidence score. So that's SDP and NLP. I'm going to switch gears a little bit. We touched on another topic previously, which was how to secure access to your exposed endpoints.

Anu Reddy [00:08:52]: Here we have the front end endpoint that's completely exposed to the public Internet, and you probably don't want that. You want to secure your VAX systems from unauthorized access because you don't want just anybody querying your LLM which has access to your private data. To do this, you need some sort of ingress control and identity aware proxy, or IAP is one of the ways to do this. IAP controls access to your applications and resources via standard Google, Auth and GKE integrates pretty well with it, so it gives you centralized control, can be applied across your and your projects. You can do user or group based authentication and it has some sophisticated features like CERT requirements, etcetera. For rag applications, IAP can control which users can access your front end endpoint or if you have any Jupyter notebooks or ray dashboard endpoints that you're using to do your embeddings generation. Those can be gated. Users or principals can be authorized to access a service.

Anu Reddy [00:09:51]: It integrates well with Google Cloud load balancer so you can have a distributed global front end and it prevents unwanted access unwanted users from accessing your bag, application or any of your private data. Yep, that's all I have. Thank you for listening. And I have a QR code here. Just some questions based on the talk and some other questions about Rag, so please fill it out short survey, it'll help us a lot. Thanks again.

Dan Baker [00:10:19]: Thanks so much. Anu.

+ Read More

Watch More

21:51
Why Aren't You Using RAGs?
Posted Jan 15, 2024 | Views 442
# RAGs
# LLM Operations
# Couch HQ
Declarative MLOps - Streamlining Model Serving on Kubernetes
Posted Apr 18, 2023 | Views 701
# Declarative MLOps
# Streamlining Model Serving
# Kubernetes
Building ML/Data Platform on Top of Kubernetes
Posted Mar 10, 2022 | Views 872
# Data Platform
# Building ML
# Kubernetes