MLOps Community
+00:00 GMT
Sign in or Join the community to continue

MLOps for GenAI Applications

Posted Aug 27, 2024 | Views 181
# GenAI Applications
# RAG
# CI/CD Pipeline
Share
speakers
avatar
Harcharan Kabbay
Lead Machine Learning Engineer @ World Wide Technology

Harcharan is an AI and machine learning expert with a robust background in Kubernetes, DevOps, and automation. He specializes in MLOps, facilitating the adoption of industry best practices and platform provisioning automation. With extensive experience in developing and optimizing ML and data engineering pipelines, Harcharan excels at integrating RAG-based applications into production environments. His expertise in building scalable, automated AI systems has empowered the organization to enhance decision-making and problem-solving capabilities through advanced machine-learning techniques.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

The discussion begins with a brief overview of the Retrieval-Augmented Generation (RAG) framework, highlighting its significance in enhancing AI capabilities by combining retrieval mechanisms with generative models.

The podcast further explores the integration of MLOps, focusing on best practices for embedding the RAG framework into a CI/CD pipeline. This includes ensuring robust monitoring, effective version control, and automated deployment processes that maintain the agility and efficiency of AI applications.

A significant portion of the conversation is dedicated to the importance of automation in platform provisioning, emphasizing tools like Terraform. The discussion extends to application design, covering essential elements such as key vaults, configurations, and strategies for seamless promotion across different environments (development, testing, and production). We'll also address how to enhance the security posture of applications through network firewalls, key rotation, and other measures.

Let's talk about the power of Kubernetes and related tools to aid a good application design.

The podcast highlights the principles of good application design, including proper observability and eliminating single points of failure. I would share strategies to reduce development time by creating templates for GitHub repositories by application types to be re-used, also templates for pull requests, thereby minimizing human errors and streamlining the development process.

+ Read More
TRANSCRIPT

Harcharan Kabbay [00:00:00]: I'm Harcharan Kabbay, I go by. Harry works with World Wide Technology as a lead ML engineer. I'm usually a more like a tea fan, but whenever I take coffee, I usually go with the black coffee and like whatever regular brand is there, not. Not having a particular affinity to anything.

Demetrios [00:00:19]: What is up, mlops community? We are back with another episode of the MLOps Community podcast. As usual, I am your host, Demetri-os. We're gonna get right into it with Harry because this is an episode all about reliability. If you are trying to do anything in production, you know how important it is. He breaks down all the ways that things can go awire and how he goes about stopping them. I will give you a little hint. He talks a lot, a lot, a lot in this conversation about templatizing things and also processes. Let's get right into it, dude.

Demetrios [00:01:02]: I gotta start with this. What do you have against local LLMs?

Harcharan Kabbay [00:01:07]: It's not like, you know, hard stance, it's more like the value out of that. Like local lms, good for experimentation. But the thing is, people are more excited about, you know, they can run those things like on CICD. Like you can spin it up like this and that. But I'm thinking about the bigger pictures. That is, how do you get these things operationalized? How do you get value out of those? Right? And I have experience working on Kubernetes, like a more hardcore guy, part of the team who actually took it all the way up to writing our own operators. So I understand how these things work. And when I think from that angle, I feel like, okay, local testing is good up to a point, but even then, I don't want to make it like no, that everyone is doing on their own.

Harcharan Kabbay [00:01:56]: You have no control. And little bit of control, or at least the manner how you are accessing makes a difference. If we can deploy it in a way like a. Like an API kind of thing. Right from the beginning, it makes more sense to me.

Demetrios [00:02:11]: So it's almost like you're afraid that people are creating bad habits getting used to these local LLMs. Almost like we did with Jupyter notebooks, where you do things. All of a sudden it becomes standard practice to do everything in your Jupyter notebook. And then you throw it over the fence and you say, okay, now figure this out because I got some cool working on my local drive and now it's an ops problem.

Harcharan Kabbay [00:02:39]: Yeah, you said it more accurately.

Demetrios [00:02:42]: Well, one thing that I think you've got a lot of insight into is really operationalizing this from the CI CD's perspective. And so let's take the classic 2024 use case of Rag. I don't care if it's naive rag, graph rag, advance rag. I'm going to make a rag rap song and put all of these different types of rags into it. But you want to be able to make sure that it is battle tested when you are doing it. And CI CD and testing, integration tests, all the different kinds of tests that you can have is a great way of doing that. I know you've got some cool things to say about that, so I'd love to hear your perspective.

Harcharan Kabbay [00:03:30]: Sure. So already lot of stuff has been talked about lag. I'm not going to go into that. What it is people already understand. But one thing that has changed with the coming of the LLMs is like, no, people start thinking like APIs, that there's some API model that you can hit and get the results back, right. So when it comes to rag or operationalizing rag, you have to think about like in a microservices way, you have your embeddings or vector stores somewhere stored, could be on premises or in the cloud. Then you have the LLM. Most of the companies, I think people are using the lamborghinis out there, like GPT four or other things.

Harcharan Kabbay [00:04:15]: I mean they have been some latest, a lot of claims around that, but how do you run those things on process? Still not questionable. But the thing is, most of the data scientists or the ML engineers, they used to call those roles, but now it's more like a engineers or AP ensemblers, they write that orchestration engine. So basically now you are making call, you get a query from a user, you make a call, you do search, do retrieval, get some results back, send it to LLM after with your augmentation in the prompt and you get result back. You have different hops there. So before we jump onto CICD, Ijenne really put some emphasis on resilient architecture. Like all of these could be single point of failure. So you have to think about how would you not fail on any of these as a single point, right? Like LLMs, how you can have a pool of LLMs and you overcome the 429s or the rate limit errors. Same thing for the databases.

Harcharan Kabbay [00:05:21]: Like if an instance is unavailable, set those in a cluster format, right? And then you start thinking of these like as deployments. And then your CI cD come into picture. Right. Now the orchestration engine that we talked about, which is responsible for that glue thing, like you're talking different pieces. How I have seen I've been working upon is more like a Kubernetes setup. And in Kubernetes you have the different concepts like deployments or stateful set. These are more like stateless apps. So a simple deployment could work, but it's multiple replicas in there.

Harcharan Kabbay [00:06:03]: But then how do you bind everything? Like people are developing that or the ML engineers are doing their development work, how does that get to the Kubernetes or some CID around that, right? CI CD around that. Argo CD is one of the ways you can do that. There are other options as well, but I like it super easy to consume. You can set it up like, you know, keep listening to this GitHub. Whenever you see updates, you can perform a sync onto the other side. So that also provide you the option to roll back if you want to. But if you take one step back, like when you do the core development itself, right? How can you make sure that you are following some kind of standards here? Like you containerize your code, right? So that way you keep track of all your libraries. You have the image version now, that image version or creating the image, you'll not be running Docker compose.

Harcharan Kabbay [00:06:59]: If you are operationalizing these things. You'll have some automation, like in set up some Jenkins kind of job that, you know, listens to your GitHub activities there. And whenever you have a merge, it's going to run or create an image for you. And then you'll see you'll need some image stored like you know, where you're keeping your artifacts. We don't keep it in GitHub, right? GitHub is not meant for those things. So there are bits and pieces of automations here and there. But the bigger picture is once you have the argo CD setup, so container is just the code. Now you have to also think about that.

Harcharan Kabbay [00:07:36]: You will not be running it just directly in production. You'll have those dev test environments or maybe stage or pre produce how the companies are using. But then you'll not be creating separate images either, right? So with image you'll have two pieces of configuration, one that is non confidential, right? Like your URL's that you're hitting, or some parameters that you are tuning, right? But then you have the secrets as well, like API keys, you'll not be keeping it in the code. So those kind of things in kubernetes you have the concept of secrets. You can use secrets, right? But then you can enhance it further, like you think about external secrets, stored it in Hashicorp or you know, azure keyword you can set up those operators that keep on syncing those things for you and make sure that you have the latest secret. That way you can implement like more segregation of duties as well. Like, you know, I have the engineers who are just responsible for rotating the keys. They do that in keyboard.

Harcharan Kabbay [00:08:40]: It automatically gets synced in there and applications are consuming those. The other configuration we talked about is like non confidential, like API endpoints or I want to tweak a little bit of temperature here and there. I don't want to put it into the code because that way I'll be recreating my images, right. I don't want to do that. So those kind of things you can manage via config maps in kubernetes. So that way when you have these things super easy, you update your code. Most of the things when you are testing, you may not be touching your images or regenerating the images. You are just changing your config maps, which is all your configurations, how you are tweaking it, right? For example, in your retriever you have been say considering like ten top ten results.

Harcharan Kabbay [00:09:28]: Now someone says, okay, let us try it with 15 results, right? If you have that saved in your config maps, all you need to do is just update there, sync it up, restart your port and you're done.

Demetrios [00:09:39]: Hopefully people are going from ten to five, not ten to 15, because that's going to just kill your output. But I guess every use case is different. Now I think about Kubernetes being such a beast and you're able. So it's on one hand it's a beast, but you want that, you want that capability and you want that power. And so from my side I'm thinking about, okay, how have you been seeing differences from the LLM side and just grabbing these commoditized models and having a different rag architecture versus I'm doing some kind of recommender system. I have the model that I created. Am I baking it into the docker image? Because you don't need to have that config map necessarily or it's not as important. So it's almost like there's these differences, but at the end of the day it's not so different.

Harcharan Kabbay [00:10:54]: Yeah, I mean even if you go to the primitive MN models, right, once you start thinking like, no, I want to serve these as like APIs, not just, you know, passive and fresh, that it is running like a cron job. I tried something with the case or knative serving. It used to be called like that, right? It's more like a serverless kind of thing is super cool. It has like a pre processing, then your main thing and then you can do post processing is super cool. And you see, similar to Caser, there's a racer out there I'm gonna play around a little bit with. But yeah, those kind of models, you don't, you, you set it accurately there. You don't need to bake in into a docker image. You can create those apps like using some inference API setup there that can also reduce some work because you don't want to create DNS entry for each and every app you deploy.

Harcharan Kabbay [00:11:52]: So I think the knowledge of Kubernetes, like you said, it's a beast. Like there are lot of possibilities to know what you need makes a big difference. You don't want to apply the same rule of thumb everywhere, right? You'll be creating a lot of work for yourself.

Demetrios [00:12:11]: Yeah, that's such a great point. And I find it fascinating how people with different backgrounds or coming into the ML space reach for the tools that they know and they are trying to figure out how to make this tool that they're very familiar with work with this new paradigm that they're now getting themselves into. So if you talk to a data engineer, it's almost like, yeah, well, airflow is what I go with and it's not necessarily the best. It's kind of a pain in the ass sometimes. But then maybe if they're progressive, they'll use some more data engineering new age type of tool, like a mage or a prefect or something like that. But then if you talk to the DevOps folks, it's like, yeah, kubernetes all the way. We're using Argo workflows or Argo CD also. Or maybe they're saying we'll use a bit of cube flow if they somehow got conned into having to use Kubeflow, which pour out a little bit of drink for everyone out there that is using Kubeflow and having to suffer through it.

Demetrios [00:13:31]: But that is, it's always interesting to me on how you approach these different sets of problems by knowing what your background is and having the comfort of saying, I'm really skilled in kubernetes, so I'm going to make Argo CD work with it. And you also understand inherently what is important. And for you, it's probably really easy to just reach for some kind of infrastructure as code, I imagine you're going for terraform right off the bat. And whereas if you tell a data scientist that it's almost like infrastructure is what. Where are we going with this?

Harcharan Kabbay [00:14:12]: You're right, you're right. I think everyone has a piece to play and it's a teamwork. It's not no one man responsibility to operationalize. We need that testing happen, but should not be so carried away by running it locally that you think about the biggest, the bigger picture. You forget about the bigger picture. Right. Should be keeping those things in mind, right? Yeah, yeah. We'll touch upon terraform.

Demetrios [00:14:38]: So the other cool part that you were saying and specifically on that, to your point, you having the DevOps background and chops, of course you're going to be looking at something from the bigger picture and saying this is great, but how do we get it into production? How do we make sure that when it's in production it is checking all of the boxes? Because it's not just like we can have it in production and you can cover the output side of things. You also have to have that security and you have to make sure that you have no vulnerabilities and you have to, there's so many different pieces that you're seeing from your vantage point that potentially others who don't have this type of background aren't looking at or aren't even thinking about.

Harcharan Kabbay [00:15:33]: You're right. I think that's related to what I was. Oh, nice thinking about at that point. Right. So you mentioned Q flow. I helped my, to operationalize Kubeflow and you know having, coming from that Kubernetes background, it helped me a lot. But then we have to think about the security and other aspects, right. I for example like tiny little things here and there.

Harcharan Kabbay [00:16:00]: You don't want to share the database, passwords or other things with everyone. Right? Are there ways that you can make it available if someone needs that, right. But then you have to be, you can implement some of those things right from the beginning. Like in Kubeflow when you create the notebook, you can create configs. So like pod defaults, I, I wrote something about that a couple of weeks back. So you can, you can set those up. It basically matching the label of your pod. It can make certain things available to it, right.

Harcharan Kabbay [00:16:36]: For example, you say credentials, you didn't find your own tag. It'll make those credentials available because you defined it that way. In the pod defaults. It's a power of pod defaults. To get it done in kubernetes you have to be thinking about mutating webhook, you'll be setting up your own server. But the thing is you may not need to over engineer if there are some tools available. I'll rather save my time and use those tools rather than, you know, showing off my knowledge there. Because it also creates a lot of LCM for me if someone, something gets updated in future.

Harcharan Kabbay [00:17:11]: Right? It's unpleasant. I don't want to take that all work. Right. Unnecessary work. I'll be rather spending my time on something newer or cooler.

Demetrios [00:17:20]: Yeah, yeah, totally. You're. You're not trying to give yourself a job, exactly. Or give yourself more work, really, the whole idea is, hey, let's automate as much as possible so I don't have as much work where I can have less work.

Harcharan Kabbay [00:17:37]: So you mentioned about terraform earlier. If you want, we can dive a little bit into that one.

Demetrios [00:17:43]: Would love to.

Harcharan Kabbay [00:17:45]: Perfect. That's the topic that is very close to my heart.

Demetrios [00:17:49]: Even after all the shit went down with Hashicorp and the. For those who don't know, can you school us a little bit? I think what happened was Hashicorp, who is the lead maintainer or basically the company behind Terraform, they changed the open source license for terraform and people who were using it after a certain version had to start paying a license or an enterprise fee or something along those lines. And it just made the community that was built around terraform revolt. And people were pissed. So pissed that they did a hard fork. Anyways, I digress. Talk to me about terraform.

Harcharan Kabbay [00:18:32]: Yeah, I mean terraform or, you know, ansible or any other thing that. The point is that no one enjoys creating these things manually. Like, you know, when something got released on Azure, like Azure OpenAI models, right. Everyone was super excited and, you know, jump onto that and start doing that. So one of my friends and mentor from my current, he told me like, you know, five years back, I was learning how to use ansible at that point. He mentioned that, you know, avoid logging into servers if you don't need to, if you have the configuration in there, right. So when you start thinking about those things, those are very helpful. When you are operationalizing, you'll think about all, okay, if I need to go there and create a storage account, like, you know, now a new member is onboarded, I assign that work to him.

Harcharan Kabbay [00:19:30]: I'll be passing on several things. One, this is the naming convention. This is how you should create a resource group. You know, these are the tags you need to put on. And these are the network firewall rules you need to enable, right. Or any additional configuration, like you can think about my ten to twelve points that you need specific about that and that is an example of one resource. Now think about OpenAI or any search or you know, database instance there. So when all these configurations is multiplied it gets out of control.

Harcharan Kabbay [00:20:05]: So that is where like terraform is super handy. I feel like anything you are doing like provisioning in cloud super useful. You just set it in a, in a correct way and you can have a good version control. That way everyone is responsible. Like I published a post some time back about how I created a modular approach like having each of those resources in its own module and then create inventory for each of those resources. For example resource groups create a CSV. So it is going to loop through that and read and you can have the parameters however you want to apply like set of configs that you want to apply. Now if I need to create those resources, like I want to go from ten to 20 resource group now, right? I was on ten.

Harcharan Kabbay [00:20:56]: So I can ask anyone to just add those to that CSV. Create a p or someone gets reviewed, they are not changing the configuration, they are just adding to the inventory file. The configuration is going to loop through that. Create those. For me every, everything is defined in set of rules that I want to apply on top of those. Super easy. And the other thing I learned over years is like when you, when you go through GitHub route, you create a PR. You need to educate a little bit.

Harcharan Kabbay [00:21:24]: Like you know, people should not be approving it just for sake of approving. I get it and I do that like little things make big difference. If you put some like a template for a pr, right. It gives you n number of options. Then asking people, okay, just review that, you know, making sure I haven't added anything confidential to that. I did my due diligence to test certain things. I think everyone feels responsible and they make sure, you know the code that we are pushing it is reviewed is definitely better than how, what a, like if you assign it to a person would do, right? And doing a QA on top of that.

Demetrios [00:22:05]: Wait, say that again. So basically just giving almost like a checkbox, checkboxes that people can look at and say did it go through all these things?

Harcharan Kabbay [00:22:15]: And more like a reminders list. Like, you know, you can create your own template. I'm big fan of templatizing. Yeah, I actually listed on something that we should talk a little bit on that you can templatize different things like in operationalizing or helping the teams out there.

Demetrios [00:22:33]: I tried to templatize my dating with my wife that didn't go over that well. I think that's taken it a little too far, man.

Harcharan Kabbay [00:22:43]: You have to separate things.

Demetrios [00:22:46]: So anyways, yeah, so you templatize the printhead in a way that you're getting, like, basically you're getting everyone to look at it, not just one person to own it.

Harcharan Kabbay [00:23:05]: That's the value of prs. Right. If you and I are collaborating on a project, you developed something, you said, okay, you know, go ahead and make these changes. I do that. I create a PR, but I don't want to create too much work for you. Like, there are certain things that I should be double checking on my behalf. Right. So that is the PR template as far as for myself.

Harcharan Kabbay [00:23:27]: So when I create a PR, it's going to prompt me with all those things. Make sure your code is following, you know, the standard camel case or other things, naming convention, whatever you have. Make sure you don't add any confidential information in there. Right. I'm going to do that. Or maybe, you know, ensure that you tested these things locally. Then I submit that PR and then you're going to review. When you review is not just, you know, checking that.

Harcharan Kabbay [00:23:52]: Okay, I got a PR. I just clicked that you're going to verify the things that you are supposed to verify. So that way it's not just I dump the code, but it's a shared responsibility when we push those changes to the next level.

Demetrios [00:24:06]: Yeah, I like the idea of the shared responsibility and getting everyone on board, but my mind instantly goes to, what do you do in a retro or what do you do when shit hits the fan? Everybody probably does the Spider man meme and just points at somebody else.

Harcharan Kabbay [00:24:27]: It depends from, to the organization that I'm currently part of is really super cool. Very high test score.

Demetrios [00:24:36]: Yeah.

Harcharan Kabbay [00:24:37]: And, you know, there are honest mistakes that happen. So we should not be like, you know, going to the blame game. It's more like, okay, what are the lessons learned out of that? First thing, it's not gonna happen overnight. Right. You have to shift your, like, design patterns from people oriented design to more process oriented, that you have things properly documented. Wherever you need to automate, you automate. But it is. It should not be like that.

Harcharan Kabbay [00:25:08]: I automated. Now no one knows because I don't have a readme. Right. It's all in my mind. So there's a, there's a very thin gap when it comes to a subject matter expert and a blocker because, you know, once you start working as a subject matter expert, you cross into a line where you're blocking everyone else now because you created a dependency. So I'm more towards advocating, open it up, you know, get the things out there properly documented, have knowledge sharing sessions, involve others, mentor others. One, you should be learning, there should be people in your team or your company that you should be learning from. Then there should be that you are mentoring, you should be spreading that knowledge out and there should be someone who should be double checking your facts.

Harcharan Kabbay [00:26:02]: Right? Like, I have something on LLM done. I can ask you, okay, what do you think about that? And you say, okay, that doesn't make sense at all, dude. Like, we need those things. If it is just all meme, there is no learning.

Demetrios [00:26:19]: So this idea of making sure that you don't become the subject matter expert you're talking to principally in the code base, it's not necessarily how we might think about it in the ML world of, oh, you have the subject matter expert, that is one of the stakeholders in this use case, and you want to get them to label some data or do something. Tell you about the use case. It's more you saying, if I am the only one that knows how this application runs, I am the subject matter expert on this application and I, all of the questions are going to have to come through me and that's going to no doubt slow things down.

Harcharan Kabbay [00:27:02]: Yeah, yeah, you are right. So it's not, see, you need subject metrics, parts, but it's more about how you put those things into operations. You should not be blocking other that people don't have any clue how this thing is running. Right. I mean, and the way you implement the thing, you, how do you write code with proper commands or readmes? That is very necessary. So you need to set up those things more process oriented, not people oriented. That was the point I was trying to make.

Demetrios [00:27:32]: Yeah, well, it's funny you say that because I've been on this kick recently on how DevOps specifically, right. It talks about people, process and technology. And I make the point that a lot of people forget in that group of three, technology comes last. And so, but people tend to put technology first, especially when it comes to mlops or the LLM phase craze that's happening. It's very much about, here's the tools that you can use to solve your problems, nothing necessarily about here are the types of people you need and what your team should look like, what kind of experience they should have. Here's the processes that you should be thinking about. And then last but not least, you can go and dip into the technology section. But if you're saying, yeah, people shouldn't be first.

Demetrios [00:28:33]: There should be processes first.

Harcharan Kabbay [00:28:36]: I think people have a role. I'm nothing saying that anything with process you have, you need people. Right. But I'm saying is you should not be stuck with, okay, there is someone who knows about it, but we don't know how that thing works. There's no knowledge around that. Right. So when you implement why the designs are poor, in my opinion, is people keep most of the things in their mind and, you know, they implement thing or they get the things working. But you have to think about, you know, you don't want to be stuck with that.

Harcharan Kabbay [00:29:07]: Like I'll be supporting all the tickets around that thing. I have to open it up so that I can get onto the newer things and let someone, you know, learn and help with the day to day task or the support generated by those things.

Demetrios [00:29:22]: All right, real quick, I want to tell you about our virtual conference that's coming up on September 12. This time we are going against the grain and we are doing it all about data engineering for ML and AI. You're not going to hear rag talks, but you are going to hear very valuable talks. We've got some incredible guests and speakers lined up. You know how we do it for these virtual conferences. It's going to be a blast. Check it out. Right now you can go to home dot mlops.com munity and register.

Demetrios [00:29:56]: Let's get back into the show. There's a fascinating piece to me changing gears now when it comes to the idea of monitoring and how you would approach monitoring being more of a DevOps person. But then in this LLM ML world, you need to be monitoring so many different things than just the system. Right? And the logs and traces are great, but what other things are you monitoring? And especially with ML, you're going to be monitoring the predictions and how they're doing and the business metrics that they're going against. But you are potentially going to be monitoring the data and the data flows. And so having somebody with a DevOps background, I imagine when you're starting to think about data monitoring, you're trying to fit it into this box of. All right, I'm coming from DevOps. I'm looking at logs and traces now.

Demetrios [00:30:57]: I should. Are there logs and traces for like the data monitoring? Talk to me about that and how you approach that problem or tackle those problems, too.

Harcharan Kabbay [00:31:06]: Sure. We have a really strong team of reliability engineering here at WWT. So when I, when I think about logging and monitoring, I would say the, you know, the, the next version or how we coin these, this term these days is observability. You don't want to, you know, digging into logs or doing monitoring, looking at the things. You should be aware of the accurate things coming out of that. Right. I'll talk a little bit like a before rather than thing, right. The permitted MN models like you run pipelines, may it be in any MlOps platform, right? But if you think about you are actually replacing the business processes, someone was you doing their work manually, they were maintaining things in excel.

Harcharan Kabbay [00:31:56]: And here you come with the knowledge of AI ML data science, you said, okay, I understand your business problem. Let me create a model for you. Okay, sure, I replaced whatever they were doing, but that time when we replaced, we took a sign of that, okay, I commit that my model will be delivering this much of accuracy. Now, there was some inherited knowledge that you replaced with your model over years. People are going to rely on your model more and more if you do not have observability. Like, you know, this is the limit I committed to, but you are not measuring yourself. You are marching towards a failure because you're not keeping an eye on the data drift. If the underlying data is changing, there's no way that you are monitoring.

Harcharan Kabbay [00:32:45]: Someone is going to point out at some point and that will be unpleasant because when these things are monitor or auditor, you have no time to fix those things back. Yeah. So I played a little bit around in Q flow, like how you can emit those extra metrics. Like thing is, if there's a packaged model, there are a lot of things already out there. But if you have understanding of Prometheus exporters or you're creating your own model, it could be a simple linear regression model. You have your training pipeline, you can think about, okay, let me emit these metrics. When I'm running, when I, when I, whenever I am running the training, right, I want to emit these metrics. Like for example, the runtime, this is my environment, the number of rows I processed, and these are the metrics coming out of that, right? And you can easily route these things to Grafana dashboards or a similar kind of like, you know, Ui setup alerts.

Harcharan Kabbay [00:33:47]: On top of that, you can see things beforehand when it actually, you know, dips down or the accuracy is not what you promised. But at the same time, you already touched upon all the important parts, like you talked about the data pipelines, right? So if you have something based out of the spread of the data already, right? So rather than relying on just the model, because model is the last stage where you're running those things, but the data change has already happened, or it is happening over week, right? So if you keep an eye on that in a way, like, you know, visually getting things out of that, creating some cool visualization on top of that, or maybe some logic that you are saying, okay, it is going beyond the standard deviation, you know, the range that is permissible, I should be alerted, I should start working on that. So that way it is super important. Now, when it comes to LLMs, it's like, you know, that I would say it's a free range of metrics that you need to now think about. You cannot say, okay, I have this number, I'm good. Now you talk about rag, right? You take a step back, you are doing basically three things here. You are getting the results back, and then you are having augmented input, right? That is, you use your prompt, you go to your retrieved results in there, and you send it to LLM for final response. So if you say, I'll just get the final response evaluated, that is half picture.

Harcharan Kabbay [00:35:22]: Because if your retrieved results are not up to that quality, you not get the final response, right? So now that evaluation has to happen at that step as well. And then you have to think about, okay, how do I do that? Like, you provided me a list of questions, I run it, I get the results from my vector store, but how do I evaluate if it is actually accurate or getting the things back? So you can think about, you know, different approaches. LLMs can be used to generate question answers as well. People have been trying that. You provided a document, you say, give me a list of question and answers. Out of that, you can use those queries to then do a search and see if your document pieces show up there, right? And that is the piece on the retrieval side. Then you can see the order where they are coming up. Do you need a good re ranker over there? But you need to measure those things at separate points and see how they differ.

Harcharan Kabbay [00:36:20]: So that goes back to our point where we were talking about the CI CD as well, how you deploy these separating things in config maps and the secrets out there. If you don't have it set up properly, then you'll be doing a lot of work regenerating those images. So think about, I might be playing around with these parameters. I need these API endpoints. Like instead of getting just the final response back, my image or the application that I'm designing, it'll have another endpoint that will just give me the search results. So that way it'll be easy for me to do the evaluation of those retrieved results as well as the final results if I want to do that. And the other piece that comes to my mind is the inference latency. So when you use LLMs or any APIs, right, you have to think about is it a batch processing or is it a real time? So in the batch you can think about pub sub hub model, you have something like a queue mechanism, you know, your applications are publishing to that and then, you know, slow and steady, you are picking from that.

Harcharan Kabbay [00:37:31]: You keep your not huge compute footprint, but you keep it up to a level where you keep on reading from that queue without creating too much of backlog. But when it comes to real time, like most of these chat bots, whatever we are doing with the LLMs, these are all real time. You don't want your users to be waiting in minutes to get a response back. It doesn't make any sense, right? So in those you should be capturing or measuring, not capturing, measuring the inference time lag there, how much time does it take? And that is where the things we discussed like, you know, resilient architecture. If your client is doing a retry, like I hit a LLM, gave me a rate limit, what do I do next, right? How can I reduce that time? Like if I know my limits already, I can beef up the rate limits for the LLMs, but it is not always possible. You have to think about all create a pool. So there are different products available there. One of the things that I explored a little bit is Azure API management.

Harcharan Kabbay [00:38:35]: I'm going to write something on that.

Demetrios [00:38:37]: Very same proxy type thing.

Harcharan Kabbay [00:38:39]: Yeah. The thing is you can capture all those metrics out there and without measuring I think you cannot improve, right? I mean you need to measure first and then you are in good shape that you can improve upon. You can see how it makes a difference. You on ordered a new model if it is taking much time, more time or less time, right? And then in observability since DevOps background you mentioned already, resource utilization is also very important because you don't want to throw all the resources available. I remember one of my data scientists asked me, can you just provide me 24 cpu's and 48 gb of ram to run this thing? I said man, why do you need that? So they had hard time to understand request versus limits when it comes to kubernetes, right? So they said okay, I want everything as a request, like from the beginning.

Demetrios [00:39:39]: So they just wanted it from, even though they weren't going to use it all the time, they were like this, potentially could need this amount. So we might as well request it and make sure that we have that capability. Yeah, yeah. What'd you tell them? What's before? Did you school on where you like, maybe a little bit something called serverless?

Harcharan Kabbay [00:40:04]: Yeah, I told him, you know, request is more like whatever bare minimum you need to get it started and then it can go up to all the way to the limits. And you know, this may be good enough for you to run like up to maybe eight gb of RAm is good. Right. We'll find more as we do that. But the thing is, when you do the development, you're not going to do it with all the data available. Right. You'll be doing it with the lesser data because it's a development. Now we logically have the next step, that is your test environment or the UAT, where you'll be potentially running some kind of regression testing or load testing there.

Harcharan Kabbay [00:40:40]: Right. So that is where if you have the proper observability setup, you keep an eye. Okay, I see with these many users per minute I'm getting up to this level. Okay, what do I need to do? I see a couple of crashes I need to bump up. If, if it is hitting, you know, oms out of memory errors, I need to bump up my memory. If I see it's crashing due to cpu limits, I need to increase that. And I personally have found like, you know, not every app is similar. Like if you are just going with the chatbot, you might need like more cpu's as compared to, if you are doing some summarization, you might need more memory.

Harcharan Kabbay [00:41:16]: So it's not one like, you know, the rule of thumb, I would say it's more like clear observability. If you have things in place, go through the set of processes like I'm supposed to do. I will t. Before I go to production, I realize, okay, this has a little bit different need. I need to do that. So we talked about logging earlier, right? So you could, you could be having a lot of logging solutions already where your standard out or standard error. Right. Like no data dogs planck or anything else.

Harcharan Kabbay [00:41:53]: Right. But the problem with the it side, what I have observed so far, like I worked with six or seven different teams doing my transition from a database administrator onto this side. We keep on dumping a lot of information, but do we actually look at that information? Never.

Demetrios [00:42:12]: That's the classic story. Yeah, we may need it later, so we might grab it.

Harcharan Kabbay [00:42:21]: I prefer more quality of logging. Like whenever you are designing, right. You have different modules like in the modular approach, you start thinking about, okay, what do I need from this? Like when this module runs, right, what do I like, need as a, as a mandatory thing or a nice to have thing? You can keep those things enabled or disabled. Again, going back to the configuration thing, you can have a flag there where I say, you know, different logging levels, like minimal or you know, standard or extended logging, whatever. And then all of my things get enabled or disabled based on that. But when you create those metrics, think about in the logging, not just to dump that data, maybe create a nice shape that you can easily create. Dashboards like JSON format, super easy to read later on. But once you roll these things out and after a couple of years someone says, okay, how do we get this information, man, you're writing it in here now.

Harcharan Kabbay [00:43:21]: It's a big thing. It's like a lot of LCM. It's generated. I have to go back and look at the code, how I wrote maybe some other things I do not recall at all. Right. It's more like these little things make a big difference if you bring it into your process. Like, you know, like these eight considerations when you write a new app. So I'll mention just one thing, then I'll wait for your inputs here.

Harcharan Kabbay [00:43:46]: I talked about temporarizing, so what I experiment a little bit, like, you know, these applications have a broader design. We touched upon the knative serving or kserv. You could be running with Django server or you could be running with fast API, right? You can actually create those repository templates as well. Like create a skeleton with all those things. I usually do that. So I ask people, you know, if you have any suggestion, update that template. If they have to create a new repository, just create a clone. So that way they get the bare minimum things or even the instructions and the readme that make sure these things are covered.

Harcharan Kabbay [00:44:25]: That has helped me a little bit on making sure five, five mandatory things are already there. And then I get inputs that people feel like, okay, now it is easy for me, I realize while coding that, okay, this thing is missing, we should make it available. They create a pull request against that template.

Demetrios [00:44:43]: Yeah. So then five years later you don't go back and say, how did I structure this? What exactly was going on here? What was I thinking? And you save yourself a lot of time, which makes a ton of sense. I mean, you said a lot there, man. There is so many threads that I want to pull on. And one just going back to the beginning of what you were saying on being sure that if you set certain SLA's or slos around models, you better damn well keep them. And the way to keep them is especially the more products that are being built with that service or that model, then it's almost like you're getting more and more dependencies on top of your service. And so it becomes more critical. Now, if you are not having the visibility into when the model isn't working, you're setting yourself up for a bit of pain and hurt.

Demetrios [00:45:50]: And so I just wanted to say that because it's like you gave your word, or you said this model is going to have this type of accuracy, or it's going to have whatever it is that you agree on, you have to make sure that it does that. And like you said, if you're not tracking it, or if you don't have metrics around it and you're nothing tracking it and observing it, then you don't actually know. And that's where problems can arise.

Harcharan Kabbay [00:46:18]: And the other thing you have to think about, you know, you lost your stakeholders confidence if they report some issue, not you aware of that. Like things can happen. Models may not perform, data has changed, but you should at least be aware first. And you know, it should be other way. Okay, I'm setting up this meeting. I realize it's changing a little bit. We need to work on that.

Demetrios [00:46:43]: Yeah, get out in front of it, because if they're coming to you and saying, why isn't this working? It may be okay if they do it once, but if it's two times or three times, then it's like, I'm not going to build with that service. It's not reliable. Yeah. And then what'd you do all that work for? To get that model out into production? Why did you even do that if people then later aren't going to want to build with that service? So that was one thing that I found fascinating, just pulling on another thread that you were mentioning in your last answer, really fascinating when you were talking to this data scientist who wanted the max allocation right away. And your approach to it, if I understood correctly as saying, let's try and give you the most basic allocation possible, and then when we start to progress it through the different stages into production, we're going to be increasingly trying to break it or trying to stress test it, and then we're going to see what comes up. And whatever comes up, we're going to deal with that when it comes up. Is that how you look at it on that road to production? Almost.

Harcharan Kabbay [00:48:05]: So the example that I gave you was more around a Jupyter notebook, like where people are running. And in Kubeflow, you can create your own notebook servers and you can also request your resources. So the point that I was trying to get there is if you're coming from a purely data science background, right? You don't want to run with lower resources, or you say you don't understand the difference between request and limits, right? So it was that kind of scenario that I need all the servers. So differentiation of story, responsibility, like when you are developing, right, you are not supposed to, or you're not required to get all the production data and run everything. Like, you know, in the models, you can take a sample size, right? There are different ways you can do that. So it's more like education that you don't need to do that here. In this environment, it is more for development. But then when it comes to request versus limit is more like limit is your hard limit.

Harcharan Kabbay [00:49:06]: Request is a minimum thing you can start with, right? You don't need it because if you go with matching the request with limit, you're reserving all your resources and then you don't even need those, right? But with the request and limit, then you can check it from your grafana dashboards or whatever observability you have. How much are you reaching? Like. No. Are you actually touching the limits or not at all? Yes. In terms of moving it to production, when we are developing, we are not doing a load testing at that point. Doesn't make much sense to me on s three to do a load test in a Jupyter notebook, right? I mean, you should be running an app, man. That, that's what the need of the app there, right? So it's more like education and segregating. Okay, we do it at this stage.

Harcharan Kabbay [00:49:52]: And at that, I would expect your notebook is shaped like an application at that point. It has all those, you know, eight or ten things that we discussed standardizing, like proper logging or, you know, API endpoints, proper naming convention there. Then give it all the data and the resources. And when we do that testing, we have a better understanding how much resources we need, where it is hitting the limit. But at that point as well, we don't need the blind testing. You should have some, something in my, in your mind, right, that I want to reach, like these number of user transactions per minute. I have to make it working for that. Now you can break any app if you want to, right? But what value do you get out of that? Like, if you say, okay, give me an app and I'll, you know, break it.

Harcharan Kabbay [00:50:38]: But what do you get out of that? Your business goal is to make it happen for this many users and I'll provide you the resources for that. And for like day to day monitoring, getting patterns, anomaly detection, then you have proper observability. You'll be probably reviewing that every week or every month and you say, okay, I see a spike in traffic. We need to revamp our resources, right? So everything goes a lot.

Demetrios [00:51:04]: And are you, when you're having that transition from whatever it was, whether it's LLM locally being run to now, it needs to be converted into an app or a Jupyter notebook that now needs to be more in the shape of an app. Have you found success trying to put that process on to the data scientists or what does it look like? What does the translation process look like?

Harcharan Kabbay [00:51:35]: Yeah, so in the beginning it was a little bit of a struggle, but I touched upon something like templatizing. Sometime people try to push back because there's too much unknown. If you provide them a scarf holding, okay, this is a skeleton, these are the files. So slowly and surely with combination of knowledge sharing session, people start feeling comfortable on then peering and peering like, you know, you hook up a person with someone who already know that, you slowly spread that knowledge out. So it's not purely a technical solution, it's more like a timing and then the way we work. When you start on the project in green field, it is easy because something new you are implementing, but if it is something already implemented and you want them to make changes, you may have more challenges there.

Demetrios [00:52:29]: Yeah, it comes back to that people, process and technology idea. Here's straight. People and processes. No amount of technology is really going to help you on this. Now until going back to the question I was going to ask you before I remembered, what I wanted to talk about was really around the different areas that things can fail. And specifically we were talking about this rag development and you were saying, do you have replicas of your vector databases? Do you have the LLM proxies? Do you have all the like, make sure your secrets are on GitHub? What other ways have you seen? Or do you think about what other vectors are you thinking about to make sure that your architecture is as robust and reliable as possible?

Harcharan Kabbay [00:53:25]: That's a really great question. And especially with the LLMs and more and more cloud vendors out there providing all these models as APIs. Right, like now, you're not limiting your traffic within your own data centers, you are, you're making a lot of hops and here and there, right? So it's, again, you know, your personal opinion based on the knowledge that I have with different things, I feel like security is one of the common things which gets overlooked. If you're creating new resources and your process is more like people oriented, you still rely on manual work. There are chances there could be human error, there could be some misses here and there, right. And that can lead to some big problem. But when it comes to the things itself, like what you have, for example, there's a newer model that come out there and you want to start using that, right. I think you need to virtually or logically segregate the task.

Harcharan Kabbay [00:54:29]: Like experimentation is one thing, but when you take it out to like a next step, like from dev, test, prod or whatever you're moving, you need to have proper evaluation. If you're not doing that, you're bound to again. Or the risk of failure is high because you don't do the due diligence there because it's a big change in itself and all the pieces in this workflow are important. I know we took example of a rag. We have the retrieval engine. Suppose you know tomorrow you want to switch to a new vector database. How do you test the quality of that, such results that you are getting? One thing good and bad about the LLMs or anything that is coming across is people feel like, okay, all I need to do is pip install, import this and start using, right. But behind the scene, these are some API endpoints or maybe not so huge, but still another thing in your landscape, like a module that is providing some functionality.

Harcharan Kabbay [00:55:34]: And if you're not doing a QA for each of those components and you start looking at just the final response, I think the risk of failure is very high there. But even if you go back to the classic ML example we took, like you're not keeping, keeping an eye on your data, that you're so relaxed now, okay, my models are running fine, but that, you know, there's a data drift you're not aware of, that is without proper observability, it'll take a lot of time for you because what happens over time, people start focusing more on the final outcome. I think it makes sense for the business, but if you are the one who is implementing or who is researching, right, you should be aware of each and every step and, and know how to do QA of each of that hop. Otherwise, you know, you're bound to get into an unpleasant situation where you say, okay, now I don't know man, I have to go to a square one and start analyzing each of that thing. And you should have an incident response plan. Like if something fails, right? How are you going to respond to that? It is not that I implemented something like generator by AI is relatively new. I know a lot of people have been experimenting, but if you roll out this app, right, you may be seeing a different kind of situation as compared to a regular app or a regular model. Right.

Harcharan Kabbay [00:57:00]: They don't see the responses or your services unavailable. Your LLM starts behaving like you know, differently. Like it starts giving you the information that it is not supposed to or some error messages.

Demetrios [00:57:12]: Yeah. People start asking questions that you never thought they were going to ask. There's so many different ways that it can not do what it is.

Harcharan Kabbay [00:57:20]: That's a great point actually. One thing about that, right, like you use LLMs for asking those questions. You may consider having some kind of topic modeling on those questions as well. Like you know what people are talking about so that you are aware 100%.

Demetrios [00:57:35]: That's one of the most important pieces.

Harcharan Kabbay [00:57:38]: When it comes to just one more thing. Demetrios, before it skips my mind, that is about, you know, we talked about the libraries like people are pulling and using it, but if you see the vulnerability around that, right. You may already remember something that happened around some zero day.

Demetrios [00:58:02]: I know what you're talking about, but I can't remember exactly what you're talking about. But yeah, there was a vulnerability.

Harcharan Kabbay [00:58:08]: Yeah. So I mean if you don't have a proper vulnerability scanning for these libraries, right. You don't know what you're implementing. Right. No one remembers over time. So I think right from the beginning you need to have all the wells and whistles when it comes to getting things to production for experimentation. It might be okay, but I've seen situations where people start playing around and they later realize, okay, this was a license thing you're not supposed to do commercially right now it's an ugly situation because the developer is not aware and he created something that, you know, the organization uses questionable. So those kind of things someone has to watch if a process is there.

Harcharan Kabbay [00:58:46]: I know I'm in talking process. I mentioned it like 20 times, but can't stress enough.

Demetrios [00:58:53]: But it's another vector to be thinking about. And I appreciate that because it's not just again, it doesn't just take into account how many database replicas do I have. That's a whole different type of vulnerability that you could potentially get yourself into some hot water for because you didn't check the licensing on that open source tool that you were using. And so being aware of all these different ways that things can go wrong is quite useful. And I wonder about just since we, you mentioned how you want to have an incident response plan and you also were talking about how, yeah, to bring down a service is always going to be possible if you throw enough crap at it. And what did you accomplish there by bringing it down? It's almost like, so what?

Harcharan Kabbay [00:59:53]: Good job.

Demetrios [00:59:54]: Can we get back to work now? I think about stuff like we had on here, someone who is into chaos engineering, and I imagine you've coming from DevOps, you've probably heard of the term, and for those who haven't, it's very much thinking about how can we try to really stress test, or how can we have a hypothesis about if something hits the fan, what will happen from here? And then can we test our hypothesis and see if our system is reliable or not, if that thing that we like, we blow up something or we start sending ten x the amount of traffic, or we start doing something like that. Have you thought about that in the machine learning world? And if so, how?

Harcharan Kabbay [01:00:48]: It's a great point. It is totally applicable because when you are talking about the resilient architecture, it cannot be completed without talking about the, the chaos engineering. And. But in terms of the permitted models, like before the LLMs world, we have been relying more on the regular regression testing because most of those things were either in a passive inference, like a chrome job fashion, and started opening those slowly and surely to more and more APIs, like creating the case, serving, etcetera. But it is important because as you are bringing in lot and lot of AAML to all these technologies, or the existing solutions, that you have to start thinking around those lines. I personally have not spent too much time, but that is on my radar. Maybe could be a topic that we'll discuss in some future podcast.

Demetrios [01:01:43]: I would like that. Yeah, because I often think about it as how can you do this type of chaos engineering with the data as opposed to the service? So what happens? And it's the same theory of I have a hypothesis and I'm going to see if my system is robust enough to be able to handle this. And so instead of saying what happens if we get ten x to traffic? What happens if all of a sudden this database goes down, or if I all of a sudden start getting ten x the data going through, are we able to handle those types of situations? And so it's bringing that same mentality, but doing it a little bit more. I guess it's a little bit more on the data engineering side of things that you want to just stress test your system and make sure that everything is doing what it says it's going to do.

Harcharan Kabbay [01:02:38]: But one thing I do like to mention here is, you know, availability of any resources in that whole landscape, right? Like a common point of failure we talked about. So when we talk about observability, right, we usually rely on. I have a logging and monitoring in place, right? I have a logging solution. But there are ways that you can set up some health checks around that. Like even if you create a fast API app, right, you can create a health checkpoint on top of that, or through that, you can hit your databases and get a health check. So if you have seen some of those like official talker images, they always come with a health check, right? But when you create one, most of the time those things are overlooked. So I'm gonna say that template thing again, the bear with me, create something that has, you know, that make sure you have these health check and I an example docker file which has those things incorporated. But the point here is you need all those things right from the beginning so that you can reduce the risk.

Harcharan Kabbay [01:03:45]: The risk is still going to be there. Like, you know, we are all from some data science and machine learning background. You can minimize the probability of something, but you cannot reduce it, right. It can still happen. It's more like if you are aware. Okay, how I can be the first one to know because my team implemented, if I have proper thing in place, I might be able to have the checks in there. Like, you know, if my database, like before making a database call, or I know my UIAP app that has checks and balances in there. If my underlying API is not performing, it'll say, okay, we are experiencing some issues, right.

Harcharan Kabbay [01:04:23]: I'm not, I'm not gonna get through the traffic. Like, I'm not going to get you to the next hop, right? Those are kind of the common development patterns. But the thing is, if you don't have the proper health check monitoring in place, like then you'll be doing some over engineering or getting into the issues where you are spending too much time on publishing.

Demetrios [01:04:42]: Dude, this has been an awesome conversation. I really appreciate you coming on here.

Harcharan Kabbay [01:04:48]: Thank you.

+ Read More

Watch More

Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.3K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io