Making Sense of LLMOps
Maria is an MLOps Tech Lead at Ahold Delhaize, bridging the gap between data scientists infra and IT teams at different brands and focusing on standardization of ML model deployments across all the brands of Ahold Delhaize.
Maria believes that a model only starts living when it is in production. For this reason, last seven years, she focused on MLOps. Together with colleague Basak, started Marvelous MLOps to share MLOps knowledge with other ML professionals.
Senior Machine Learning Engineer with 5+ years of experience across diverse industries including banking, retail, and travel.
Lots of companies are investing time and money in LLMs, some even have customer-facing applications, but what about some common sense? Impact assessment | Risk assessment | Maturity assessment.
Making Sense of LLMOps
AI in Production
Slides: https://docs.google.com/presentation/d/156zWsu39LKjUPBKgMcCrpGawBQKYdDtS/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true
Demetrios [00:00:05]: The marvelous MLOps duo that I'm about to bring onto the stage. Where are you at? Başak and Maria. Hello, you all.
Başak Tuğçe Eskili [00:00:15]: Hello.
Demetrios [00:00:16]: Hey, how you doing?
Başak Tuğçe Eskili [00:00:18]: Great, great. Yeah. Amazing conference. Looking forward to more talks.
Demetrios [00:00:24]: Yes, I am looking forward to your talk. So you all have 20 minutes on the clock. I'm excited for what you are about to present. I'm also very excited for all of your growth that you've had over the last year. When we first talked, you were like, yeah, I think we might start posting more on LinkedIn or we might start writing a blog. It looks like that has worked out quite well for you.
Başak Tuğçe Eskili [00:00:53]: Yeah, indeed. It's been a wonderful journey so far.
Demetrios [00:00:59]: Excellent. Well, are you all going to share your screen? I'll pull it up onto the stage once you share it and everything goes well with that, I am going to jump off the screen and share the screen now with all of you. I'll see you in a bit.
Maria Vechtomova [00:01:18]: I hope people are seeing the right. Perfect.
Demetrios [00:01:26]: All right.
Maria Vechtomova [00:01:27]: All right. Thank you. Good morning. Good afternoon. Good evening. I know there are people from different time zones in the audience. We are Maria and Başak. Thank you for the great introduction.
Maria Vechtomova [00:01:36]: The metrios. We work in different organizations now. Maria is working at ahotel Heights. I am [email protected] and today we're going to show you some overview of what's happening in the market regarding LLM ops and what makes sense to do, actually, starting with the fact that every company wants to do something with llms. It's either top down from executive level managers as a company objective, or from bottom to top from data scientist initiatives. What we have seen in the market is that those three main applications, there are Q A systems. There are chatbots, which is the most common, and there are agents, which are actually taking actions on top of the user query, such as sending emails, such as creating a web UI. I've also seen recruitment.
Maria Vechtomova [00:02:29]: The agents are being used in the recruitment process to select candidates or eliminate some candidates. So these are the three main applications we have seen organizations are working on. Another overview is the three different ways to use slms. The first one, which is the most common one, is the foundation model, which is using an endpoint from cohere with an additional enhancements of prompt engineering or rag implementation. The second one is using a pretrained model with, again, some enhancements of fine tuning or prompt engineering or rag implementations. And the last one is training from scratch again with some additional rag implementation or reinforcement learning, human feedback implementations. But it's very clear that the first two are the most popular, the foundation models and pretrained models. Now I will show you some diagrams that you have seen probably many times.
Maria Vechtomova [00:03:32]: Also, the previous talk was about RAC systems. I'll be sharing these diagrams to remind you how those systems actually look like. This is an AurEC which is using a vector database to get some similar context and enrich the prompt. So we get the prompt from the user and then we create an embedding with the prompt. We send it to vector database, we get some similar context and we use this similar context to enrich the prompt and send it to our LLM to get a better response. This is a naive rack that we've seen implemented at organizations. If you want to make it more advanced, if you want to make it more robust and accurate, you can implement some advanced techniques which is either before retrieval or after retrieval. And the before retrieval we see semantic query routing, we see query rewriting, we see query expansion as different techniques to apply, and in the post retrieval part we see re ranking of the retrieved similar context, we see summarization and we see different techniques to actually combine these different contexts retrieved from the vector database.
Maria Vechtomova [00:04:41]: And the rest is very similar to Naiverac. They are used to enrich the prompt and LLM uses this context to generate a better response. So this is an advanced version of the REC systems. Another enhancement of the LLM use cases is when there is a pretrained model being used, which is fine tuning. The benefit of using fine tuning techniques is that the standard available models may not be customized, may not be trained for a specific use case. So with pretraining you actually fine tune the model for your specific task. In addition to that, you might also bring your domain knowledge using your own data set and make it the pretrained model more robust and more customized. In fine tuning, we also see different techniques from supervision perspective and from parameter training perspective.
Maria Vechtomova [00:05:41]: We see self supervised fine tuning. We see supervised fine tuning. We see reinforcement learning when it comes to supervised trainings, and we also see in the parameter training part different techniques. We either train all parameters, which is very costly, or we use transfer learning which has been out there for a very long time. The most common one we see in parameter training is parameter efficient fine tuning. So these were the overviews of some advancements in the LLM use cases. So having shown some techniques and also given that there has been so much research in llms, there has been so many tools developed, so many frameworks, so many libraries released, it's actually very easy to create an LLM application with literally ten lines of code or you apply the advanced techniques to actually bring your LLM application closer to the research level. This is the overview of the trends happening in the LLM field and organizations.
Maria Vechtomova [00:06:55]: So I will show you some example of applications announced by big companies or big names, starting with my favorite Zalando because I'm their dedicated customer. Zalando is an e commerce website and they release a fashion assistant to help customers navigate through Zalando's large assortment within a chatbot. So you ask certain questions, you give some preferences and this assistant helps you to find the right product. Another one is from booking. It's still being developed, but they released a better version. The AI trip planner helps travelers find their destinations and accommodations via chatting. So you type where you want to go or what type of holidays you are looking forward to, how many people you are, and the AI trip planner helps you to find the right selection of accommodations. So these are examples of improving user or customer journey in certain platform.
Maria Vechtomova [00:08:03]: But we also this is another example of again, using chat to actually help customers to navigate and understand their platform, which is from Palantir. This is not a knowledge based system, but it's actually a chat bot. If you see an error, if you get stuck in the platform, you ask in the chat and you get an answer and you get a solution. So again, this is an example of improving a user journey, but we also have seen in the news that things can go wrong. So I'll show you this example. This is a grocery supermarket from New Zealand. Pack and safe the intention was great, they wanted to help customers to use their leftovers. So it asks users to enter various ingredients and it generates a recipe.
Maria Vechtomova [00:08:57]: But you don't have control over what users are giving to this application. So when users actually start giving unedible ingredients to this app, it still generated some recipes which were not really edible. Another example of what can go wrong is from DPD which is a delivery firm, which is a shipment company. And in the DPT chat when users ask for recommendation for better firms, the DBT chatbot is actually starting to criticize the DBT itself in a very negative way. And this is not what you want for your chatbot to do. So you don't always have control over your LLM application. Therefore we say before starting an LLM project you should conduct an impact risk assessment and after you've done your POC you should conduct your maturity assessment. And these are very common.
Maria Vechtomova [00:10:00]: Best practices have already been existing in the data science use case domain, but Maria will actually deep dive into each assessment to show you what is different and necessary for specifically LLM use cases.
Başak Tuğçe Eskili [00:10:16]: Thank you Bashak. So I think the most important thing is to start with impact assessment. You need to identify the business problem, actually find the business users and understand do you really need LLM for this specific use case? Because very often you don't need an LLM. You may have a simpler solution that work very well for your business objectives. Then often what we see is data scientists are eager to try all these new cool techniques and they start a project which is not aligned with the business objectives, which is doomed to fail because there is no business owner from the company that will actually support your use case. Estimated cost before you start, it's very important to know how much it will cost and how long it will take to build the project and obviously what would be the impact of it. Because if costs are higher than the impact, then it doesn't make that much sense to start with the project in the first place. So after the impact assessment is done, you also need to do the risk assessment.
Başak Tuğçe Eskili [00:11:21]: A lot of use cases that we see, pretty much every corporate company has an internal knowledge based use case currently that is LLM based. But some are trying to do something customer facing as we have seen in the examples and the risks of those application is much higher because you can end up bad in the news as some companies do. Preventing hallucination is also very important, especially for customer facing applications and privacy concerns. What if you need to have PIi data as part of the chat application? How do you handle that? How do you handle the bias? Because all the models are biased, people are biased. The data that models are trained on is biased. So how do you handle that properly? Data security breach is another problem that may occur for customer facing application. When you have an LLM system, it's most likely that you have a reg application and you actually depend on some external third party API. And what if that goes down? What kind of impact it will have on your system? Lack of human oversight.
Başak Tuğçe Eskili [00:12:36]: It's crucial to have someone monitoring the output coming from the system and also misuse of AI. So it can be that people don't use the chat bot or whatever application you have in the way that is intended to use by providing it some other information, and you don't want that either. So you need to understand the risks before you actually start with your POC, and we don't see that happening that often. So after you actually have something in place, a POC, and it is working successfully, you want to understand how mature you are and what we've seen, that people actually think they are more mature than they actually are. And it is the case for a standard ML application, but also for LLM applications. A couple of years ago, me and Bashak created MLOPs maturity assessment. We actually mentioned in the podcast for Mlops community back then as well, and we released it also, and we conducted the maturity assessment within Aho Buhesa. That's a company that has many different brands and many different applications.
Başak Tuğçe Eskili [00:13:55]: And we did it on the project level. And I think that's the difference between maturity assessments that are proposed by Google or by Microsoft. There is a maturity level for mlops, and there it's actually looked at the organization and how teams are organized, but it's not very actionable. You don't know how to change it. So when we look at the MLOS maturity assessments, we have seven main points, which is documentation traceability, reproducibility, code quality monitoring, a testing feature, source and explainability. So the four first ones are the most important documentation. So you need to know whether KPIs and business goals are documented. Software architecture design is there in place.
Başak Tuğçe Eskili [00:14:44]: There is ML model choice that is clearly defined. Why have you chosen for that model traceability and reproducibility? I think that's the whole core of mlops. That's for any machine learning model deployment. You need to know what the code produced, the model, what infrastructure environment was used, what model artifacts were produced and what data was used for model training, code quality best practices to ensure that there is a pull request process, that there are CI CD pipelines that actually run automated unit and integration tests, and that there is monitoring in place. And these four pieces are also really important for LLM applications. However, for LLM applications we have extra questions that are also crucial to go through. So if Bashak, you go to the next slide. So there are LLM op specific questions.
Başak Tuğçe Eskili [00:15:44]: For example, regarding foundation model API. For any API call we can look up which endpoint and version was used, the structure of request and response token usage costs latency, what prompt and response was generated. Then for reg application you have three pieces of generating the embeddings, storing and retrieving embeddings. So when generated embeddings, you can look up what model was used to actually generate the embeddings, computational and retrieval latency, how documents are parsed, how long are the chunks and what was the strategy for creating those chunks? Storing embeddings. So when storing embeddings, typically in some vector database, we can look up how data that database is updated with the new documents, then metadata saved with the chunks and which document and part of the document the chunk is coming from. And when retrieving embeddings from the vector database, we can look up how many chunks are retrieved. The strategy for combining chunks metadata filtering, what metadata was actually retrieved and what similarity algorithm was used. Then there is also an important piece of prompt engineering, so that there is a strategy present for integrating content into user query and also query enrichment before it's sent to a foundation model.
Başak Tuğçe Eskili [00:17:22]: The prompt engineering is a huge topic in LM research and probably everyone has heard about OpenAI's guides to prompt engineering. That's a great resource that gives you some ideas on how you can do that better and fine tuning. So model fine tuning is actually very similar to the process of training in mlops. So you need to know what code was used for model fine tuning, what infrastructure was used, what artifact was produced, what training data was used. Also the retraining strategy. So how often do you fine tune your model and how the whole process automation of it looks like and methodology, how do you actually fine tune? There are multiple ways and for some you will need to have some infrastructure in place. And coming back to the infrastructure point, I think there are some core pieces in mlops infrastructure, so it's actually quite simple. You need version control, you need to have Ticd, you need to have some orchestration.
Başak Tuğçe Eskili [00:18:33]: I think airflow is probably the most widely used. You need to have something for model registry. So you have Sagemaker, Vertex, CI registries, ML flow container registry. You need to have something for compute and serving. So there are end to end tools for that as well. Evaluation and monitoring. So for monitoring you have some standard monitoring tools that are used also for software engineering and feature stores like Tecton data rigs has featurestore, sagemaker, hopsworks, feast. So those components are the core components for mlops and we will also need them for LLM ops.
Başak Tuğçe Eskili [00:19:14]: But for LLM Ops we also have very specific extra components like foundation model API providers like Amazon Bedrock Databricks has released a foundation model API which is quite interesting the way how they implemented vertex AI and Azure OpenAI. You have vector databases which wasn't a core piece of mlogs before, but they are very useful not just for LLM but for other application like recommended systems. So I think they will stick around also for standard mlops and also some prompt engineering uis which are very handy. MLflow released it recently, so we will see different changes in the whole infrastructure obviously also compute and serving will have different requirements if you need to do some fine tuning and model serving compared to standard mlops. But this is still the core piece of infrastructure. So we actually created a list of recommended resources. I believe it will be coming on our GitHub and also now LinkedIn soon. But here are some interesting resources we've seen like a large language model, a survey that came out recently.
Başak Tuğçe Eskili [00:20:42]: Eugene Yan has a really nice blog and she writes a lot about llms. Racks for LLM survey and reinforcement learning with human feedback from hugging face. A lot of research, a lot of people working on it. So if you have any question you can send us an email and you can connect to us on different social media. You can scan the QR code for that. And thank you very much. We love cats, so we have a lot of cats.
Demetrios [00:21:16]: In our presentation, somebody called out specifically that they love the memes. So the meme game is on point. That is for sure. You are professional memers. So there's some questions that are coming through the chat and I think we have a minute to answer them. If you all are open for let's see, let's see here. Lauren's asking, in your opinion, what is an ideal scenario for evaluating whether to use llms and the cost that incurs? Who should we be talking or shoot? All right, words are hard, so give me a sec, let me try this one again. What's an ideal scenario for evaluating whether to use llms and cost, e.
Demetrios [00:22:15]: G, who should be talking to who? What are some common challenges in these discussions? So I think this is an organizational level question.
Başak Tuğçe Eskili [00:22:26]: Yeah, that's a good question. I think it depends on the application and what makes sense. We had an interesting use case regarding recipes. So I think recipes in general is now a very complicated topic because of one company coming bad in the news because of it. But I think we ended up doing some POC, not with llms, just with OCR. You scan a recipe so the system understands what kind of ingredients you need, and it just suggests the one based on some recommended system and the similarity to you and add it to the basket. There is no LLM needed for that, right? You can do it with just OCR.
Maria Vechtomova [00:23:14]: Yeah, I think it should start from actually defining some KPIs. Improving user experience may not be always measurable, but with the chat bot, for example, as a sales agent or with a chatbot as a knowledge base, what exactly you want to get out of it? Do you actually want to improve a certain experience, or do you want to actually increase sales so defining those and then also at the POC level, comparing if it's actually bringing that value would be very helpful.
Demetrios [00:23:47]: No, I just want to tell my shareholders that I'm using AI. That's all I want to do.
Maria Vechtomova [00:23:53]: Or you want to actually please your executive level managers so that they can use rehab and AI?
Demetrios [00:24:00]: Yes, exactly. That makes the stock go up to the moon. So are there any particular grounding techniques you could recommend if those exist?
Başak Tuğçe Eskili [00:24:15]: Yeah, that's a good one. So I think we don't have the problem because I think, well, depending on the industry, obviously, but for food retailers, I think it's quite hard to come up with a solid business use case for llms. So if there was one, I think management would be much more eager to do something with it. But now they basically just do things on paper, I would say. So we have a gen AI lab, but there is no per se use case that can generate serious value. So I think that's. Yeah, that's quite complicated situation.
Demetrios [00:25:03]: Got. I think that is it. There are a few more questions in the chat, but I'm going to keep it rocking. Başak and Maria, for everyone that is not already following you on LinkedIn, I encourage them to follow you and enjoy all the incredible memes and the amazing posts that you have. Thank you so much for joining us. This is awesome.
Maria Vechtomova [00:25:25]: Thank you.
Başak Tuğçe Eskili [00:25:26]: It was a pleasure to be here.