Charting LLMOps Odyssey: Challenges and Adaptations
Yinxi Zhang is a Staff Data Scientist at Databricks, where she works with customers from various verticals to build end-to-end AI solutions. Prior to joining Databricks, Yinxi worked as an ML specialist in the energy industry for 7 years. She holds a Ph.D. in Electrical Engineering from the University of Houston. Yinxi is a former marathon runner and is now a happy yogi.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In this presentation, Yinxi Zhang navigates the iterative development of Large Language Model (LLM) applications and the intricacies of LLMOps design. They emphasize the importance of anchoring LLM development in practical business use cases and a deep understanding of one's own data. Continuous Integration and Continuous Deployment (CI/CD) should be a core component for LLM pipeline deployment, just as in Machine Learning Operations (MLOps). However, the unique challenges posed by LLMs include addressing data security, API governance, the imperative need for GPU infrastructure in inference, integration with external vector databases, and the absence of clear evaluation rubrics. The audience is invited to join as Yinxi illuminates strategies to overcome these challenges and make strategic adaptations. Yinxi's journey includes reference architectures for the seamless productionization of RAGs on the Databricks Lakehouse platform.
AI in Production
Charting LLMOps Odyssey: Challenges and Adaptations
Demetrios [00:00:00]: We're going to jump to the next talk because we've got Yinxi around here somewhere, I believe. Calling Yinxi to the stage. Where are you at? Hello?
Yinxi Zhang [00:00:14]: Hey, I'm here.
Demetrios [00:00:15]: You are awesome. So, one thing, I imagine you want to share your screen, right?
Yinxi Zhang [00:00:23]: Yes.
Demetrios [00:00:24]: All right. You want to get that rocking. And in the meantime, I am going to encourage people. You're. You're actually currently working at Databricks, right? And I'm going to encourage everyone to check out the Databricks booth that we set up and give a huge shout out. And thanks to databricks for sponsoring this conference, because without their love and support, I don't know what we would do. I did it earlier, but I'm going to go ahead and do another little product placement right here. This is.
Demetrios [00:01:02]: Mmm.
Yinxi Zhang [00:01:05]: I think I'm okay.
Demetrios [00:01:07]: There we go. We got it. Nice. Excellent. So I will be back in about 20 minutes, and for anyone that has questions this talk, it's going to be another long talk, not the lightning talk. So we'll be able to ask questions at the end of it. Feel free to jump in the chat with your questions. All right, I'll see you in 20 minutes.
Yinxi Zhang [00:01:33]: All right. Thank you so much for the introduction, and hello, everybody, and welcome to this session. We'll talk about Aim Ops, what challenges aim brings compared to M ops, and how we make adaptations. First of all, who am I? A little bit of self introduction. My name is Chi. I'm a staff data scientist at Databrooks. I've been with the company for three years. I work with our customers to deliver and deploy geni product on databricks platform.
Yinxi Zhang [00:02:10]: All right, throughout this talk, we want to. Let's see. Okay, so throughout this talk, we want to talk about different generative AI design patterns and deployment strategies. We want to emphasize and shed light on LL mops. It is basically trying to assist and automate the iterative development of Gen AI applications. We want to walk you through each of the components in the LM ops loop, including understanding business use, case data exploration and analysis, pipeline prototyping. And how do you package, evaluate, model, and use the ICD tools to deploy the model into production? In the end, what are the different serving options you have when it comes to llms? All right, without further ado, let's jump into it. This diagram here basically shows the four different patterns of Gen AI system and application design.
Yinxi Zhang [00:03:20]: It's going from different maturity levels. At the beginning, you want to start from basic, off the shelf LM models and do prompt engineering to craft specialize the prompt so that it gives you expected behavior and good answer as you mature through the knowledge, we can start to develop and design a rack system which combines LLM with customer and customer external additional data. Till this stage, with prompt engineering and rack, we are not touching the foundation models, right? We will just be providing additional information to the model but not changing any of the model parameters and moving forward to fine tuning and print. Tuning is basically leveraging your own data to update the model weight, either fine tune it or training from scratch. So usually at the beginning of the stage you would do prompt engineering and only move up to the next level if first of all, you have sufficient data. Secondly, the current results are not satisfactory. All right, so with regular machine learning pipeline, the machine learning workflows include three key assets. You have your code in the code repository, then you have your data.
Yinxi Zhang [00:04:56]: Most of the case is tabular, and in addition you will have your machine learning models. All those three key assets will need to be developed in the development environment and then tested in the staging environment, finally deployed to production. So from dev to production, you can think of it as from a more open environment to a more closed environment. And you would want to limit human, restrict and limit human interactions with the prod environment and delegate a lot of the control to CI CD tools or service principles. That's for mops. So how does lms make a difference? Right. The first thing come to mind, of course, is the change in terms of models. Instead of having regular SQLR or XQL model, now you have large language models.
Yinxi Zhang [00:06:01]: The popular generative models usually have billions of parameters, and even smaller classification models like Bird have billions of parameters too. So how do you interact with models with those large language models? And you can either have a local downloaded version of it or interact using LM APIs by providers like OpenAI or Cohere or mosqueml. So the interaction with models are different comparing with conventional ML pipelines, and it usually requires a GPU infrastructure. And then let's think about data. For conventional machine learning models is usually tabular data. But for lms, your input data are raw text. And actually even before the input to the LM models, at the beginning you might start from unstructured data, document format like PDF files and image files, and you need to parse it into raw text. We mentioned prompt engineering and Rag all depends heavily on prompts.
Yinxi Zhang [00:07:17]: So your prompt is a big key part, a big key addition to your raw data. In the case of Rag, the external document and knowledge base are embedded into embeddings and stored in vector database. So raw text, raw input, unstructured input, prompts and embeddings are additional format of data you will need to use in a geni system. So that makes like Geni Ops or LM ops more challenging than regular LM Ops. And throughout the remaining of the talk we'll try to explain the LM Ops lifecycle and I will try to walk you through the key components in the lifecycle step by step. Usually you would start from the development stage, right? And at the development stage we start from a business problem and trying to understand this business problem with existing data. Do data analysis and the discovery and then start to prototype your pipeline code, either prompt or Rag. Once your code is ready, we want to think about how do we package this pipeline or package this model to prepare for deployment and we do evaluation and model validation before we actually deployed it to production with CI CD tools which will help with the unit testing, integrate automated testing and automated deployment let's first start from business.
Yinxi Zhang [00:09:20]: Understanding it is really important and critical to ground lms in real life problems before just jump into the world of Loms and start experimenting different type of foundation models. Maybe you should ask at the beginning what is the problem I'm trying to solve? Do I have sufficient data and can I define quantified success criteria because only those solves a real problem that can be deployed into production and generate real returns, right? And also the use case will decide what type of tool stack you can use. Are you thinking about a chatbot where you won't need generative models like Chat GPT or llama Tor Mistra? But in the case, if you only need something like a PII deidentification, it is basically a classification model and you don't need all those large generative models. Model like Bird is suitable and more appropriate for those classification use case and you also want to think about how does user interact with the solution. Is real time response required? If so, what are the throughput and latency slas? Those are the things to define at the design stage and do a little bit analysis of your query type. If this is say a chat bot, what kind of questions people usually send to it? How long is it, what are the context generally looks like and how once you build this chatbot, how does the user interaction looks like? Is it going to be a UI and how does it integrate it with the rest of the existing system? Are all things you want to consider from the business design point of view. Then the next step is data exploration and analysis, incorporating proprietary data into LM powered applications provides additional context to LM and make it more performant and reduce its hallucinations. So understanding your data available for LM is critical.
Yinxi Zhang [00:12:03]: This is very similar to the conventional machine learning world. You spend a lot of time profiling your tablet data, creating visualizations, and trying to understand the insights of the data before jumping into conclusions and jumping into model design. We should do the same in the case of Li use cases as well. Profile your existing text, profile your query examples, and understand the size and nature of your document collection. If your data has specific language, specific domain language, keep that in mind. And when search for models, pay additional attention to the models that have been trained on similar data. That might give you a little bit of an edge. Comparing with off the shelf model that trained on general public data and how frequent the documents are updated are important in the design as well.
Yinxi Zhang [00:13:13]: So leverage your data and we'll get to more details of how data plays an important role in the rack sections in slides later. All right, then it comes. The most fun and exciting part is to prototype your pipelines and code with the time limit. In this talk, we'll only cover the prompt engineering examples and rack examples. So prompt engineering is basically articulate your language so that you communicate better with LMS and it do what you want it to do. For moms and wives in the audience. I would use this analogy that think of talking to Adams like telling your kids or husband to do laundry. It might not work that well if you just say, honey, could you please do the laundry? Of course, sometimes it will understand it, but most of the case you need to provide more clear, concise and step by step instructions, right? We'll say, honey, could you please collect all the dirty clothes and socks in your house or on the floor, put it in the laundry basket, bring it to the laundry room, and put it in the washer once it is done.
Yinxi Zhang [00:14:47]: Once it's done, make sure you transfer them to the dryer and dry them. And we also need to provide a few examples, right? You would tell them, hey, could you please separate white from colors and use different detergents for different fabrics? And this is still, sometimes this is working pretty well, but you probably need to provide additional information like tell them how to behave, right? If you don't know how to deal with one of the clothes, come bring it to me and ask me. So same thing. When interacting with LMS, provide them the step by step with thoughts and then giving them examples and asking them to say, I don't know. When you don't have relevant information trying to reduce hallucinations. Unlike humans, LMS probably don't need all those emotional prep and encouragement, but it is more prone to hacking. In the case of doing laundries, you should be really alert if they say, hey, can I just bring this to the laundry store? And can I use your credit card? So that's like a sign of similar sign of prompt hacking that you should be aware of. All right, all the jokes aside, a couple more calls about prompt engineering.
Yinxi Zhang [00:16:23]: Prompts are model specific. One prompt, the same prompt used for different models, generally produces different results. And also models tends to have different way of different syntax of guard your prompts. If you have used the Lana two and one or two variance models before. Basically its system message is braced by the all uppercase ins. Characters, which represents for instruct and like Chat GPT have different prompt syntax than Rama two. So just pay attention to what are the required best practice for different models when you're experimenting with them. And the recommended best practice for prompt engineering is to go from simple to complex.
Yinxi Zhang [00:17:32]: You want to use tools like ML flow to track your prompt queries and also check your responses. This is similar to the hyperparameter tuning we would do in conventional machine learning, right? You want to keep track of all the experimental trials and then figure out the best hyperparameters. In this case, you want to figure out the best prompts and to streamline and modulate the process. Use prompt templates, framework like blanching and llama index all have prompt templates. The more advanced use case would be to automate the design process using packages like this one linked here. It's DSPI, developed by Stanford. And now we want to talk about racks, right? Rack is basically a dynamic solution to foundation models. Out of the shelf.
Yinxi Zhang [00:18:35]: Foundation models, first of all, are trained on public data. Hopefully you haven't seen your proprietary data before, right? So if you could provide that additional context to the LLM, it kind of reduced that limitation. The other limitation of foundation models is they are changed on historical data before a cut off snapshot. So same thing if your data is constantly evolving and providing that new information as context to LM definitely helps. From the user point of view, the interaction with llms of rag still looks the same. User would ask a question using prompts and then generate the answer. But under the hood, the system actually do a few additional steps. So it will first embed the user query, and then this is where the vector database plays a role.
Yinxi Zhang [00:19:47]: Here it will go into the vector database and do a similarity search to find relevant document which has like a closer cosine score with the query and retrieve that relevant document. Use it as an augment to the context and generate a new prompt with retrieved document together with our original query and then send that to the LLM to generate the final answer. So if you have your source documents, you will need to have evolving constantly refreshing data pipeline that generates embeddings from source documents to embeddings that stored in vector database. The usage of vector database adds additional component to LM systems and also require us to pay attention to embedding, refresh and also model packaging for LM pipelines. Benefits of RAC includes of course like providing the external knowledge database to foundation models so that they have up to date and more context relevant responses and thus this will reduce hallucination and give the system an edge with understanding of domain specific questions. This is also more cost efficient comparing with fine tuning or pretraining foundation model. So without changing the model parameters, you are still able to generate relevant results just by providing external knowledge base. All right and then let's talk about model packaging, right? So we mentioned all the different components that is unique to LOM and that brings new challenges for LLM model packaging.
Yinxi Zhang [00:22:19]: Instead of packaging just one regular model, you might want to package a model chain which includes prompts, the model itself, the integration with vector database, et cetera. In the case of pretraining or fine tuning, you will have customized model and if you have additional preprocessing logic and post processing logic then you need to pass it as customized model as well. Before we talk about solutions of model packaging, I do want to call out the trade offs between using third party APIs and Excel post models with a time limit. Here I will gloss over the letter to aspects which is predictable and stable behavior and render login and really highlight on data security and privacy. I have been working with customers in verticals like healthcare and finance. They have really strict data security review system and governance. Using third party APIs which involves sending data to external servers is a big risk to concern. So they would prefer to host their own models in their own secure environment to get full access to the model and avoid data leakage.
Yinxi Zhang [00:23:50]: Then coming back to model packaging. Tools like MLFlow has really nice integration with different frameworks and can help streamline the process of packaging LM pipelines and prepare for model deployment. MLFO now supports building flavor like Pytorch, Tensorflow, hacking phase, transformers, OpenAI and Lanchain. The integration with Llama index is on the roadmap and probably will be released soon. And I provided the two simple APIs you can use to log and track your lane chain chain. And in terms of deployment, you can use the flow deployment API to deploy your LM structure to cloud providers like Azure AWS or databricks. All right, evaluation and model evaluation I want to be conscious of time here, so I'll probably gloss over this section and evaluation is very challenging and evolving domain right now. The field has a consensus that now it lacks evaluation rubiks and also high quality evaluation data sets and the current landscape we have a few popular benchmark data sets.
Yinxi Zhang [00:25:35]: This diagram on the right side of the slide is from Mosaic's model gov met and it's been a popular choice to use LMS as evaluators too. So dataworks have a nice, really nice blog post about it. We evaluated different model choices and different evaluation Rubik's configuration to test the performance of using LMS judge of course, the ideal solution is to have human feedback in the evaluation. So when designing the LM systems, always keep in mind to provide the interface that human users can provide a feedback. When seeing the UI and seeing the results. It's going to be beneficial. All right, next we are going to talk about CI CDs again. I want to be mindful of time here and we'll just gloss over the CICD content here.
Yinxi Zhang [00:26:45]: Basically typical CICD Python have those steps. Developers will build their source code and commit changes and then run the test, deploy the code to staging environment and monitor the performance, make sure everything behaves as expected and have some stable performance and then deploy it into production for test. You want to have a lot of unit tests and then do an integration test in the end to do a final end to end test from the beginning of data ingestion to the final user facing API or user interface. Benefit of course is like automated test and helps catch regression errors at the early stage and reduce frictions in the model iteration and pipeline iteration stage. And the last component in the OP system we want to cover here is serving. There are different serving options if you don't require a real time response. Of course, batch and offline survey is a good choice. Spark has a really nice and neat module called Pandas UDF which can help distribute the serving across multi gpu clusters.
Yinxi Zhang [00:28:23]: And in the case of real time survey, you need to think about how to package your API call and pre and post processing logic. All right, and we have a reference architecture on the Lake house platform for rags. So basically on the database Lakehouse platform we recommend to save your objects and assets into Unity catwalk and then you have different components of the pipeline. This is in the production environment where you will have your vector search database update and that's a separate pipeline, not separate pipeline, that's like a refreshing data pipeline. And then the updated vector database will be used in the rack pipeline as well. So you'll have pipeline development and evaluation. Once that's done, it will go to pipeline validations. The generated model will be locked using MLFlow to MLFlow tracking server, and then the packaged model format and Mlflow model file will be used for deployment.
Yinxi Zhang [00:29:57]: And you can add like monitoring component to it and generate survey endpoint at the platform too. All right, the key takeaways I want to highlight is for application design, always start from a business use case. Start simple, but plan for scaling. Remember, it's going to be like iterative. Your model development does not stop at the time when the model is first deployed into production. You want to keep monitoring the performance and iterate the system constantly and then understand the challenges introduced by LMS and find adaptations to it in terms of data security and privacy. Consider to use a centralized API governance and a payload encryption, and one example of that is databricks foundation models. And then think about how to use vector database and gpus for survey and for evaluation.
Yinxi Zhang [00:31:18]: Trying to collect the human feedback starting at the beginning. All right, thank you all for staying with me. I'll take a look at the event page and see whether there are any questions I need to answer.
Demetrios [00:31:38]: You don't even have to, I've got you covered on this. I think there is one question that came through that I specifically want to ask because I think about it quite a bit, and it comes to when you're using llms to judge the output and evaluate, what kind of extra cost can you expect to incur?
Yinxi Zhang [00:32:03]: Right? That's a great question. So I don't know whether you have seen the paper and article about rack triad. So basically you have your input queries and then you have your retrieved context and final answers. Those are the three verticals of your triangle. And within the three verticals you have LM as judges to evaluate the context to query relevance and also context and also result final answer to query relevance. And you also want to judge the groundedness, whether the final answer has any hallucinations or has any information that is not existing in the context. Right? So with this triad, basically it means if you are using lms, judge for one single call instead of generating the result. Once you are generating the results, three additional times because you are evaluating all those information.
Yinxi Zhang [00:33:13]: And yes, it will definitely increase the cost. But on the other hand, we need to think about the opportunity cost here, right? If you don't use lms, just what are the alternatives? The alternative will be curate evaluation data set and ask human, and there has to be humans with domain knowledge, which is usually domain expert, to evaluate that. And what about the cost comparison between those two? Yes, I agree, there will be definitely additional costs using LMS judges. And the justification on this is whether it worth it comparing with other evaluation solutions.
Demetrios [00:33:56]: Yeah, so basically, if you're going to do evaluation with an LLM, that's great, and it's going to cost you more. Yes, but is it going to cost you less than doing it with a human that actually knows and understands the answers?
Yinxi Zhang [00:34:13]: Yeah.
Demetrios [00:34:14]: Okay, cool. So the other piece on this, I guess, is using an LLM to evaluate versus fine tuning a model. And I guess what I am seeing in this question, and I'm realizing is that just because you fine tune a model doesn't mean you can forget about evaluation.
Yinxi Zhang [00:34:41]: Exactly. Evaluation has to be everywhere, right? In all kinds of gen AI application design, you need to have proper way to evaluate your performance, and only with quantified evaluation metrics you can make a decision point. Whether I want to move forward, whether I want to actually deploy this.
Demetrios [00:35:07]: Seen. So this is a great one from Linus asking about in the knowledge based chat bots. Do you get better performance with just having knowledge documents or examples of Q and A from human to human queries?
Yinxi Zhang [00:35:26]: Oh, that's a good point. So I'm trying to rephrase and make sure I understand it correctly. Right. So there are two options here, either using human curated examples as few shots learning, or using rag to provide additional context. Is that correct?
Demetrios [00:35:48]: Yeah, and I think it's even more so. Even if you're doing rag, what context are you providing? Are you providing just random pieces of a knowledge base or are you providing question and answer pairs?
Yinxi Zhang [00:36:02]: Okay, yeah, those are all great discussion points. I actually believe the LM is also similar to any machine learning models. The more you understand the data and your documents, the better. Right? So I've been working with customers and users. I find they spend a lot of time just working with domain experts to iteratively refine the prompts by providing examples and do prompt engineering. And then we also experimented with rack. And the key to improve the rack performance is actually how well you parse your data, how well you parse your document. You have to create the context chunks that make sense, that actually provides useful information so LLM can use it as context, right? I think with highly skilled humans curated the prompts, doing prompt engineering will achieve even better, or at least at par result with racks.
Yinxi Zhang [00:37:10]: But if you design your rack using that domain knowledge, you will get better or on par results as well. It's all about how well you understand and you understand your document and understand your question, rather than which technique is more superior.
Demetrios [00:37:29]: Excellent. So this has been great. There are a few more questions that are coming through in the chat, but I realized we are a little bit overtime and so I'm going to keep us moving because that is what I do. That is my one job today.
Yinxi Zhang [00:37:44]: Sure.
Demetrios [00:37:44]: I want to thank you so much for coming on here and breaking this whole thing down for us. For all these other questions, I would just recommend you're going to jump in the chat and so I'll let you answer them in the chat on the platform.
Yinxi Zhang [00:38:00]: Sure. Thank you so much. I do have a quick thought on the LM as a judge part. Now, MLflow has evaluation metrics for Genai and feel free to try it out. You can use OpenAI or your external models to be LM judge and evaluate your model performance or for use case like summarization or question generating, question generating question answering.
Demetrios [00:38:30]: There we go. There it is. So perfect. I love it. And yeah, I'm a fan of ML flow. Anyway, this is just another reason to be one. All right, we will be seeing you noon, and we're going to just keep it cruising. Thanks so much.