Sign in or Join the community to continue

Open Model and Its Curation Over Kubernetes

Posted Jul 26, 2024 | Views 109

# Generative AI

# Kubernetes

# Google Cloud

Share

speaker

Cindy Xing

Software Engineering Manager @ Google

Being in the software industry for over 18 years, Cindy is a problem solver and domain expert in building large-scale enterprise & cloud products for customers. In 2021, Cindy contributed to the Azure OpenAI service and was one of the early engineers. She is also an active member, architect, and user group co-chair in the Kubernetes, CNCF, and LFEdge/Akraino communities.

Currently, Cindy is working at Google Kubernetes Engine group. Her team's mission is to onboard GenAI customers (developers, data scientists, ML Engineers, etc.) to GKE by addressing friction points. One of the goals is to curate Open Models and empower customers to make the right choice.

+ Read More

SUMMARY

Open Generative AI (GenAI) models are transforming the AI landscape. But which one is right for your project? What are the quality metrics for one to evaluate his/her own trained model? For application developers and AI practitioners enhancing their applications with GenAI, it’s critical to choose and evaluate the model that meets both quality and performance requirements. This talk will examine customer scenarios and discuss the model selection process. We will explore the current landscape of open models and collection mechanisms to measure model quality. We will share insights from Google’s experience. Join us to learn about model metrics and how to measure them.

+ Read More

TRANSCRIPT

CINDY XING [00:00:10]: Thank you so much for staying with me such late. I know like this is the last session. My name is Cindy Xing. So today's topic is about open model and its curation over kubernetes. To start I'd like to give a quick intro of myself working in the industry. Over 15 years I've been working on many large enterprise companies like Microsoft, Meta, Huawei and Google. So in the past when I worked in Azure I helped build OpenAI service and copilot and now I'm working in Google Kubernetes engine team and help customers to onboard Genai workload to GKE. So about my life, I have three kids, obviously I'm a busy working mom.

CINDY XING [00:01:05]: And then I also have a german shepherd, you can see him here. So that's about me for today's agenda. I like to walk through about OpengenaI and then also talk about open model and then talk about what it means by quality for open model. Lastly, but not least, I want to spend some time about my working experience when I build open model over GKE. Before I really start the content, please give me a quick show of hands. I want to know how many of you are DevOps, so help me understand how many of you are machine learning engineer or data scientists. Okay, one last question. How many of you have used kubernetes before? Awesome.

CINDY XING [00:02:03]: Great. Okay. In September 2023 Linux foundation implemented a user survey. This survey they received from valid 249 receivers. Our respondent about the survey against a handful of questions. Those people are from all kinds of companies in the United States and Canada. They are ranging from startup company, medium size or large enterprise. All those people, they are very familiar with language Genai or you can say they are extremely familiar with Genaida.

CINDY XING [00:02:45]: So as you can see, like some of the survey, first of all, over 60% of the people they are claiming for their companies or organizations, they plan to significantly invest in Genai. And then the second theme is they all want to adopt open Gen AI technology because first of all it's publicly available, it's innovating, it's encouraging collaboration, it's easy to integrate. But on the other hand, there are a bunch of factors people really care. For example, openness, quality like performance, accuracy, consistency, they wanted a neutrality, security are very important. Next, I want to share some detailed information. First of all, as you can see from the left chart, only 9% of the people saying they want to use proprietary technologies. Most of the people they want to either like purely open source or a combination of open source plus their proprietary technologies the second one I want to really call out is security. I do know many of you are data scientists and machine learning engineers.

CINDY XING [00:04:12]: You build the machine learning model, but we just heard about the safety and security. Think about your model will be deployed and be used by real people. So then like how can they make sure it's secure and safe is very important. Obviously you can see from the survey, people really care about security even on top of the cost and all the other areas. Then I like to talk about open model. I'm pretty sure many of you already familiar with this. I'll be really quick. So Openmodel, as you can see, is all pre trained large model, their weight files are publicly available.

CINDY XING [00:04:58]: Open model is not completely equal to open source model because for their training code, what data set they're using, their architecture or algorithm can be closed. Those open models can be used for training, fine tuning and inference. So some of the models might have some terms or lessons requirement. Then if you go to hacking phase, you can find over 70,000 open models. I bucket time ties them into different categories. As you can see, majority of them you heard even the whole day is about language. Large language models. There are some other models like diffusion models which can be used for text, to image, for image generation.

CINDY XING [00:05:54]: And then recently there's some new technology called mixture of xdev. People are trying to take it to make their training be faster or even inference be faster. Lastly, but not least is like the multimodal models, like for example the Polygemma from Google, so people can use it for creating language, manipulating like audio, video, all those kind of data. Before I talked about open model quality, I like all of us to walk through what is the Jai ecosystem? When you talk about ecosystem, there will be involved of people like who are the people in this ecosystem, what kind of roles or responsibilities they are playing. Secondly, what are the tech stack involved in Genaida ecosystem? Then the next would be like how how people are doing it, right? As you can see, like there are all kinds of roles, data scientists and researchers, they're picking a domain or business problem. They're trying to figure out what are the available data sets and then what kind of architecture algorithm they can use to build the model. Machine learning engineer, you're trying to improve your model or even deploy it to infrastructure. So then from an infrastructure admin perspective, they're managing their infrastructure, saying how many GPU TPU I have capacity? And then what kind of limit control I want to set for my engineers to be able to use.

CINDY XING [00:07:44]: Then the other one is I believe outside in this conference room there are so many application developers, they are willing or eager to adopt Jai machine learning models. So those are a lot of people there. Then if you look at the tech stack, I think there are three big areas. First of all is the applications. We all build the software. We are software engineer. Then the second one is about the asset, like either your data and then the model you're building. And then the third one is the infrastructure.

CINDY XING [00:08:24]: After we talk about the who and tech stack, I like to call list out a bunch of open source tools framework available. So first one is about a model as a service. Where can you find those models or download the models? Hugging face is a publicly available website or service. You can download the model, they're free. And then kaggle is another effort Google started. You can also find the model there. Obviously I believe in this room another seven implemented talk on Metaflow metaflow Kubeflow airflow can help you to build a pipeline where you can get your data, train your data and then build machine learning and do inference. Ray is another infrastructure where you can orchestrate your training and inference together.

CINDY XING [00:09:25]: I'm pretty sure many of you are familiar with Jupyter notebook where you use it to start the whole flow, right? And then together there will be tensorflow, lanching, Pytorch. I'm pretty sure all those tools that you're super familiar with. The next one I want to call out is like what are the inference servers available? There are a bunch of public available inference server like VlLM TGI from hugging face or Trident from Nvidia. Then even from an API perspective, because earlier I talked about there are a bunch of application developers. If they want to build applications, there's some rest APIs available for them to use before they talk to the inference endpoint. For example the OpenAI API which is pretty widely adopted. Then there's recently another standard called Open inference Protocol. A group of people are trying to build a standard for people to follow, then the next one from an infrastructure.

CINDY XING [00:10:39]: As you can see Kubernetes has been widely adopted application orchestrator. And then the second one I want to call out is the slurm. It used to be adopted widely by the HPC high performance computing folks. Now like Slurm is also adopted for people to do large machine learning training. And then from an industry perspective, obviously you can see the top three cloud providers. They are building managed, how to say machine learning services to the world. So from Google we have a Vertex AI, Amazon is building Sagemaker and Azure ML build the OpenAI cloud service and then infrastructure perspective. We have GKE, EKS, AKS and then like a bunch of companies, they are creating their open model like Nvidia recently.

CINDY XING [00:11:45]: Last week they released super big 340 billion parameters or even more super large model. All those companies have been building their foundational model as well. So to rephrase the whole flow in the AI ML area, researchers, data scientists use data to train model infrastructure, admin control and manage their resource. And then machine learning engineer deploy model to machine learning endpoint and manage those endpoints. Then application developers adopt. I spent quite some time to lay out a story. Now let's talk about like they talk about AI quality, especially for Gen AI quality. What do we mean? In my mind I'm listing the below areas.

CINDY XING [00:12:49]: So if they even categorize them into bigger buckets, I would say there are two big things. One is the model quality itself. So I'm pretty sure data scientist, machine learning engineer, you care a lot about the quality for your model. Through today's conference, I've heard a lot. How can you ensure your data set quality? And then how can you make sure your model is accurate, consistent, without bias, and then it's transparent for customers? That's the first bucket. What I mean by model quality, the second quality I would say is from operation Perspective, this is about how can you ensure you can do a secure, performing and reliable way to operate your inference endpoint. And then on the other hand, from an application developer's perspective, you want your application to be reliable, performing and secure as well. So for today's talk, I'm more focusing on the second part because you've already know how you can do ensure your quality for the model.

CINDY XING [00:14:13]: This slide is like a good Google's thoughts about what can be the security risk from a Gen AI perspective. I'm not gonna talk in dive into detail, but obviously you can see for example prompt injection, people can like jailbreak your prompt and then create a lot of risk in the sense of security. And also your model can be stolen. How can you prevent that happen? So those are the potential risks when they talk about the gen AI. So next, what about my experience in Google when I enable open AI open model to Google Kubernetes infrastructure? Let's think about user story. You're an application developer or you are machine learning engineer, you go to hacking phase and you find an open model first of all. Then secondly you want to deploy it to Kubernetes, but the things you wouldn't know, would it work? What kind of hardware memory gpu I should use. So and then after that while I'm running it, how much cost it's gonna take.

CINDY XING [00:15:34]: And then how can I keep on tracking or monitor the health of the endpoint? And then lastly, but not least when the application developers is trying to inference or predict against the endpoint, how can I make sure it's secure? There's no jailbreak for the input. So those are the ones we enabled in Google. Here are the screenshots you can go to hacking face, there will be a deploy menu and then you can see Google cloud. When you click that it will direct you to a Google website. For that we have a one click button. You can deploy this open model to Vertex which is managed backend, or you can deploy it to your own GKE Kubernetes cluster. Imagine in hacking phase there are over 70,000 open models. This is definitely a big challenge if you want to figure out which should be working, which not, what kind of compute configuration you are going to do.

CINDY XING [00:16:47]: So for the past few months, what my team has done is they build a pipeline of automation where we are able to pre compute. They basically curated roughly 400 top open models for all of the customers. As you can see. If you pick lama three or any model which we verified, there will be a green tag showing you. This model is you can trust. The second one is we pre computed. Like we figure out, this is the computation settings you can trust where your model will be sure it's gonna work. The other one would be like for certain model they require you to put your dev token.

CINDY XING [00:17:47]: So from a Kubernetes perspective we automatically create the secret and then allowing you to provision that. So from a transparency perspective, after your deployment you can see the cost, you can monitor your model and see the health, whether it's working. You can use also use the sample code and then try it out or validate against your endpoint. So again, as I mentioned, we build the automation pipeline. We are able to do all those. The other one I want to mention is like Google in our browser.org they open source a benchmark where you can even automatically evaluate your model inference performance. So I think I was caught. Basically I'm losing time now.

CINDY XING [00:18:45]: I just needed two more minutes to finish the whole thing. So I want to touch base on the security part I mentioned earlier. As you can see in the whole stack, their data model, application infrastructure, how are you going to think about security? In fact, if you later on you can find the DAC, I included a bunch of links in it. So from a Google perspective to address some of the security concerns, Google is hosting all the inference server containers on Google Cloud. Before we host it, we do vulnerability checks like against all the code dockerfile, and then we build it and make sure it's like follow the license, it's secure. Then we store it so that when people deploy their inference endpoints, this container is verified already. We guarantee it's secure. The other one is we have security check.

CINDY XING [00:19:57]: Not everybody can access those models to your projects. Obviously in Google there's a bunch of security guard around your Jupyter notebook. Similarly, we do vulnerability scan and make sure there's no insecure code running on it. Lastly, but not least, I really want to call out and call your attention for this capability Google is building. It's called the model armor. So I'm pretty sure you all care about the prompt injection, especially around your input and output. People can hijack into your prompt or output. So with this model armor in Google, you can pick and choose the policy you want to select.

CINDY XING [00:20:50]: For example, toxicity or like what are the information you really care about. Once you configure it, like through this model armor, we will check against the prompt and output, make sure your application won't either show sensitivity data or show something inappropriate. So with that saying, any question, the other one is I would encourage you or like you to take some survey. So in fact, we spend some effort to prepare a bunch of questions. I would appreciate if you can take the survey and let us know how much how further we can help you and support you. Yeah, thank you.

+ Read More

Watch More

Declarative MLOps - Streamlining Model Serving on Kubernetes

Posted Apr 18, 2023 | Views 760

# Declarative MLOps

# Streamlining Model Serving

# Kubernetes

Operationalize Open Source Models with SAS Open Model Manager

Posted Oct 21, 2020 | Views 917

# Presentation

# Model Serving

# Sas.com

The Birth and Growth of Spark: An Open Source Success Story

Posted Apr 23, 2023 | Views 6.4K

# Spark

# Open Source

# Databricks