Vertex AI has Some Limits and I Reached All of Them
Adam Hjerpe is a Senior Data Scientist at ICA Gruppen.
Deploying a time series model in Vertex AI may initially appear straightforward, but unique attributes in the use case can present challenges. Catch Adam Hjerpe's session to gain insights into the challenges faced during deployment, how they were overcome, and the valuable lessons learned throughout the process. Ensure you are well informed about the limitations before embarking on your own deployment journey.
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)
Adam Hjerpe [00:00:02]: Yeah. My name is Adam. I work as a senior data scientist at the ECA group. Previously I worked building different reporting solutions from the finance and insurance company and also building classification models at ECA Group. Together with John, Oscar and a few others, we build mostly recommendation systems. And today I will talk about time series modeling, which has been a use case that we migrated from Spark to GCP, and a few lessons I would like to have learned before I migrated. So what I will talk about first is explain a bit like what is a time series short about the use case, how we utilize the Vertex AI platform to production models, the learnings, during this case the solution, and then a short summary of what we learned. So to start with, a time series model is a repeated set of measurements over time.
Adam Hjerpe [00:01:06]: Generally one example could be say that you have a tracking device and you're monitoring the amount of steps you are taking input data in order to forecast the number of steps you take each day. Could be maybe what day in the week it is, are you working or are you on your holidays? Is it rainy? And maybe how much did you sleep the day before? So with that type of data, you can build a time series model to forecast the number of steps that you would take. Another example, and a quite flexible model is the prophet one that's very easy to integrate this type of data patterns. So here we will see the number of events on the Facebook sites historically. So we can note for example that we have the drops around the new year's Eve. Also we have the drops around the holidays. You see a trend increase during 2016. I think that was where they launched it generally.
Adam Hjerpe [00:02:18]: And the color here is just highlighted, which type of days you see a level increase. Also depending on which day the number of events are forecasted. And in general you would like to find a set of drivers for your time series model. So one could be different trend factors like is the population increasing, are you taking market share? Is maybe other companies going bankrupt or similar things? And you have of course seasonality, you can have different holidays in Sweden. And also we have the level shifts or the external factors that could be like either regulatory parts or if you have government subsidies in different parts. And the most important part of what you try to minimize in general is the random variation. So when we build this type of model, we normally build a model that is denoted as y hat in the slide, and then we take the difference with the actual forecast. And if that is indistinguishable from noise, then we know that we have found most of the patterns in the data, if you see like a clear trend, maybe you would say like in this number of steps example, you see like every second Tuesday, your steps increase.
Adam Hjerpe [00:03:44]: Maybe one data point can be, oh, you go to paddle each Tuesday. So having that factor accounted for is something you would like to add. Short about what data we have, we want to do sales forecasts across different levels. So we have the historical sales calendar effects, maybe promotions. We can have economic indicators, some type of measurement of online activity. And if we have a very accurate forecast, we can improve operations and like strategical processes such as financial planning, promotion planning, risk management and a lot of things. Given that we have very accurate forecasts in this use case, we are building thousands of models, and that's why we used spark before to have that parallelization component. And why we are implementing so many models is that we want to model at each specific level.
Adam Hjerpe [00:04:48]: For example, you could build a total sales model, and then if you want to find the online and the offline sales, you can just have a scaling factor. But in order to build accurate forecast, it's better normally to model at each level directly. So our stakeholders, they have different levels that they want accurate forecast. And that's why we have quite many of them. A short overview of vertex pipelines. They let you orchestrate machine learning flows in a serverless manner. You can build the vertex pipelines using either the Kubeflow development toolkit, or you can use something called tensorflow extension. We are using the Kubeflow one.
Adam Hjerpe [00:05:35]: And in general you have a pipeline illustrated here. And in the pipeline we have different components and the components are doing the actual work. It could be either that you get the data, it can build that you train a model, you compute a metric, those should be self contained and only have one single responsibility, and then you orchestrate all of them. As a pipeline, we also have artifacts, and those can be many things. For example, a model that you have trained, or a dataset or a set of metrics. Generally you can pass data between components using artifacts. You can also pass data using parameters, but that is simpler type of data. Artifacts can hold more complex data, right? And the tensorflow extension extended is recommended.
Adam Hjerpe [00:06:35]: If you have, like I'm citing GCP, if you have terabytes of data and you're using Tensorflow, then that could be the case. Otherwise they suggest you to start using the Kubeflow implementation. I will show like short video of how the use case, the training pipeline, the first one we built. So we will look at the training and the predictions, and then we will also show these parallelization parts where you can put out compute into parallelized components. So first here is the training pipeline and the prediction pipeline, and here you have a set of those artifact custom metrics. You create the training data, the validation data, and then you extract that to storage, GCP storage, in order for the model to be able to obtain the data. This is the parallelization loop, so it will run across each of the different levels that I told you about. First you look up is there an existing model, you train a new one.
Adam Hjerpe [00:07:47]: You import actual metrics from the one you have in production, if you had one. And then you go to this update model guardrail which will compare the metrics from the previous model with the new one. And then after that is done, the model is passed because we are predicting at each time. So you get the future data, you do the extract that to cloud storage, you look up the champion model and then you do the actual predictions. And we write that to bigquery to be shared with the business or through reporting after that step. And here we have 50 steps. I think it was around ten parallelizations. And then we tried to scale it, but it didn't work so good initially.
Adam Hjerpe [00:08:38]: So first attempt we tried to send data using something called other artifacts I told you about. One good thing about the artifact is that you have the lineage. So each artifact you can look at which component marked in blue has created artifact, or there are other components that are utilizing it. And then in this example we compute the distribution of the data and we compute the divergence. If the data has changed a metric for that. This is very simple to use. You can build fast, you can send it to the next component, etcetera, but apparently you can only do 200. And if you have worked in cloud before, you should probably read up on limits beforehand.
Adam Hjerpe [00:09:27]: I didn't do that. I was surprised. But that doesn't work here because we have around 1000 models. So the number of artifacts is way over how many you are allowed to use in a pipeline. So then instead we got rid of the different artifacts and instead try to use custom training jobs directly instead and minimizing the number of artifacts. Just to show you the upside to using custom training jobs or managed training jobs from Vertex AI is that you get this model versioning. So you see you have the different types of models here to your left, and then there is an alias for the default one. And when we go into this update component I showed you, it will update the default one close to automatic if it succeeds in having better error metrics.
Adam Hjerpe [00:10:34]: And you also can list the metrics associated with each model and you can compute whichever metrics you would like to include. However, the number of create, read, update and delete requests in the custom training jobs is 600. Since we are parallelizing and we have no state between the parallelized parts, we break that sometimes. So that didn't work either. And we also have more reads because at each step we also looked at the best model and then we train a new one and then we update. So you get around three operations at each step and then the third attempt and that didn't go either. Right. So now we are not using the, the managed model registry, instead we are persisting the models on the cloud storage and we set like a number so we can see which one is the best one, which is the default one.
Adam Hjerpe [00:11:40]: But we do that more manually. It's more work, but we tried that part. But what happens is that first we have the program, we send it out to all these parallelization steps and then we just try to write it to the bigquery table. And since we have around thousand there, we unfortunately broke a limit too. And that is that the number of inserts can only be 25 per every 10 seconds. And since we don't have any control of who is writing when that broke, sometimes. Also there was a solution and maybe many of you who have worked with cloud APIs and know this type of technique, but it's called fan in, fan out. And it's like the fan out is when you parallelize, you take the program, you put it into different containers and you do the compute.
Adam Hjerpe [00:12:35]: But instead of each work writing individually, you fan everything into one single place. In that way you can control the rate limit. You are writing data, right? So the fan out, you distribute the work and then the fan in, you combine the results into a final outcome. And all the, all results are then written into a bigquery at one step. And that is, yeah, that works of course, just an overview to how the final solution is or how it looks in, in the vertex AI, the fan out is what you get in the graphical interface and it's called DSL parallel four. It's a control flow operator. It looks very similar to a regular for loop. And then the fan in, you collect all of the parallelized works into one component, and then we write all the results to bigquery.
Adam Hjerpe [00:13:42]: So the summary I would say is prioritize the planning for GCP, like the service integration, what data do you have, do it in an effective way. And also how you train the models read up on like rdna limits and plan ahead. And also the fan in and fan out method works. So you can do, you can write to the model registry as long as you do a fan in step in beforehand. You can use all the failed attempts also in this way I can also mention that we find out recently that you can write also too, as long as you partition a big query table, you have the limit on each partition, so you can actually write it in that way too. And comparing to spark, I would say that the vertex AI pipelines offers an accessible way how to parallelize compute also. Thank you.
Q1 [00:14:48]: It was a very nice presentation. I actually have two questions, so I'm gonna do the first one, and if anybody else had another question, I'll pass it on, ask it afterwards. Okay, but first question is you explained the Vertex AI platform and now it provides an interesting abstraction over the power pipeline. I was interested in understanding more what each component corresponds to, an action, a script, what else?
Adam Hjerpe [00:15:19]: Right. You can normally I package each component in one script. I think that is probably neater. But technically you can write a component inside the pipeline file and just have a list of components living in the same place where you define the pipeline. It's a matter of taste and you write it as code. And then when you do a command called compile, I don't remember exactly, but then it will compile all of the components that are used by a pipeline. And you will get a JSON file normally, or a JAML file, and that contains instructions. So it's a runnable program.
Adam Hjerpe [00:16:09]: So it has simply fetched all the logic that are living in the components. You can also package components, I think it's called as docker containers. And then you can push them up to artifact registry and then you can share that type of logic across the organization. So normally we have, or how we have used now is that we have components, and then you have vertex components, BQ components, etcetera. Just to keep like the mindset of having one purpose for each component. And then if other teams would like to use them in the future, we would have the possibility to push them.
Q1 [00:16:50]: Up when developing, moving, doing this big migration to Google Cloud platform. Have you considered van der Lockin and how you addressed it?
Adam Hjerpe [00:17:01]: I haven't been part of those discussions. I think GCP was on the table for a long time, longer than I've been at ICA. We have other cloud providers as well, and those are utilized a bit depending on what capability they are offering. For example, the LLM space. Maybe Microsoft is most liked, but I think as the DS team, what we have said so far is we should use GCP in general. I think the vertex pipelines, they're using the, utilizing the tube flow DSL. So those should be portable, I would guess. I don't have a better answer than that.
Q1 [00:17:50]: Okay, thank you anyway, was a very good answer. Thank you.
Q2 [00:17:54]: When you say you have a thousand model and then you're running it in the pipeline, then you choose a champion model and then you come in with the next step. I was wondering when you're running for those model, and then when you have a two champion model, then the pipeline, how do they handle it in this.
Adam Hjerpe [00:18:12]: Case, could you elaborate just on where do you have two, just an example.
Q2 [00:18:19]: If, when the pipeline, how do they value for those, which model is champion model? Yeah, I want to know this part first. And then the common question is like when they are valued for this kind of champion model, and then when they come out like a two champion model, then what's the next step, how you handle those kind of situations?
Adam Hjerpe [00:18:46]: Good question. We have that for each of the models. We are only doing at one level at a time. So there can only be one champion coming out. We will not have two different champions qualifying for the level. So say you are building for online sales, then you can only have one champion models. But if you were to have competing model types within the online, then I think you would simply pick the best one of those competing ones and then check the one that won over that initial evaluation step.
Q3 [00:19:31]: So are you trying to tune some hyper parameters through this? Hello? So are you trying to tune some hyper parameter through this process? Like, and how, how frequently are you doing this? And is it like a repeated process that you do between them, or is it like, okay, once for one model, you're finalized, this is a champion model. And then you just stay with it for some time. Like, how do you know that? Okay, there is a better model available for the next time.
Adam Hjerpe [00:19:57]: And in the time or first the hyper parameter part. That is done now through a more manual notebook procedure. So we have like an associated notebook that will run the hyper parameters, and then we write them to a config. And then at each retraining step, you are choosing the best hyper parameters. But in the time series world, it's always better to include more data. So that's why we are retraining. But the hyper parameters are fixed. You get better estimates for each of the regressors.
Adam Hjerpe [00:20:36]: In general, we thought about maybe having the hyper parameter part also automated. But since we are building, we have evaluated or checked in with the business. Do you think the forecasts are reasonable looking far ahead? Does it meet your expectations and those things? If the hyper parameters were to be updated at each time also the forecast would vary a lot. And now it's word to word. We have arranged meeting yearly to look at the full forecast and we have a few error metrics also, so we get an alert if the performance will degrade.
Q4 [00:21:19]: All right, thanks so much for the talk. So interesting to see your iterative process on this. I just had a question on your final solution there on the fan in fan out strategy, because you showed how you store the models as kind of artifacts in the vertex pipeline. But in the end I was just curious how you actually, after training those models, how do you say them? And then put them into the write to bigquery component. So to say, at the very end, did you store me some GCS packet or did you still utilize the model registry or anything else?
Adam Hjerpe [00:21:53]: Didn't quite get that, yeah, good question. We are writing to CGS, so we get a lot of small files, but there is a tool so you can download small files using threads provided by GCP. So then you speed up that process and each of the files contains the predictions and it's at a monthly level. So data is not super big. So I think that's a feasible. But if the files were to be very big, I think you would examine other storage services.
Q4 [00:22:28]: Thank you.
Q3 [00:22:30]: How often do you retrain these models? And when you do, do you overwrite the original files or the GCS models that you have stored?
Adam Hjerpe [00:22:41]: So how often we are retraining and when we retrain, what do we do with the old model? Yeah, exactly. The models are retrained each month. The main stakeholder is from finance, so it's in align with accounting at a monthly pace. And the old models we write to CGS version them by date. And also we changed the winning one always gets replaced by default name. So we have the same type of methodology as the registry, but I think given the fan in it, it would be better just to populate them in the registry and have GCP handle those parts also.