An Overview of Common ML Serving Architectures // Rebecca Taylor // DE4AI
Rebecca has been working for Lidl e-commerce for 2 years, first as a Senior MLOps engineer and then as the tech lead of Personalization. Before this she spent 5 years working as an electronic engineer with a focus on signal processing for predictive maintenance and then an additional 5 years as an ML engineering and data science consultant. She has a PhD in Bayesian Statistics with an undergraduate in engineering.
There is often a disconnect between what is taught about model serving and what is actually standard practice in industry. Your deployment design is often severely impacted by the unique data and platform setup of your company as well as financial constraints. Here I discuss some of these constraints as well as how to build designs that can fit within them.
Adam Becker [00:00:09]: Rebecca, are you already here tasting?
Rebecca Taylor [00:00:12]: Hello.
Adam Becker [00:00:13]: Hello. How you doing?
Rebecca Taylor [00:00:16]: Challenges here. Like, having some challenges.
Adam Becker [00:00:19]: It's okay. We'll overcome them. Is this it? Did we do it?
Rebecca Taylor [00:00:23]: Yes. Okay, you can share.
Adam Becker [00:00:26]: Okay. Hi, everybody. I'm Rebecca Taylor. And today. Okay, no, sorry, Rebecca, you go.
Rebecca Taylor [00:00:31]: Okay. So you can share it full screen. Yes. Yeah. So basically, yeah, as you said, I'm Rebecca, and I'm a tech lead at Lidl. So, like, Lidl e commerce and. Yeah, like, we have many models in production and many different architectures for them and different use cases. So I'm, like, specifically on the personalization side.
Rebecca Taylor [00:00:57]: So, basically, in general, looking at how can we make the user experience better using data? So today I'm going to just speak about some considerations when it comes to choosing a ML architecture and some basic examples of what you'll currently see. So you can go to the next slide. Okay. So, as we all know, it's getting so much easier to, like, build actual models. I mean, ten years ago, they weren't, like, probably 1% of the packages that are out there at the moment. And there's also so many, like, resources and boot camps everywhere. And, you know, even on the data engineering side, there's so much out there that's available. But I think there's maybe a bit of a gap in terms of what actually is there in industry and what's the current state of industry and the.
Rebecca Taylor [00:01:46]: The kind of constraints that you'll run into. I mean, you can build these insane models, but to actually have them living in production and signed off and making money for the company, what that actually looks like, at least in Germany, is not necessarily what you might have in mind or be as flexible as what we would like them to be. Okay, so you can move to the next slide. So, basically, some of the things that will really influence what you end up building, I'll talk through some of them, but one of the biggest ones I've seen practically is the team structure. So different companies have different structures. And I was in consulting for about five years as like an email engineer, data science, data engineer consultant. And basically what I saw is that there's such a diverse way that companies set up their team structure. One of the common ones is that you have kind of like the data scientists all sitting together building their nice little MVP's and kind of having these proof of concepts and also doing a lot of maybe analytics and things, but then they kind of hand over their models to, like, an ops team or like maybe ML Ops, maybe DevOps.
Rebecca Taylor [00:02:58]: It depends on the maturity of the company, if they even have an ML ops team. So sometimes their models are actually not even used as is and completely transformed by another team. Right. So you get a notebook, and, you know, they hand over a notebook in this model, gets completely kind of redone and productionized and deployed. In other cases, the data scientist is responsible for kind of giving a model artifact and maybe some conflict files or whatever, and then there's, like, slightly less done on the operational side. In some cases, your model that you make will be used basically just like in a non production environment for a while to test how well it works, and then it gets changed into a totally different language and deployed. So that's some of the things you really have to consider. I think the stack is becoming slightly more coherent.
Rebecca Taylor [00:03:47]: Well, slightly. And I think one of the things that I've seen a lot more lately is kind of a product structure. So you have kind of like a mixed team where you have, like, in one team, like, data scientist, data engineer, ML engineer, even some devs, business analysts, and they'll deliver, like, end to end systems. So this is what we have for part of the business in lido at the moment, which is quite cool, because what ends up happening is we're responsible for basically, like, parts of the front end all the way through back end data engineering. ML engineering feature stores. Well, not like the infrastructure as much of it, but, like, what ends up in the feature store, you know, pieces of the data lake that we own and the whole tutti. So even, you know, the monitoring alerting. So, like, as a team, we're responsible to deliver stuff.
Rebecca Taylor [00:04:38]: And, yeah, that makes the collaboration quite interesting. The challenge there is when you have a lot of models in production, then who maintains them while you're the same people that are building the new models. So what often happens is you also have, like, an ops team that's, like, gonna be on standby, and they'll also be briefed on how to handle these models and stuff. But basically, if you don't have, like, those kind of mature setups, then you actually really have to consider carefully what you put into production and what your design is, because if it's not gonna be able to be maintained properly and if there isn't correct alerting and also, like, change, like, as your model, if you want to retrain and do significant changes, you might not get the budget to do that. Right, because there just isn't that option. So, yeah, and oftentimes, like, the ideal model never makes it into production just because of these type of team constraints. And then another constraint that we often like run into is that there's like predefined tools and tech that you're allowed to use, right? So you might be super experienced with a certain and you join a new company and you realize, nope, you're not allowed to use that. There's specific things that you're only allowed to use that can also be, for example, from a location point of view, cloud services, you might want to be building the coolest new gen AI model or using it in some way, but it's only available on tcp or something, or AWS and not azure, just for example.
Rebecca Taylor [00:06:08]: Also from an integrations point of view, there's certain things that work well together and others that maybe don't work as well together. And if you have bad luck in terms of the architecture decisions of your predecessors, you might be forced into combinations of tools and tech that actually are difficult to work with. And then a lot of time is spent kind of gluing things together. Right. So yeah, I think that's something that also could be surprising if you're new to the industry and realizing or new to a different role. I mean this is actually maybe a tip for if you're going to choose a new role, to really ask questions in those interviews, right? So yes, they're interviewing you, but you can also interview them in terms of like the tech stack and ask detailed questions about that and how like, I mean do they have an in house ML platform that they maintaining and building? Are they using something that's already like tried and tested existing? Like are they using a, like a more flexible type like for example like Zenml or they like locked into, you know, something else, right, like Azure ML for example. Yeah. Those decisions can significantly impact like your designs and what you end up being able to build also the industry, so, and business requirements.
Rebecca Taylor [00:07:24]: So that's super important. Like, I mean in some cases, yes, you can build a, you can build a nice data science model that pulls features from like you know, some cloud storage and you know, does a prediction and sends it back or whatever and that will work for you. Right. But in other cases, pulling from storage for example, is just not going to cut it. You're going to need much more low latency solutions and you'll need to be putting caches in place. And often like on our side, basically we have quite advanced caching systems on our website as well, which means that you'll be caching things on your side potentially in your model APIs, but there's also front end caches that you have to consider and you also have to look at what those kind of systems do. So this is something that like can really constrain your design, especially if you want like near real time features. Some of those features aren't available in the way that you think they are from certain systems.
Rebecca Taylor [00:08:22]: So, you know, if you're in a situation where the actual full design of how events are happening in your front end, this is just in like an e commerce or banking maybe example. If you don't, if that isn't set up, um, in a, in quite an advanced fashion, like maybe you don't have a proper Kafka solution or something going and you can't access the events in, in time. It can really constrain your design and you have to get quite creative of how to still deliver some good business value, um, even, you know, um, with these constraints in mind. So, okay, you can move to the next one. So here's just some really basic patterns. So um, yeah, this is the example of you, you have a model, you want to perform inference on this model and what are some of the ways this will look? Right? So the most kind of easy, obvious one is a batch scoring job. So basically your input is like data, multiple tables or whatever, and your output is also a table, right. So that's the easiest.
Rebecca Taylor [00:09:25]: That's nice. You can have an interface where you are just putting your things into a table and downstream gets to consume that and serve it and do whatever they want with it. Or maybe it's used for some forecasting or some pricing models or predictions or something, you know, reports. I mean that's the one that like, it's easy and there's normally no issues then in the case where it's also batch scoring again. And often you'll see batch scoring in like marketing solutions, right? They're like, here's all our clients. Or you know, let's run some batch job and do some, you know, prediction or clustering or something like that. Same. So in like, you know, banking for clients, they, they might have some like for example, if you're looking for patterns of fraud or things like that, sometimes they'll do general screening batch jobs that will just run and then there can be analytics done on top of that.
Rebecca Taylor [00:10:17]: So if you like, in that space, that's quite nice. But then the one step up from that is normally if you also doing the serving side in terms of the results. So it can still be like a batch scoring job. So the results are written to a database, but you also expose that database basically the information of the database is then accessible by an API. So you'll basically just have some gate request and you can then pull the models results either for one instance or one client or one whatever, or for even a batch of them potentially, but it's available by API and then it's easy to consume from other services. Then the slightly more tricky side is when you need to have prediction on the fly. So the prediction part is real time, not the features, but the actual prediction happens real time. And this is very common when there's any type of data science happening in websites.
Rebecca Taylor [00:11:16]: The person will be clicking and doing things and based on what they've, what the current state is of this person, of this customer, you can, you then want to make some prediction, right? And the nice case is where everything that you need for this prediction, so all the features, not necessarily like pre processed features, but at least the, the raw information for the features, if those are available, you know, by the front end or in that layer that's calling the API, then you can just have your model in the API and just, you know, run, predict and get an answer. Right. The problem comes in and that's the easy, this is the nice one, right? But often this isn't the case and you have additional sort of historical or slower moving features, all features that you have to get from somewhere else. And that means that you get your, some of the features in, in your payload, but the rest you have to get from somewhere else. And when you have to get it from somewhere else that's normally going to be, there's data engineering jobs typically that will get it to the place you need to read it from. You'll probably have something like redis or some caching layer as well. Or I mean maybe you're using something like Mongo which also has a bit of caching built in. This is where you have to also worry about cost a lot and basically tuning this sort of online feature store basically, right, to handle what loads you need and what latency you require basically.
Rebecca Taylor [00:12:38]: So that's sort of the harder one. And it gets harder as you know, like you even move into, as I put in the last point, like really hardcore streaming stuff. So if you have like a full on real time transport that's used in a prediction sense as well. So that's not as common, you know, really like used in industry. I think you get most of the value already by the first points. Okay, we can go to the next one. Wow, this is really small. I'm looking at it on my phone.
Rebecca Taylor [00:13:07]: So basically this is just like a batch scoring example with a little bit more detail. So this is the simplest case basically where you'll often have multiple data sources. I mean, I just showed two. And then you have some transformations and mapping jobs and things like that, and then combination of the sources. And then in our case we're using unity catalog for our data lake house thing. But yeah, so you have some tables like or maybe one table there, and then you have a batch scoring job that basically has access to your ML model and it just runs predicts basically over all of those entries. And obviously these jobs, all these blue jobs are data engineering jobs and they will then be optimized as well. And the clusters that you're using to run them on, if you're using classes, will be tuned and customized for the actual code that they're running on to save costs.
Rebecca Taylor [00:13:57]: And then, because often this is a lot of data that you're working with, right. And these jobs can, you can, if you write them badly, they can take 2 hours. If you write them well, they can take ten minutes or something. You know, if you optimize the cost, this will be huge differences. Right. Also using things like spot instances, if you're in databricks, please, you know, do that. So you know where you can, if you're doing a batch scoring job, for example. And then, yeah, so once you're there, then you typically will for future analytics and for traceability in future, you'll write these predictions to a historic table.
Rebecca Taylor [00:14:31]: I mean you can also have different data modeling ways of looking at things, right? So you could also have one table, but when you pull the data from it, you can just pull the latest. But often storage is cheap and computers expensive. So we kind of just have a historic table and then we have a current state latest table and then in our case we have a job that writes that to Mongo and then you have a prediction collection in Mongo that you can look up and obviously there's all the partitioning things that can come into that. It could be multiple tables, you can handle it how, how it's needed to be handled. But this is just an overview of what often is used. Okay. And then next one. So this is now the serving part of a solution like this.
Rebecca Taylor [00:15:11]: So this is basically an example where this API will have a model inside the API, but it will also have data that is being pulled from predictions basically that have already been made. This is also a common use case in something like ecommerce or banking where you have customers and you have old customers and new customers. In some cases you have people that you have data about, lots of data about historic data, about purchase data or whatever. And in other cases you don't. You only have the real time features that from their current clicks or whatever, or maybe just some information about them that's available from the front end. So you'll actually have the model in the previous slide making these batch scoring predictions, but you'll have a different model that hopefully does something very similar just with less data. And it will then make a prediction for everyone where you don't have the historic features for. So basically, if it's existing user, you pull the features that you have already pre comp, not the features.
Rebecca Taylor [00:16:22]: You pull the predictions that you've already made directly. In that case, you can end the loop and return without, with a lower latency. Well, not necessarily lower because yes, it will be. Am I coming to an end?
Adam Becker [00:16:35]: Yes.
Rebecca Taylor [00:16:36]: Okay. Okay. I'm basically done. Okay, I'm basically done. Okay. But in the other case, then you have to still pull the features and make the prediction and then you can return. So just in closing, maybe go to the last bit. Yeah, I'll just summarize saying that basically, you know, you have to really think about many things when you're doing your design.
Rebecca Taylor [00:16:58]: Keep things as simple as you can, build on that, make sure you're getting value and speak to people that have done stuff before.
Adam Becker [00:17:06]: Wonderful. Rebecca, thank you very much for this. I love these types of presentations to just give me like a sense of context and overview about like all the different things, even just to make me feel less crazy for picking something that starts out a little bit more basic before going to just like a full or kind of like more heavyweight solution. So thank you very much for that.