Lessons Learned from Doing MLOps within E-commerce
Marcus is a seasoned consultant specializing in Data, Machine Learning (ML), and Artificial Intelligence (AI), dedicated to developing data-driven products that enhance organizational value. He combines hands-on technical expertise with proven team leadership abilities.
Marcus Svensson provides a firsthand account of his MLOps journey within a prominent retailer. Explore the issues faced, challenges encountered, and the successful strategies employed to navigate through them.
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)
Marcus Svensson [00:00:03]: So fun to see so many faces here as well. I've been to a couple of Mlops mill taps, but this I think sets the record for attendees at least. And I will be talking about some lessons that I've learned over the years doing machine learning operations, especially within the context of e commerce. And I've been doing that while my name is Marcus, as on the screen, if you missed that. And I've been doing data, machine learning, ML everything data. Yeah, you can extrapolate or just label it AI, and I've done that as well. So keeping up with the hype and in ecommerce specifically, there are a lot of things that you can do with machine learning, and one of the large areas is search and discovery with the purpose of helping the consumer find what they want as fast and seamlessly as possible. And under that umbrella you have a lot of things, but one of the things are recommender systems.
Marcus Svensson [00:01:10]: We have ranking of products, also known as learning to rank in the ML space. And we have search. When you free text, search for something, and if you want to get fancy these days, you can use semanticsearch. So truly understanding the user intent and use the most hype vector databases out there to do that for you. And what do you want to accomplish with this then? Well, we want to track this user experience improvements through something and we do a b testing, experimental approach, iterative approach, and measure various KPI's that we hope tracks what we are interested in. So we want to have increased conversion rate, so the fraction of users that end up purchasing anything, we want to increase the average order values to make them. If we have a lot of products, perhaps find complementary products that they are also looking for. And if you combine these two, you get the revenue per session.
Marcus Svensson [00:02:12]: So maximize how much you get from everyone visiting your site. And this is very general across e commerce. If we deep dive a little bit into these areas, the recommended systems, you have a couple of different ones, so they are based on different types of machine learning. You might also like that you have probably seen some time on when shopping that is highly personalized to you based on your behavior. We have the similar item and also bought with. So the classic that's usually under the products that you can see too. If you're looking at the red dress, you will most likely see other red dresses here and in also bought with. You will see complementary shoes for your red dress.
Marcus Svensson [00:03:01]: On the ranking front, we had a product catalog of tens of thousands of products. So if you went to the dress page now, for example, you would see perhaps a thousand different dresses. And then the question is the user will see the first 20, so which 20 should we show them? So that becomes a ranking problem over a large catalog. And the third one, the search functionality of a site, of course, very important. And there, oh, you have retrieval and ranking of user queries and results. And as I mentioned before, we can also try to make it very clever. One other use case that we did a lot of time and effort in our personalized emails as well. They were not technically on the site, but you get the idea.
Marcus Svensson [00:03:50]: And then we had to send personalized product recommendations to more than 100,000 customers. So that was also a challenge. And some of these are particularly challenging in terms of Mlops. And those are highlighted here because they are highly personalized. Some of these are not user specific, perhaps region specific, but the yellow ones, they are individual to each user, and that can sometimes become problematic. And if you have butted your head against the wall many times trying to maintain high availability, you have probably encountered that it's increasingly cost and time and energy intensive, the higher you want to achieve. So on the y axis here we have the cost or complexity or time, so you can choose whatever you like. And on the x axis we have availability in percent and there is an exponential increase.
Marcus Svensson [00:04:54]: It's also the reason why Google Cloud and AWS and Azure, they themselves cannot even guarantee 100% uptime availability on their services because extreme events makes this problem very hard to maintain. So these problems that we faced with our, a lot of these are API based. So trying to serve user requests one by one and are personalized. So it becomes quite problematic when you have to, from a machine learning ops perspective, make it work. And to make things a bit more complex. In the land of e commerce, the traffic is also highly fluctuating. So here we see the visits per second, a likelihood or probability distribution where most of the time it's pretty calm, it's pretty chill. You're on the left hand side, no problems.
Marcus Svensson [00:05:50]: And then sometimes you have campaign going out, maybe campaign email to a couple of tens of thousands of people that click it at the same time when they get the notification in their phone, you have a headache, then you have a headache during Black Friday. That's outside of the plot. Now you have a lot of alvedo one and trio during that day if you're in this space. So it's just quite tricky to keep your APIs up sometimes. But that's why we have mlops here, right? That's why we are here, to learn how to deal with that. And we then develop a unified mlops platform. Sounds so nice to handle all our problems. And that can be tricky sometimes because if we think about these white boxes and blue boxes here, where we have the mlops as the blue box whose job is to secure stability and availability, and the white boxes are the AI, the cool stuff that brings direct business value here in conversion rate, average order value, revenue, possessions.
Marcus Svensson [00:06:58]: As we talked about before, what we faced and a lot of teams faced is that you have to prioritize a lot. Where are you going to spend your time? Do you want to improve the infrastructure or the mlops? Or should we spend time improving the actual algorithms that we can able test and measure? Bring, hopefully if we do a good job, bring results. Right. So during the years we have made a lot of trade offs here and made some bad decisions that you will learn about later. And we call it mlops, but it's more like ML DevOps. So we have traditional mlops terms, let's say, that are more tailored towards machine learning itself. So we have the pipelines, right? Continuous retraining model serving, whatever that is. If it's online, live serving, if it's batch offline, you have experiment tracking for the a bit test we talked about before, model versioning data model like drift monitoring, that's a big field model versioning.
Marcus Svensson [00:08:03]: But then you have a lot of other stuff as well that you have to do. You have to have scalable APIs that's not really ML specific. You have to have error monitoring and logging and handle that in a good way. We need to have proper infrastructure as code. Ideally we need proper DevOps, Ci CD pipelines, version control in general, a lot of compliance and security. I mean, a lot of things that maybe you wouldn't call mlops, but if you're a small medium sized company, aka if you're maybe a couple of years back, that falls under your responsibility as well. So you have to think about that. So it's a lot of ports that you have to do.
Marcus Svensson [00:08:47]: And once again you can choose if you want to spend time there, if you want to spend time doing the fancy AI stuff that you can show to the stakeholders that, oh, holy shit, we have pushed out there conversion rate here. So I need a big salary raise and to give a little context on this presentation, then I was in a quiet startup scale up e commerce company, so we grew quite a lot. And the data team, aka me, grew under the commercial organization. That's where it started with, or I wasn't the first. It sounds like it was just me, but I will not take all the credits. We were like a couple of people here doing. Initially it usually starts with bi analytics, like just the basic, and then it grew into data science. You want to have all these fancy machine learning use cases.
Marcus Svensson [00:09:38]: And with that comes mlops and Google Cloud. And back, I mean, we started the ML like Google Cloud journey even before I joined in, back in 2018, before Vertex AI even was a product, we used AI platform back in the days. If you're a real og, you remember AI platform, that was the shit. But then on the other side of the organization, quite far away, was the tech organization that were responsible for a lot of tech systems, but more primarily the e commerce platform than the website itself. So you had front end developers doing front end stuff, back end developers making sure that the backend works, and DevOps, an infra box here responsible for site reliability. I like the boxes because it looks like it's teams, right? But this was one guy, so he probably likes this presentation and we then have to do some type of communication with each other. But initially was quite sparse. These types of two, like ports of the organization was like way farther away than this, right? It was two different parts of the office.
Marcus Svensson [00:11:01]: And yeah, communication wasn't that flowy. Let's say we did our thing and they did their thing. I mean, we were on different clouds as well. You can just sit here. But what we did way too late. Then the communication port was that, of course the DevOps guys knows DevOps. And for far too long I did all the stuff that's in the middle here, trying to balance my shitty knowledge of infrastructure as code. I didn't know much about that.
Marcus Svensson [00:11:33]: It turned out so. So CSS had to google what that meant. But this guy Rockstar and I took way too long to realize that he should or should, but with some Kennil boulevard, I convinced him that he could help me a lot. And I think that this systems level thinking and organizational, let's call it organizational level optimization thinking, or whatever term we want to use. If I look back, I could have taken a holistic view way earlier to realize, well, I don't have to. We have this competence in house. It's quite far away. I don't know the person, but if we open up a bit here, he can probably help me with a lot of stuff and it can become two way relationship, even though it's probably mostly one way in this slide.
Marcus Svensson [00:12:27]: But that was nice. And through this, then, if the size of this boxes represent time spent, me and my colleague who were doing machine learning looks like also a team here. Me and my colleague who did machine learning stuff could do more of the actually algorithmic improvements and focus a bit more time on the experimental stuff, the iterative stuff, trying the new cool machine learning algorithms to figure out if we can improve their business KPI's and adjust a little bit less time on the, let's call it more DevOps, ports of mlops. And I think that's the key takeaway from me here, that not only look at your closest context, your closest team, but zoom out a little bit, try to figure out on an organizational level if you really bring the most value, if you do the thing that you're best at. So find ways to take the dev out of ML DevOps. Thank you.
Q1 [00:13:31]: Yeah, one question from back from the column, which is, but I did not get for real, which is the difference between machine learning as a service and Mlops in the way that you taught us about mlops, which is the difference, the fact that the ML as a service, you do not access to the code itself, while the Mlops, you are the one developing that.
Marcus Svensson [00:13:56]: My interpretation, MLS as service is more of a commercial term to denote the fact that you purchase machine learning in some way. So Automl, for example, that you can leverage, sure. Bigquery, ML and those stuff, you don't actually do anything. MLOPs is for me a more broader term that talks about the infrastructure and tooling that's required to scale your machine learning models efficiently. Does that answer the question, or what's MLS like? What's ML as a service for you?
Q1 [00:14:32]: I know what mlops should be, but the fact is, in that case, you are training the model and you are, let's say, managing your own model. While if you do ML as a service, you are agnostic or you can also be agnostic about the model and so you can just take it, use it and give it back. So in any case, you need to use the infrastructure, but in that case, the infrastructure you need to develop in order to use it on your own and offer yourself willingly a service. Can that be the key point of that?
Marcus Svensson [00:15:00]: Perhaps. I mean, we as consumers of machine learning models, in this case we did our in house machine learning model training and evaluation, but also, let's say, used machine learning as a service, or automl, for example, and other services to get other types of machine learning to try as well. Since you can able test fairly easily in ecommerce setting.
Q2 [00:15:26]: I've been actually personally interested in these kinds of higher level systems type questions and I was wondering where your journey is going next. And like, did do you see that maybe other parts of the organization could need like some sort of speed networking or like just learning about each other's problems?
Marcus Svensson [00:15:42]: Yeah, I mean, I think that's not only about machine learning, ops or data science or. It's a constant problem you have when you grow organization. So some tips that you could perhaps take away to improve, like cross team collaboration, just to take the first steps. You can have random lunches, for example, so you don't get to pick, you don't even get to do the RNG generator because then you will cheat. So you have to really do random lunches with colleagues across the. Then you get insights into completely different departments and you have to, depending on what type of you're in, you have to really explain what you're doing, like, to a five year old if you're into mlops, but that's one thing that you can do.
Q1 [00:16:33]: For example, I'm curious to know, why did you choose Google Cloud in the beginning? There is no reason behind that.
Marcus Svensson [00:16:40]: Pexus Awa yeah, it's a good question. Why Google? I've got that many times. Why Google Cloud? And I blame my predecessor because I joined in 2020, and the Google Cloud person, he joined in 2018, so I blame him. So we had Google Cloud in the data AWS on the dev side hosting the site, and we had power Bi, so we had a little bit of azure as well. So we did a little bit Pokemon, catch them all there and took all three. We thought that was the best approach to cloud choice.
Q3 [00:17:16]: Could you please exemplify a bit more about the conversion rate optimization? You talked about that you were using machine learning methods to increase the conversion rate and a b testing. In what way did you do that?
Marcus Svensson [00:17:31]: The conversion rate, let's say a benchmark commercial rate for ecommerce fashion is around 2%. So 98% of the people who visits your fashion site will not buy. And how you improve the 2% if you have 50,000 products, like we had just to have better. I mean, it's search and discovery and user experience in its core. So better ranking on all the pages, better for the text search, better recommender systems. Was there any particular part of it that you were more interested in?
Q3 [00:18:09]: Yeah, more like if you have an example of how you develop the machine learning, like the specific model for those kind of cases.
Marcus Svensson [00:18:16]: Yeah, I mean, I would like to reference back to the previous talk here. I think you did a great job in it. Talking about the iterative process of machine learning development. But in the beginning we had nothing right. So we just started with the most naive model, which in the racking case is like the most sold items the past sometime and benchmarked that. So a B tested versus that and then tried more and more complex methodologies to have more clever ways to rank the items and see if it improved, click through rates, for example, and then eventually conversion rate as well.
Q3 [00:18:55]: So did the like the ML model itself suggest like the A B testing what you should a b test?
Marcus Svensson [00:19:02]: The ML model suggests the sorting of the products, order of the products, and you as or we as an organization, we a b test to make like to split the traffic 50 50 versus the naive version and the model version, and then find the machine learning model that outperforms the naive version and then the machine learning model because the defect is standard. And then introduce a more complex or another type of machine learning model layer.
Q3 [00:19:30]: You use the A B testing to test if the model was better than it was before.
Marcus Svensson [00:19:34]: Yes. Okay, thank you. And that's very common and very effective way to do it as well. You can do ABcDFG testing at the same time if you want to spice it up. You're talking about model tracking and model versioning. So is that something in house developer platform or any open source platform like MS flow or something? When we started doing that, it was very primitive. So that was, I think the first version was name suffix in the name of it. So high quality version control there on the model.
Marcus Svensson [00:20:13]: But these days do I have any top of mind model versioning? I mean, Vertex AI I'm sure has model versioning built in as a service you can try. I don't think we used any automatic model versioning and deploy like automatic flow from tagging and deploy for the models for the naming. So I don't have a better answer than Google would give you, unfortunately. Thanks.