MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Technical Debt in ML Systems

Posted May 06, 2024 | Views 240
# ML Systems
# Technical Debt
# King
Francesca Carminati
AI/ML Engineer @ King

Francesca has a background in pure mathematics and has held the position of data scientist and AI/ML engineer over the course of her career, most recently at King and Peltarion. Francesca collaborated with various companies across diverse industries, contributing to multiple use cases with a focus on language models and computer vision. Currently, Francesca develops tools for ML engineers and data scientists as part of King’s ML Platform.

+ Read More

Maintaining Machine Learning systems can be difficult and costly because they often end up with large amount of technical debt. In this presentation we will discuss the reasons why ML systems are more likely to have this type of debt and three sources of technical debt in ML systems.

+ Read More

Join us at our first in-person conference on June 25 all about AI Quality:

Francesca Carminati [00:00:02]: My name is Francesca and I am an AI ML engineer at King. I am part of King ML platform and I basically develop tools for data scientists and ML engineers to support them in different phases of the ML lifecycle. So you might have heard of King. So I'm going to give a brief introduction. So King is basically a leading interactive entertainment company and we have the mission of making the world playful. King was founded in Sweden 20 years ago and we have game studios in Stockholm, London, Barcelona, Malmo, Berlin, and offices in San Francisco, Chicago, New York, Los Angeles, Dublin, and Malta. We are part of Activision Blizzard since 2016, and we have been recently acquired biking. Over the years, King has developed over 200 games, and we design games that have broad appeal and that allow people to play for a moment and then move on with their day and pick up their game later.

Francesca Carminati [00:01:13]: Also, our games are synchronized across platform, which allow players to switch seamlessly between device and platform and continue the game basically whenever they want and wherever they have left off. And so for us, this is what encapsulates the idea of by side entertainment. And we make our games available for free, but our player can purchase virtual item price rarity to the entertainment bulb provide, and we also have social features embedded in the game and last couple of numbers. So among the top games we have Candy Crush Saga, Candy Crush soda Saga, and Palm Hero Saga. We have more than 200 million players every month, and only in candy Cross Saga alone we have more than 50,000 levels. So this is roughly the agenda of my presentation today. So I'm gonna talk about what is technical depth in a software system, but also why are ML system particularly prone to tech dev and few sources of technical debt in ML system? And again, I'm going to conclude with my references for my presentation and some final thoughts. To start with, what is technical debt in a software system? So, technical debt refers to the accumulated cost and consequences of choosing a quick and easy solution over a more robust but time consuming one.

Francesca Carminati [00:02:45]: And what happens when you have a lot of technical debt in a system is basically you get to a point where you need to do a lot of refactoring and maintenance, become really hard, and maybe it is impossible to basically add a new feature or do iteration before you actually fix the problem that you caused and that are due to technical depth. And to be fair, every system needs maintenance and needs refactoring at some point, but technical depth listeners specifically to the problem that you caused because you were prioritizing feed instead of a more stable solution. And what is tech debt in particular for ML systems. So the first article about this topic actually was published in 2015, and it was called hidden technical debt in MN system. But at the time it kind of went under the radar because this is really not a fashionable topic and also because there were a lot of work at the time about genetically adversarial network. So it kind of felt a bit under the rather compared to a more like interesting topic. At the time, however, this topic is known a bit more renovated, interested, interest based. Recently there has been more articles published regarding refactoring in MS system and encode duplication and so on.

Francesca Carminati [00:04:15]: And the author seems to agree that we got to the point where we really need to worry about technical debt, because building a new MS system is actually faster and cheaper now compared to the cost of maintaining them and iterating with them. And another problem is that typically one of the main requirements of machine learning system is their ability of iterate and also the speed on which they iterate. And so if you have a lot of technical depth, you end up not being able, you end up basically focusing all your effort in fixing the technical depth before you can proceed with a new iteration. And this means that you are missing one of the basic requirements of these kind of systems. So why are these systems in particular prone to technical depth? Well, the first reason is that they have the same kind of problems that you might see in any traditional software systems. But the biggest reason is that ML system tends to be developed at a system level. And I mean that typically what helps a lot in finding tech debt is having a system that is composed by very encapsulated component. So every single module in your system should be neatly encapsulated and have very clear responsibility.

Francesca Carminati [00:05:45]: So that basically if you, if there are any errors or problems in one of your components, the errors that do not spread. But this is actually really, really hard to do with ML because complex model tend to erode boundaries. And an example is, let's say you have ML system and you want to change your model code, and so you have a component which refers to your, specifically to your model code and you want to change the feature of this model. This is also going to impact, maybe your data collection is going to impact, your data validation is going to impact maybe your post processing. And so you see how basically it's really hard, you're going to have a correction cascade, so a modification in one component is just going to spread and you have very strong entanglement between all these modules. And another reason of technical deb in ML system, or why they are so prone to technical debt is that we still lack time tested abstraction that can answer to this boundaries erosion. So there has been like different work that basically try to set best practices or design pattern to kind of give common answer to the same problem. But we still haven't seen patterns that basically stood the test of time and can help like combat in fighting this system wide technical debt.

Francesca Carminati [00:07:11]: And the last reason is that ML is very iterative and experimental in nature. As you might know, there are like a lot of iteration and all of them needs to be very fast, which means that speed needs to be basically is the most important thing. And that leads to an accumulation of errors, basically. So I'm going to talk about few sources of technical depth. It's going to be impossible to talk about all of them in 30 minutes. So I'm just going to talk about the, the most, let's say the three most important ones that have been cited in different works. And these are code duplication, configuration depth and undeclared consumers. The first one is very easy to understand.

Francesca Carminati [00:08:02]: Code replication basically means repeating the same code in different places of your code base. And basically that's a problem because if you want to change something in your code, then you need to change it in multiple places. For example, this happens when you have a pre processing that you apply in your training pipeline, and then you just copy paste it for your inference pipeline. A subsequent iteration, you change your pre processing. And now you also need to change your pre processing, the inference pipeline as well. And you need to make sure that these are exactly the same. And this of course is a source of error. You need to propagate your entire system and create very hard to find training and inference queue.

Francesca Carminati [00:08:56]: And code abbreviation is actually the biggest reason of refactoring in MN system by far. So basically deleting duplicated code and just refactoring and solving this kind of problem. And the most afflicted component in the MS system of code duplication is actually the model code, which is really interesting if you consider that this is actually the smallest part of an MS system. So typically an MS system is composed by a number of modules like your code base, that refers to infrastructure, data collection, data validation, monitoring and so on. The model code, the model specifically, the code specifically to the model is usually five to 10% of the whole code basis. And this is where the most code duplication happens. And this is due to the fact that this is also the part of your system which is most affected by the iteration, where most changes and where speed is actually like are very important. You really need to be fast in these iterations.

Francesca Carminati [00:10:03]: And a solution to code application would be generalizing code or like creating packages that have a more general use and that you can reuse in different parts of your code basis. But again, generalization is an effort that contrary to speed, the second source of depth is actually configuration depth. So as you of course know, MS system have a very wide range of configurable options. This is true for any system. But for MS system in particular, you might have a lot of settings related to your features, your data threshold for preprocessing, post processing, a set of algorithm specific parameters and so on. And so the collection of all of these settings basically is what is called like the configuration of your system. And this is where actually DeP can accumulate a lot in MS system. And the configuration refactoring is the second most common, the second most prevalent refactoring in MS systems.

Francesca Carminati [00:11:18]: For example, this can happen when you try to with a model in production. And to start with, you basically harcode maybe some of the parameters of your architecture in your code, and therefore you have maybe no configuration files where all the parameters of your architecture are collected. And that means that in the next iteration you basically need to dig deep into your code in order to try to understand how many layers your network had in the first place. So having no configuration files where you collect all these settings is one of the first reason of why this configuration, this depth can happen. But another thing that could happen is that maybe in your first iteration you go with a former model of a specific architecture and you do have a configuration file where you collect all of the, all your architecture settings. However, in your next iteration you decide to go maybe for an xgboost model. So now this time all the settings changes, but you have no time to refactor all of your configuration file. So what you do, you just basically just have them in the same file and just keep the old ones there because you have maybe no time of remapping this setting.

Francesca Carminati [00:12:46]: So you're gonna basically end up maybe with redundant settings or duplicated settings or obsolete settings or even playing wrong settings, and you end up not knowing where your yes, what basically what settings are still relevant in your configuration and which one are not. So this is actually a very boring reason for technical depth. However, it's very easy to fix. You just need to make sure that basically all of your tunable configuration of your system are in one place and basically just add, basically just add the configuration in your code reviews and yeah, basically it should be also very easy to see what changed from one iteration to the other just looking at the configuration files of your system. Yes, basically you should see what changes from one iteration of your system to the other iteration, just looking at the configuration if possible. And the last reason of tech debt is what we call underpillar consumers. So undeclared consumers basically means other system that depends on your system that you are not aware of. So what happens very often in that your system maybe makes prediction widely available.

Francesca Carminati [00:14:08]: So maybe it drops them in a bucket or in a table or a directory. And then there are other systems that basically reuse your prediction, but no one is keeping track of just how many system depends on the output. And these other systems are basically the underclared consumers. This can happen, for example, if maybe your system basically generates prediction, and then another team in another organization decide to reuse this prediction as input for another model to solve like a slightly different use cases. Or maybe they use their raw predictions and then they apply a different post processing just for a slightly, maybe a tangential use cases. But then you basically are just creating dependencies that you are not aware of. And if you change something in your model, or basically you're just gonna have a cascade of error downstream. So this actually requires kind of collaboration with security engineers in order to set up rules in order to prevent access and make this dependency known, basically declared.

Francesca Carminati [00:15:24]: To conclude like three important points. Point technical debt in MS system hinders iteration speed and make maintenance really hard, because at some point you're gonna need to focus all your effort in refactoring or fixing problem due to prioritizing speed instead of more robust solutions. Also, ML systems are more likely to be prone to technical debt for different reasons, because they have the same problem of traditional software system, but also that tend to happen in a system wide level, because it's really hard to create encapsulated model with very specific responsibility, because complex models tend to erode these boundaries and create a correction cascade, and also because of the iterative nature of ML. And the third point is that some of the source of tech debt are code duplication, where you're just repeating your code in different part of your code base configuration there, which refers to how you store all the tunable settings of your system, and then undeclared consumers, which is basically other systems that depend on the output of your system that are not, that you are not aware of. However, something that helps a lot with techdep is just awareness, just knowing where problems can can happen and just including configuration in your code reviews helps a lot in just solving this problem at the very beginning, so lack of awareness just creates more technical depth. And I am going to leave you with my reference for this presentation. I also added the original. The fourth reference refers to the original work about technical debt of 2015 by Scaly and other authors.

Francesca Carminati [00:17:24]: And yes, I thank you very much for your attention.

+ Read More

Watch More

Lessons from Studying FAANG ML Systems
Posted Jun 21, 2022 | Views 525
# ML Platform
# ML Efforts
# Duo Security
Systems Engineer Navigating the World of ML
Posted Dec 12, 2022 | Views 619
# Systems Engineer
# World of ML
# UnionAI
# Union