MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Uber's Michelangelo: Strategic AI Overhaul and Impact

Posted Jun 07, 2024 | Views 615
# Uber
# Michaelangelo 2.0
# Generative AI
# Deep Learning
Share
speaker
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Uber's Michelangelo platform has evolved significantly through three major phases, enhancing its capabilities from basic ML predictions to sophisticated uses in deep learning and generative AI. Initially, Michelangelo 1.0 faced several challenges such as a lack of deep learning support and inadequate project tiering. To address these issues, Michelangelo 2.0 and subsequently 3.0 introduced improvements like support for Pytorch, enhanced model training, and integration of new technologies like Nvidia’s Triton and Kubernetes. The platform now includes advanced features such as a Genai gateway, robust compliance guardrails, and a system for monitoring model performance to streamline and secure AI operations at Uber.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Demetrios [00:00:01]: What's going on? Mlobs community, I am your host, Demetrios, and this is another podcast episode. This time we're doing something a little different because I just read the Uber blog post all around their evolution of the Michelangelo platform. And so since they've released this in depth blog post documenting what they've done from 2016 until now, I thought, why don't I break it down? We'll do an episode where I'm going through some of my key learnings and takeaways, and I'm doing this because two out of the four authors of the newest post are going to be speaking at the AI quality conference on June 25 in San Francisco, so I thought it was only right. I also will be triangulating this post with a few other resources, namely the original Michelangelo Post that came out back in 2019, some podcast episodes that we've had with people who created the original Michelangelo Post, primarily Mike from Tekton, and also a few podcasts that we've had with Melissa and Michael, who are working on Uber's internal education data and AI education. And I will also be mentioning that one of the writers of this blog post gave a talk already at the latest AI and production virtual conference that we had. So there's all kinds of resources that I'll be citing and talking about, and all the links of this are in the show notes, including the main blog post, which I highly encourage you to go read and soak up. But here's going to be my breakdown of it. And before we start, let's give a shout out to the amazing engineering culture that Uber has to make these learnings and what they've done over the years public.

Demetrios [00:02:04]: I mean, we've seen so many of the Uber team go and start create companies, and it's been cool to see them in this space. It's been really nice. I mean, shout out to Aparna at Arise, Pierro at Predi Base, Kevin and Mike at Tekton. All these folks, they were part of Michelangelo's story, and they've since come on the podcast and they've shared what they've learned. And so, speaking of sharing learnings, as a random side note, I just want to shout out some other company blogs that I find just as good, maybe even better, than Uber's engineering blog. And I love how they talk about what they build slash they show how the sauce is made. So these blogs that are fascinating to me and always high quality, I would say, are doordash. Shout out to hain, friend of the pod, and I just love that guy.

Demetrios [00:03:07]: I love hanging out with that guy. Instacart Sahil came on the pod too. That was great. Airbnb does a really good job of this. Nobody from Airbnb has ever been on the pod, which is fascinating to me, but maybe they've never heard of us. I don't know if anybody is a listener and you're working at Airbnb on the ML platform or doing stuff with machine learning AI at Airbnb. Hit me up. I'd love to chat.

Demetrios [00:03:31]: LinkedIn does a great job with their blog, Spotify and my favorite and what I think is the most underrated educational engineering blog out there is the Nubank blog. Absolutely. I've said it many times, I will say it again. Nubank engineering blog is a hidden gem.

Demetrios [00:03:52]: All right, let's take a minute to thank our sponsors of this episode. Weights and biases elevate your machine learning skills with weights and biases free courses including the latest additions enterprise model management and LLM engineering structured outputs.

Demetrios [00:04:10]: Choose your own adventure with the wide.

Demetrios [00:04:12]: Offering of free courses ranging from traditional mlops, LLMs and CICD to data management and cutting edge tools. That's not all you get to learn from industry experts like Jason Louie, Jonathan Frank, Shreya Shankar and more. All those people I will 100% vouch for. They are incredible friends of the pod. Enroll now to build better models, better and faster and better and faster. And just get your education game on. Check the link in the description to start your journey. Now let's get back into the show.

Demetrios [00:04:50]: All right, let's set the scene for the blog post. Uber started building their internal ML tool in 2016. This internal ML tool was called Michelangelo. They say they've been through three distinct phases when it comes to this internal tool from 2016 to 2019. That was the foundational phase of predictive ML and tabular data from 2019 to 2023, aka last year. That was a progressive ship to deep learning, and from 2023 onwards, that is the venture into generative AI. This is the platform evolution that we're going to be talking about. How Michelangelo went from just predictive machine learning to now supporting gen AI and deep learning and what that looks like, what some of their design principles have been along the way.

Demetrios [00:05:57]: From the start of Uber's journey in AI and ML real time has been at the core of the Uber product experience. And some use cases that have leveraged classical ML have been riot eta fraud detection, search driver matching, dynamic pricing, and for Uber Eats, it's been ranking systems, intelligent photo carousels, recommender systems upsells on checkout in the original phase from 2016 to 2019. The inspiration for creating Michelangelo was because machine learning was happening in a very ad hoc way at Uber. Let me read you a quick excerpt from that original blog post. Before Michelangelo, we faced a number of challenges with building and deploying machine learning models at Uber related to the size and scale of our operations. While data scientists were using a wide variety of tools to create predictive models are scikit, learn, custom algorithms, et cetera, separate engineering teams were also building bespoke one off systems to use these models in production. As a result, the impact of ML at Uber was limited to what a few data scientists and engineers could build in a short timeframe with mostly open source tools. Specifically, there were no systems in place to build reliable, uniform and reproducible pipelines for creating and managing training and prediction data at scale.

Demetrios [00:07:44]: Another way of saying this, there was no established path to production, nor once it's in production, what are we doing with it? And you'll see that this idea of trying to unify the system so that there's not so much ad hoc action happening, that comes up again and again in the blog post. They are really crucial about building for flexibility. So they want to give you templates and make your life as easy as possible, unless you need to go deeper down the rabbit hole and then you have that customizability also with the Michelangelo platform, but they're not going to let you have that be the default mode. They want to make it as easy as possible. In the newest blog post, they show that the tech they were using during this time when they created Michelangelo 1.0 was Tensorflow, Spark, Xgboost, and they also called out how they were using Cassandra and ML Lib. This 1.0 iteration of the platform is where the concept of a feature store was introduced. In a podcast with Matt Bleifer, one of the founding engineers at Tekton, he talked about how at the time of the Michelangelo blog post being released, he was working at Twitter and he was at an off site trying to design Twitter's system. And I think it was the recommendation system that they had for the feed.

Demetrios [00:09:29]: And they read the Michelangelo blog post and they were like, huh, there's a thing called a feature store. Wow, maybe we should try that. And so they implemented it at Twitter, which I thought was a nice little fun fact about it. And as a cool flex, the newest blog from the Michelangelo Evolution, they talk about how the Uber feature store now today hosts over 20,000 features that anyone at Uber can use when they're building models, which is incredible, 20,000 features, and they still have a feature store. It's still in use. But there were some challenges when it comes to Michelangelo 1.0. Remember, this gets us up to 2019. Then they started to see that in 2019, there were things that needed to change, and so they created basically Michelangelo 2.0.

Demetrios [00:10:25]: The first iteration, Michelangelo 1.0, fell over on a few different fronts. For one, there was lack of quality definitions and project tiering. And project tiering is a huge one that's going to come up over and over again that I think is fascinating to really reflect on, because the idea there is that if you do not have project tiering, you do not know which projects are most important. And so I like that they specifically call out that each project in Michelangelo 1.0 was treated the same, regardless of impact on the business. So theoretically, you could have a model that was driving millions of dollars in revenue per day, getting the same attention and support and SLA's as a model that still had not clearly defined what the ROI was. So that is the whole reason that they were like, okay, we need to figure out, this is project tiering and what projects are the highest value for us so that we can give them priority. And another issue on the Michelangelo 1.0 version was the lack of deep learning support. And that is a huge theme in the newest blog post that's like basically front and center.

Demetrios [00:11:49]: They talk about it a ton. The lack of deep learning support meant that something had to change within the uber Michelangelo platform because they specifically called out, they had the data for these deep learning models, but they did not have the developer experience. So things were getting fragmented, and it wasn't fun. And I remember that Mike del Balso, the founder of Tekton and one of the creators of Michelangelo 1.0, he came on a podcast and he referenced that when he was leaving Uber. I think it was around 2019, maybe 2018. One of the big questions that they routinely would go over was, like, this idea of many different models versus a bigger model and a large, like, deep learning model, and especially for, like, use case, sorry, city specific models. So I think if I remember correctly, it was Mike saying, you know what we would ask ourselves? Should we train one generalized big model that understands each city, or should we train city specific models that can understand the city? So you have a Detroit model, you have a Denver model, you have a New York model, that type of thing. And ultimately, those were the questions that were being asked as he was leaving and what we see from this blog post is that they eventually said, we gotta allow people to try with deep learning.

Demetrios [00:13:27]: We gotta make it easy for them. Because what was happening is that the data scientists were going and they were trying to use deep learning and they were doing, again, this whole idea of, like, this ad hoc style of, all right, well, if you're not going to support me with the Michelangelo platform, then I'm just going to go and bolt on some ways that I can do it myself. So the third downfall in Michelangelo 1.0 was the lack of support for collaborative model development. They mentioned this fragmented developer experience with some teams trying to support the learning use cases and creating the ad hoc tools they had to work with. And they said, you know what? That's the whole reason we created Michelangelo, was to have that unifying experience. We need to bring deep learning to front and center. Now, this gets us into phase 2.0 and the architecture design principles behind 2.0. And again, phase 2.0 lasted from 2019 to 2023.

Demetrios [00:14:35]: What happens here is that by 2019 they realized, you know what, we've got some serious ROI and revenue being driven from our machine learning projects. Let's double down, let's encourage people to do more with this and let's optimize our use cases. You know what's better than a little bit of revenue, a lot of revenue being generated from ML use cases. So they were encouraged to optimize their different ML use cases. And this means that the data scientists and the teams that were working on the AI and ML initiatives, they began to use more advanced techniques like deep learning. And so the platform had to evolve with these teams because they were asking for more deep learning support. And they set out the platform team set out to create an architecture that allows for a bit of a Lego plug and play. And some of the tools that users can have at their disposal are internally built tools, but they can also grab the best in class open source tools and bring them into the Michelangelo platform.

Demetrios [00:15:56]: So now, if you look at the diagram of what tech they're using in this phase, they started to open it up and add support for Pytorch, Pytorch, Lightning and Ray. And let's read real fast. The design principles of Michelangelo 2.0 define project tiering. That is number one design principle. Focus on high impact use cases to maximize Uber's ML impact. Provide self service to long tail ML use cases so that they can leverage the power of the platform. And one way that they did this, as we'll see in a minute, is they created a unified GUI. They called it Ma studio.

Demetrios [00:16:48]: They also invested a ton into empowering anyone at the company to learn ML and how they can incorporate it into their jobs. And so for more on that, we did a whole podcast with Melissa and Michael, as I mentioned, on what they do specifically to get people into the understanding and mindset of how to leverage data science, machine learning and AI at Uber and anyone. It's almost like they have a mini university in Uber to try and get people doing more stuff with data science machine learning AI. I thought that was super cool. The second design principle from the blog is that they had this monolithic verse plug and play. The architecture will support plug and play of different components, but the managed solution will only support a subset of them or the best user experience. Bring your own components for advanced use cases. Again, going back to that, trying to make it easy, templatized right out of the box for anyone to come and use.

Demetrios [00:18:03]: But if you're super advanced and you want this really in depth capability, then you can bring your own. The third design principle is API code driven, verse UI driven. So take the API first principle and leverage UI for visualization and fast iteration, support model and iteration as code for version control and code reviews, including changes made in UI. The fourth design principle is build versus buy decision leverage. Best of class offerings from open source or cloud or building in house. Open source solutions may be prioritized over proprietary solutions. Be cautious about the cost of capacity for cloud solutions. And the fifth design principle was that they wanted to codify the best practices like safe model deployment, model retraining, and feature monitoring inside of the platform itself.

Demetrios [00:19:08]: So I encourage you to either read the blog or just go directly to that specific diagram that we can leave in the show notes if you are curious about how they set it up, because they have the whole online offline and then like this control pane, which I thought was, yeah, it's, it's nice to look at, it's confusing, a little bit confusing, honestly. But you spend a few minutes, you look at it, you kind of understand what's going on. And also they explain some of the ML quality considerations that they started to grapple with when they're creating Michelangelo 2.0. And they have another diagram that is really cool, where it's the different phases of the ML lifecycle and then the quality considerations that they wanted to be thinking about in these different phases. So you've got code review and test coverage, and the reliability when they're training models, and also the hardware cost when they're training models on the actual model itself. They're looking at things like accuracy, freshness, reproducibility. That's one of the quality, or some of the quality metrics that they're taking into account for serving the model. They're thinking about latency, availability, and cost.

Demetrios [00:20:29]: So those are the quality metrics that they're looking at, and these are just a few things they thought about when they're looking into the quality of the system. They ended up launching a whole framework for measuring and monitoring key metrics. They called it mes, or the model excellence score, which, granted, not the name I would have gone with, but it is also easy to judge. So to tackle the reproducibility piece, they created what they called model iteration as code. You all have heard of infrastructure as code. Well, they've got model iteration as code, and this did a bunch of things like it created an ML repo that. It also managed dependencies by using immutable docker builds, and they created continuous integration and delivery pipelines. And all of this was to tackle the reproducibility piece to make things more reproducible.

Demetrios [00:21:35]: So that's cool. But by far the meat of this article is what I'm about to explain next. And that is the steps that they took to make deep learning what they called a first class citizen within the platform. The first thing that they mentioned is that they brought deep learning support to feature transformations. Next up, they talk about the model training upgrades, and a big one is switching from Spark to Ray. And that was mainly because deep learning workloads have lots of ways that they can fail on spark. And funny enough, the researchers that I know training foundational models not at Uber but just out there in the community, they don't really talk that kindly about Ray. So I wonder what the feedback is on Ray now that it's been a few years on the serving side.

Demetrios [00:22:38]: In this blog post about Michelangelo 2.0, they call out serving latency as something that is the most important thing for them because of the use cases like driver ETA, or a lot of those that I mentioned before, the Uber Eats feed ranking system. You've got lots of different use cases that need to be very fast. As soon as someone opens the app, they gotta get that. So they switched from Neuropod, which I have never even heard of before this blog post, to Nvidia's Triton serving because it works better for serving deep learning. And you can also serve Tensorflow and PyTorch straight out the box with Triton serving. And last but not least, I've got to call this out, because when I read it, I was amazed, honestly, how have they not done this earlier? But okay, right on. Over the three year time period, or four year time period from 2019 to 2023, bringing Michelangelo 2.0 into existence, they made the switch from Mezos to kubernetes, and it's just like, what? They weren't on kubernetes before. Wow.

Demetrios [00:24:01]: It's taken them a long time to get off mezzos. The reason that they did that specifically is for their GPU resource management. Michelangelo 1.0 was using mezzos and then they, when they upgraded to 2.0, they got rid of it and they switched to kubernetes. And of course they mentioned how they have thought long and hard about how CPU's and GPU resources should be shared amongst the teams, so that if somebody is not saturating that GPU or cpu, other teams can come in and utilize the GPU to the max. Because I guess like GPU's are expensive or something, I don't know. And last but not least, the era of Genai is among us. This final part of the blog, they go into what they've done since last year to bring LLMs into the fold and have Genai workloads be a first class citizen on the third iteration of the Michelangelo platform. The first piece that they've done is they created the capability to use both external LLM APIs and internal LLMs.

Demetrios [00:25:23]: And so to do that, they had to create a genai gateway is what they're calling it. And that's basically just which model are we going to use. They leverage the larger APIs for tasks that are more general, and then they use the internal models, probably some open source, off the shelf, probably llama three, honestly, or Mistral, and they will use those and fine tune them with all this rich data that Uber has. They definitely called out a few times that like, wow, we've got a ton of data and we want to use it. We want to train models on all this great data that we've got. So they're probably fine tuning those internal models, the open source models, but then when they need to, they will use the external models for like more complex tasks, is what it sounded like, and more generalized knowledge. If they need that generalized knowledge, then they'll go out there. And so they created this AI genai gateway, and it gives a unified experience.

Demetrios [00:26:30]: So you just make the call and then boom, the Genai gateway will figure out what model it should be going to in this like, Genai Gateway experience, there were some key factors that they called out because it's not just like routing it to the right model. They also said, we want to have logging and auditing so we can track and audit models, because maybe some government agency is going to knock on our door and say, what are you doing with AI? And we can show them exactly what we're doing. They also set up cost guardrails and attribution so they can know how much they're spending and attribute it to different use cases and projects. And they get alerted to overspending when it happens, which makes me think like, okay, what happened? So that they had to put that guardrail into place? There has to be some kind of story there. Like, oops, I didn't realize it was costing us that much to make all those LLM calls, or whatever it may be. Imagine you pushed a new feature and millions and millions of people are using it, and now you just made a whole lot of LLM calls and gave OpenAI a whole lot of your money. I want to know what the story is behind that, or if they preemptively thought, hmm, they probably did, because they're smart engineers, right? They had to have thought that beforehand before pushing anything to prod. What's the story there? As you would expect, they put safety and policy guardrails presumably into place because they don't want to end up on the front page of Internet news.

Demetrios [00:28:16]: All right, last but not least, Michelangelo 3.0. They made sure that when people are using Jet AI capabilities, PII, or the personal identifiable information is redacted. So specifically, when any data is being sent to an external LLM, all the PII gets redacted. And it seems like they've built a very robust evaluation framework to easily be able to compare models. When a new model comes out, they can just throw in their use case specific evaluation frameworks and data sets and get a pretty good approximation on if it's better or worse than what they're working with. So let's conclude with that. Let's wrap it up. That's the blog breakdown.

Demetrios [00:29:11]: It was something that I felt like I wanted to do specifically because the the original blog post, the Michelangelo 1.0 or meeting Michelangelo, it was so influential in the mlops community. And also I wanted to do this because two of the four authors of the newest post are going to be speaking at the AI quality conference, and so I felt like it was only right. They spent all this time and hard work creating this blog post. It's my ode to them, a little bit of a hat tip. And as a quick reminder, only a few tickets are left if you want to come to the AI quality conference on June 25 in San Francisco. We might sell out by next week, which is absolutely incredible to me. It's mind blowing. Huge shout out to everyone that is coming and supporting the conference.

Demetrios [00:30:05]: It is a lot of fun, and we haven't even started yet. As I was reading through this blog posts, and specifically the evolution of Uber's Michelangelo, I was in awe at all the customizations and how advanced they seemed to be. But also it reminded me that we have this vision of companies who have been doing AI since the mid two thousand ten s and where like, they're so advanced, they're so far ahead of the pack. And then you read something like this and it reminds you, oh yeah, Uber just was able to migrate from Mezos. Like, how far ahead of the pack were they if they were on Mezos until like last year? Come on. And so what it reaffirms in my mind is that a lot of infrastructure from these big companies that started doing ML in the early days, it's like this snapshot in time. They had to make certain tech decisions and choices because there was nothing else out there. And if you compare that to the tooling options that we have today, like, think about the tooling options that they had in 2016 for ML workloads versus the tooling options that were out there in 2020.

Demetrios [00:31:35]: And that's another snapshot in time versus the tooling options that are out there today. It's just, you see how much the field has evolved, and it is so much easier to get an ML platform up and running today than when these folks set out to do it in 2016. They were basically creating everything. But now, because they had to create everything, there's almost like this legacy that was left behind. And so they have these legacy systems, and you start out your journey so early on, which means that the platform team that is tasked with upkeep of this platform in this use case, or in this case, it is the Michelangelo platform. But let's just generalize it to any AI platform out there. There's constantly going to be these conversations of, like, do we invest in upgrading our system right now? Is there enough demand? Is there enough pull? Is there enough of a reason for us to do that? And I'll forever remember this conversation with Anush from Pinterest when he was talking about the conversations that his team has for the ads platform, the machine learning ads platform at Pinterest, and they also were one of those ones that they found success early in ML in the 2015 2016s. But he has to be tasked with keeping that ads platform up to date.

Demetrios [00:33:22]: His team has conversations around the ROI of upgrading from an open source tool that like, no longer has a maintainer or community support around it, but it's a major part of their system. So should they be the main maintainer of this open source tool that people are moving away from, or should they just work to move away from it themselves? And those conversations are fascinating to me because you're always going to have to evolve your platform, and so making the decision on what parts to evolve versus what parts to leave, those are fun decisions and deep decisions because you don't want to move off of something if it still is serving you. And only a few people are actually going to benefit from that move. So I will leave you with what I think is arguably the most important sentence in the whole evolution of Michelangelo blog post, and thanks for joining. If you liked this episode, I would love to get feedback because it is something new that I'm trying and so I might do it more. Maybe not though. Just just let me know what you think. It would mean the world to me and share it with a friend if you found it useful.

Demetrios [00:34:57]: Let's get into the most important phrase of the whole blog post, which is not all ML projects are created equal. Having a clear ML tiering system can effectively guide the allocation of resources and support. We'll talk to you later. This has been another edition of the MLOps Community podcast.

+ Read More

Watch More

Language, Graphs, and AI in Industry
Posted Jan 05, 2024 | Views 1.1K
# ROI on ML
# AI in Industry
# Derwen, Inc
Data Governance and AI
Posted Feb 16, 2024 | Views 593
# Data governance
# AI
# Gjensidige
Impact of LLMs on the Tech Stack and Product Development
Posted Nov 02, 2023 | Views 367
# Tech Stack
# AI Product Development
# Bito