ML Platform at Uber - Past, Present, and Future
Kai Wang is the lead product manager of Uber’s AI platform team, managing Uber’s end-to-end ML platform called Michelangelo. Today, 100% of Uber's most business critical ML use cases, including GenAI use cases, are managed and served on Michelangelo, driving Ubers both top line and bottom line business metrics. Kai has 12 years of engineering and product management experience in high tech, with a EE PhD degree from the State University of New York - Buffalo.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
Michelangelo is Uber's internal end-to-end ML platform that powers all business-critical ML use cases at Uber, such as Rides ETA, Eats ETD, Eats Homefeed Ranking, Fraud detection, and more recently LLM-based customer service bots. In this talk, I will discuss how Michelangelo has been evolving to continuously improve Uber's ML developer experience and our next steps. I will also briefly share learnings from our 9-year journey of building such a large-scale ML system to drive business impact for a large-size tech company like Uber.
ML Platform at Uber - Past, Present, and Future
AI in Production
Adam Becker [00:00:06]: Okay, Kai, let's see. Can you. We're good now on audio, right? You can hear us?
Kai Wang [00:00:11]: Sorry, some technical issues around my end.
Adam Becker [00:00:13]: No worries. No worries, man. Okay, well, I'm going to leave the stage for you. You have a talk about Michelangelo, I think I. It might have been the first time that Michelangelo had the blog published about them that it caught my attention. This was now a bunch of years ago now, right? How long is it?
Kai Wang [00:00:35]: At least six years ago. Six or seven years ago. Yeah. We're writing another blog which will be published next month.
Adam Becker [00:00:43]: Nice. Yeah. This is a machine learning platform at Uber and I've been following it. I've been trying to follow it as closely as I can, certainly through the blogs. And then whoever gets to leave it.
Kai Wang [00:00:56]: I try to interrogate them.
Adam Becker [00:00:59]: I'll be back in 30 minutes.
Kai Wang [00:01:02]: Cool.
Adam Becker [00:01:02]: And we'll see if we have questions in the chat. And if so, I'll be back a couple of minutes before that to ask you and Kai, thank you very much. The floor is yours.
Kai Wang [00:01:11]: All right, thank you so much. Hey, folks, my name is Kai. I'm the lead product manager for the AI platform team here at Woober. I manage Woober's internal machine learning platform called Mac angel. And today I'd like to use this opportunity to give an overview of this machine learning platform, why we build it, how it has been evolving, and also what's our next. So Uber started this machine learning journey back in.
Adam Becker [00:01:43]: Hi, I want to interrupt you. Sorry for 1 second. I think we're hearing this audio. Maybe it's like the scratching from the mic. Oh, that might be what that sounds.
Kai Wang [00:01:53]: Maybe let me take my headset off.
Adam Becker [00:01:56]: Yeah, let's try.
Kai Wang [00:01:57]: Can you hear me?
Adam Becker [00:02:05]: Yes, I think now it's good. I'll come, I'll interrupt otherwise. Okay, now you're good.
Kai Wang [00:02:11]: Thanks. Just let me know. Sorry for all the troubles. Okay. Yes. So Uber started its machine learning journey back in late 2014, early 2015, when the few teams like Maps team, Rix team and safety team started exploring the possibility of replacing the rub based systems with machine learning. And fast forward nine years. Eight or nine years.
Kai Wang [00:02:37]: Now, Uber has fully embraced machine learning. All the lines of business has incorporated machine learning into the core user flow. So virtually every single button click our users have on the Wooburn app involves the machine learning behind the scene. Let's take the rider app here as example. When user logs in, we use machine learning to authenticate a user and also prevent the account takeover. Then once the user start typing address, we use machine learning for autocompletion and also to rank the results. And once the destination is specified, machine learning is heavily used for maps, routing, eta prediction, pricing and also just to recommending the right product for you. And we also use machine learning for rider driver matching and for safety measures.
Kai Wang [00:03:31]: And this goes all the way to post trip payment fraud detection and also chargeback prevention all the way to customer support. And on the each side is similar story here with the focus on the recommendations, personalizations and also the eats ETD estimated time of revival. So every month there are more than 20,000 models being trained at Uber and at any certain moment we have more than 5000 models serving production, making 10 million real time predictions per second at peak. And all this happens on Woober's internal machine learning platform, Macanjo. So before I jump into the details of Macanjo here, let's take a detour. Let's take a look at this so called machine learning lifecycle. So machine learning is a highly iterative process. You try different model architectures and parameters, train and evaluate the model until a certain performance level is met.
Kai Wang [00:04:30]: Then the model is deployed to production. Then we need to periodically retrain the model with fresh data to maintain the performance. And this whole process requires different user personas such as data engineers, data scientists, applied scientists, machine learning engineers to closely collaborate to successfully launch a machine learning project. Actually the process looks more like this. So we need systems, infrastructure support and pipelines to streamline this process and enable collaborations. So prior to Michangel, each team had to build their own oneoff systems or infrastructures to support their machine learning needs. These systems were unsharable, hard to manage and impossible to scale. Our machine developers ended up spending 70% of their time on building and debugging these pipelines and infra only 20% of their time doing what they are really good at, building and deploying models, and the rest 10% praying that field pipelines will work and crying when they did it.
Kai Wang [00:05:35]: So this is where Mechanio came to play. Macanjio serves as the centralized platform that takes care of all the underlying infra and system complexities by providing an easy to use abstraction layer so that our ML developers can spend most of their time building and shipping models and the rest 5% of time blaming on mechanical for pipeline failures, which is much better user experience. Just joking. So when we first started Macanjo back in 2015 2016 ish, we had completely different two text decks for classical machine learning and differently for classical ML. We had a purely UI driven solution with great ease of use, but it had limitations such as no code review, no versioning. Also, the UI tools were built by different sub teams with different UI UX design, so the user experience was quite fragmented. On the other hand, for deep learning we only had limited support like Jupyter notebook templates and also Dr. Containers at that moment.
Kai Wang [00:06:42]: So it's quite ad hoc. Starting late 2019, the deep learning was really taking off at Uber. So to streamline the deep learning development and drive Dr. Adoption, we started project canvas to support deep learning as first class citizen in Mac angel, which provided great flexibility for building customized, complex deep learning models. However, it was purely code and configuration driven without UI support. To further enhance the machine learning developer experience in 2022, we have rearchitectured and revamped our machine learning platform tech stack to have a unified UI and code experience for both classic ML and DL developers. Users can choose either the UI or the code driven way depending on the model they're using, and all the code and config changes are reviewed and versioned in ML model repo for better knowledge sharing across Uber. So take a closer look at canvas Canvas is inspired by the mature software development process.
Kai Wang [00:07:47]: Project canvas applies software engineering principles and best practices to machine learning. In canvas, all the source of truth is code stored in ML model Repo. This included our ML application framework dev and test tools, user code, and shared models and libraries. By storing all user applications on the same repo, we enable chain tracking, code review, unit testing, and most importantly, allowing experts at Uber to develop reusable models and make sweeping improvements. Our application framework abstract out the details of the underlying infra and let users build our application by simply specifying dependencies, training configs and model code. Users can run the application locally with sample data for debugging and testing. After that, they can submit a job to remote clusters for fully training without any code change. Then the train model is automatically uploaded and deployed for serving and auto routrain can be easy set up.
Kai Wang [00:08:59]: This is a sample code snippet for building a canvas model training pipelines. You can see that all the different steps like feature prep, transform and trainer and evaluator can be easily configured in a YAML file with a configuration driven approach, and within the trainer, users can define their own model by reference to the python code they wrote, so it's quite a plug and play fashion. Then we also launched this called Macangel Studio back in 2022. Macanjo Studio is the UI is a brand new UI for Macanjo. It consolidated all the existing fragmented UI into one seamless user flow to cover the whole ML lifecycle end to end. So within ML studio, we divided the whole ML lifecycle into five different phases. You start with prepare your data, analyze your data. Then once your training data is ready, you move on to train your model and evaluate your model.
Kai Wang [00:10:05]: And once your model is trained, now you can deploy the model and start making predictions. And we support both the batch prediction and also online predictions. And once your model is deployed, now you can set up your online testing and also set up your retrain pipelines. And at the end you should always monitor your model performance when the model is in production and debug issues when things happen. The other things provided by the UI is that user can simply build some tree based models by filling a few parameters from the UI without writing any code. And we also enable this code review process from the UI, so users can actually submit the code review just by clicking one button on the UI, and we automatically generate the code in the backend for our users. The other good thing about Macanja Studio is the rich visualizations for users to evaluate model performance and also monitor model in production and also for debugging purposes. So at Woover, for any of the machine learning needs, there are only two tools you need.
Kai Wang [00:11:13]: One is the MS studio, the other one is canvas. As I just mentioned, for standardized machine learning needs like the tree based model training and evaluation, you can go all the way on the UI without writing any code. But for more advanced use cases such as you want to build a deep learning pipeline or build very complex retraining workflows, then you will refer to the code driven way, which is canvas. But no matter where you build your pipelines, either from UI or from the code, all the pipelines can be managed and run from the UI. And also you can view and analyze the results on the UI and at the end, monitor model performance. And then lastly, the current breakthrough in the generative AI space, particularly in the large language models, have the potential to bring about significant transformations to the tech industry. At Uber, multiple teams are trying to leverage generated AI to enhance our end user experience, such as enabling conversational shopping experience on our Uber Eats app and LLM powered customer service chatbots, as well as increasing the productivity of Uber employees through programming copilots and conversational knowledge search. So, to facilitate and drive the adoption of at Uber, we've built essential components within Micanjo to support capabilities such as prompt engineering, rack LM, fine tuning and serving, and all the way to launch and support.
Kai Wang [00:12:49]: And we also built a geni gateway equipped with functionalities like PI reduction, safety and policy guardrails, logging and auditing for our users to securely and efficiently access both the external and internal lms. We will share our progress of the Ji adoption at Uber in the near future. So, in summary, the evolution of AI ML at Uber can be roughly categorized into these three different phases. We first started predictive ML for tableau data use cases like trip CTA risks pricing using simple trip based model like render voice and xgboost. The second phase started from 2019 to 2022 where we have been pushing for deep learning adoption and model iteration as code in ML model Repo, which resulted in a rapid growth of DL projects. So now we have more than 60% of our projects. Machine learning projects at Uber actually powered by deep learning and then, as I just mentioned, the survey started last year as part of the new wave of generative AI. Throughout the whole journey, Michangel has been instrumental and essential for accelerating the AI advancement at Uber.
Kai Wang [00:14:05]: So lastly, along the way, we've learned some lessons in hard way. I want to share some of the lessons with you folks here. First of all, having decentralized machine learning platform and the centralized machine learning platform team can drastically enhance the ML DAP experience and the efficiency for large size company. How we do this is to drive the standardization of machine learning and also just reduce the duplicated efforts across the company, which at the end will drive the machine learning quality and let the developer choose what they wanted to use. Some developers at Uber, especially the applied scientists and data scientists, they prefer the UI way for model iterations. Our machine learning engineers, they actually prefer the code way for model developments. So we'll provide both and let them to choose. Also, a lot of users, they are very satisfied with the predefined templates provided by the macadure team, so they will stick to those for their model iterations.
Kai Wang [00:15:11]: But a lot of power users and advanced users, they actually want to directly access the infra layer components, for example the ray distributed computing, and to build their customized pipelines. So we also provide SDK for them to do that. Then thirdly here, design the platform architecture in a modular way so that each component can be easily replaced. In this way you can fast incorporate the latest technology, the cutting edge technology, into your platform without much change to the code base. And fourthly, deep learning is really expensive. So do not apply deep learning blindly. Use it only when it actually aligns with your problem statement and your needs. And lastly, when you have so many machine learning projects like at Uber, we have more than 500 different machine learning projects for different purposes.
Kai Wang [00:16:08]: Having a machine learning tiering system, this can actually guide you where you should put most of your resource and also support that can actually drive the most business outcome. So that's all I have for today. Thank you for listening. I'm happy to take any questions, if there's any.
Adam Becker [00:16:29]: Nice. Kai, thank you very much. This is fascinating. We'll wait for some questions to be trickling in in the chat. Until then, maybe I could start with a couple. So, can you say a little bit more about the tiering system? What goes into it? Is it just. I mean, I imagine that obviously business value goes in there, too, but effort, what are some other dimensions, and how do you guys actually quantify these things?
Kai Wang [00:16:56]: Yeah, very good question. So the most important factor here, as you mentioned, as you can guess, is the business impact, right. How much this machine learning project can actually move the needle for our business metrics. And on the other side, very importantly, we also look at the resource from our product teams. Do they have machine learning expertise? Do they actually train the model with the first tier data? Basically, is the model quality good? If the model quality is bad, we cannot guarantee the SLA for serving this model in production. So those are also the factors we take into consideration. The third one is, is this an experimental project, just someone from one team? Maybe this is a tier one team, but this might be a tier three project because someone is experimenting new ideas.
Adam Becker [00:17:51]: I see.
Kai Wang [00:17:52]: That's also something we need to take into consideration.
Adam Becker [00:17:56]: Do you also have a sense of. So you're saying that most of the projects now, like 60% or so are using deep neural networks, but at the same time, you're introducing this cavet, and you're saying these are not always the best tools for the job, really only use them when it's necessary in even, let's say, like, traditional tabular kind of data sets. Have you guys found some distinction between using deep neural networks and just like, the traditional XgBoost, or do you have some other intuition that you can help to, you could share with us about when deep learning is the way to go?
Kai Wang [00:18:33]: Yeah. First of all, let me clarify something. I think I misspoken. It's 60% of our tier one projects, so we have about 40, 50 tier one projects for the lower tiers. Most of the lower tiers, they're still on XgBoost.
Adam Becker [00:18:47]: Yeah.
Kai Wang [00:18:48]: So what we did is actually back in 2021, all the way to mid 2022. We just did experimentation of replacing xyboost with deep learning for all the tier one projects and some of the tier two. And at the end of day, we look at the performance and also the cost of maintaining and training, serving and maintaining that model and calculate ROI. And it turns out 40% of the tier one actually doesn't make sense for adopting deep learning. So that's where we dropped. So what are some of the top use cases here? We use deep learning for our computer vision use cases, for sure. Deep learning is very good weapon. Our NLP use cases, and even for some of our tablet data, for example, our pricing, our current model is actually deep learning model, our eats recommendation engine, basically to recommend the restaurant dishes to our users.
Kai Wang [00:19:46]: That's powered by deep learning. I think these are the top use cases where deep learning can really shine.
Adam Becker [00:19:54]: Is it like in multimodal situations, too, that you tend to default to deep learning?
Kai Wang [00:20:01]: No, we don't have a multimodal project yet. We are working on that. But all these are either purely computer vision or they're usually pure NLP at this moment.
Adam Becker [00:20:15]: Got it. One thing that's always been interesting to me is sort of like the mandate for the target Persona that is going to use, let's say, like a platform, like Michelangelo. So when you guys built this, was the idea to help accelerate data scientists, and then has it ever begun to just kind of increase in scope? Where now you're saying, well, maybe analysts, maybe even software developers can start to play around? Well, now that we have this here, how about business people? To what extent have you managed to just constrain the scope of your target audience?
Kai Wang [00:20:50]: So, all the way until last year, before the geni breakthrough, our main target user personas, they are machine learning engineers, applied scientists and data scientists, and some of the data engineers who really understand machine learning, who can really write code and do all those stuff. But since Geni come in, it really lowered the bar for building machine learning based applications. So now we are also extending the mechanical existing capabilities to serve our product operations, to serve our salespeople, who can just use the UI to build some marketing materials. So we are still in the process to building those capabilities.
Adam Becker [00:21:36]: Oh, wow. Okay, that's very interesting. You could just see how the spaces continue to expand here, even just by virtue of this technology. Right? Like, this technology just lowers the bar, and that places immediate consequences for you. Right now, more people want to use it, but at the same time, you still need to do model control and version control. You have to manage all of these different things, and all the pipelines have to be in place. It's just that now you have even more demand for it across the organization, and now you're getting pulled into.
Kai Wang [00:22:07]: Yep. And we need to build an even higher level of abstraction for those sort of non tag.
Adam Becker [00:22:14]: Yeah, yeah. Very cool. Let's see. Do we have any other questions from the audience? I'm not seeing anything yet. You said you're working on the next blog post, right? For Michelangelo. So please share that around when you get a chance. Otherwise, it's what is like ML at Uber. You guys had like, a very good blog post.
Adam Becker [00:22:36]: Blog space for, right?
Kai Wang [00:22:39]: Yeah. So it will be just part of an will be specifically focused on the ML platform.
Adam Becker [00:22:48]: Awesome. Okay, Kai, thank you very much. Please stick around in the chat. And will people find you on slack? Are you on our slack?
Kai Wang [00:22:56]: I'm not sure. I tried to join. I got rejected something, but that's fine. Find me on LinkedIn.
Adam Becker [00:23:01]: I know some people. I'll talk to them. Cool. Kai, thank you very much. Yeah, we'll chat soon. Adios.
Kai Wang [00:23:11]: Thank you so much.