In mid-November, the Apply(ops) 23 conference, organized by Tecton and Demetrios Brinkmann, took place. Usually, I write a one-page summary for my teammates to share learnings and good references, which I also post on LinkedIn. However, this time, I decided to write an article instead as I found the slides and speech full of good tips.
As the event was hosted by Tecton, there were live demos of their products and client feedback. I will not include these in my recap (though the content was valuable) because I want to keep my summary focused more on learning than on product demonstrations.
Speaker: Min Cai, distinguished engineer of platform engineering, from his LinkedIn profile, seems to be the tech lead behind Michelangelo (the internal ML platform) and Horovod (a distributed deep learning framework, built on top of TensorFlow, PyTorch, etc.)*
The talk mainly focused on the origin and direction of the internal ML platform, Michelangelo. This platform seems to have inspired many companies, including Ubisoft ( I know i was there), to build their own ML stacks. They began working with ML in 2015 because they needed a system capable of making complex decisions in real-time (Uber has around 137 millions monthly active users in 10 000 cities), interacting with both real-world elements (like drivers, restaurants, shops, traffic) and digital elements. This quote captures their vision of machine learning.
They are closer to an Alphabet-like company than Netflix, but they still see that a centralized AI/ML platform can accelerate ML adoption.
They shared some examples of use cases, such as earner onboarding (setting up an account for a driver, restaurant, etc.), rides recommendation (vehicle and trips), and restaurant/food recommendation. They presented some numbers related to ML projects, including:
In the first section of the talk, Min shared the evolution of the platform using a very informative diagram (the Y-axis represents the number of use cases).
From the voiceover and the diagram, it seems that:
Min also presented more details about the workflow on Michelangelo, focusing on the concepts of Canvas and Studio. There is an illustration showing what a canvas is:
The canvas is the framework of an ML application. It connects the different parts of the platform and organizes the code that will run . This includes many configuration files and SDKs for using the platform. The Studio offers a more integrated UI/UX approach to doing ML on Michelangelo. It connects various UI components and the canvas. There’s an overview of the Studio’s flow with a look at the UI, which seems very simple (not sure how flexible it is to use)
Finally, they talked about their vision for GenAI. It was not very clear (as it is in the whole company), but Min pointed out some interesting things in their slides:
If you want to learn more about Uber’s vision, Min gave a great talk at the Scale 2023 conference (where most of this talk’s content is plus some extra content).
Speaker: Dr. Rebecca Taylor, who is currently the tech lead for personalization at Lidl, will discuss how to manage machine learning in a multi-cloud environment.
A multi-cloud environment is when you use more than one cloud service. It’s different from a hybrid cloud, which combines on-premises infrastructure with one cloud service. The hybrid cloud was popular about 10 years ago (😓 …).
The multi-cloud approach seems to be gaining popularity:
The need for a multi-cloud approach can be due to historical and technical reasons (like in ML deployment, accessing specific services). There are pros and cons to going multi-cloud, which I’ve summarized in this table.
With this mindset of multi cloud it’s important to have good abstractions layer, and to have some considerations in mind when going a path to another:
Rebecca took also some time to list different techs that can be interesting in a a multi cloud setup:
She concluded his talk with an entreprise tech and highlight that databricks work well in a multi cloud setup, Tecton as an all in one feature store (that work with databricks, SnK etc) and Zenml cloud.
Speakers:
Hello Fresh, a leader in the food/meal kit delivery industry, has a strong machine learning (ML) culture and openly shares it with the world. Erik Wildman (former Product Director at Hello Fresh) made an impressive presentation about their vision on MLOps a few months ago.
Returning to the conference, they seem to have a strong culture and shared some impressive figures about their ML achievements:
Notes: Their platform for MLOps is built upon Databricks (with AWS) and Tecton for the feature store.It mainly supports offline predictions (batch manner) but they are also developing live inference use cases with SageMaker and Databricks endpoints (abstracted behind a layer for the user, as they should not worry about that).
The presentation emphasized the importance of building an MLOps platform in a company. It included a clear illustration of moving to production.
The reasons for adopting MLOps and a centralized solution are:
They shared insights about their approach in choosing elements for their stack.
They also outlined different layers in their stack, catering to various roles:
Some interesting concepts were defined in the presentation:
Notes: The definition of the MLOps blueprint for a model, composed of a dataset reader, some metadata, and model definition, is not new but good to see represented like it (and there is room to add new components).
Finally, they shared some key takeaways:
Disclaimer: This is a discussion between Ali Ghodsi co-founder and CEO of Databricks and Mike Del Balso co-founder and CEO of Tecton, so they are selling solutions that can be used to operate ML projects.
I collected a few interesting points from this discussion that I think are worth mentioning:
This chat was super interesting but keep the disclaimer of the beginning of the section.
Speakers: Aayush Mudgal, Senior Machine Learning Engineer at Pinterest (also a startup mentor in adtech and edtech)
This presentation focused on strategies to develop ML systems like the recommender system at Pinterest. ML touches many aspects of the application, affecting millions of users.
The billions of Pins pose challenges in the context of the recommender system. Aayush presented a timeline of their recommender system deployment.
They started using their first model in 2014 (four years after Pinterest’s release), followed in 2017 by their first boosted trees and logistic regression. These models had to be converted to C++ for deployment.
After 2017, they explored using deep learning to address the problem that boosted tree and logistic regression models are hard to train (as they don’t support incremental training). They decided to switch to deep learning but had to work extensively on their stack to manage these models.
In 2020, they deployed their first multi-task model, followed in 2021/2022 by their attention/transformer-based model built on Tensorflow. In 2022, they released Pinnerformer (a transformer + multitask model, predicting user interactions with Pins), and also introduced a new environment based on Pytorch (caled Mlenv).
He also highlighted some pain points for an ML platform, like the difficulty in upgrading software and hardware, and the need for multiple expertise areas (like TensorFlow and PyTorch) to support users.
Their current platform insights include:
Statistics around their platform are impressive:
He also shared some challenges at Pinterest (that can slow the Ml deployment):
Speakers: *This panel was moderated by Mihir Mathur from Tecton and included:
The panel focused on 4 topics: the scope of their recommendation systems, the tools used, the feedback loop of these systems, and the challenges in operating them.
==Scope of their recommendation systems==
==Machine Learning tools==
👆 If you’re curious to learn more about Riot tech stack and the models they use (which are not typical in the MLops landscape), Ian made a great presentation at the Data Council a few months ago. 👇
==Feedback loop and challenges==
This conference was full of knowledge, and I really wanted to focus on the essential learnings with some related resources. What was most impressive was the statistics about the ML platform usage in companies. I summarized that in this table (with some extrapolations):
The number of models built and used is really impressive (if the numbers are accurate). I think this in a way illustrates how ML is democratized in their companies but It also raises the question of whether all these jobs are really necessary (but I guess it is but on my job I don’t see this kind of numbers😁).