MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Lessons from Studying FAANG ML Systems

Posted Jun 21, 2022 | Views 481
# ML Platform
# ML Efforts
# Duo Security
# Duo.com
Share
SPEAKERS
Ernest Chan
Ernest Chan
Ernest Chan
Senior Data Scientist @ Duo Security - a Cisco Systems business unit

Ernest is a Data Scientist at Duo Security. As part of the core team that built Duo's first ML-powered product, Duo Trust Monitor, he faced many (frustrating) MLOps problems first-hand. That led him to advocate for an ML infrastructure team to make it easier to deliver ML products at Duo. Prior to Duo, Ernest worked at an EdTech company, building data science products for higher-ed. Ernest is passionate about MLOps and using ML for social good.

+ Read More

Ernest is a Data Scientist at Duo Security. As part of the core team that built Duo's first ML-powered product, Duo Trust Monitor, he faced many (frustrating) MLOps problems first-hand. That led him to advocate for an ML infrastructure team to make it easier to deliver ML products at Duo. Prior to Duo, Ernest worked at an EdTech company, building data science products for higher-ed. Ernest is passionate about MLOps and using ML for social good.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
Vishnu Rachakonda
Vishnu Rachakonda
Vishnu Rachakonda
Data Scientist @ Firsthand

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.

+ Read More

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.

+ Read More
SUMMARY

Large tech companies invest in ML platforms to accelerate their ML efforts. Become better prepared to solve your own MLOps problems by learning from their technology and design decisions.

Tune in to learn about ML platform components, capabilities, and design considerations.

+ Read More
TRANSCRIPT

0:00 Demetrios

Vishnu! Look at that shirt man, that's pretty sweet.

0:03 Vishnu

Isn't it the coolest shirt? I got it from my best friend in the whole wide world and the coolest community in the world.

0:09 Demetrios

For those that are just listening, Vishnu is showing his “If in doubt, log it out” shirt that you can get on the MLOps community website, because I was crazy enough to try and create a merchandise shop that our accountants tell us is a horrible idea and has only been bleeding money since its inception. But anyway, we're here today to talk with Ernest Chan. Who is this guy?

0:39 Vishnu

Ernest is a great, great, great communicator, blogger, and a data scientist at Duo Security, where he helped start and run their entire ML and frontend platform team. He’s a really cool guy who's written what I call the “big tech ML blog posts” where he goes through all of their different ML systems and synthesizes lessons on model deployment, model serving, and how to build an ML platform. What did you think of the podcast?

1:08 Demetrios

Well, that actually was one of my big takeaways – how he was able to learn all of that, and then bring it back to Duo and implement it. When we talked through that with him and the challenges that he faced when he was implementing it, those were huge. What about you? What kind of takeaways did you have?

1:27 Vishnu

I really enjoyed how he communicated about super complicated software engineering concepts and machine learning engineering concepts, effortlessly. He really was able to toggle between talking about resource utilization and then what kind of models should be built for particular use cases. I learned a lot about that from a maturity standpoint, as a machine learning engineer myself, so it was a great conversation. Listen all the way through for all the gems and onto the podcast.

1:56 Demetrios

There we go. And if you want to buy one of them shirts that Vishnu has on, you go to MLOps.community and find all of our merchandise. Peace out. intro music

2:08 Vishnu

Hello, everyone. Welcome to another episode of MLOps Coffee Sessions. As usual, I'm Vishnu and I have Demetrios here with me. How you doing Demetrios?

2:19 Demetrios

I am doing amazing today. Absolutely amazing. I took an ice bath this morning. So things are good.

2:25 Vishnu

Ah, okay. That'll do it. That'll do it. Today, we have Ernest from Duo Security joining us. Thank you so much for joining us, Ernest. It’s a real pleasure.

2:34 Ernest

Of course! Thanks for having me. I've been listening to a lot of the podcast. I’m a fan of what all you guys do. So you've been kind of blasting through my ear for the past several days. It's cool to do it live now.

2:49 Demetrios

That puts a lot of pressure on us, man.

2:52 Vishnu

I hope you can deal with the sound of my voice and how it's gonna devil you at night and during the day – at all times. chuckles But what we have you on here to discuss today is not the sound of my or Demetrios’ voice, but ML platforms. You have put together some of the most thoughtful content about MLOps and ML infrastructure recently, and in particular, parsing lessons from big tech ML systems. I've really enjoyed reading your blog posts, the community has really enjoyed reading them, and I wanted to start by asking you – what led you to write this series of posts and take a case study approach?

3:35 Ernest

Yeah. It was kind of a surprise that it took off, so thanks for the kind words and for enjoying it. It's kind of my first blog post after the first one, which was kind of a practice one. But basically, a little context – Duo's core business is multi-factor authentication services for other companies. And then several years ago, we started to branch out. One of the products that is this branching out is a threat detection product called Duo Trust Monitor, which I help build. After launching that for general availability, I was looking at the list of things the product manager has in mind for future ML projects. I was basically thinking like, “We're going to have to start from scratch for a lot of this,” because we didn't build a lot of reusable components in the first iteration. We could reuse the workflow orchestration piece, but a lot of the other parts were custom.

So then I advocated for an ML infrastructure team that started small and I was very interested in what other companies do. There are a lot of players – companies in the space – that are kind of ahead of us in the infrastructure and platform journey. They're really good about publishing what they do, and I thought that diving in could be helpful to learn. My girlfriend says that sometimes I have an obsessive personality for certain things. So I really dove in and I thought that the research would be useful for other people and then I wrote the post.

5:10 Vishnu

Absolutely. It certainly has been and a couple of thoughts there. Number one, it's a tale as old as time – data scientists/ML engineers frustrated by the process at a given company and then says, “How can I do this better?” Right. Kudos to you for doing that. I thought one of the coolest things about the post – to anybody who hasn't read it, we’ll include it in the show links – is that it really dives into a number of different big techs. A lot of times, we'll have one-off discussions about Uber, about Etsy, about Google, about Facebook – but there are 12, 13, 14, sometimes 15 companies that you researched and that you included in your analysis, which I find to be a real strength. With that in mind, and with that comprehensiveness being a strength of your analysis, walk us through the five components of an ML platform that you noticed. Is there anything that you would add to that, since it's been about like six to eight months since that first post?

6:08 Ernest

Yeah, of course. So I’ll start with the first part. And yeah, I definitely wanted to cover a lot of different companies to kind of find commonalities. I didn't actually cover the super comprehensive platforms like Facebooks, or Googles, partly because they feel even more out of reach than what other people have, and I didn't know if it'd be as useful for those of us with less mature platforms. But yeah, the main components are – there's a feature store, which kind of feeds into both model training and serving. Then there is workflow orchestration, which helps you orchestrate your model training pipelines, the output of your model training pipelines goes into a model registry and then there's a serving system that can serve your model in online or batch. Once something is online, of course, you want some sort of monitoring. For more models, usually, you want some specialized model quality monitoring, because you have to worry about data metrics on top of your operational metrics. So there definitely are more components in platforms, but I wanted to show the more fundamental components so that it's sort of a more manageable system and it's also easy to learn. I think one way to think about these components that could be useful for some of the software engineers listening, is sort of comparisons of these components to standard software engineering tools. So workflow orchestration is kind of like CI – you can create pipelines or DAGs for testing and building your code, then the result is an artifact. For code, it might be an artifact in Artifactory or a Docker image in a container registry, or a tarball in S3. But for ML, it's a model artifact that goes to the model registry. Then your model serving system is kind of like your standard stateless service, if it's an online serving system. The caveat is that it uses much more CPU and memory than your standard IO-bound service. Monitoring is similar to your observability tools, except that you need to monitor model metrics along with your operational metrics.

I think one of the most interesting components is the feature store because it's different from a standard database. It's used offline to create the artifacts you deploy. There are also strict requirements for parity between online and offline. And it's also kind of a point of collaboration between teams. So it's got features of data warehouses, online databases you use for real time traffic, and so on. But these analogies aren’t perfect. I just wanted to note that there are a lot of similarities with the software tools we already know and we shouldn't think of them as completely novel things. We can borrow from some of the lessons of these really robust software components as we’re on this journey.

9:15 Vishnu

That's a really great point. I wanted to quickly comment on that. I think a lot of ML engineers and data scientists – this is the thesis of mine that I've been harping on for a while – walk into the field, starting with model.fit, right? You start with the models, and then you figure out all the other infrastructure. And you realize how that other infrastructure has parallels to existing software engineering and where it's different. And I think what you pointed out there about feature stores in particular took me a while to understand. You know, “What is the difference between a data warehouse/database that is using all the other core data functions of a company and a feature store which appears to just be for machine learning engineers?” There are a lot of great blog posts about it. I think Logical Clocks has a great one, I would recommend checking it out. But I think that's a great point. And I know Demetrios had something else that he wanted to ask you about. So I'm gonna kick it to him.

10:08 Demetrios

Yes, yes. I was looking at the blog and just thinking about the limitations that you highlighted and what some of these different lessons would be that you wouldn't necessarily advocate for everyone to go out and use. So you talk about – there's some great things that you see as patterns, but what are some of these anti patterns? Or patterns that companies are using that you wouldn't necessarily say others need? And I almost like… I guess I'll leave it at that. And then I'll ask you a follow up question. chuckles

10:50 Ernest

Okay, so – the anti patterns. Yeah, I don't know if I have a great answer to this one. I guess another way to think of it is like, in terms of those existing components, what might be some things that you don't need to start out with, or maybe for a long time. One thing I've been thinking about is – I think workflow orchestration for ML workloads might not be as critical. For data workloads, I think there it could be very critical with your ETL/ELT stuff. But I think a lot of times data scientists don't necessarily need complex DAGs. A DAG can be turned into a sequence of steps that, at that point, just collapse everything and put into a single job. It's not the best design, but I think a lot of times data scientists just need to run something – run a job on a schedule – instead of having to manage Airflow.

11:44 Vishnu

I definitely felt that way.

11:46 Ernest

Yeah, I think it's easy to start and say, “Oh, I want these branching scenarios and fan-outs.” But a lot of times, if you can put in one job, it's a little simpler. Kind of related to your podcast with Skylar, I think one great place to start is with “How do you package and serve a model?” If we break down ML systems into – there's a development phase and then a production phase. Usually the development phase is really well known and easy to do, either locally or on a single remote machine. But then it's the integration with your application that becomes much harder. I don’t know if I answered your question completely.

12:28 Vishnu

I think it did. What do you think, Demetrios?

12:33 Demetrios

For sure, yeah. And it's exactly that – sometimes what's overkill and what is necessary, especially for those people that are just trying to realize the business value right away?

12:48 Ernest

Right. Yeah, I see. cross-talk I was just gonna say it's kind of related to – Vishnu, you've talked about before that it's important to invest early in platforms, yet there is like a sweet spot in “When do you start doing that? When is it too late? When is it too early?” And there are risks to both.

13:12 Vishnu

I completely agree. It's a delicate balance of supporting the breadth required for a platform to be effective and going into the depth required to actually create solutions that create business value from a modeling and model generation standpoint, in terms of actually creating models and putting them into production and keeping them going. I think one of the things that you summarized well in the blog post is this idea that what platforms are about is – how do data scientists repeatedly create value? And I thought that your quote right there was actually the most succinct distillation of “Oh, yeah. This is why we should create a platform. Because we're doing this once and we're doing it a million other times and we want them to repeatedly deliver value.” As you thought about that focus for why platforms exist, what level of maturity, or completion, did you observe in these companies’ efforts (Etsy, Uber, Spotify, etc)? And did you feel like anything was still missing or maybe half baked? Or things that you felt that could have been done better?

14:24 Ernest

The latter part is a little hard because usually they don't write about it. Usually, you write about what you have and then the other stuff is like “maybe you have it, maybe you don't”. But yeah, can you repeat the first part of the question again?

14:41 Vishnu

What level of maturity did you observe? You referred to Skylar – where Skylar said “Start with the serving components, start with how the entire interface will be used, and then work backwards.” How sophisticated did you find that outside-in approach at these companies? How fully formed was it across the entirety of that process?

15:07 Ernest

Yeah, I think most of it is pretty fully formed. From my perspective, they all have really strong model serving capabilities, which makes sense because that's actually how you deliver value. Model monitoring seems very mature, but there are some that are still working on it or don't have all the capabilities. I think it's easy to put that to the side for a little bit, as long as you can monitor your metrics in some coarse-grained way and your application isn't a mission critical application. If you get bad recommendations, and it’s not in the world, then it's not as important to be super sophisticated with your model monitoring. Definitely, with feature stories is probably the most impressive – it seems like most of the companies have a very sophisticated feature store that they’ve built in-house specifically for their needs. And it's probably the thing that's hardest for smaller companies to stand up by themselves.

16:17 Demetrios

So out of these different ones that you've studied, which platform or which architecture did you admire the most?

16:27 Ernest

I guess there's different… There are some architectures that are very impressive because of their specific capabilities. I really like (I think it's PayPal and Uber) who have very sophisticated systems around Shadow Mode. And that's very useful for delivering value. But other than that, what's kind of impressive is the concerns that they abstract away from the data scientist in a way that provides a great user experience and makes it so they don't have to think about things. So I think there are some things, like Intuit’s feature of “once you have a model, then we have the service that helps you basically scale test it and help you determine the resource requirements.” And that seems super useful. I think more about it in terms of features that would be a great user experience, rather than a specific platform, because a lot of them have really great architectures and features already.

17:40 Vishnu

Yeah, there's a lot of impressive engineering that goes into building these platforms. I sort of have a really big picture question here, which is – we had on Jacopo from Coveo talk about this idea of MLOps at reasonable scale, which is the scale that most companies have – most companies aren't Pinterest and Etsy and Uber with oodles and oodles of engineers and dollars to pay for an ML platform like this to be built. So with that, we have the scenario in our company a lot where the team is about five to ten data scientists, there are about double digit models that are about to come on into production, or already are in production (maybe 10 or 15) and the company is starting to grow – might be a little bit sort of Series B, C – and there is a push from the engineering team for the data scientists to get their shit together, so the data science team hires an ML engineer. To that ML engineer, or to that company, based on your research around how to build ML platforms that look good and do good things for big tech companies – what advice would you give them?

18:51 Ernest

There's definitely, in that scenario, it seems like… I've met companies before where they accumulate so much tech debt that they want someone to come in and fix all their problems, which at that point is a little too late because your data scientists are probably disgruntled, the candidate will be less appealing, wants to hear how many times you say tech debt. But at that point, it's the easiest to make the case for staffing because you desperately need the help. But I feel like I have to say it depends. But I think most of the time you start with, “How do you package and deploy your model?” And then go from there and think about “How do you make that more repeatable?”

I actually kind of ran into a similar problem recently when I was working with a nonprofit, and I'm helping them build their first ML system. Some constraints are – as nonprofit, cost is a concern. So I thought we’d use serverless as much as possible, both for the cost and for the maintenance benefits. Also, their data sciences team doesn't really have anyone with strong engineering skills. So I'm not going to build something really complex and expect them to maintain it. I'm not going to like to try to stand up a feature store or have them deploy things in Kubernetes. What I did was create an architecture for that system, and then set up the code repo in a way that they can easily test, build, and deploy the model and the code around it. So they can pretty much self-service deploy the model. And because there's some structure around it, the data scientists can do what they need to do to improve that model.

But going back to your question, I think before thinking about platforms, it's important to think about, “Do we have a solid data platform, or at least a repeatable way to get data that you need for analytics in general and modeling?” That's kind of a mistake we made at Duo – we thought we had a solid data warehouse ETL platform and turned out it wasn't. We had to rebuild it and that kind of took a lot of time. I think after you have a good data platform, then you can focus on ML infrastructure.

I’m making the distinction between ML infrastructure and ML platforms – the main difference between them is that ML infrastructure is about unlocking new capabilities and it's less focused on repeatability. So it's really easy to build useful ML infrastructure that's not necessarily repeatable, you want to design it with repeatability in mind. But maybe that's not the first goal because you just want to make it possible to do something. I think after that, when you have some infrastructure, then you can kind of redesign things in a way that it becomes a platform. And I guess, similar to what some other guests have said, sometimes when you do it manually or more custom the first time, you have a better understanding of your requirements and then it's easier to generalize it.

22:03 Demetrios

There's something cool that I am thinking about with what you did. Again, this is a bit more big picture, but… you saw that we need some kind of infrastructure and we need to aim towards having a platform. So you had this initiative that you championed inside of the company. How was that process? Because we talk a lot about how MLOps is more than just what tools you decide to build, or if you're going to do it in-house, or build versus buy – all that kind of stuff. What were some of the value props that you went to managers with or the upper management and said, “Look, this is important because of XYZ.” And how did you get buy-in?

22:50 Ernest

So I was lucky, or am lucky, to have a very supportive manager. But part of the conversation, it started out with our one-on-ones where I basically said, “We want to build these things. The product manager wants these things. But here's the state of the system we helped build. Here are the parts of it that are reasonable and the parts that are going to be pretty hard to reuse across projects.” Then from that, I wrote a proposal – it's kind of like a vision document in a way about “This is the type of platform and infrastructure we can build. These are the core tenants and how we might do it.” Then I shared it with my manager. She told me it was way too long, which was correct, and helped me edit it down. Actually, at that point, she was already kind of invested. And then I shared it out with the rest of the team, we managed to move an engineer who was more working on the analytics platform to help out in this area, and then we kind of started from there. The first thing we did was basically meet with data scientists to be included – previous data scientists – to figure out what the pain points were – what were the most like high impact and low effort things that we can do at first to demonstrate value with building out infrastructure for our systems.

24:22 Demetrios

Alright. So let's change gears real fast and let's jump into the other article that we both thought was great – around deployment. Specifically, maybe you can go over just different common model deployment patterns that you saw, and you have learned about, and potentially are using?

24:44 Ernest

Sure, yeah. The most basic one that I saw is, I guess it counts as batch. Yes, it's batch, but in a way where you enumerate all the possible inputs and you just put it in a lookup table. I think that's like one specific type of batch. But yeah, in terms of batch – there's just batch where you process some batch of data, or you enumerate all the inputs and put your outputs and predictions in some set of tables, so all you need to do to serve your model is to look it up in the database. And then there's, of course, the stateless service – usually, it's a pull-based model where clients send requests to the service and they get a prediction back. But some companies also have sort of a push-based model where the service consumes events through some stream and then pushes the prediction somewhere.

But yeah, I recently wrote a blog post which is more generally about architectures I've seen in industry for serving a huge number of models. Part of the motivation was – at Duo, we train per customer models, but in a batch cycle, which leads to a huge number of models and I was thinking, “How do we serve this in real time? It seems really hard.” So I wanted to dig in and see what different companies have done. Different companies have published their own architectures, but a bunch of them are surprisingly similar in how they handle this case. So that's what the newer blog posts about and I can dig more into it if you'd like.

26:24 Demetrios

When you deconstructed all of these different ways of doing it, and you were looking at how to apply this to what you're doing at Duo, what were some main takeaways that you had? And it seems like I'm getting the sense of like, the way that you learn, or a pattern from you, Ernest, and it feels like you just go out, you try and learn about all the best and the brightest in their fields, and then you synthesize it down to something that you can put out as a digestible piece of content. I imagine that is so that you can take it back and use it while at Duo. What were some of these huge learnings that you took from writing this blog post?

27:08 Ernest

Yeah, that is a big part of how I learn. I guess one issue is that you usually get just a snapshot point in time and you don't learn as much about how they prioritize what to build first, in some cases. But yeah, at Duo, I think the first thing that I sort of realized was thinking about the distinction between infrastructure and platforms – one of them being repeatability. We have to kind of start with the infrastructure side and then from there, think about how to turn into a platform. But then one big lesson, from both what I've read and from talking with data scientists, is that we don't have great visibility into what happens in production. This might be the case for a lot of different companies – at Duo most employees don't have access to production for security purposes. So I can’t go and look at things in production. Basically, we have these huge pipelines running in production and we want to get more visibility into the model metrics, into the artifacts that are produced. Early on, a quick win is getting access to those. So one of the things we did was to set up a model registry, that's not something that we had – we had workflow orchestration, we had a model training pipeline, or the way we serve a model is in batch. So, out of those five core components, the main things we didn't have were the feature store, a robust, repeatable model quality monitoring solution, and then a model registry. And it seemed like the best first place to start was with the model registry. That alone provided a lot of visibility into the metrics for a model, because not only could we give data scientists access to the UI, but we could pipe those metrics, we use MLflow into our data warehouse so they can actually do in-depth trend analysis on these models. Since we train models every single day, for every customer, you do need some analytic tools to be able to really dig into how performance changes. The other thing we did wasn't even really part of the platform. It was just to set up some simple S3 bucket monitoring – so we copy artifacts from one place to another place that data scientists can actually see, and they can provide scoped access and audited access to see those artifacts in production. So that's another one. Then after that, we kind of reevaluated and we saw that there was kind of a lack of standards in how we write and read tables from our pipelines in general. That, coupled with the need to decouple or split our monolithic pipeline, lead us to adopting Delta Lake as a data format. Which is still in progress, but it's going to help us a lot in making sure that different processes can write and read from the same tables. It's also a standard data format that we can extract metrics and insights from, put into a data warehouse, and act as an interface between parts of our system.

30:47 Demetrios

A quick one for you and then I see Vishnu wants to ask a question. As you're talking about – you were reading all these blog posts and then you had the ‘point in time’ idea of what they did, but you didn't get how they prioritize things, as you just talked about, “Oh, well, this seemed like a low hanging fruit. This seemed like a low hanging fruit.” Was that how you were prioritizing? It was like, “Well, this is like an easy win. Let's go for this first.” And then you just start adding easy wins and easy wins and building from there?

31:21 Ernest

It's kind of a combination of… prioritization is like, “Once we have enumerated the pain points for all the different parts of the data stack's lifecycle, what's the combination of high impact plus low effort?” We also have to take into account “What ordering do we sequence these improvements? Does doing one first make the rest a lot easier?” So those are all the different factors. We don't always want to pick the easiest wins, it's like some sort of sorting metric that we came up with arbitrarily in our heads to decide what to work on – somewhat arbitrary.

One thing that I gleaned from all the research is that all these platforms are kind of different, because they really tried to prioritize what's useful for their companies. It seems to me that at Intuit, they put a lot of emphasis on privacy, compliance, and reliability. So that's where the scale testing comes in. I didn't write about this, but they also have sophisticated ways to ensure compliance on the data that is used by the models. But there's other companies where it's more about super high scale, low latency serving, and that's what they prioritize. So I saw these changes and kind of understood that I can't just follow what they did. I have to kind of follow their approach, really understand the requirements of our system, and use that as prioritization.

33:01 Demetrios

So good.

33:02 Vishnu

That makes a ton of sense. One question I have is – can you just explain what Delta Lake is and how that solved the problem that you were facing with this metrics analysis?

33:13 Ernest

Yeah, sure. So Delta Lake – basically, a common pattern is you would process data with a processing engine like Spark and you dump it to object storage like S3. And your data table isn't like one file, it's split across a ton of parquet files. What Delta Lake provides – one of the main things it provides – is asset transactions on those tables. So you can transactionally update a single table and it fails, it's not you update certain files and then the rest are not updated. Or you appended some parts to the table, but you missed some parts. One of the issues with using parquet by itself is that you don't get these transactional guarantees. So if I want to have one place in my S3 storage that says “This is a table of features,” and I have some pipelines that write my features to it. If it fails halfway, you're kind of screwed because Spark will first delete all the files and then it will write.

At that point, it's kind of hard to know that it failed. And if you have consumers that are reading from that table at the same time, it's gonna get inconsistent data. It might see parts of the new data, parts of the old data, or it might see none of it, if it's reading it at exactly the wrong time. So having this abstraction of tables on top of S3 makes it easier to think about artifacts that we persist as not just like a list of parquet files, but tables that you can update over time and delete and put some tools around. Also in terms of the consumers, it's easier to know that I'm getting a good snapshot of this table and have decoupled writers and consumers of the same table. Which is hard to do with Spark in the native way that it does writing and reading.

35:24 Vishnu

It's a great explanation. That's a really great explanation. Related to that, because you're such a good explainer, I wanted to ask you more about model rollouts and Shadow Mode. It's something that you highlighted as a sort of design pattern in your surveying and model deployment blog. I get this question asked a fair bit to me, where people say, “Hey, Vishnu. If I'm building an ML system and suddenly something like COVID happens, and all the data that I had before is completely skewed in practice. What do I do to solve that?” I think Shadow Mode is a solution. Do you agree? And can you explain how these companies that are doing this massive scale implement it?

36:07 Ernest

Yeah. That's actually going back to what might be missing from the blog post. I think I would have highlighted Shadow Mode more since it's such a powerful technique. Because it's kind of part of the serving system and the monitoring system, but you also have other components that enable Shadow Mode and make it easy to do. In terms of the massive data drift problem, I'm not sure if it would help the existing model unless you turn it off. But I think it could help in the general case of testing changes really quickly. One thing that I liked – I think it was PayPal’s system – not only do they make it really easy to do Shadow Mode, but you kind of schedule a time in. So you don't have to max out your compute. You can say, “I have these many slots for doing Shadow Mode.” And then a team can go in and say “I want to shadow this model for this time,” and they can just get the results for that time. So that seems like a really great feature.

Another great feature is just thinking about, “What model am I shadowing?” So, you're not just deploying a single model, but you're trying to pair it up with things to compare with. And I think that is kind of a shift I had early on thinking about, “How do I do Shadow Mode?” But yeah, in general, with Shadow ModeI kind of talked about the user experience part, but it's really important to make sure that your shadow deployment doesn't have as much priority on resources as your production deployment, and that it's done in a way where it doesn't affect your production traffic. That's a little hard sometimes, because a lot of times it may access the same data, you might have to route traffic to both the production model and several shadow models, which is going to impose more load on your system. But doing it in a way where it's least interfering would be useful.

38:27 Vishnu

One of the things I noticed in your blog post, in your explanation just now, and throughout this entire podcast is – you have a real ability to understand the engineering challenges posed by machine learning in a very granular way. For example, I noticed in the serving posts that you had a really nice sort of tangent on resource utilization and I thought that that was really useful because I think a lot of machine learning blog posts tend to gloss over the particulars of the engineering challenges and you end up having to figure that out yourself. When you're looking at your AWS instance, or your resource utilization, you're like, “What the heck is going on here?” There aren’t really those call-outs when we're talking about high level things like platform and some of the engineering challenges there. So with that observation, I want to ask you – are you an ML engineer or a data scientist? How do you think about your career?

39:29 Ernest

Um… kind of in between? I’m a data scientist by title but I think I've been doing ML engineering stuff for a while. I think I heard on a recent podcast that there's going to be a post called “The Rise of the ML Platform Engineer” at some point. chuckles

39:50 Demetrios

Hopefully. chuckles

39:54 Ernest

I think that's what summarized maybe my last year of experience – infrastructure or platform engineering that way. But I've always kind of lean more towards the engineering side. I've done a little bit of research and prototyping, as well. But I think I enjoy the engineering side a bit more.

40:15 Demetrios

There's something that I wanted to ask you about with the platform that you have now and it's mainly around like the trade-offs that you can run into when you're trying to have something that is flexible – a platform that can be used and molded in two different ways by the end users or something that's simple. And I feel like I've heard about how you have to choose one of these. I'm wondering how you look at that when you're trying to take into consideration things like, “Should we go the simple route or should we go more flexible?”

40:52 Ernest

Is one of those related to user experience? Or is the assumption that the user experience can be the same for either one? I would think that usually the simpler approach is easier to build a good user experience around, but what are your thoughts?

41:08 Demetrios

Yeah, it's mainly around like, we can create something that is very opinionated and simple, or we can create something less opinionated and flexible, because our engineers or our data scientists want to have that ability to tweak things.

41:24 Ernest

Right, yeah. I think one interesting… It kind of reminds me of Skylar's post of like “Data is wicked, you have to provide like different levels of control, or else no one's really satisfied.” And I think that's something that we've seen from these posts – some of these companies start with purely optimizing for production needs (mainly scale, reliability, availability – that sort of thing) and then they kind of realize, “Wait, data scientists don't want to write in this language, or it's really cumbersome to use this.” And this is an option, to use services that deploy on this super high scale thing, but eventually, they'll go towards “We're gonna support Python. We’ll support your Python models. Maybe as an engineer, I'm not happy because it's not very efficient, but it helps increase our business speed, which is what's important.”

So I think having a good user experience for different users at different levels of engineering maturity is really helpful. But in general, I would say because the space moves so quickly, you probably don't want to standardize on a specific technology or framework unless it's super well-established. There's probably not too much risk in saying “One of the models we’ll support is XGBoost and the other is PyTorch.” Because they've been around for a while and many people use them. But if you're going to say, “I'm standardizing on this new framework that just came out.” And it has one blog post and ten users – then that's probably not a great idea.

43:15 Vishnu

I totally agree. Totally agree. As an early adopter myself, I have to deal with that tension. And with that, it’s kind of… cross-talk Go ahead.

43:25 Ernest

Oh, sorry to interrupt. There's this really good PowerPoint by an engineer called something like, “Choose Boring Technologies”. It's been around for a while. But that's kind of part of the thinking. It's like, “I choose MySQL because it hasn't broken – it hasn't lost any data for the last 10 years. Maybe it's not the best and shiniest object, but that's part of the technology choice.”

43:53 Vishnu

Boring over sexy.

43:54 Demetrios

cross-talk It’s funny because one of the community members, back in the day, I remember Flavio – he talked about how he wanted to create a boring conference. And he's like, “We'll just have a conference about boring technology. None of this cutting edge stuff! We're just going to talk about the stuff that actually works and you can have no problem sleeping at night if you add it into your stack or you're using it.” Sweet, man.

44:19 Ernest

That would be kind of cool. Like if they really dug into, “It's all technologies we've used 10 years ago, but here's the details of how it works really well and why it works really well.” At least I think I would geek out about that.

44:35 Demetrios

Yeah. Right? So, sweet, man. I appreciate you coming on here. I appreciate your wisdom and your ability to take all of this information that is a lot to read and a lot of time that you’ve obviously spent on studying up on these different patterns, and then distill it into a blog article, and distill it into your mind and regurgitate it for us. It's been more than helpful. I cannot thank you enough. Ernest, this was awesome. I am actually a Duo user. I remembered that I do use Duo. Mainly from my coin base two-factor authentication. So yeah. Sweet. But, thanks again, man. This was awesome.

45:18 Ernest

Yeah, thank you so much. I really appreciate the work you guys do around the podcast community. And thanks so much for having me on. outro music

+ Read More

Watch More

47:38
Posted Dec 12, 2022 | Views 543
# Systems Engineer
# World of ML
# UnionAI
# Union
28:28
Posted Aug 20, 2023 | Views 379
# ML Deployment
# Orchestration Systems
# Etsy