MLOps Community
+00:00 GMT
  • Home
  • Events
  • Content
  • People
  • Messages
  • Channels
  • Help
Sign In
Sign in or Join the community to continue

Continuous Deployment of Critical ML Applications

Posted Mar 10
# Deployment
# Pipelines
Emmanuel Ameisen
Emmanuel Ameisen
Emmanuel Ameisen
Senior ML Engineer @ Stipe

Emmanuel Ameisen has worked for years as a Data Scientist and ML Engineer. He is currently an ML Engineer at Stripe, where he worked on helping improve model iteration velocity. Previously, he led Insight Data Science's AI program where he oversaw more than a hundred machine learning projects. Before that, he implemented and deployed predictive analytics and machine learning solutions for Local Motion and Zipcar. Emmanuel holds graduate degrees in artificial intelligence, computer engineering, and management from three of France’s top schools.

+ Read More

Emmanuel Ameisen has worked for years as a Data Scientist and ML Engineer. He is currently an ML Engineer at Stripe, where he worked on helping improve model iteration velocity. Previously, he led Insight Data Science's AI program where he oversaw more than a hundred machine learning projects. Before that, he implemented and deployed predictive analytics and machine learning solutions for Local Motion and Zipcar. Emmanuel holds graduate degrees in artificial intelligence, computer engineering, and management from three of France’s top schools.

+ Read More

Finding an ML model that solves a business problem can feel like winning the lottery, but it can also be a curse. Once a model is embedded at the core of an application and used by real users, the real work begins. That's when you need to make sure that it works for everyone, that it keeps working every day, and that it can improve as time goes on. Just like building a model is all about data work, keeping a model alive and healthy is all about developing operational excellence.

First, you need to monitor your model and its predictions and detect when it is not performing as expected for some types of users. Then, you'll have to devise ways to detect drift, and how quickly your models get stale. Once you know how your model is doing and can detect when it isn't performing, you have to find ways to fix the specific issues you identify. Last but definitely not least, you will now be faced with the task of deploying a new model to replace the old one, without disrupting the day of all the users that depend on it.

A lot of the topics covered are active areas of work around the industry and haven't been formalized yet, but they are crucial to making sure your ML work actually delivers value. While there aren't any textbook answers, there is no shortage of lessons to learn.

+ Read More

All that glitters is not gold and from our last MLOps coffee session it turns out engineers are as susceptible to the Shiny Object Syndrome as the rest of the population. Over-engineering, over-complicating, and constant urge to move to the next shiny tool when something gets standardized - all traits our recent guest Emmanuel Ameisen from Stripe shares as serious challenges when it comes to streamlining the continuous deployment of mission critical ML applications. With hundreds of ML projects behind him, he’s pretty much seen it all - including the human barriers that get in the way of a successful project.

Grab a cup of coffee, tea, or hot cocoa (there’s no hot drink discrimination here) and listen in because the whole session is riddled with truly helpful gems. But if you’ve only got a moment, here are our top three takeaways.

ML Engineers Like It...Complicated?

You know you're an engineer at heart if you want to over-complicate something that could be done in a much simpler way.

Machine learning is a fast-evolving field and we’re progressively tending towards simpler solutions. Ten years ago, you might have had to train your own model and maybe five years ago pre-trained embeddings that worked well became available for download. And we’re still improving on that process to make things easier for everyone.

But while systems have improved, ML engineers as a group haven’t changed that much. We still like taking the complicated, build-it-from-the-ground-up route. Or as Emmanuel puts it, ‘As the fancy stuff becomes normal, it becomes boring. And when it becomes boring, we want the new fancy stuff.’

That personal quality might be one of the pivotal factors driving the rapid evolution of ML. That doesn’t mean it isn’t better practice to focus on your use case, KPIs, and stakeholder to find tools and systems that make the most sense for what you’re doing.

Operational Excellence & Regular Maintenance of Models

We could just as easily call this section 'Where the Real Work Begins'. As Emmanuel says, “You develop operational excellence by exercising it.” He walked us through how most teams release models. They’ll have a new use case, they’ll think of the model, make the model, be happy with it, and release it. Sometime later, they’ll decide to release an updated version and run into a series of problems; code used to train the model is way out of date, release criteria might as well be non-existent, and no one seems to know where the data is. It’s almost like performing ML archaeology to figure it out.

Developing a system of regular engagement with the model prevents you from having to clean up big problems later as the data rots or the assumptions you built the pipeline or stop being relevant. The cracks that can occur in the system are much smaller and easier to manage when maintained on a two-week rotation rather than letting it sit and rot for a year or more before thinking about it again.

It seems as though there's an assumption that models are deployed and then there's no need for maintenance. If anything ML is the opposite of static and must be updated more than traditional software because it depends on both code and data - and data changes.

Advice for Iteration

It’s probably safer too as it prevents oddly specific tribal knowledge from disappearing forever.

The main model Emmanuel works on with Stripe decides whether or not any given transaction is allowed or blocked. For something like this, it’s not enough to be good on average. It has to be good across the board. You can imagine there’s a multitude of high-consequence potential failures here. He argues that automation makes it safer because when you have so many things to think about, specific pieces of information get distilled into the team's knowledge. Automation helps safeguard the consequences of someone leaving the team or needing some of that niche knowledge later and the person who had it is no longer available.

Automation might sound a bit dark and scary in the first iteration as you’re still feeling the ropes of how things are done, but in the long run when you did a few circles around the block automation might be your best friend.

To ease yourself and your team into it, Emmanuel says ‘Suggest before you automate.’ Loop a human into the process. Write your automation, suggest the value you want to automate for, and do that for a few cycles. Afterwards, if the person in the loop consistently says they’re not changing anything, then you’ll feel more comfortable automating.

Bonus: Breaking into the Field

Likely related to an engineer’s propensity to love all that is complicated, people tend to think, "I must do the most complicated thing I can do to get hired in ML."

But the truth flows in the other direction. Being able to show actual progress and a completed project, even if it’s something more simple, may be the game-changer. It shows you’re not only able to learn but to learn progressively. And sometimes, the most important feature of any project is simply that’s it done.

Building Machine Learning Powered Applications: Going from Idea to Product Book

Looking for that simple project to take on? Emmanuel’s book, Building Machine Learning Powered Applications: Going from Idea to Product book, might be the thing you’re missing. You’ll get the most out of it with a hands-on approach as it takes you through building an example ML-driven application. It covers the tools, best practices, challenges, and solutions for each step of the way and translates to building real-world ML projects.




speaks French



What's happening, Adam? How are you doing, man?



Very good, very good. How are you?



I am trying very hard not to make the joke about what your last name means. laughs That is how I’m doing right now. People are sick of it. It's just basically “insert random animal” and “random country” and say that’s what your last name means. laughs If it is an animal that has to do with the mar or the ocean, even better. But anyway, we're here today. We just got off this podcast with Emmanuel Ameisen. Who – Wow. He blew me away. I don't know about you.



I think “wowzers” is the word. Yeah, definitely.



All right. So some top takeaways from you, and then I'll give you mine.



Yeah, he obviously wrote the book on machine learning power – wrote a book – which is very favorably reviewed. He obviously knows his stuff. And he's at that end of the MLOps space where he's worked in really quite mature organizations. That's quite interesting. Lots of interesting takes on how to go from the clunky, hard, “how do you get value out of the model” process to flying and the hard yard to put into going from that starting point to where we kind of want to be, which is really quite cool. He’s got just quite a pragmatic approach to things actually, like how to align your metrics to get value, and how to do things properly. I think those were the big ones for me.



Yeah. It's funny that you mentioned that because it was very much like, “Okay, this isn't really the zero to one phase – this is more like the one to infinity phase” that he is trying to fine-tune right now at Stripe. He's doing an incredible job. And I think probably one of the best quotes that I've heard in a while was (before we actually started recording), he mentioned, “You develop operational excellence by exercising it.” Then I asked him to go into it, he goes hard at what that means and how he implements it. I just loved it, man.

So let's jump into this. Before we do, I will mention there's a little bit of an aside. Before we jump into this, there are all kinds of things I want to actually announce. One is that we've got all kinds of cool swag on our shop – on the website. Go check it out. Adam’s baby has been wearing it. It's our best model. chuckles That's incredible to see. We've got baby clothes, but we also have big kids’ clothes and big humans’ clothes.

The next thing I will mention is that we're looking for people to help us take some of these podcasts and choose where the best parts of the podcast are and then we can create snippets out of them and basically make a highlight reel for those who don't have time to listen to the whole podcast and just want the ‘quick and dirty’, ‘best of’ podcast. So if you are up for telling us your favorite snippets of the podcast episodes, get in touch with me, because we would love to have you help out. And that's it. Let's jump into this conversation with Emmanuel.

intro music



Yeah, man. Let's talk about your book – Building Machine Learning Powered Applications. What inspired you to write this?



Yeah. I used to work at a company called Insight Data Science. I joined it after being a data scientist myself for a bit. And that company – the whole goal was machine learning education, and more specifically, professional education, so teaching people how to actually do machine learning in a corporate setting to actually deliver value and then get hired for it. And so, the way it worked is we would lead projects with people that wanted to get hired as machine learning engineers and data scientists – oftentimes applied projects in partnership with companies. And, you know, we'd build a machine learning application to do text classification, email classification for support tickets, computer vision, or even reinforcement learning.

We kind of touched a broad range of applications. And I started seeing that the failure modes of all of those applications, the ways in which things would go wrong, actually had a lot more in common than I thought. Initially, I thought that maybe every machine learning application was its special jewel. And then I realized, “No, success criteria and failure criteria are pretty consistent across the board, at least in some ways.” That felt interesting. So that's why I wanted to write about that. Before I started writing, I saw that there wasn't much being written about it, so that was further motivation. I was like, “Oh, this is an interesting topic and also I can't find resources on it. So I'm going to try it.”



I always find it fascinating when you talk to people that have written books because it's such an undertaking – a document that size. How did you find the writing process? How did you go about starting, as well? Did you start writing and then find a publisher? And what would be the next book you'd write?



Oh, man. Okay, how did I start writing – I'll start there. Because I feel like I didn't really… Well, I always wanted to write a book eventually – at some point – but I didn't want it then. What happened was, I was actually getting pretty frustrated with NLP projects, and how 1) they all tended to look the same (the successful ones) and 2) everybody always wanted to overengineer them. This was like three or four years ago and it was wonderful because for most NLP projects (natural language processing projects) you can do stuff that's pretty simple and get just amazing value immediately. But everybody was excited about incredibly complex architectures.

I wrote this blog post (I don’t really remember what it was called) something like “How to solve 95% of NLP problems,” and it was a tutorial of like, “You just do this, and then you do this. And if that doesn't work, you do that. And then that doesn't work just do that.” It was based on literally dozens and dozens of NLP projects, and just seeing them succeed and fail. And that blog post just took off. I think now it probably has like half a million reads or something. People just really liked it. And O'Reilly, the technical publisher actually reached out to me and they were like, “Hey, we love your blog posts. Do you want to write a book?” So that's how I wrote – that's how we got started with writing. Yeah.



Just before you tell us what your next book will be – do you think it's changed since then? Do people still want to overengineer NLP?



I think yes and no. So I think what's cool about ML is that it is a field that's evolving fast. And so the kind of easy, “Hey, just do this. Don't do the complicated thing, just do this.” That solution gets more elaborate as time goes on. Maybe initially, to give you an example, think like 10 years ago, you'd say, “Oh, you want to do review classification or something. Just do like something called TF-IDF. Do counts on words. Have some version for each sentence, count the words, count their occurrences, and then you train a classifier. And that's fine. That'll work.” Then maybe five years later, papers Word2Vec and stuff like that came out, where you could actually download pre-trained embeddings that were really good. And it was like, “Oh, don't train your own model, just download these embeddings and use them as-is. It will take you like an hour. It's easy.”

It will slowly – we're not exactly there yet – but I would say we're slowly getting to the point where maybe that version is like, “Oh, just have an API call to or some service and they’ll have deep models that you can just kind of use.” So that's changed. But what I feel like has always remained true is that machine learning engineers are incredibly good at self-sabotaging. Whatever the simple, rational thing that you could do to solve the problem, they're like, “No, no, no. I'm gonna build my own 16-layer thing. First, I’m going to buy a supercomputer, then I'm gonna train on it for like three months and do this thing.” And you look at it like, “It's been a year. What have you produced?” And they’re like “Nothing, but I burned $100,000 of cloud credits.” I think, somehow, we're still always excited about the fancy stuff. So as the fancy stuff becomes normal, it becomes boring, like, “Oh, I don't care about this anymore. I want the new fancy stuff.” So that hasn't changed.



Do you think that's intentional? I have a bit of a view on that. Years ago, I used to do talks about… I used to call it something I probably can't say on a family-friendly podcast – but it was about this idea of all CV-driven development, that was the other one, where you get people solving a problem who have seen it. Do you think it's intentional? Or do you think it's actually just a natural component of the world we work in, where you've got these interesting things that are technically complicated, and it's easy to go into a complex solution?



I think it's a bit of both. There's definitely resume-driven development – for sure. I think, for a while (and still now) there was this perception – again, I worked with people that wanted to get a job in machine learning and so one of the things that maybe they believe is they’re like, “Well if I want to get a job as a machine learning engineer, I have to find the most complicated ML solution and implement it. I'm not gonna get hired because I took some pre-trained model and used it for something super useful. I'm gonna get hired because I invented a new type of machine learning model (or whatever).” And that's not true.

In fact, what happens is that, people that will be hired and will be successful are the ones that were like, “Hey, here's what I achieved. It doesn't really matter how I do it, but here’s what I achieved.” So I think that resume-driven development is a part of it. I think the other part of it is that this stuff’s just cool. A lot of folks that work in this industry (me included) just probably like nerding out on this stuff. If I'm just doing something for fun, yeah, for sure I'll go for something that's overkill and complicated, because it's fun. So I think that's a natural tendency that when you're trying to deliver on impactful and important projects you have to fight. You need to really ask yourself, “How can I make this simpler?” Rather than “How can I make this fancier?”



So basically – there ain't no shame in using a pre-trained model. Really, the idea is getting down to what metrics… How are you moving the needle? And how can you prove that you're moving the needle? Which leads nicely into some questions that I wanted to ask you about. As you're now working at Stripe and you're looking at how your machine learning team is moving the needle – how do you look at KPIs? How do you attribute things back to the machine learning team? What does that look like when you interact with stakeholders so that you have something to bring them?



Yeah. Hitting me with the easy questions, huh? Okay.



We've already done… This is like year two of the podcast. So we already got all the easy ones out of the way. chuckles You didn't realize. You should have been on here a year ago, I would have given you the real softballs.



Tough, tough. Well, I think there are a few ways to tackle your question. One thing is – it's always easier, especially in machine learning and engineering in general, but especially machine learning, if you can tie whatever you're working on to some number that the company you work for cares about. In a lot of cases, that number is some version of revenue. Sometimes it can be other things – it could be cost-related, it could be security-related. I work on a fraud team so it's not always obvious to say, “We've improved our fraud prevention – how many dollars does that make us?” It's not always straightforward. But anyway, I think being able to tie it to that number is important.

One of my mentors, when I was just getting started, had this saying that it was really cool, which was basically, “Write the press release before you start doing the work.” And I think that that helps a lot in those situations where, before you're gonna pitch some new model or some work to your manager, or try working on something just write the email you're gonna send to the company as you're done with the work. Sometimes, you'll realize that that email sucks. It'll be like, “Well, I spent six months building the system. Now, I guess, we’re kind of slightly better at doing this thing that some people care about. Maybe.” If that's the email – then just don't work on it. So I found that that's really helpful. If you can write the email in advance and be like, “Wow, this would be really cool.” Then that kind of motivates that project and helps make sure that you will have something to go to stakeholders with. You've already written it, so you just need to execute it.



Not like, “Yeah, we spent six weeks and turns out this is hard.”



Yeah. chuckles That’s right. Well, that's the other thing, though. The other way to handle your question is to just say, “Just reduce the iteration cycle.” Maybe that's one of the ones that you got out of the way with two years of running the podcast and many machine learning folks coming up. But, if there's one thing that's been a constant (I feel like in my career) is that the shortest iteration cycle wins. And making the iteration cycle shorter will make you win. As much as you can say, “We wasted our time in the last six weeks.” That's not a great look. But if you're “Hey, I ran an experiment. Took me five hours. It wasn't good.” That's fine. Who cares?



So how do you shorten the iteration cycle? What are some tricks that you’ve found to do that?



I'm a big fan of automating everything you can. It's hard, especially what I found is it's hard as companies get more mature and how you may have like a bunch of business processes or a bunch of different outcomes that you care about, that aren't just like, “Hey, is this model good?” But also like looking at different slices of traffic, different key users – that sort of stuff. But as much as you can, automating away every single step is really what matters. You want to get to the world where… There are two different iteration cycles that I often think of. One is the experimentation cycle. So you want to get into a world where you want to try a new feature for this important model or you want to try a new model for this thing – you can get your answers really quickly. So for that, you want to automate, “How do you gather data? How do you generate your features from the data? How do you train the model? How do you generate evaluation metrics?”

Oftentimes, when you start on teams, these are like 12 different things that you have to do and maybe have a checklist that's like, “And then download this thing. And then you go there and you pull this branch, and you do that thing.” And, of course, it takes forever and if you do one thing wrong, you have to redo the whole thing. So just kind of automating the glue there and the system. And then one thing that we worked on a lot last year is the iteration cycle of deploying your models, which I think is often maybe not taken care of as early as it should. So how do you make it so that if you have a new cool model, it can just be in production and you're not worried about it without you spending weeks of work to get there, or you’re not getting woken up at 3 AM when you deployed the model and it was terrible? So I think those are the two loops that you kind of shorten.



So that leads to another interesting question that I think we had. I completely agree, I think that is usually where the big wins lie. And they're the ones that scale, right? They're the things that remove all the barriers to scale and things like that, especially if you're starting out. It can be scary, though, to do it with critical stuff. I think it's interesting working in finance, because there are a lot of tie-overs, like in the energy industry we are tied into national critical infrastructure. Basically, we get fined out the wazoo if anything breaks, so automating things and iterating things can get scary. So do you have advice for how you tackle that other than “Leave it alone and let it creep over in the corner.”? chuckles



Yeah, the first option is just to never touch it. “We deployed this model eight years ago.” “The person that deployed it left the company and we haven't touched it. If we ever have to change it, we'll probably just go bankrupt because we'll break everything.” Yeah, that's one option. laughs I actually think that, in many ways, automation will make it safer. So I agree with you. We have like the main model that I work on, which is the model that powers what's called Stripe Radar. Basically, it decides for every transaction on Stripe, whether we allow it or block it. So you can imagine that messing the deployment of that model up is pretty bad – potentially, absolutely tragic.

So we have to be pretty careful not just to make sure that we deploy a good model, but to make sure that maybe if the model is good on average – what if it's good on average, but it decides to block all payments to every podcast provider on Stripe? And it's a small enough slice that we didn't notice it. That would be really, really bad. So there are a lot of failure modes. But I think automating a lot of the things ends up making it a lot safer because what happens when you have a lot of failure modes and a lot of things to think about is that they get distilled in the team's knowledge. And it's like, “Oh, Adam knows about this particular thing. Make sure that you ask him because you have to run like this one analysis – I don't really remember – it's based on this query. It's kind of broken. But last time we didn't run it, something went wrong.” And then like, “Oh, and this thing happens.” And you have a bunch of weird tribal knowledge and if any of it goes wrong or goes stale for any reason, everything goes to hell. So I think automation is safer.

The other thing I'll say is, you asked, “How do you do it?” One thing that we found really helpful is – before you automate, you suggest. That's kind of the motto. And that works for machine learning, too. Let's say that when you deploy machine learning models, usually you deploy them with a threshold – it depends – with classifiers. You might say, like for fraud, “Anything above this score is positive or negative.” Maybe you want to select a threshold automatically, but you're not too sure. So initially, you write your automation and all you do is – you have a human in the loop and you suggest that value to them. And you do that over a few cycles. And once a few cycles go by and they say “Yeah – actually, I never change it. It's always good.” Then you can feel more comfortable automating.



Yeah, I completely agree – for some of the big critical stuff like that, where it's possible. When I went to insurance, that was kind of the only approach. People weren't confident in taking their hands off the controls. And I kind of got to the point where I thought “Maybe that's the only way to do this, actually. Let's not think about full automation just yet.” I find it really interesting, actually. Going off-piece a bit it here – but how similar your language in the way you're talking about doing this stuff, and especially about failure modes, aligns with the chat that we had with Mohamed Elgendy a few weeks back. Because he talked about really similar stuff and that idea of failure modes, it's really interesting to hear. Demetrios and I spoke about building confidence around testing and that – but he says that actually identifying failure modes and trying to do that – it's just, I think, certainly something to check out. There are similar lines of thought in thinking that's obviously the right approach.



Yeah, I feel like there's a common thread in operational work where you think to succeed, you just obsess about the potential failures and reducing their likelihood.



So, can we dive in for a moment to what your actual infra looks like at Stripe and how things are going there? Like the nitty-gritty of it. I guess we can maybe start off with you giving us a little bit of a background of what you're working with. But really – you've been at Stripe for a while and I imagine you, as you mentioned, you've been iterating quite a bit. So how has the iteration looked from the infrastructure side? What have you seen, as you mentioned in the beginning, when you got there? Potentially, there were 13 steps on data collection. I imagine you've cut that down a little bit or you automated it so it's not as painful. What were the low-hanging fruits when you iterated on some of the stuff in your specific use case?



Yeah. So we started from a pretty privileged place, I would say. Because Stripe is certainly a larger company than the median company, I guess, statistically speaking. So what that means is there's a lot of infra. So, we have compute infrastructure and also batch compute infrastructure, meaning we have teams that handle orchestration tools and infrastructure tools like Airflow. We have teams that handle a model training and serving service where you can train a model and with one function call have that model be an API that you can call live. So when you deploy your model this kind of happens magically for you.

We have a feature team – a feature computation team – that has a framework where you can define your features offline and then once you've defined them once offline, they're available to you online. These are pretty gnarly problems that I think I'm pretty happy that we don't have to solve on our team. So that's always been a tremendous help. I would say that… For context, the last year we actually spent focusing quite a bit on reducing both of those feedback loops –like prototyping and deployment – and almost all of it on both sides ended up being in these layers of glue that I mentioned. Let me give you an example. Let's say you have a new feature, you've tried it, it helps the model a lot, and you want to deploy it. So first you’ve got a notebook somewhere, trained your model with a new feature, you saw it was better.

Then you'll have to get that code and merge it to our production code without breaking up production training code, and that code will then train a model. Once you have that model, you'll have to then score it in a bunch of tasks against our current production model to compare it against, like we talked about, there's kind of a bunch of different slices of the world that you would want to compare. So you use, let's say, ad hoc Airflow jobs to score millions and millions and millions of charges – expensive jobs over time. Then, once you've scored all these, you're going to have to run your analysis on it. So you have a notebook, you have some queries, you have some SQL, and you do analysis.

Then what happens is – at Stripe, we don't deploy for this particular model. We actually customize the model for large users. And the way we do so is mainly by (at least initially) customizing the actioning threshold. So for hundreds and hundreds of users, we decide on a different actioning threshold. That means that you have to do your analysis for this and you're trying to figure out which threshold. And then finally, you think, “Oh, I'm ready,” and you do something called a shadow deployment, which we can go into a bit, it's a very useful process – where, again, we can rely on the infrastructure there that we have. And then you monitor this manually. Then if this looks good, you say, “Okay, let's slowly ramp up traffic to production.”

And then once you deploy, for a few weeks or a month, you'll want to also keep an eye on it because maybe there's something wrong, so you take a look at performance. Basically, that's the process. I'll stop in just a short bit, but what I want to say is – everything we did last year was just automating all of this. It's kind of unglamorous work in many ways, where you're just going in the guts of your system, and you're like, “Okay, this job does this thing. That job does this thing. And then a human does this. Can we systematize all of it, make those processes just directly connect to each other and define and code what the human was doing?” When they were looking at two curves and were like, “Yeah, it looks good to me.” So a lot of it was just doing that.



It's such valuable work, though. I totally… I think it's one of the things where people who try to bake machine learning into their product or platform don't actually understand how complicated and kind of unknowable that path is. Because that path you've described there, sounds right for Stripe and what you're doing, but it'd be completely different from the next organization. Right? Because it's context-specific. People come in with their own thoughts and ideas, and actually common knowledge and the consensus is changing so rapidly. Did you have to do any of that path-trailblazing-type stuff yourself? Was it quite a natural fit for where you were going for Stripe?



Sorry, what do you mean, exactly?



Well, I mean how much of that was quite apparent “These are the steps we'd have to take to deploy it this way. And these are the tests.” How much of it was actually going out on a limb and trying things?



Oh. Yeah. chuckles A lot of it, I think, you learn by trying to deploy maybe without doing one of these steps. You know? You're like, “Okay. Well, we're gonna just deploy a new machine learning model.” And then, you get to this point where you’re like, “Well, do we know that we didn't completely break this specific…? Are we just rolling the dice here?” And you're like, “Okay. Well, I guess we should do this one analysis.” And then you do it and you’re like “Okay, I'm pretty confident with this.” And then like, “Well, okay. So we know that we haven't broken this user,” but then you're like, “What about this other use case? Is this something that we've taken into account?” And I think two things.

1) One of the reasons I joined Stripe and one of the reasons I really enjoy working on this is – this is a problem you only have if you're really successful. So it's a good problem to have. We have this problem because there are a bunch of different users using our APIs every day, in a variety of ways. And so, that's a great problem to have, that we have to think about all of these different use cases and we have all these business processes.

The other thing that I feel like I've become more of a zealot about after doing this work is, basically, the work that we ended up doing this year was encoding our business expectations in code, essentially, for lack of better phrasing. So, what that means is, usually, there's a lot of arcane stuff that goes in machine learning and it's like, “Oh, you train this model and then whatever, like you do some stuff, you do some analysis – whatever that means – and then you deploy.” Instead, it was like, “No. When we have this model, here's the contract we have.

The contract we have is that we won't change the rate of actioning of charges on this specific, whatever country by more than this rate. If it's more than this rate, then we'll change it.” And then we also always will move towards having the same (this is just an example) false-positive rate and a higher recall. We’ll never trade false positive rates down – we’ll always improve recall, that sort of stuff. And that's really the key there for us.



So, a few things. chuckles First one – awesome vocab word that you said, “zealot”. I haven't heard that word in a while and it made me think, “Oh, I need to incorporate that into my speech more.” The other thing is, you mentioned this phrase before we got on here, which was already a quote, and I was like, “Oh, man, we gotta get you saying that when we're on the podcast because I want to make a t-shirt out of it.” And it goes, “You develop operational excellence by exercising it.” That, what you've just said, is basically that – if I'm not mistaken. But can we go deeper into that quote of yours?



Yeah. So chuckles Okay. Essentially, maybe the clearest example I have for this is, again, releasing models. Here's how most machine learning teams I've been on will go about releasing models. They'll have a new use case, they'll think of a model, they'll make the model, they'll be happy with it, and they'll release it. Eventually, sometime between weeks and years later, somebody will say, “Oh, we should release a new version of this model.” And then in comes the problem, right? The code they used to train the model is way out of date – doesn't work anymore. The data – nobody knows where it is. The release criteria, again, aren't in existence.

So you're just kind of looking at what the model does today, like, “Well, it looks similar.” And so you end up with this problem, but again, I've had multiple times in my career where you’re kind of doing this reverse archaeology to figure it out. That's because, again, you haven't exercised your release pipeline – you did it once. But what ends up happening is, even if you do this exercise twice – you release the model once and then you did it again – what I found is that, as long as there's enough time between when you're releasing this model, your code will rot. Just because code naturally rots. So, whatever you're building, even if it's very smart and very fancy, and you're like, “Oh, I've mathematically solved how to set whatever parameters for this model,” it’s like, “Yeah, sure. But all of the assumptions that you had about the world broke.” When you created this pipeline, maybe your company only operated in this country. Now it operates in 12 countries. So all of your assumptions are wrong, all the distributions are different, all this stuff.

So really, the only way that you can have your models retrained and re-released – which for many applications is a gigantic boost in performance that you would be foolish to leave on the ground – is to have that pipeline (that operational work of releasing a model) be at least as frequent as, essentially, the data demands you to be. Basically, once you get to that point where, maybe automatically, you're going through this whole process every week or every two weeks, the issues that happen at a two-week granularity are small enough that you can fix them. It'll be like, “Oh, we changed this one thing, so I'm gonna change this thing.” And then it's fine. So your pipeline would only ‘break’ in terms of small cracks and you'll fix the small cracks.

It's kind of like cleaning regularly or doing housework regularly – you fix a small thing here and there. But if you leave it for a year, you come back to it and it's like a haunted mansion – the ceilings falling on your face and you just give up. You burn it all down and you build a new house. So I think it's all about doing that frequently enough so that you don't end up in the haunted mansion scenario. You can just regularly exercise the work and regularly just touch it up. It becomes almost like a Marie Kondo zen-like thing rather than a nightmare.



Oh, that's so good, man. That's brilliant to think about. Though, do you ever feel like you get the ‘death of 1000 cuts’ because you're continually doing all this small work? You're like, “How can I get out of this painful patching up these little cracks?” Or is that just part of it?



Yeah, I was gonna kind of similarly ask – do you ever struggle with the balance of that work versus doing new stuff that isn't that kind of productionization piece? Is that balance hard to find?



Yeah. I mean, I think some of it is… maybe you have to be the kind of person that likes tending to your garden, you know? Like trimming the weeds and all that stuff.



cross-talk Yeah, the constant gardener.



chuckles I think what happens is – everything is always like a prioritization exercise. So, for this stuff – to take the example of our platform – essentially, we document this pretty well. You just released a model –What went well? What went wrong? Did anything break? And then I'll just stack it – we’ll be like “Okay. Well, this thing – there's this bug. You have a 1 in 10 chance that this kind of annoying thing happens, but it's not dangerous. It's kind of annoying. It would take a week to solve. We have more important stuff to do.” You know? So some of that stuff, we’ll just decide not to do.

And then some of the stuff is like, “No. This could cause a big, big issue.” In some cases, we actually get the wrong performance metrics, like, “No, we have to solve this.” And so you prioritize that above doing new work. But again, you can always bring it back to “Is this worth more than me building a new model for this other thing?” Then you make your decision that way. One thing I'll say – this is maybe I'm just becoming superstitious – but when we were doing all this foundational work four times, four times, I found a small thing and I was like, “Oh, this is kind of annoying. It’s a small thing. It’s not worth our time to fix it.” And the same thing, two months later, ended up causing just a huge issue. Every time.

So I've become kind of way more of a stickler, being like “Yeah, yeah. It's a small thing.” But if I can, I'll just do it. I don't know if that's rational. It's just that I kind of got just unlucky breaks four times in a row on small things becoming a big thing. So I guess you develop your own heuristics.



Yeah. laughs



There’s something about being in the mind space at the time and thinking “Right, I'm thinking about this now. So let's just fix it.” As opposed to cross-talk thinking.



Also, along the lines of the way that you do reproducibility, you mentioned how much of a headache it is for a lot of people out there, and all the teams that you had been on previously, where you have to figure out like, “Alright, I want to go and get a better model. Let's see. What data was this trained on? What code was this trained on? Where did we get this data? How did we clean it?” All that stuff that goes into trying to reproduce the same results – how do you go about that now?



Yeah. I can maybe go a little bit more into the Stripe tooling around it because I think it's really good. We also have some blog posts on our engineering blog about it. I think there are a few things that you kind of want: 1) You want your data generation to be something that you can rerun. Honestly, that can be as simple as – if you are going to train a model, it has to be attached to a Spark job, ideally, that runs in a scheduled manner. Again, because if you just have like, “Hey, I wrote the Spark job two years ago. Just rerun it.” Then I guarantee you that if you try, it'll break – the assumptions are all messed up. And so, if you can have something that generates your training set every week – even if you don't use it, it's fine – and that will alert you if it breaks, then that's really nice.

So that's something that I think is accessible for most companies. It doesn’t have to be Spark, but have some data job that just keeps running, keeps making your training set and, ideally, you do some simple testing on it. Nothing crazy – but just test that it's not all empty or just has only one kind of label. 2) The other thing that we have that helps with this is – we do the same thing for our training workflows. Our training workflows are essentially… think of it as similar to a SciKit Learn model. It sits on top of the data generation job and basically, everything is defined either in Python or in a JSON config, telling you to like, “You take the data. You filter out these three rows. The label is defined this way. You take this column and you multiply it by three (and like whatever). You train the model with these parameters, either this many months of data for training, but also this many months of data for test.” And you have the same workflow for model evaluation where it's like, “Cool, you take the results of the previous workflow and just do a bunch of stuff on it.” As we get through more complicated use cases, we extend this by just adding more, essentially, scheduled jobs.

For us, we'll have scheduled jobs that take like, “Yes, we've trained this model. We've tested it. It looks good in terms of general performance. But now we want to take the actual binary score – let's say like a bunch of French charges – and compare how it does to the previous model and verify that in France this specific condition is met.” And then we do a bunch of things. But essentially, what I'm trying to say is that all of this is just scheduled and it runs whether we deploy models or not. And if it breaks, we know it.

So extending that to all models – this is something that you can build at a platform level, where you just say for every data scientist in the company, “You want to ship a model? You have to have this job that creates your data. You just write the job and then your model training just keeps running.” That is super helpful. Because then you just have it by default. It's maybe like 50% more work when you do your initial model. But then you kind of get away from this huge problem scot-free after that.



Yes, interesting approach there. I suppose I've not thought of it like that. But you’re kind of artificially creating users for your pipelines so that you can then just treat them like you would any other software product, right? You just go, “Well, the user raises a bug. Need to go fix it.” Which is quite cool. As opposed to going, “This thing works. Put it on the shelf and let it collect dust and fall over when I next try and I need it.” Yeah, it's quite an interesting idea, though. Maybe it's just my lack of reading, but I’ve not thought of that.



Yeah. I haven't seen much writing about it either, but yeah. I think it protects you from exactly what you called out. Right now, I think one of the biggest risks of ML is the stakeholder that made the ML model oftentimes is not around, and nobody understands anything when it breaks. So that’s kind of like a forcing function. Because maybe you built the model and the pipelines, but now, usually the ownership is at a team level. So it's like, yeah, but the team now owns the pipeline that trains and evaluates this model, and so they keep owning it as long as the models are in production.



Yeah, because I always think that the tricky bit comes from things like the static artifact of the dataset that trains the model can easily get forgotten. Not so much the model itself, which might be fine, but the actual artifacts that hang off of it. Yeah, that's cool, that’s cool. I like that.



That's incredible. This has been so insightful, man. And I just want to call out everyone who has not read your book – go and get that. You can buy it from O'Reilly Media or Amazon. And I think they probably have that like, where you do the 30 days free trial with O'Reilly – I know they do that a lot. So if you want to go, read the book in those free 30 days chuckle if you're really cheap, go do it. Get after it. The book is called Building Machine Learning Powered Applications Emmanuel. If I got that right, with my French – my poor French accent. I appreciate you coming on here so much, man. It's been really enlightening. There are so many key takeaways that I've had from this chat. Just thinking about… chuckles

From the beginning, you were dropping bombs – machine learning engineers are extremely good at self-sabotage. We all know that. And they’re trying to overcomplicate things. I think that's just an engineering thing in general, but with all the different shiny tools and new frameworks and Py Lightning, or whatever the last PyTorch thing that just came out yesterday – I can't remember the name. But we all want to try it, right? And maybe we don't need to for our use case. Like, let's bring it back to the KPIs and let's bring it back to the stakeholders and actually move the needle. Then – this was a great one from your mentor – “Write the press release before you start doing the work.”

And make sure the press release doesn't say, as Adam mentioned, “We worked on this for six weeks and it turns out it was really hard. laughs So we gave up.” Also – this was key for me – when we were talking about the KPIs and how to interact with stakeholders, “Tie whatever you're working on to some number the company cares about. Figure out what number that is – it doesn't have to be revenue, specifically, but figure out what the company cares about. Tie whatever you're working on to that and be able to show that you can move the needle on that.”

We're also going to link to something in the blog post about your internal tooling at Stripe because I'm sure there are a lot of people that want to go a lot deeper into what you mentioned in this call. Last but not least, and maybe this is all doom and gloom – Adam and I created a fake startup for the MLOps tooling sector while we were listening to you called Haunted Mansion, which is, in effect, doing exactly what you talked about. chuckles



One of the best analogies I've ever heard. I loved it when it was mentioned. I'm sorry, I love that.



chuckles The cobwebs and all of that, that's what we're trying to avoid. “The stakeholder who built the model is not around when it breaks,” and I think we've all had to deal with that. We all know that feeling. Last but not least, “You develop operational excellence by exercising it.” Man. So many good quotes and so many key takeaways here. This was incredible. Thank you so much.



Thank you. This is really fun. Thanks for having me.

+ Read More

Watch More

Posted Mar 10 | Views 250
# Automate Data
# Infrastructure
# Data Science
Posted Aug 02 | Views 1.4K
# ML Flow
# Pipelines
Posted Aug 23 | Views 108
# Accelerated Training
# Deployment
# Complex Tabular Models
See more