MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Platform Thinking: A Lemonade Case Study

Posted Feb 15, 2022 | Views 1.4K
# Case Study
# Interview
# Lemonade
# Lemonade.com
Share
speakers
avatar
Orr Shilon
ML Engineering Team Lead @ Lemonade

Orr is an ML Engineering Team Lead at Lemonade, currently working an ML Platform, empowering Data Scientists to manage the ML lifecycle from research, to development and monitoring.

Previously, Orr worked at Twiggle on semantic search, at Varonis on data governance, and at Intel. He holds a B.Sc. in Computer Science and Psychology from Tel Aviv University.

Orr also enjoys trail running and sometimes races competitively.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
avatar
Vishnu Rachakonda
Data Scientist @ Firsthand

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.

+ Read More
SUMMARY

This episode is the epitome of why people listen to our podcast. It’s a complete discussion of the technical, organizational, and cultural challenges of building a high-velocity, machine learning platform that impacts core business outcomes.

Orr tells us about the focus on automation and platform thinking that’s uniquely allowed Lemonade’s engineers to make long-term investments that have paid off in terms of efficiency. He tells us the crazy story of how the entire data science team of 20+ people was supported by only 2 ML engineers at one point, demonstrating the leverage their technical strategy has given engineers.

+ Read More
TRANSCRIPT

0:00 Demetrios That was awesome, man. We just talked with Orr – Orr Shilon, for those who want the full name – he's one of the ML platform engineers at Lemonade. What do you think of the conversation, Vishnu? 0:14 Vishnu I thought it was great. It's always rare, I think, to find the kind of person who is both in the weeds as a talented individual contributor, but also has to think strategically about company and business needs. Usually, those kinds of people are either plucked very quickly to the top or are off doing their own thing nowadays. And it was cool to see someone like Orr, who has stuck around at Lemonade for a while and has really pushed the kinds of ML platform infrastructure that I think a lot of companies would love to have. It was a great conversation. 0:55 Demetrios That's true. When he was explaining the platform that they have, and basically, the conversation that we have coming up for you right now is centered around how Lemonade created their platform and what it looks like, and what are some things that they keep in mind – this platform thinking mentality. When he was talking about it, I just kept thinking, “Wow, this is pretty advanced! This is cool. I hope this becomes the standard, as opposed to the exception.” So hopefully, when everyone's listening to it, you get some good feedback, or some good nuggets of wisdom that you can bring into your job. I wanted to mention something else, man. There are some things to call out. At some point, he was with one other platform engineer, serving 20 data scientists – that's pretty incredible. Now, the team has since grown to four, but 2 to20 – one platform engineer for every 10 data scientists – that's a pretty big one. Then they have a Slack bot that they can train different models with? They have a Slack bot that they can push models to production with? That's pretty wild, too. I've never heard of that. 2:09 Vishnu Yeah, the emphasis on automation, platform thinking, and efficiency was pretty impressive to see. Because you don't get that level of leverage (in terms of two people being able to support that many people) very often, unless you've made long term investments in the efficiency and productivity of those people. The fact that they were able to do that shows that they have really done that for a long period of time and with a lot of vision. I think everyone listening will learn a lot about how if you make the right kind of investments up front, over time, they can really pay off in terms of efficiency. 2:50 Demetrios There's one thing that I wanted to point out. We were chatting on Slack, while he was talking about this – he goes over their five different pillars of their ML platform. The first one he talked about was how they built their own feature store and it reminded me of when Jesse was on here, like a week two weeks ago, and Jesse was talking about this conundrum they have for a lot of these companies who started doing ML early. And it was exactly the narrative that Jesse laid down for us. If you haven't listened to the conversation with Jesse, go back and listen to it. It was a fascinating one. Not a lot of people checked it out. I'm surprised why it didn't get as many views or listens as I thought it would, because the conversation with Jesse was just incredible. But I think we must have messed up on the title or something. People weren't working on it. It's called “Scaling Biotech” so people probably thought, “Well, I'm not in healthcare. I'm not doing anything with Biotech, so it doesn't apply to me.” Jesse had amazing wisdom. So I highly encourage you to go listen to that. But the premise of it, and what Jesse talked about, was how these companies start building their own tools internally, because there is nothing on the market. And they all start doing that around the same time, because they start seeing the need and there are these bottlenecks that are happening. So everyone's kind of building the same internal tools – say like a feature store, we can use that because that's like the canonical example. But it's not just for feature stores. A lot of them, like monitoring or deployment platforms, we can use any of these. But what happens is, Uber's building their Michelangelo and they have a feature store, and then they spin it out and it becomes Tecton. And Orr was talking about how they were building a feature store and it's the same thing. They have this problem now because Tecton wasn't out before they started their whole journey on that feature store. But now it's so customized – their feature store at Lemonade is so customized to their needs, that he was saying “I have a really hard time seeing us go out and buy something off the market because we have something that is so customized to what we do. We're just going to keep doing it.” And Jesse was talking about that too. He called it exactly how it was. And that's what made me really appreciate the conversation with Jesse more after hearing Orr talk about this. But anyways, I think we've talked enough about what you're going to hear and we’ll just get into the conversation with Orr right now. Any last thoughts from you, Vishnu? 5:21 Vishnu Real quick, do you want to just go through his bio and read it out? I'm happy to do it. Just so that everybody gets a sense of what he does? 5:28 Demetrios Yeah, that's probably a good idea. 5:29 Vishnu Yeah, let me do that real quick. So, Orr Shilon is an ML Engineering Team Lead at Lemonade, who’s currently developing a unified ML platform. Trust me, it's really cool. This team's work aims to increase development velocity, improve accuracy, and promote visibility into machine learning at Lemonade. Previously, Orr worked at Twiggle on semantic search, at Varonis, and at Intel. He holds a B.Sc. in computer science and psychology from Tel Aviv University. Orr also enjoys trail running and sometimes races competitively. 5:57 Demetrios Last thing before we jump into the full conversation – we're looking for people to help us edit these videos. Basically, you don't have to know anything about editing, you just have to tell us where the nuggets are – where the gems are – in these conversations so that we can trim it down and get like a 10 minute highlight reel for people to watch on our ML Ops Clips secondary channel. If you're interested in that – because we can't hire a producer, you're probably thinking “Well, get somebody that's actually a pro at this.” The problem is that producers and podcast producers or video producers, they don't understand machine learning and so they don't know what's a gem and what's just us talking and it's not that cool. So if somebody out there who is knowledgeable in this space, and passionate, and a listener, who wants to help out, we would love some volunteers. That's it. Let's get into the conversation. [Intro Music] 6:52 Vishnu Hey, Demetrius. How's it going?

6:54 Demetrios What’s up, man? I'm stoked to be here. As always, we've got another incredible guest. What's going on, Orr? 7:00 Orr Hey, man. Hey, guys. I'm doing great. Thanks for having me. 7:04 Vishnu It's a pleasure to have you on, Orr. We're really excited to talk about the ML platform at Lemonade. I'm excited to hear more about what the company does, how you guys have built the team you have, and get to learn from you in terms of what this really fast-growing and hot company, in a very interesting industry, is doing from an ML platform standpoint. So to kind of kick things off, can you tell us a little bit about what Lemonade does, and what your ML platform team at Lemonade is supporting in terms of use cases? 7:40 Orr Sure. So, I'll start off with Lemonade a little bit. Lemonade is a full-stack insurance carrier. And I guess we're sort of powered by AI and social good. But we do basically four kinds of insurance. The first is PNC insurance for homeowners and renters. The second would be pet health insurance. The third is life insurance. And the fourth is – we recently launched car insurance. We do this mostly in the United States, but in several countries in Europe as well, depending on which kind of insurance. Like I mentioned before, we're a full-stack insurance carrier, which means we're not an aggregator – we actually are the underlying insurer. 8:23 Vishnu Got it? It's funny, I'm actually a Lemonade customer. [laughs] I’m here in New York City. 8:30 Demetrios So you’re highly biased. [laughs] 8:32 Vishnu Yeah, definitely. [cross-talk] 8:34 Orr That's great to hear. [laughs] 8:36 Vishnu I'm pretty happy with the product. With that in mind – so, you guys are a full stack insurance carrier, you have all these different lines. Where does machine learning fit into what Lemonade does? Why do you need machine learning? 8:50 Orr I would say sort of on this line with tension between two different things. The first would be improving the product – we want to improve our users’ lives by improving the product. That would be something like – maybe I'm doing intent prediction for a chatbot for customer service before you actually reach a customer. You could maybe handle something before reaching a representative. Or maybe doing other things automatically for customers. Then the second big chunk would be trying to improve our business with things like maybe predicting lifetime value for users – things like that – which, you know, can be done in any business. 9:41 Vishnu That makes perfect sense. I think it's always helpful for us to set the context for why machine learning is a part of the business. And with that in mind, I have seen from the outside a little bit of how this business has grown and also just how ML engineering at Lemonade has grown – from the job postings you guys have, from the descriptions of your platform that I see online, and talks. Can you tell us a little bit about what the end-to-end process (going from data to model in production) looks like at Lemonade on your ML platform? 10:18 Orr Yeah. The platform sort of provides what we call “point-in-time data”. Basically it provides data on our data warehouse directly in Snowflake. And if researchers want to do something like exploratory data analysis, they can do that directly on the data warehouse on these giant dimension tables with hundreds of features that have already been created for them, or they can do it on raw data. I think at this point, we have maybe 1,500 features in our feature store. So we're at the point where, hopefully, researchers will be able to not have to create features for new models, or maybe have to create only several of them. They'll start there. They'll do their modeling in a notebook. Then, the most important step for us, is that we translate modeling code – both for training and for inference – into our internal platforms framework, which kind of democratizes training, where anyone can kind of train anyone else's model. We have a Slack bot to be able to train models with Cloud resources. So we can kind of run training there, and configure periodic training as well. 11:48 Vishnu That's super cool. 11:49 Orr Yeah. Then finally, they'll use that same Slack bot to deploy a service to production since we usually do online inference. Then that service will be exposed to developers, which will integrate with it at Lemonade. Finally, we use a third party for monitoring. We use Aporia, which I think alone has been on here, at least once or twice. I think what we've learned over time is that we really want our data scientists to manually configure monitors to further machine learning models. We don't want anything automatic out-of-the-box so we don't get this “alert fatigue”. It's something that needs to be constantly tuned all the time. They'll configure both their data drift monitoring, concept drift monitoring, and the performance drift monitoring, with the help of these very specific alerts per feature, in order to be able to monitor something that matters. We want high precision, even if the recall isn't that great. 13:01 Demetrios Can we double-click on that real fast? Because I'm wondering how much of this stuff you can recycle, such as the features, for example. Then, when it comes to the monitoring – how much of that has to be custom or bespoke every time? Or how much can you recycle? 13:19 Orr Like I mentioned before, we hopefully recycle very many features. We have dozens of models running in production and each one of them has dozens of features. So we definitely recycle features. Then, we actually have several kinds of monitoring. We'll monitor features separately from models. We’ll monitor maybe the amount of null values. We'll monitor if features are equal between training and inference time – this is something the feature store will do completely separately. Also, if a model even uses the feature. Then when it comes to model monitoring, or like the features within models – we had a time where we tried to do auto-monitoring to take this burden off of data scientists. And we kind of reached the conclusion, at least at this point in the platform, that we're not able to provide that service in a very good way. So we're at the point where data scientists will manually configure monitors per feature for a model and that's part of our process of reaching production. They'll get these alerts in Slack and whatever. But that's a big part of the process of reaching production. We do these iterations once in a while where we go over the monitors and make sure that everything's up to par. 14:52 Vishnu How did you guys decide on a Slack bot as being the right way to have this model training process work? 14:58 Demetrios Yeah, actually the Slack bot is really intriguing isn't it? That's probably the first time I've heard about a Slack bot being brought into the whole ML Ops equation or into just the training and machine learning equation in general. 15:13 Orr I think I'm lucky enough to be working at an organization where automation is the top priority. So this Slack bot – there's a Lemonade Slack bot with blog posts about it. It's called Cooper, like from Sheldon Cooper. And it's really easy to integrate with this platform – the Slack bots platform. It's kind of standard Lemonade practice to do things with a Slack bot. So we kind of have our own service and our own commands that integrate with this platform and do a bunch of different things. Maybe we'll have commands to bring up Sagemaker notebook, or take it back down. We'll have commands to maybe remind people that their notebooks are up so we don't spend a lot of money. Then we'll have commands for managing the machine learning lifecycle – for training models with Cloud resources, for deploying them, for taking things down. I guess several other small pieces as well. 16:15 Vishnu You said something really interesting there, that I would like to dive deeper into, which is – automation is a top priority at Lemonade and that you're fortunate to work at an organization that does that. Why do you think automation is so important to Lemonade in particular? What is it about the company that makes you guys want to really prioritize that? And why do you feel like you're lucky, in that context? 16:42 Orr I think we prioritize automation from a business perspective. One of the top metrics that we try to look at is, maybe, the amount of customers that we have per employee, or the amount of ARR, (we call it IFP) per employee. That's a very important metric that we look at. So automation is a big part of the company. And I think I'm lucky to work at a place like that because, I guess, it's a priority for everyone to automate things, which is really nice. Then we have cool things like Slack bots which deploy machine learning models or train machine learning models. 17:27 Vishnu That's really interesting to hear that you guys actually apply almost business-level metrics to what is usually considered just a technical imperative of automation. I think a lot of times when we talk about automation mindsets and such at different talks and podcasts that we've been in, it was mostly like, “Well, how do you automate more so that the engineering team can be more efficient internally and can get rid of bottlenecks?” But to hear that it's such a business focus that trickles down into everything you guys do – I think that's a very unique approach that clearly yields some pretty interesting results, like a Slack bot that serves what seems like a lot of customers in the company. 18:11 Demetrios Yeah. I also want to note, when you're looking at that automation side of things, what are some points where it's gone wrong? Not “gone wrong,” per se, but just where you tried to automate something and it shouldn't have been automated? You mentioned the monitoring before, where you tried to take that to automation but then you had to dial it back. Has there been a point where the Slack bot was trying to be implemented and you realized, “Whoa. Actually, it's not the best use case for a Slack bot. We need people to be doing this.”? 18:48 Orr I don't think I've seen that. I guess maybe not all Slack commands that ended up being implemented are utilized 100%. I don't think I've seen metrics on it. But I don't think I have a really good story there. I think a great story would be like something we talked about – about trying to automate machine learning monitoring, and kind of failing there and resorting to having low precision, high recall, and having dozens of alerts a day without anyone being interested in them. 19:26 Demetrios Ah. And that's how you knew “This isn't working.” [cross-talk] 19:29 Orr Yeah. I mean, I knew that it wasn't that it wasn't working for us to configure monitors automatically. 19:38 Demetrios Yeah. Okay, last question about the Slack bot and then we can move on. But it's just so fascinating to me. Is there someone that is looking at metrics on which commands are being used with the Slack bot? Do you have a team that just babysits the Slack bot and decides what to put in there? 19:56 Orr I think there's maybe a person. He definitely doesn't babysit the Slack bot. It's a platform. Technically, they don't even know the commands that we've written to integrate with this platform. I said I'm lucky enough to work at an organization that has automation as the key – platforms is something else that's really big there. Decentralizing development in every sort of way is also a big priority. 20:28 Vishnu That is fascinating to hear. I think, with that in mind, I want to ask a little bit more about what tools you guys use in the context of your ML development stack. You mentioned Snowflake. You mentioned notebooks. Are you using enhancement on top of notebooks? Any kind of managed service? You mentioned Aporia. We’d just love to hear what your tool universe looks like. 20:51 Orr I think you guys have seen this blog post by Ernest Chan, where there's like five main building blocks? So maybe I'll describe each building block that we have. 21:04 Demetrios That’d be great. 21:05 Orr So, we have an internally implemented feature store. We kind of started before Tecton was public – or before they exited “stealth mode”. The feature store is so customized towards our specific needs that I'm having a really hard time seeing how we can use any other party tool, even though people are developing things that are much better and much more comprehensive. We have some super-specific needs on our very specific data and use cases. So we have an internally-implemented feature store. It's implemented in Python. There are different contexts that it runs in. It reads streams with AWS lambda, it serves real-time features from Kubernetes service with fast API framework. We recently ported all of our code to a synchronous delegate event loop, which has been very successful for us in terms of model-serving latency. The online feature store is backed by DynamoDB, and offline, like I said, it's backed by Snowflake. Then we also run different ETLs over Kubernetes, as well. The workflow management that we use is Airflow, which also does our periodic training if we need and then it'll also manage the different ETLs if we want to do batch ingestion into the feature store. We use MLflow, both for experiment tracking and as a model repository. And we're very heavy on infrastructure as a code. So we don't use the MLflow feature where there's a production model, maybe like a Blue-Green deployment, or however they do it there. We actually literally take the model ID and stick it in our code, like in Git. Everything is versioned in Git at Lemonade. Like all provisioning code is versioned in Git. So like I said, we're very heavy on platforms and decentralizing development. 23:39 Vishnu So I’m… I'm sorry, go ahead. 23:42 Orr I was going to keep describing different building blocks, but I'd love to hear a question. 23:47 Vishnu Well, I mean, I'm just sitting here as an engineer, and I'm kind of like – Well, you have all these different sort of nicely configured – or architected – building blocks. It's pretty clear to me how you guys have solved all the different natural friction points that a lot of other companies, including my companies, have faced in building and productionizing machine learning models, and then also monitoring and maintaining them in production. So what are the areas of friction that still exist? I'm curious. In terms of maybe your tool selection, or just in terms of putting more ML models into production? 24:28 Orr I think maybe that biggest challenge that we face in general is like a “people” challenge. The market is very hot for employees at the moment and I think we try to solve problems where we want to enable data scientists from all different backgrounds to be able to use this platform. That's still our biggest challenge – enabling people that have only ever delivered notebooks to reach production, and at the same time, enable data scientists who have been software engineers for 15 years and are very opinionated on frameworks and would like to do everything (they want to know what's going on under the hood). The interfaces that we provide is really the biggest challenge – It's not the underlying code. It's really deciding on those interfaces and keeping them current and having them work for everyone from the least experienced to the most experienced. 25:37 Demetrios Wow, I love that. And I love hearing about how you're serving this whole spectrum – from the data scientist who has been only a data scientist and loves Jupyter notebooks and doesn't want anything to do with anything else and then the data scientist who is transitioning from being in the software engineering world, and is very opinionated. I'm thinking about – when you're opinionated like that – or you as the platform person, you have to ultimately make some decisions and you have to be opinionated about some things. How does your choice – how does that look? And are you serving the different users of the platform? Or is it something where you just say, “Alright, we can't let this happen anymore because we've seen the downstream effects.”? 26:29 Orr I think we're kind of at the point where we've gone full circle. It's kind of where we started with something that was very open and then the second version of our model-serving framework, which is the main area where data scientists work – the model training and serving framework, where they’ll sort of implement an interface basically to fit a model, to predict on the model, and give a list of features. That framework was very open in the beginning, where you could make all these different decisions. Then the second version was kind of closed, where it was really good for 80% of the use cases, but did not work very well for about 20%. We've kind of gone full circle in the way that we can sort of allow both at this point. But there's nowhere in the middle. Some people will make all of these decisions about where their features come from, and specific queries, and they'll maybe write custom queries to bring their features from Snowflake in a custom way. But they'll kind of have to take responsibility for that. While others would just provide a list of features and it all happened for them. 27:45 Vishnu I think your statement about how you think about your platform work, really is the clearest articulation I've seen of this sort of customer mindset that has to happen for internal platforms to work. You mentioned, “The hardest problem I have is not picking the tools, it's not setting up the architecture – it's really figuring out what my customers (in the form of data scientists) need, and it's designing those interfaces thoughtfully enough that I'm serving all their needs without making my work impossible to do.” And I think that is the central challenge that we talk about so much on this podcast. It's kind of fun to hear that framed really elegantly by you in your example. 28:28 Demetrios And… Vishnu, sorry to interrupt there. But another point that I wanted to add was that it sounds like, or what I understood as well, is that, if you make a really cool platform that someone can have a great time using and these data scientists enjoy using, you're going to be able to attract better talent because they enjoy using the platform – they enjoy the problems that they're working on. Is that another piece of it? Or did I just make that up in my head? [laughs] 28:55 Orr I think that's a big piece of it – having people. I like to talk about ownership a lot. I think it's quite obvious from the way that I've described the platform that we're not a “throw it over the fence” kind of team, where data scientists have ownership end-to-end. This attracts a certain type of data scientist, but I think people that have worked on teams that were “throw it over the fence” have experienced the frustrations on both sides – both with different ownership models and lagging delivery. And I think that's something that's very important to us, having this clear ownership. There are obviously gray areas in the middle, but there's clear engineering ownership and clear data science ownership. Data scientists will write the code that runs in production and that's so important to us. 30:02 Demetrios Do they also get the call at three in the morning if something goes wrong? 30:07 Orr They're tagged in Slack, yeah. They’re auto-tagged on their models. 30:15 Vishnu [laughs] Cooper's letting them know. 30:16 Orr Yeah, exactly. I mean, we're definitely auto-tagged on more things, but they do get tagged on different types of alerts on their models, whether it be applicative alerts – just, like, exceptions – or data and concept drift, null features, things like that. 30:36 Vishnu Going back to something you said before, about the “people” challenge and hiring in general in this global job market that we now are experiencing – can you tell us, as a team lead, what parts of the hiring process are so hard right now? Is it finding qualified candidates for this type of work, in terms of ML platform and the combination of software, data, and machine learning? Or is it closing on candidates? What are some of the challenges you're facing? 31:05 Orr So like you said, I'll speak only to challenges on specifically machine learning platform engineers. 31:12 Vishnu Sure. 31:15 Orr I think we're having more trouble closing qualified candidates. The candidate pipeline is definitely not what it was two years ago. I see much, much fewer candidates and we have to approach them more than they would approach us. Lemonade’s engineering brand is quite good. I also think this is like a very interesting role in general – building a machine learning platform. And yet, we're still having to approach candidates at this point in time, whereas before, we didn't have to at all. 31:54 Demetrios I have a theory about that. Just before we move on. 31:58 Orr I'd love to hear it. 31:59 Demetrios What I think it is, is that there are so many new startups that are just getting an influx of cash – the amount of VC money that's going into all of this stuff that has to do with machine learning, now, these machine learning engineers are being brought on to all of these different startups and maybe they're the first engineer for the machine learning platform, or they're being tasked with a lot of responsibility – for them, for all of this different talent that's out there, there's a lot of opportunity, right? So the reason, I think, if you go back a notch as to why there are so many job openings for machine learning engineers, it's because of that, but the amount of VC money that's going into all of these startups that have anything to do with machine learning or using machine learning in their product – in the core piece of the product – has just gone up drastically. So that's why it feels like there's not enough talent out there. But that's my theory. 33:01 Orr No, I completely agree. Like two years ago, there may be 5 tech unicorns in Tel Aviv, and now there are 50. 33:08 Vishnu Wow! I mean, Tel Aviv – I know Tel Aviv is a boomtown. It's really cool to see. But I did not know the scale was that massive. 33:19 Orr Yeah. There's a lot of VC money being poured in at the moment. I don't know what's going on specifically, now-now. But in the past year, there was. 33:30 Demetrios Yeah, you're seeing the repercussions of it, now. You're seeing people that have taken jobs maybe three months ago, or five months ago, because they got the VC money a year ago and now they’ve finally found someone to take that machine learning engineering position. So it's harder to find those people. But, Vishnu, I know you had a question and then I cut you off. Orr, sorry – I cut you off with that theory. Tell us more, if you can remember what you were talking about. [laughs] 33:55 Vishnu Go ahead, Orr. 33:55 Orr I was gonna continue with the building blocks, but this conversation is more interesting. 34:05 Vishnu I think one last question I had before maybe we can jump back to the building blocks piece is – you've described a really interesting process, in terms of what ML looks like at Lemonade. How many people do you have on your team right now? And how many people do you support? What does that org chart and headcount look like? 34:28 Orr We're currently four people on our team and we support just over 20 data scientists. I think one of the things that I'm most proud of is that there were several months where we were two people supporting over 20 data scientists. I'm really proud of the engineering to data scientist ratio that we've had there. We still kind of managed to compartmentalize all this from the organization. It's a big part of the platform thinking at Lemonade in the way that even our DevOps organization has exposed building blocks for us to use and the simplicity of us being able to get up and running with anything open source even within a medium- to large-size organization at this point. 35:24 Vishnu First off, I'm glad to hear number one, what you're proud of – because I think that that is always an interesting lesson in terms of what – you’re team lead and a leader on your platform and your company – and it's always interesting to hear what leaders celebrate. That tells you about their values. So that's cool. And number two – two people supporting 20 data scientists for a company of Lemonade’s size and customer base, that's pretty remarkable. And it does speak to the quality of the vision behind your engineering and operations internally. That's really awesome to hear. 36:02 Orr Yeah, man. We're really kind of standing on the shoulders of giants. Both cloud computing in general and then, Lemonades infrastructure above that. 36:15 Vishnu I'm glad that people are getting to hear about this through this podcast, because I work in the healthcare sector – I work at an early stage startup – And I think, for us, in the industry that we're (in healthcare) we deal with a lot of administrative bloat in the US healthcare system by design. There's not as much emphasis on efficiency and not as much understanding, almost philosophically, of the power of leveraging tools and infrastructure to make one person the equivalent of two people five years ago. I think industries and companies like yours are really showing the way to people like mine. I've been a first machine learning engineer hire or a first data scientist hire, I look to companies like Lemonade, or Pinterest, or companies that are a little bit further along in industries where that level of efficiency and infrastructure leverage, I guess you could call it, is prized. So thanks for sharing that. 37:24 Demetrios Infrastructure leverage – that's a new one. I like that. Might have to coin that. 37:30 Vishnu I don't know if that's quite the word, or the verbiage that I want to go with, but we'll keep that there for now. 37:36 Demetrios Well, let's talk for a minute about the other building blocks that we kind of cut off and derailed. We went on a little bit of a tangent. So we got to the first two, right? The feature store, and then MLflow. But there were three more that you mentioned. 37:53 Orr I also mentioned Airflow as training orchestration. Then I've kind of mentioned monitoring beforehand, for which we use Aporia. The final one is model serving. We have this internal model training and serving framework, which I also sort of ended up discussing that the interfaces there are some of the most important things that we handle. There's a set of methods that if they're implemented, then the platform will guarantee a bunch of things, like a highly available service, and several different types of monitoring. I mentioned before that we use applicative monitoring, generic monitoring of features, and then we have specific monitoring for things like data drift and concept drift. We'll get alerts on those things. CI/CD – you know, all these engineering guarantees – if we implement this set of interfaces. We also get batch inference out of the box, which is kind of nice if people just want to go and do batch inference on this model, even though it's an online model. 39:13 Demetrios There's something that has been coming up quite a bit recently with the ML ops community, not only in Slack, but also on the people that we have on. It's all centered around testing and how to do testing for ML. Specifically, “What kind of things do you look at?” How have you guys cracked that nut? 39:32 Orr I would say that's still one of the biggest challenges that we have ahead of us. On the data side, we do testing. Since we have a feature store, we're able to do testing of feature unification. Features, even if they're generated from different sources, we'll still be able to test that they're correct in inference versus training. But on the model side, I would say we're still at a manual phase in this process. It's still something that's done in notebooks. 40:11 Demetrios Yeah, that's a really classic one. I feel like that's probably why – we put out a few videos on testing recently, and they've gotten a ton of traffic. And I think it's just because most people are hitting that bottleneck right now and they're saying, “How do I do this? What are some best practices? Who can I learn from?” There's not a lot of literature out there when it comes to testing and some people, depending on which space you're in, depending on your use case, you're going to do testing differently. You need to think about different things and keep certain things in mind when you test. Not to mention, there's a ton of different kinds of tests that you can do and which ones you focus on. So, yeah. 40:57 Orr I’m of the opinion that, just like in monitoring where we thought we could do it automatically and we found out that it's something that – at least at this point in our platform – we have to have data scientists work on this as part of the process. I think we're still at the point where it's the same for testing. Like, we can't auto-test something. I don't know what we could auto-test to have someone feel safer deploying a new model to production. It's still something that people have to take responsibility for, at least currently, on our team. 41:40 Vishnu So, I kind of want to zoom out here from the technology and go back to the big picture. We've talked about some of the wins that you guys have experienced in terms of efficiency. We talked about some of the lessons you've had in terms of automation, and its upsides and downsides with monitoring. We talked a little bit about what the future looks like in the sense that interfaces continue to be a challenge that you are trying to think through from a platform vision. I want to go back to the mission statement that is in your bio – you, as the team lead at Lemonade, and the ML platform team at large, focus on development velocity, improving accuracy, and promoting visibility through machine learning. How did you guys get such a clear mandate? Can you talk us through, historically, what that looked like? Was it sort of the CTO kind of saying “This is the way I want the ML platform to work,” or was it a little bit more dynamic? How did you get to such a clear vision that's then translated into all these results? 42:51 Orr I definitely would say that it's dynamic. I want to say that the head of data science at Lemonade, I think he made a good decision by bringing in engineering quite early in the data science lifecycle at Lemonade, and bringing in dedicated engineering quite early. There was a time, obviously, where Lemonade ran models within the service, and features were sent to it automatically, like our first version of running machine learning at Lemonade. Like, I was the third person on the data science team. I think that's quite early to bring in engineering and I think that's a decision that paid off with building the team in general – to start with platform thinking early within the process. These are goals that have developed along the way. Development velocity is like the biggest thing – it's very easy to state. Then, improving accuracy is something that a platform can allow. And then visibility, both internally in the team, and externally are priorities for us. We want data scientists to be able to see what other people have done and we want model training to be democratized so that anyone can train anyone else's model with different hyperparameters. Then externally to the organization is also something that's been understood along the way – that we kind of need to explain what we're doing organizationally and the platform is maybe a place to start. 44:46 Demetrios I want to tell you, just real fast – there's a quick funny story I have about the head of data science at Lemonade. I can't remember his name right now, for the life of me. 44:57 Orr In English, it's Nathaniel, but it's not in Hebrew. 45:01 Demetrios Yeah, that's it. So, back in my Dotscience days, when I was in sales, I reached out to him and tried to sell him the Dotscience platform. And I think I remember just asking something like, “Hey, we do this X, Y and Z with Dotscience. You want more information?” And he was like, “Yeah.” I was like, “Oh, my God! Lemonade! The guy said, ‘Yeah.’ This is amazing!” And then he ghosted me for the rest of my time at Dotscience. So whenever I look at Lemonade, I always remember that. [laughs] 45:28 Orr [laughs] Maybe he hired me at the same time. 45:33 Demetrios [laughs] That's probably it. Yeah, it was like 2019 – was it? 45:37 Orr Yeah, that's when I started. 45:41 Demetrios [laughs]There we go, man.

45:42 Vishnu [laughs] “Sorry, Demetrius. I have Orr now.” 45:45 Demetrios Yeah. I mean, he made the better choice, to be honest. Dotscience went out of business. But maybe they wouldn't have gone out of business if Lemonade was a customer Who knows? 45:57 Vishnu I think a lot of people – going back to what you just said about you coming in early – I think a lot of companies struggle to embrace platform thinking early because they're not sure if they're ready for the expense and the investment. It's definitely one of those things that pays off long term, but you have to be committed to it. You can't pull the plug early. I see that now at the company where I'm at now – where I have to advocate a lot for really thinking about wanting to invest in our limited time, energy, and resources into building a data platform that'll be really good in a year, and not focus on generating a bunch of one-off reports right now – about different analytics, or insights, questions. So it's good to hear about a story where this does work and I'm definitely gonna send this podcast to my boss and say, “Hey, look what happens.” [cross-talk]

46:47 Demetrios It kind feels like this was in the culture, though, of Lemonade – already, before you got there, Orr. 46:55 Orr I agree. Platform thinking is something that's kind of big at Lemonade in general. But it is a gamble, right? A premature optimization is one of the evils of engineering. 47:10 Vishnu Yeah, that's true. I want to close with a quick question about your talk at ApplyCon, which I highly recommend. We're gonna put it in the show notes, so everybody should check it out. It's just 10 minutes. Great overview of how to engineer into real business problems. You had a quote, which is verbatim “If you're making an online prediction, consider making the business point in time part of your machine learning platform.” Can you quickly tell us what ‘business point in time’ is and what it's allowed you to do? Why do you think other people should adopt it? 47:45 Orr Yeah. I’ll start with the fact that it really depends on the product. There will be companies with products that this is completely irrelevant for, and then companies where it may hit a spot where “Oh, this is perfect!” The ‘business point in time’ is basically when you make most of your predictions. So, we'll make a lot of predictions at Lemonade during these specific times during our business flow. Maybe when a user creates a quote, or when they purchase a policy, or when they make a claim since we're an insurance company – I think it's quite easy to understand this flow. Then we kind of want to know how data looks during this specific point in time, because that's when we're making predictions. So this is a training data notion. And creating training data for these very specific points in time, instead of having this open way where data scientists can say where they want, when they want data from for each data point, and have it be totally open and open to issues just as much as it's open to anything, is something that we found is just not needed. People don't need that at our company. They want to know how data looked when someone purchased a policy. And that's all they want to know, in most cases. So providing data both allows us to test it very well – to make sure that it's unified between training and inference – and then also to just minimize the amount of mistakes and the amount of engineering that goes into making decisions. Because it's kind of only done once – people create data for this point in time. 49:28 Demetrios So I couldn't help but notice something there – when you talked about this. It felt like you got a little bit passionate when you started talking about the way that data scientists want their data, or the different ways, or the rainbow of choices that they have. What's behind that? Why are you so passionate about that? Have you seen it go the wrong way? 49:53 Orr I think this is just how it sort of worked out organizationally at Lemonade. This is what the customers have wanted and they were very adamant about it. But, I mean, I could totally see why some customers would want to be able to choose data from any time. It just kind of depends when you're making predictions. You have to look at what's going on in most of the use cases. 50:23 Amazing. Man, this has been – it blew my expectations out of the water. I knew we were gonna have a great chat, but I didn't realize it was gonna be this good. I want to thank you for coming on here and just blowing my mind. Vishnu, as always, a great time hanging out with you, too. And that's about it. You got any final words, Vishnu? 50:43 Vishnu We're gonna take a second afterwards and think about it and come up with lots and lots of quotes, which we have from here. But, I think your emphasis on platform thinking, the lessons you shared with us, and the engineering quality at Lemonade really stand out, Orr. Thanks a lot for joining us. 51:00 Orr Thank you guys for having me. I really enjoyed it. 51:03 Demetrios Your team is hiring, right? 51:05 Orr Oh, yeah. 51:06 Vishnu [laughs] 51:06 Demetrios Always. [laughs] There we go. If anybody is… is it Israel only? Or anywhere? 51:14 Orr For my team, we're hiring only in Tel Aviv. 51:17 Demetrios Okay, Tel Aviv. There are quite a few people in Israel in the community. So if you want to go work with Orr and get some of this incredible way of looking at the ML platforms, hit him up. You're in the community Slack or just I imagine people can get ahold of you on LinkedIn and all that good stuff.

[outro music]

+ Read More

Watch More

1:03:54
Monzo Machine Learning Case Study
Posted Dec 07, 2020 | Views 1.9K
# FinTech
# Case Study
# Interview
More than a Cache: Turning Redis into a Composable, ML Data Platform
Posted Jul 28, 2022 | Views 1.7K
# Redis
# AI Native
# Vector Search
# Redis.com
Product Thinking in Data & AI
Posted Mar 11, 2024 | Views 549
# Product Thinking
# Data
# AI
# Genaios