The Evolution of Lyft's Feature Store // Devon Mittow // DE4AI
Staff Data Engineer with 10+ years of experience in data for retail, fintech, social media, and transportation. Tech driver for the Central Data Engineering team an ML Feature Store at Lyft.
A brief overview of how Lyft's ML feature store has changed and evolved alongside the business.
Adam Becker [00:00:05]: We're gonna talk about the evolution of the feature store at Lyft. I remember it was, maybe it was 2020, 2021 or something. It was like one of the first blog posts that I remember about feature stores and feature platforms, and it was at Lyft, and it was a fascinating blog post. It just like really kind of like set my mind onto it. Very curious to see what you guys have been up to.
Devon Mittow [00:00:32]: So, yeah, I'm excited to share it with you guys. The. Yeah. So here I'm going to talk a little bit. Oh, shoot. Sorry, one sec.
Adam Becker [00:00:45]: Is that the mother duck coming?
Devon Mittow [00:00:48]: Yeah. Okay, so I'm going to talk a little bit about the evolution of the feature service here at Lyft. So I'm Devon. I'm a staff data engineer. I've been here for about four years and working on the feature service for the last two. And so I'm just going to tell you a brief overview of what our feature platform looks like today. Some of the problems that we've had with this mature feature service, what are some of the solutions that we've done to address these problems, and some of the conclusions that hopefully can be useful to you guys going forward. And I think some of these themes will sound really familiar considering what was just discussed on the panel.
Devon Mittow [00:01:24]: This is a quick little overview of what our feature service does. Basically, users are able to configure a feature as JSON and SQL file in a git repo. Those configs are then integrated with Amundsen, which is our data discovery platform that was developed here at Lyft and is open source so people can search for those feature assets. Then those configs are picked up by airflow, which are then used to construct airflow dags, which execute the featured SQL and then publish that data to both the offline data warehouse and the online data store, which we refer to as DS features. It's really just a DynamoDB table that's suitable for that online low latency key value fetches. So the customers that we have are various ML models, marketing and comms platform, model based driver, incentives, fraud detection dispatch. Some of the most critical tzero services that are, that are running here at Lyft are all customers and very highly integrated into DS features. So just double clicking a little bit on what a ML feature actually looks like.
Devon Mittow [00:02:39]: As I mentioned, it's a JSON and a SQL file, right? So really, anybody who's capable of writing SQL, configuring a JSON file in a text editor, and creating a git pr is able to configure an ML feature here at Lyft and just looking a little bit specifically at some of the critical aspects that are in that JSON config. We have that ownership information, we have the upstream data dependencies if you need to wait for anything, and also any kind of data quality checks that you need to integrate into that airflow DAC to make sure that the data is well formed before publishing it to those different data sources. So this kind of leads to a division of responsibilities between the platform and feature owners. So feature owners are responsible for everything that's in these files. So that's the SQL, the business logic, picking a good data source, configuring that ownership, and alerting for things that the feature owner is responsible for. Defining, writing good data quality checks, making sure that the blocking ones are blocking and the non blocking ones are not blocking. And so that stuff really is all the feature owner's responsibility. And the platform should be just responsible for online offline data sync, making sure that the data goes where it's supposed to be, orchestration that the airflow DAg's running as expected, and that that data is being fetched at appropriate SLA's for whatever service is calling that.
Devon Mittow [00:03:58]: Right? So there's 2400 features that are currently defined on the platform. So really the platform team is not capable of addressing those kind of feature ownership, data ownership responsibilities at scale. But because it's so easy to create a feature on our platform, lots of different Personas end up creating those features. So the feature platform was initially created, I think in 2017, as you mentioned, Adam. So it's been around for a long time, 2400 features. Who created these features? In an ideal world, that's a data engineer or a data scientist who has a lot of skill in the data area, who knows sort of what best practices are and how to be a good data owner. But in reality, software engineers end up creating them. In some cases, analysts end up creating them, people who understand their business case and are using them to create marketing audiences for targeted on platform marketing messages.
Devon Mittow [00:04:54]: All these kinds of different Personas can end up creating features because there's such a low barrier to entry. So ideally that feature creator would understand these de best practices that most state engineers know like the back of their hand, right? Data contracts, making sure that your upstream data source has commitments for landing time, quality, availability and all that, and making sure that you understand how you can resolve those issues, follow up with them if there's any kind of problem. How to write a good data quality check what kinds of things you should be checking for, whether or not that kind of check should be blocking. Also, what to do when that check fails. So a data quality check isn't worth anything if no one does anything when it fails. What are the next steps that you can do to address and resolve a data issue? And also just general on call knowledge. Some of these feature ownership responsibilities will result in a page that's coming from a failed airflow dagger task. And a data engineer who's very familiar with airflow will know what to do with that, how to open it up, where the logs are, what to do, how to follow up on resolving an issue.
Devon Mittow [00:05:56]: All those feature owners may not be capable of or have all the skills or tools that they need to really step up and fulfill their data ownership responsibilities. And even in the case where we do have a feature, well, so I guess when I got to the platform, we also had an issue where that ownership configuration field in the JSON file was optional. So only 55% of the features that we had on our platform even had an owner listed. And so if there's not even an owner listed, how can we even possibly expect those feature owners to live up to their responsibilities? Right? So those people didn't, even when they did have the owner configured, they didn't necessarily have the tools and skills they needed to live up to those responsibilities, and it resulted in a degradation of data quality and trust in the data. So where people might have been able to reuse a feature that was already defined on the platform, they didn't know if it was healthy, they didn't know if it was landing on time or if it's well owned. I and so it became an issue, and it also became an issue for us as a platform team. As we go through various migrations, evolutions of the platform, we want to change things. Going through all those rollouts where the platform changes will have an impact on the feature owners, requires some partnership and communication with those feature owners.
Devon Mittow [00:07:12]: And so if we don't have that good ownership information, and if those people aren't sort of aware of how their feature data is operating, it really slowed us down in terms of our iterations on the platform itself. So what did we do about it? The first step was to improve the ownership coverage. That ended up being a lot of slack work. We really started from the read logs. So who's reading this data? Who's fetching this feature? Who will care if we turn it off? And then if you threaten to turn it off, then that will motivate them to step up and become sign up for the ownership responsibilities. But in a lot of cases, there was nobody there. Nobody would care if we turn it off. So we did turn it off.
Devon Mittow [00:07:54]: This is being mentioned in the panel before. We want to detect features that are not being used and proactively deprecate them so that we keep our space limited to those features that are just providing business value. We also improved our observability tools. We made the feature data more accessible through Amundsen, our data discovery platform. We added ownership and tiering and stuff so that's well exposed so people can understand who owns what feature asset. And then we also just built some dashboards that can make that process of troubleshooting, debugging and understanding feature health a lot easier and smoother. Right. And so basically, the conclusion of this little mini story is that you should definitely try to build good data governance into your platform from the beginning.
Devon Mittow [00:08:37]: Because if you have that good ownership information and are maintaining it and are trying to make sure that it's staying in alignment with your organizational structure as it changes, you're going to be in a better spot. You're not going to have to go searching for through slack to get people to sign up to own new things when they don't really want to. Deprecating features should be as easy as creating them. It is a good thing that it's very easy to create a feature, but we also just need to make sure that those rails to deprecate them once they're no longer useful are greased and that it's very easy to do so, so that you can keep that set of features well groomed and also just make it easy to do the right thing. So I can't choose the correct data source for these other people, these other feature owners, but I can make it easy for them to access all the metadata about datasets so that they can make an informed decision quickly and live up to those feature ownership responsibilities without a ton of like background or scaling up. And that it becomes obvious what they need to do as long as they understand their business use case and what they're trying to do with their data. So that's all I got. Thank you so much, Adam Zerny.
Devon Mittow [00:09:47]: That was just my quick little lightning talk there.
Adam Becker [00:09:50]: That was excellent. And it was so in tune with the last panel discussion, I can't even imagine. It's fascinating, actually. So can you, can you go back just like a few more, a few slides back? Yes. Actually, no to that YAML file? Yes. Okay. Give me 1 second, I want to look at this. In this case, this essentially would solve some of the issues that we were talking about earlier.
Adam Becker [00:10:24]: If imagining that you can actually implement this properly and get full coverage, which is like being able to tell which models are dependent on which other models, which features are related on which other features, and then the moment that you see some features never being relied on by anybody, then you would imagine that at least you have more confidence in being able to turn it off and to discard it.
Devon Mittow [00:10:48]: Yeah, that's definitely the approach that we took. I didn't talk as much in this platform about that kind of observability layer, but because everything in this config is ownership config that has to be manually updated by the feature owner. So I think we all understand the limitations of having ownership metadata like that because people leave, people change people. Who knows if that email lists that you put in there is even still valid? And maybe that team got reorganized out of existence. Right. So you do need to have that automated observability into who's fetching the data. The way that we have it at Lyft is we basically, we can't fire an analytical event on every read activity, but on a sample of like one 1000 of fetches to the DS features online data store, we just fire an analytical event that tells us what service is calling that and who's reading it. So we have some signal that tells us who's actually fetching that data.
Devon Mittow [00:11:42]: And that's kind of the baseline of our visibility into that read activity, because it's just a dynamo database, but it does go through our SDK, and you have to refer to the feature registry to get the feature by name. So sort of in this DS features layer, we have an opportunity to capture a little bit of information about these read calls that are happening. Yeah.
Adam Becker [00:12:03]: Okay, so that's interesting. I wasn't even thinking about that. You're saying there's like. So there's at least a couple of different ways you can go about it. The first one is to just be very explicit about who is consuming what based on what they're telling you. They sign up, they register, they say, this is my model, and it's consuming this other model or output from another model. This is when they promise you what they're going to do with it. But then you're saying there might be situations where this is insufficient.
Adam Becker [00:12:28]: You still want to see based on that model, who's the one that's reading from it, and you're saying you've wrapped up the way to even make calls in the first place so that you have some aspect of observability.
Devon Mittow [00:12:39]: Absolutely. Because I think that, yeah, Lyft, you know, what's the average tenure of a software engineer in this industry, right? Like two years or so, something like that. This feature service has been around for seven years. So there's obviously relying on the knowledge of an individual person to say what the dependency structure is, is going to decay over time as your platform matures.
Adam Becker [00:13:02]: Devan, thank you very much for coming and sharing all of the incredible things you guys are up to.