Sign in or Join the community to continue

How Feature Stores Work: Enabling Data Scientists to Write Petabyte-Scale Data Pipelines for AI/ML

Posted Sep 17, 2024 | Views 588

# Feature Store

# Petabyte-Scale

# Featureform

Share

speaker

Simba Khadder

Founder & CEO @ Featureform

Simba Khadder is the Founder & CEO of Featureform. After leaving Google, Simba founded his first company, TritonML. His startup grew quickly and Simba and his team built ML infrastructure that handled over 100M monthly active users. He instilled his learnings into Featureform’s virtual feature store. Featureform turns your existing infrastructure into a Feature Store. He’s also an avid surfer, a mixed martial artist, a published astrophysicist for his work on finding Planet 9, and he ran the SF marathon in basketball shoes.

+ Read More

SUMMARY

The term "Feature Store" often conjures a simplistic idea of a storage place for features. However, in reality, feature stores are powerful frameworks and orchestrators for defining, managing, and deploying data pipelines at scale. This session is designed to demystify feature stores, outlining the three distinct types and their roles within a broader ML ecosystem. We’ll explore how feature stores empower data scientists to build and manage their own data pipelines, even at petabyte scale, while efficiently processing streaming data, and maintaining versioning and lineage. Join Simba Khadder, founder and CEO of Featureform, as he moves beyond concepts and marketing talk to deliver real-world, applicable examples. This session will demonstrate how feature stores can be leveraged to define, manage, and deploy scalable data pipelines for AI/ML, offering a practical blueprint for integrating feature stores into ML workflows. We’ll also dive into the internals of feature stores to reveal how they achieve scalability, ensuring participants leave with actionable insights. You’ll gain a solid grasp of feature stores, equipped to drive meaningful enhancements in your ML platforms and projects.

+ Read More

TRANSCRIPT

Demetrios [00:00:07]: Simba, I got something. I gotta, I gotta show you this real fast. Look what I'm talking here. Oh, look at that. So that's pretty cool. This is my favorite because it's one of the only ones I have that actually fits in all of my cup holders around the house and all that stuff. So if anybody wants to get one of these, maybe you got to go by Simba's booth in the sponsors channel. So click on the left hand sidebar, see what Simba's got on offer there.

Demetrios [00:00:39]: Now, I know you got to talk for us, man. I am excited for it. I'm ready for it. You got to share your screen and I'll throw it up here on the stage and then we'll get rocking.

Simba Khadder [00:00:49]: Sounds good.

Demetrios [00:00:53]: Oh, boom, here we go. I'll be back in like 20 minutes.

Simba Khadder [00:00:57]: Peace. Thanks, dude. Awesome. Hey, everyone, thanks for joining today. Today we're going to be talking about feature stores. We're going to go very broad from just what they are, what they do, why people care about them, and maybe under the hood a little bit where we can understand how they work and how they enable people and data scientists to build data pipelines at scale. So feature store, remember how we learned machine learning? I know if you all took that coursera class, vander ng or whatever, but you're given these perfect CSVs, you grab the columns you need, the features are there for you. And a lot of what your work was, was just building simple models, regressions, trees, et cetera, on top of those nice, perfect features.

Simba Khadder [00:02:00]: So I'm here to tell you that those perfect CSVs do not exist in reality. In reality, a lot more time is spent taking data. That's a mess, as Demetrius showed in this video, out of the data swamp and into these perfect places where we can actually use them. So you heard it here first, Andrew Ng lied to you. There's no such thing as a perfect CSV to work off of. A lot of the work in data science. If you're in a big company, this is maybe going to look slightly familiar. It's a little more messy.

Simba Khadder [00:02:38]: There's a lot of ad hoc processes to get models into production. It's almost a little terrifying how much duct tape is used to put together some production models that we use every day. So let's break it down. Let's talk about, let's talk about what solutions the markets come up with, how people are actually solving these problems. So the agenda for this talk, I'm going to start by talking about the actual data problem in machine learning. I'll get into feature stores, how they work, the different kind of abstractions and methodologies that different companies have taken to attempt to solve this data problem, and then an end. I'm going to get into some of the internals, talk about the data engineering behind the scenes that's abstracted away from you, and how feature stores enable data scientists to build these scalable pipelines. So, let's begin.

Simba Khadder [00:03:41]: What do feature stores do? So, feature stores do five things. They help facility deployment. They enhance deployment of features like feature pipelines. They enhance collaboration, organized experimentation. So there's kind of an organizational value to it. It is mlops, of course, increased reliability of your features and your feature pipelines, and then finally they preserve compliance. So why is deploying features hard? Why is this a problem that needs to get solved? Well, when we're building features, often we're working in a notebook. Maybe we're working with pandas or some sample of data.

Simba Khadder [00:04:25]: Maybe we're hitting queries directly on the database. We're building these dags to build our features, to build our training sets that we finally use to train the model. In this environment, everything is nice. It's static. I have this notebook. If you run through it, in theory, it will do the same thing again. It's how a lot of data science is done, in this iterative, really interactive fashion. And it kind of looks like this, like, you know, you see a lot of pandas, you see a lot of duck tb, you see a lot of polars, you just see a lot of these sorts of frameworks, ibis and other things.

Simba Khadder [00:05:07]: And we're working in that paradigm of experimentation. But that paradigm experimentation, is very different from production. In production, the data is changing. We have to maintain and keep features up to date, and oftentimes, especially for low latency models or online models like recommender systems fraud, etcetera, we need to keep. We can't actually process the feature at request time. Those features have to be preprocessed and cached. So now we have to deal with streaming data, streaming pipelines, batch pipelines. We have to deal with situations where we're actually featuring data that comes with the request.

Simba Khadder [00:05:47]: Like if there's a comment and you're trying to see if it's spam, well, the comment comes with the request. So you have all the code and the deployment, and everything looks so different from that nice notebook that we started with. There's this kind of. And, you know, that's not even the, that's just the beginning of it. There's backfill back. Like, if you. Once you start talking about streaming, it just gets even worse. There's this kind of chasm between the experimentation paradigm, the notebook, and production, and it can be almost insurmountable.

Simba Khadder [00:06:27]: Parts of the goal of feature stores and feature form is to try to make it possible to cross that easier. There is some value. Obviously, you don't want to, because everything I've seen is everyone writes production grade code from the beginning, so there's no experimentation pipeline that also can make things very slow. There's kind of. There's a reason that we use notebooks, we use pandas, we use all these things in experimentation. So the problem to be solved is, how do we make it possible to work quickly in experimentation, but then get those features actually in production? Because we don't want to end up in situations, which maybe some of you have experienced, where you have this perfect feature in your notebook, you're like, cool. If I build this training set, my model works really well. Great.

Simba Khadder [00:07:17]: Now I need those features to actually exist in production, and that's really hard. The way I see it solved is in two ways. One is either you have to have these unicorn data scientists who are also really good data engineers who can build and deploy their own feature pipelines, now, writing good SQL, writing good data frame code, that's very doable. Like, most data scientists are fully capable of writing that. Where it gets hard is how do you tune spark to do what you want? How do you set up? How do you handle back pressure? How do you handle monitoring? How do you handle incremental runs? Because the data is changing. How do you update a streaming feature? There's a lot of these problems that data engineers, this is what they do. But for data scientists, it's a different skillset. It's a very hard skill set, and it's orthogonal.

Simba Khadder [00:08:15]: So there are some unicorns who really are exceptional at both, but it's rare, and it's honestly not what you want to just do. Just go hire everyone to be this amazing data engineer and data scientist. The other thing that I see very often, probably more often, is that the ML data scientists pass the features, even literally just passing over a notebook, over defense to a data engine. The problem here is that the data eng team, though, they are more capable of building those data pipelines for scale, for production. One, features are inherently experimental. You're always tweaking them, you're always changing them. Whereas most data engineering is built with this mindset of kind of engineering, it's like engineering versus science. There's this robustness that comes with those pipelines.

Simba Khadder [00:09:01]: And so one, it tends to go a bit slower. And two, if you think of like the incremental ROI of like, hey, I have a slightly better feature for my model versus I have this new dashboard and exec needs you pretty much always going to be deprioritized. So for the complexity of the task versus the actual ROI to the business, it doesn't really always make sense for data engineers to be fully focused on all these ML tasks that come in. So what we want is a way to democratize it so that data scientists themselves can build their own data pipelines, production grade data pipelines, but they don't need to be experts in spark because let's face it, who really is the other piece that comes with this is, okay, cool. Now we have our features deployed. Great. Now, right, we're good. I have a few questions for you.

Simba Khadder [00:09:54]: Your features deployed, you have these pipelines, they're powering your models. What are those features? How did we build them? How did they work? What kind of peculiarities do they have? I know we all document our feature pipelines, right? Maybe this looks a little familiar where you have your features in production, but you know, there might be some untitled notebooks on the way, maybe some table v four that we use in prod, maybe some shell scripts and ad hoc spark submits that we ran to make things work. And yeah, just ignore that DF final, final in the notebook. So there is also a huge, this ad hoc ness with data engineering for ML where we just do whatever we can and we to try to get it working and it can get really, there's, it's not just hard to manage. It can actually result in really negative consequences to your model, because a lot of times these features are also deployed without proper drift monitoring. So there's mal monitoring with a concept drift. There's a lot of different types of drift. One type of drift that's really common, one is the feature themselves.

Simba Khadder [00:11:17]: The value is moving over time, and they should, right, there's a certain level of, they will change, but you kind of want to understand what's changing and why, and catch things that go past a certain heuristic and make sure that you kind of have this yellow flag to look into it. The other kind of drift that I see very often is where you will train a model and the feature in the training set will have a specific distribution, and that distribution will be very different from what features are actually being requested in production. So maybe in training, the average age of the user is 50, but in production, almost all of your recommendations are going to 20 year olds. And so that difference can result, especially if you didn't mean to do and you had no idea that was happening, can result in negative consequences to your mob performance. So this tends to kind of be an afterthought. I mean, it's already hard enough to get your features in production when you start worrying about monitoring and then governance, which can be another huge blocker of, I need to talk to legal, I need to get check off before I can even put this feature in production if you're in a heavily locked down space. So these are all kind of the data problems that I see. The idea is I have a model.

Simba Khadder [00:12:32]: Models take features. Those features are just kind of signals to feed in. As a data scientist, you're fully confident in being able to maybe work in a notebook and get those training sets set up. Now, when it comes to collaborating and deployment and everything that comes with that, it starts to become very ad hoc. And this is where you start to see the cracks. And I would say in general, like, the most painful is the facilitating deployment. That's where I really see, like, it can be a complete blocker and it can just cause lots of issues. And I've seen just a lot of really ad hoc processes to just get these features in production.

Simba Khadder [00:13:17]: None of this is necessarily like new. It's not anything that I've kind of, we've figured out, and it's breaking ground. A lot of companies have kind of seen this and have come with different approaches to how to solve it. Now, the umbrella of solutions oriented towards solving the data for ML problem, our feature stores. So some people say, oh, they don't, you know, we don't have one, we don't have a feature store. My view is, if you have features in production, if you have models in production that are being served data, then that data is getting there somehow. It's getting processed, it's getting set up, it's getting there somehow. That thing, that process, that's your feature store.

Simba Khadder [00:14:06]: I think part of the issue is, and I think I put it somewhere else, I think part of the issue is actually just in the name. When people hear feature store, I get a lot of confusion. Like, why is everyone talking about this thing? It's just a database, or features are stored. Like, it seems like a really a lot of hype about glorified cache. And I think for some feature stores, they've taken the approach of truly being a feature store. Even though a lot of companies and products and solutions have tried to rename the category to feature platform, virtual feature store. A lot of different ideas. There's still this tie into feature store, which I think pretty much everyone who works in features can agree is kind of an imperfect term.

Simba Khadder [00:15:09]: The thing about feature stores is they tend to apply again to storage. But the thing is, how do you even get the features there? A lot of the problems I talked about deployment actually come in with the transformation piece. How do I take my data sources, transform them, and get them into my inference store and into my training store? And then there's all this infrastructure underneath that I have to worry about. So one attempt at solving this is what I would call the literal feature store. In the literal feature store, as the name implies, it's very literal. You build your features elsewhere and then you store them in the feature store. There's a few examples of products that look like this. A good example of this would be like databricks feature store, which I'm including here.

Simba Khadder [00:15:50]: So in the databricks feature store, everything is feature tables. They kind of look like this. In the feature tables, there are these features. If I click on one, there's not really much here. This is it. This is what the feature is. Because in the end it's just a column. It's just tables and columns.

Simba Khadder [00:16:06]: That's what the literal feature story is. The value is that it unifies. It's like a singular place to store features. And then for verive it can be used in prod, but the negative is it's just storage. Compare that to something like feature form, where the feature is not defined as a table, but actually as a definition. So all the SQL data frame logic is actually what the feature is. The orchestration, the monitoring, the materialization, all that kind of comes as part of it, and you start to get much richer value and metadata that comes with that. So yeah, the problem with the literal feature store, and this is something if you were looking at or building feature stores, is like they don't really do the transformation piece.

Simba Khadder [00:16:49]: And in my opinion, you kind of have to solve most of problems yourself anyway. If you use those, there's a feature platform or physical feature store approach, which I think is much better. The benefit is it really does, it actually solves those issues we talked about. It really does. The problem though is every single product that calls itself a feature platform pushes their own compute engine, often have a custom DSL. They're all proprietary, and that comes with other issues of lock in. So the idea of a virtual feature store was what would happen if we took the feature platform idea, but made it more of a plug in architecture so that data scientists work in Pyspark SQL, they're still writing code as they would. They're defining those things in feature form.

Simba Khadder [00:17:44]: The name feature form, I came up with it because I wanted terraform for features. So that declarative approach, and then it runs on these engines underneath the hood, and that becomes the virtual feature store. Diving under the hood a bit, feature form itself, just quickly, everyone was asked, where does it run? It's actually running in Kubernetes set of services. One thing it does is interesting is for a lot of tasks, like a lot of the glue, it will actually run jobs in Kubernetes to move data around, handle some of the metadata, handle tasks like compaction and interactions. But underneath the hood, what enables feature form to work, what enables us to build this kind of virtual feature platform is deeply using open formats. And open it is kind of this lake house architecture that's appeared. And I know there's some talks on it actually today has made it possible to do this. What it does is it separates out storage and compute.

Simba Khadder [00:18:45]: And specifically with iceberg, what feature form has been able to do is specify our, it's pretty much an index, right? Like you're storing all this data in s three. But with iceberg. Iceberg is allowing us to break things down, set up indexes, make it so that spark or whatever engine we use can choose and pick data to use. It also allows us to do it too. Like, we can pick the right data files to use, we can set up the right indices, we can set up the right partitions. We can do all that using iceberg. And it's agnostic to the compute engine. So we can use our own learnings of how to tune those indices across the different engines.

Simba Khadder [00:19:27]: The other thing it's allowed us to do is because of some of these optimizations, it allows us to be able to also unify streaming and batch. Because what we've done is make it look like because of iceberg, you're incrementally adding new data. It looks like a stream you can process up to a point and then continue processing off of Kafka. Feature form makes it possible to be able to unify that streaming and batch. As a data scientist, you're still just writing that SQL, you're still writing, you're still writing Pyspark. I mean, truly like in a notebook, like it looks like something like this or a more complex one like this. This is just data frame code. You're giving a little bit of metadata for feature form to use.

Simba Khadder [00:20:13]: Everything gets usable both as a data frame and back. And under the hood, we are building production grade data pipelines similar to as if a data engineer built it themselves. And we even provide parameters for them to tune if they want to perfectly tune their spark job or whatever else, they can do so as well. On top of that, I mean, that's just the compute piece. There's the monitoring, governance. Like, there's all these jobs that we run. We essentially provide that full end to end feature lifecycle, but we're running all the compute and storage and everything via commonly used open formats so that they're interchangeable. So you can, like, your data engineers won't see this like weird new thing that hosts data and stores data to them.

Simba Khadder [00:21:00]: It will just look like a ton of things that they're used to. But as a data scientist, you get all this stuff kind of built in, and finally it's open source. Check it out. You can find this on GitHub. And if you want to learn more, you want to reach out. I'll be in our booth after this, and then you can always reach [email protected]. if you'd like to talk more about this. Awesome.

Simba Khadder [00:21:24]: I think it's question time.

Demetrios [00:21:30]: Question time. Dude, that was really good. That was. The meme game is on point. Some of those, like, of course we document our features, right? Like all the time. Always. I could just hear people going and.

Simba Khadder [00:21:51]: The CMO, the chief meme officer here.

Demetrios [00:21:53]: Nice. All right, very cool. So there was something you said at the end. I'll ask the first question that I'll let people drop in questions because it takes a little bit for the stream to catch up. But when you were talking about like using iceberg and being able to use it with Kafka, I didn't quite understand what you meant by that and why, what's going on there?

Simba Khadder [00:22:20]: Yeah, so think of the concept of I'm building a new feature. This feature is, let's say, a user favorite item per user per day. If I'm building that today and I want to train on it, I want to train on it, my historical data. And so that's the idea of backfill. So how do I write one transformation that I know will have? I will pretty much backfill all the historical feature values so I can train. But it will also continue to update in production as new data comes in. So that in production, when I'm using it, it's staying up to date. Typically, that's done very piecemeal, and it's very custom.

Simba Khadder [00:23:01]: People build and do this themselves. Feature form. Because of the distractions we've come up with, you just write your SQL query and it'll just make it happen.

Demetrios [00:23:09]: Okay, cool. The whole idea of it going with Kafka and iceberg is just basically saying you'll unify it from the backfill till right now.

Simba Khadder [00:23:23]: Yeah. And think of it from a data science perspective. You're just like, hey, here's how I write that query. I write the query. I registered to feature form, and now it exists. The whole concept of, okay, I have to backfill, I have to interact with iceberg. And how do I pause that and then start a new streaming job and keep that up to, like, there's all these things you have to think about that if you're a data engineer, it's like what your days look like, and it's very complicated. As a data scientist using feature form, you just, here's my SQL query.

Simba Khadder [00:23:48]: Cool. I'm good. Like, that chasm between production and experimentation just goes away.

Demetrios [00:23:53]: Nice. Okay, cool. Now, the other piece that I obviously am interested about is you mentioned I. Iceberg is there. It doesn't matter. Iceberg, whatever. Delta, Lake, hoodie, all those.

Simba Khadder [00:24:10]: Or we support all three now because we try to not force. If some companies fully on hoodie or fully on delta, we don't really want to push them to use iceberg. By default, feature form uses iceberg. An open source feature form only uses iceberg. And I would say probably our iceberg implementation is probably best. I think nowadays, I think it's because of databricks buying tabular and the concept of unifying kind of iceberg and delta. I think the separation between those two will go away. I think the question is solved for hoodie.

Simba Khadder [00:24:49]: I like hoodie. We'll kind of see what happens there. But I guess my view is that soon enough, most of what we do will be interchangeable, and we can kind of customize based on the specific formats to get that extra performance.

Demetrios [00:25:04]: Okay. And then for the engines, like, what does that look like?

Simba Khadder [00:25:13]: Yeah, I mean, it's interesting. Like, this separation means that every engine uses the indices that we create an iceberg slightly differently. So we actually have to kind of tune them a little bit based on it. But one thing that was really exciting that happened very, very recently is Snowflake released something called dynamic tables for iceberg. And what it shows me is that Snowflake is also taking the approach now of kind of removing the storage layer so you can actually store your snowflake tables in s three or whatever as iceberg. And I just think that's where everything's moving to. And as that continues to happen, the compute engines kind of become plug and play. And really what's the focus becomes? How do I tune these indices and build all this glue around it and that harness around all that, that's what feature form is.

Demetrios [00:26:02]: All right. Yeah. So this is, this is fascinating for me because it's like youre a, in a way, I'm trying to put it into a box in my head, right. And it's like you're orchestrating things, but you're not orchestrating typical data in the way that we would think. Like airflow orchestrates it, you're orchestrating more the features around it. And I think that's one of the things that I've heard people talk about so much, is decoupling the features from the code is so powerful.

Simba Khadder [00:26:38]: Yeah, I think, yeah, I would describe it very similarly. Airflow is for building and scheduling. Dax, it's good at it and there's other tools that are good at it too. And I think the problem with features, there's a lot of this semantical problems that only exist for feature engineering. Feature drift monitoring doesn't exist outside of features. So there's all these concepts that we have to plug in and API and abstractions and everything is really oriented towards a workflow that ML people have.

Demetrios [00:27:08]: Yeah, yeah. And how so? Last questions. What's the wildest architecture you've seen with this? Maybe whether or not people are using feature form. What are some fun things that you've seen? Because that's always a good one.

Simba Khadder [00:27:27]: What are fun things I've seen? I think the most crazy thing I've seen of seen is, well, one, I'll put it as like the tools that you will see at some of these old companies. Like you'll just learn about these like SAP databases and other types of databases that like I've even like never heard of before, that we had to like learn and figure out. I think one of the funniest ones I ran into was for one company and I'm gonna it, this isn't all exactly true, but going to change it a bit to make it so I can tell the story. What they would do is they would pretty much ssh for like four servers because someone kind of built a backdoor enough to be able to get to data because they wanted to work in notebooks and not do everything through the way data engine wanted. And so they would, like, go and, like almost like sneak in for like four servers and download like, parquet files and, like, bring them back. And it would take hours because, like, those network calls were so slow. It was kind of a mess. It was.

Simba Khadder [00:28:24]: It was kind of funny. It's crazy the things that data scientists will do to avoid using an MLS platform, but it's not good. I think a lot of Mlops platform people, like, are really caught up in. Like, it's perfect if you use it, but there's kind of a UX problem that's ignored. Like, it's a product. Right. Like a platform as a product. And so you need to think of who your end user is.

Simba Khadder [00:28:42]: And if they don't like what you have, even if it's theoretically valuable, then it's not valuable.

Demetrios [00:28:47]: Yeah. Yeah, because it's like if you send an email and the subject line doesn't entice the person to open the email, it doesn't matter how good that email is because the person's not going to open it. So I feel you.

Simba Khadder [00:29:03]: And the funny thing is, if no one uses it, you can kind of convince yourself that it's really good and it's the users who are wrong. And, like, you're, you know, I've seen that, too. It's kind of a strange, dangerous. Yeah.

Demetrios [00:29:13]: But then, yeah, on the other side, it's. It's also funny, like, the amount of work that we go through to not have to do that and not have to use it. And the. It reminds me of, like, when you see some of the amount of work that some, like, criminals or con men go through to create this elaborate scheme and this elaborate con, and you're like, man, if you just focus that energy into something legit, you would have what you needed and probably way more, and it would have been legit illegal.

Simba Khadder [00:29:47]: Yeah, I think it's very true.

Demetrios [00:29:50]: There we go, man. Well, Simba, this has been awesome. Dude, I am going to cheers you virtual. Cheers. Thanks so much. We'll have anybody that wants to go chat with Simba down at the booth. And we're going to keep rocking because now we've got the roundtable session. So.

Demetrios [00:30:07]: Yeah, Simba, I think you're also going to go to a roundtable session, too, right? There's.

Simba Khadder [00:30:11]: Yeah, I think there's a feature.

Demetrios [00:30:14]: Yeah, I think there's a feature store one or like, trade engineering one. So that would be cool to have you there and be able to ask you more questions. So now let's keep it rocking. I'll see you later, Simba. Thanks, dude.

+ Read More

Sign in or Join the community

Watch More

On Juggling, Dr. Seuss and Feature Stores for Real-time AI/ML

Posted May 30, 2022 | Views 962

# Juggling

# Dr. Seuss

# Feature Stores

# Real-time AI/ML

# Redis.io

Data Scientists & Data Engineers: How the Best Teams Work // Panel // DE4AI

Posted Sep 18, 2024 | Views 659

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K