Sign in or Join the community to continue

Unpacking 3 Types of Feature Stores

Posted Oct 01, 2024 | Views 504

# Feature Stores

# LLMs

# Featureform

Share

speakers

Simba Khadder

Founder & CEO @ Featureform

Simba Khadder is the Founder & CEO of Featureform. After leaving Google, Simba founded his first company, TritonML. His startup grew quickly and Simba and his team built ML infrastructure that handled over 100M monthly active users. He instilled his learnings into Featureform’s virtual feature store. Featureform turns your existing infrastructure into a Feature Store. He’s also an avid surfer, a mixed martial artist, a published astrophysicist for his work on finding Planet 9, and he ran the SF marathon in basketball shoes.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Simba dives into how feature stores have evolved and how they now intersect with vector stores, especially in the world of machine learning and LLMs. He breaks down what embeddings are, how they power recommender systems, and why personalization is key to improving LLM prompts. Simba also sheds light on the difference between feature and vector stores, explaining how each plays its part in making ML workflows smoother. Plus, we get into the latest challenges and cool innovations happening in MLOps.

+ Read More

TRANSCRIPT

Simba Khadder [00:00:00]: Hey, I'm Simba Khadder. I'm the founder and CEO of FeatureForm, and I like to drink my coffee arabic style.

Demetrios [00:00:08]: And we're back with another MLOps community podcast. I am your host as always, Demetrios. Talking to my man Simba today. And we got into the traditional and the new age. E. Simba gave me a great breakdown on embeddings because he was one of the first people that I heard used the term embeddings back in, like, 2020. He told me why he was so intimate with that term because of his recommender system work that he did back in the day and why that spurred him to go and create a feature store of his own and start selling it. And so now he is doing the thing that one tends to do when you create something cool or when you have a pain, and then you figure out how to ease that pain, you go and bring it to the market to see if anybody else has the same pain.

Demetrios [00:01:03]: He really walked through different ways of looking at feature stores. I appreciated that, how vector stores and feature stores can play together and why they're so oftenly confused between needing and if you need one, do you not need the other? If you need both, how does that look? And then we talked about LLM land and how personalization is a must when it comes to LLMs. I usually look at it as variables. He gave me a little bit of the light bulb going off in my head and told me, you know, it's just features that you're talking about. Variables are features. And so we can bring those into LLM prompts just as much as we can bring other things into LLM prompts, like the context that you get from a vector store. Enough from me. Let's hear it from Simba.

Demetrios [00:04:02]: Dude.

Demetrios [00:04:03]: So you were the first person, and I gotta give you credit for this because before LLMs, you told me, I'm pretty sure everything's just like moving. It's all gonna be embeddings. And I remember thinking, I barely know what an embedding is. And that's how I know it was before LLMs, because once LLMs came out, I think everybody understood what embeddings were. And can you explain that position? Has it changed since then?

Simba Khadder [00:04:35]: Yeah, you also have to, it's an experience to have people come to you and try to explain to you what an embedding is. Like, I'll be talking about something and be like, oh, there's this thing called embeddings. I'm like, oh, tell me more. I've never heard of them before. So, yeah, that's like a day in the life. Yeah, I think it's still true. I mean, part of how I got into embeddings and like, I guess into embeddings, part of how I began to kind of get interested in that field and kind of like the, I would say at the, at the time, more to cutting edge of deep learning was I was doing recommender systems and what was really kind of new then and maybe now, kind of. Actually, I'm still seeing some articles about it.

Simba Khadder [00:05:20]: Like the two tower architecture. You're seeing people kind of pushing holistic understanding of items and users to the model itself and essentially turning users and items into embeddings. So this concept of like, hey, we have this holistic view of what something is. The analogies around, like in texts we often hear about the king is to queen as man is to woman. You can get way more sophisticated than that. Um, I remember I did, uh, uh, embedding generation for an e commerce data set. And one really cool one that came up, one really cool, like kind of analogy which looks like a parallelogram and uh, embedding visualizer was, uh, Coke is to Diet Coke as cherry coke is to coke zero.

Demetrios [00:06:11]: Really? Wait, how does that work out?

Simba Khadder [00:06:14]: Well, like, I think when I think of the flavor profiles, like coke is a generic, like, that's the standard. Diet Coke is the sugar free. Cherry Coke is like coke standard, but there's almost like a slightly different, like vanilla, extra sugary flavor to it. And Coke zero is like Diet Coke, but has again, that same kind of extra flavor profile. And I can kind of explain it and we could talk about it and be like, yeah, that's how it tastes. But the crazy thing is, is that this model was able to derive that just by looking at, oh, user x bought item y. And the first time I started seeing stuff like that, I just was like, man, like, it really deeply understands what these things are all by itself. By looking at really sparse data, which if you take that and maybe expand it, you kind of start to see what lms are.

Simba Khadder [00:07:08]: Or if you take that to an extreme level, you're like, hey, if you just read random text one token after the other, like, a model can actually start to understand ideas and concepts and things that you wouldn't really think would be possible just by reading word to word. So this is a very long answer to your very short question, but I think the real answer is that, yes, we're always sort of at the beginning of things. I think we're kind of. There's always these s curves of like, there's a new architecture that comes out or a new style or new, in this case, like a training technique, and that just pushes us to the next level and intends to plateau for a little while. And then it goes to the next thing. At the same time, though, every single time that happens, everyone's like, all right, everything we've ever done is now deprecated. And I remember when deep learning became a thing and everyone was talking about even like, 2017, 2018, it's like, yeah, everything else is deprecated now. We don't need anything now.

Simba Khadder [00:08:16]: Everything is going to be deep learning. And that's it. Feature engineering is dead. Everything else is dead. We stuff to deep learning, but to this day, probably the most common model in production is some form of a random forest, like Xgboost or something similar. And I wouldn't be surprised if that still remains to be true years into the future.

Demetrios [00:08:37]: Yeah, and there's the stories that come up consistently of folks that go through university. They do a bunch of cool, deep learning problems, and then they get on the job and they're like, where's all my deep learning? What's going on here? First of all, you don't have the data for deep learning. Second of all, that's not what we do here, and so get used to x. Deep boost, baby.

Simba Khadder [00:09:01]: It's so true. And yeah, there's a big gap between how people learn about machine learning and what it actually looks like in an actual company.

Demetrios [00:09:11]: So I really like the idea that you mentioned before of this two tower architecture. I think that's cool. And I hadn't heard that before. And basically you're saying, if I'm understanding it correctly, the customer is put as one type of embedding, and then the item or whatever people buy is another type of embedding. And depending on how close these embeddings live to each other, you can recommend certain embeddings, correct?

Simba Khadder [00:09:44]: Yeah, actually how we would do it is we would, if I want to recommend something to a user, there's another common strategy which again, has re become popularized, which was called re ranking, where we would use an embedding and we would do candid generation. So I would take, let's say I'm YouTube, and YouTube actually popularized this in their paper where they were like, hey, I have a user embedding. I'm going to take this user embedding. But the model figured out all by itself, just based on behavioral kind of items. And I will figure out a thousand or 10,000 video embeddings that are near this user embedding. And those are my candidates. And then I'll have specialized model, like one for the side view, one for the autoplay, one for whatever, where I will re rank those 10,000 candidates to decide what to show people. And so the one thing I think is that, yeah, embeddings don't, it's not just like a text thing.

Simba Khadder [00:10:46]: Like we can do embeddings on a lot of things, like users and items, images. And then the other thing, which I think a lot of people miss and is again getting popularized. I feel like everyone's relearning a lot of things that people in recommender systems learned a few years ago now. But the other thing is that you can actually use embeddings as a drop in, as a feature in a typical, in a traditional model. So, for example, if you have a ranking model and you want to put an item as like an input, you can actually just put the whole item embedding in and do that. You can do things like average embeddings together. Like, I can take the last ten items the user watched and I can just average them together and blend it. And that actually works.

Simba Khadder [00:11:32]: But the other funny thing about embeddings is it works except when it doesn't. And when it doesn't, it's quite painful experience of trying to figure out what's happening. Cause they kind of have debug actions on abstractions. Yeah, yeah, it's. Which is why, like actually boosts and things remain very popular, because they might not always have the best performance, but they always do quite well, or they almost always do quite well. And you kind of kind of start to understand what it's doing and why, so you can figure out, you know, make it not do dumb things as easily as.

Demetrios [00:12:07]: That's such a great point. It reminds me of a video I watched on how windmills that are your typical windmills that we've probably all seen, especially in wind farms, those tend to break like ten times more than windmills that spin on the y axis as opposed to the typical ones. And so the windmills that spin on the y axis and are just spinning around in circles like this. And for anybody that's just listening, you can think like a water bottle that has holes in it and it's a. It's just spinning on the y axis. Those break the. Sorry, the ones that are the typical ones, those break a lot more than the ones that spin on their y axis. And so you have to do some calculations on.

Demetrios [00:13:01]: All right, I can generate more power with the typical windmills, but it's in spurts because it's going to eventually throw me an error and it's going to break. But I can get like ten years of no breakage with these very simple windmills. However, it's not going to generate as much power. So it's almost like the tortoise and the hare. And that reminded me of it because you were saying, hey, you can have a this very powerful architecture of setting up with embeddings on embeddings type thing and using those as features, but when it breaks, you potentially are going to spend a lot of time trying to debug it and figure out where it's breaking as opposed to, hey, let's try this simple architecture that if it breaks, we don't need to spend that much time debugging it and it is not really going to break as much. Or if it does break, we can get the uptime is much higher, right?

Simba Khadder [00:14:05]: Yeah, I think that's spot on. It's just more accurate in its simplicity. You know what it's going to do and depending on, you just have to think of your problem space in some problem spaces. Maybe having that really one that amazing once in a while model recommendation prediction is worth the bad predictions. And there are other times where, hey, I'd rather have it do pretty good every time, then have it do amazing most of the time, but really badly some of the time.

Demetrios [00:14:41]: Well, I feel like you wanted to go down this route, but I know there's people that ask the question frequently, how do vector stores and feature stores play together?

Simba Khadder [00:14:56]: Yeah, it's a good question, but it's almost a funny question because if they didn't both have this word store in them, I don't think anyone would ask this question.

Demetrios [00:15:07]: But they both kind of play with embeddings. Right? You were saying feature stores have the embeddings and then vector stores for sure are all about embedding.

Simba Khadder [00:15:17]: Yeah, yeah. And they both kind of touch data. I think it's true. I think it's more like it's not things that I typically put right next to each other, but yeah, I remember like in the early days, we released embedding support in feature foreign and there was a mix. It was mostly people asking like. Or being like, okay, like, we don't understand. Very few people at that point were like understanding what that meant and why we open sourced something called embedding hub, which is a really early vector database. I think the key difference is really the value prop.

Simba Khadder [00:15:55]: I'll tell you what the key difference is today and where I think it's going. The key difference today is that a vector store is a specialized index, which makes it really quick to do nearest neighbor lookups. To do true nearest neighbor lookups, where I give you a point and you find me ten points nearby is a very, very expensive problem. There are many algorithms that give you approximate nearest neighbor lookups. That's a vector database. They typically build indices. A really common one is called HMSW, which Facebook released in the early days. There was a library I used to use a lot called Annoy, which was Spotify, right? Yeah, Spotify did it.

Simba Khadder [00:16:47]: Actually, that's Eric's, the guy who built annoy, or I believe he built it by himself. But whatever him, he's released it, or I associate with him, he's now the CEO of modal. So it's funny how all these kind of things, people work on very different things. Everyone's kind of seen and played with a lot of the same tools, but yeah, like we. So the problem with a lot of these indices is they don't actually really fit well into the traditional database models. The only one that kind of does is a specific one, which is, I believe it's IVF and uh, it's not as good as HTMSW, but it's a little easier to scale and it works better and looks more like a traditional index. But anyway, so a vector database is a way to index your vectors, your embeddings, such that you can do nearest neighbor lookups pretty quickly. And there is a give and take of how of recall.

Simba Khadder [00:17:51]: And I, that's the difference between the different algorithms. There's also a difference in speed and latency, scale, usability, ux, whatever. So that's where the vector databases fight amongst themselves. Feature stores are almost solving quite a different problem, but has a little bit of overlap. The problem that feature stores are solving is the problem of a lot of ML is inherently a specialized form of data engineer. You take data and you turn it into features put in a less ML wordy way. Like, I have raw data and I need to pull signals out of it. And so a lot of what I'm doing is just kind of iterating on the data to get the best signals that I can provide to my model for it to do a good job.

Simba Khadder [00:18:44]: Now, there's a lot of problems that come into that. Some of it is organization, like versioning, lineage governance comes up a lot, and there's a production problems, which is, okay, well, even if I build this thing in my notebook and I use Polar's or pandas, well, I need to get this in prod. And in prod, if it's like a recommender system, it can go down or bad things happen. So how do I build a enterprise grade data pipeline for all these features I built? And by the way, I'm a data scientist, I'm not a data engineer. And so what we see a lot of is data scientists going to data Eng teams and giving them notebooks, like, oh, here's my notebook, can you please put this in prod? And so there's this awkward friction between data engine data science where data engineers are like, like, these data scientists have thousands and thousands of things they're throwing our way, and we just have enough stuff to do. And in practice, like, the amount of asks that you'll get from data science versus the ROI of every specific data pipeline is actually, it just the math doesn't really math very well. And then for the data scientists, wait.

Demetrios [00:19:57]: Because of the data scientists being way more valuable, like they're, the stuff that they're asking to put in production is so valuable.

Simba Khadder [00:20:05]: Is that why actually, it's more of that. Like, hey, this might incrementally increase my model a bit.

Demetrios [00:20:11]: Oh, okay.

Simba Khadder [00:20:12]: And a lot of asks for a.

Demetrios [00:20:14]: Little bit of lift.

Simba Khadder [00:20:15]: Correct? And. But it's a data scientist's job. And that little lift does make, like, a difference to the business. But from the data engineers perspective, it's just like painful. And from the data scientist's perspective, it's like, how can they do their job? Like, it's science. It's a lot of hypothesis testing and throwing stuff away and evaluating things. If every experiment is very expensive, then you can't really do your job very well. And so where we find future stores being really valuable is making it possible for data scientists to kind of self serve, build and experiment and deploy their own feature pipelines and have us be kind of like this harness that gives them the versioning, the monitoring, the scale, the uptime, all the things that you would expect, incremental processing, stream processing, everything you'd expect.

Simba Khadder [00:21:09]: And maybe a really good data engineer could do. We make it possible for data scientists to do themselves. And so the problem space is different now, the overlap is that, well, sometimes those features are embeddings and so that's where the overlap is.

Demetrios [00:21:27]: I see. That's where they dance together, because I think the confusion comes in when you hear folks talking about how they're using vector stores for their recommender systems.

Simba Khadder [00:21:41]: Yeah. And I think what it is is I would describe it as everyone has a feature store. If you have a model in production that takes data as inputs, whatever process you take to get that data there and manage that data processing, that's your feature store. Now they could be a bunch of bash scripts. It could be a python package that you install in the dark container. Just do the processing on the fly. There's a lot of things that people do, but that whole promise space I would define as a feature store. And a lot of what I'm doing when I'm talking to prospects is trying to understand if they have models of production.

Simba Khadder [00:22:19]: It's like you have something here, help me understand, like what that looks like today so I can show you what the kind of Roi would be.

Demetrios [00:22:27]: Yeah. Do you feel like there are phases to when you need more of a bulletproof solution as opposed to those bash scripts? And. Cause I remember we had Neil Lithia on here ages ago and he was head of ML at Monzo bank. And one thing that he said, I think they had their own homegrown feature store. Right? He was saying, yeah, what I see with, if we wanted to buy something off of the shelf, that's us going and trying to leapfrog. In a way it's us saying, cool, we have nothing. Now we're going to one really quick because we can buy it off the shelf as opposed to we've already got something in there and it's working and the gradual increase of us making it better and better. And so I wonder if there's patterns that you've seen where you're like, I recognize that you're currently at this scale and you want to go and do it this way, or I recognize that your current architecture, your, the way that you set it up is this brittle, because I can only imagine that we all, in the back of our heads, we know our weak spots and we are all just expecting, even if it isn't, we all kind of have nightmares of how brittle our systems are.

Demetrios [00:23:57]: So anyway, Tangent aside, if there was a question in there, I'm sure you're going to answer it right now.

Simba Khadder [00:24:06]: Yeah, I think the question, I would define it as when do you actually need a feature store? Or how do you even decide, feature aside, like, how do you do that analysis of, okay, it's time to go of a vendor or something off the shelf versus, hey, we have our own homegrown hacky thing. We know it's hacky, but what's actual ROI of going for something else? And I actually think that this question was at the heart of, we both kind of saw like an implosion of Mlops startups because there was this, like, humongous, like, everyone had an Mlops company for like a moment in time a few years ago, and then a lot of those companies don't exist anymore. And there was this kind of era where people were buying mlops just almost experimentally. Let's just try this. Or we heard this is a good idea. I'm not really thinking of ROI. This more like, we need more shiny tools. And I actually think that LM ops kind of sort of looks like this today.

Demetrios [00:25:08]: That's what I was going to say. And I don't know if I fully agree with. A lot of those companies are not around anymore. I think a lot have maybe pivoted to LLM ops companies and, or the, the story has changed because folks feel there's a bigger market out there with the LLM use cases, which I'm not necessarily sold on, but that's a whole nother conversation, so forgive me for interrupting you.

Simba Khadder [00:25:38]: No, I think you're. I think the companies that maybe we're referencing are companies that I would say were like above the fold. So companies that most people knew about and had some. There's this like this whole, like, for every company, for every monitoring company you knew about, there was like 15 that, you know, very few of us had ever heard of. Same with feature source. For every feature store company you've heard of, for every feature form, there's 15 companies that even I didn't know about all the time until I randomly run into them in like a cycle or someone's like, oh, there's this thing that has no website, but we're thinking of using. And so, yeah, my brother has a.

Demetrios [00:26:15]: Friend who says he knows a thing or two about feature stores. But also I'll let you make your point on the question that I didn't ask before that I tried to ask but couldn't get out. But I will say that feature stores feel like it is relatively strong because you do see that folks are still coming out with new feature stores. There's, I heard about chalk the other day and I heard about what's the one that Airbnb open sourced. Right. And so the fact that there are more people coming into the space, even though it is not viewed as like, the booming space that it once was, is still a good signal. And I know that you mentioned before we hit record that, like, it's just something you do. It's not as sexy as it was in 2020, but it is still something that is very useful.

Simba Khadder [00:27:11]: Yeah, I think if you talk to at least in future store space, I don't know so much about the other ML spaces, but at least in our space, even, like, our competitors will say, like, yeah, like, all boats are kind of rising in feature stores. There's this way more. I think what it was was a lot of people didn't really fully understand the problem space, or they kind of understood, like, we need to do this ourselves and field ourselves. That might have been driven by a little bit of, like, I think good reasons in some places where, like, hey, before we bring this vendor in, let's make sure we actually understand this problem. There's probably some bad reasons where, hey, this is like the hottest thing ever and everyone who's doing this is writing blog posts. So I want to do this and not buy it, which always happens in new categories. And I think we moved past the error where it's like, hey, the ROI, the Feature store is kind of well understood now if we have data scientists doing ML and they're working with data, they're always complaining about all sorts of things related to the data. Feature store is kind of an umbrella that solves a good set of those problems, but just get one.

Simba Khadder [00:28:15]: Like, there's enough of them very well proven at this point. They have big customers. Like, are we really going to build this in house? Why? And I think the other thing we ran into is even though a lot of companies could build in house, like you mentioned, Airbnb, what ends up happening, all those companies is if they build something valuable in house, the team that builds it then leaves, starts a company and leaves the company holding a bag of, oh yeah, this thing that we built originally, that now everyone who actually wanted to drive it forward has now left, and all you're left with is like maintenance on it and it just degrades, which we've seen a lot of.

Demetrios [00:28:56]: That happened with the LinkedIn feature store, feather, I think, which is a bummer because when it came out, we interviewed the folks that worked on it, and then it feels like it didn't get picked up and it didn't get the energy that it needed to be injected into it. But you made a great point on resume driven development. And in 2020, there was so many blog posts about feature stores. That's so true that, yeah, if I'm going to be working in the mlops space, what am I going to do? Well, an easy win is I build my own feature store. And we even had Nicholas on here a few weeks ago, and he was talking about how he has bigquery as their feature store esque type solution. And so it is like that, that thing still happens. And that was the reason that I wanted to ask you the original question that I asked you and didn't let you finish and give your answer to. So now I'm gonna shut up.

Demetrios [00:29:58]: I swear I'll go on mute and I'll let you give your answer.

Simba Khadder [00:30:01]: Yeah, I also have the problem of like, who also, I'm like, oh, we can talk about all these other things too. But yeah, coming back to the main problem, which is, hey, I, I have bigquery. We have this ad hoc process in the house, and we use it as our feature store. Do we need an actual external feature store? First, I'll answer it very pragmatically of like, you own the whole MLS platform, and feature store is just one of many things that you can invest and buy. I think in practice there are four, call it modalities that ML engineer data scientists will be in. There's the data modality, which is like when they're working with data and putting data platforms in production. There's the training modality, which is experimenting, sometimes locally, sometimes in notebooks, just like building models. This is where ML Flow, this is the part of the model pipeline we're using something like ML flow or weights and biases or something.

Simba Khadder [00:31:01]: There's mall production, which is okay, now I have a small, it's in a model store. I actually need to get it in production. I need to serve it. I need to have this uptime, this latency, all that sort of stuff. GPU's all that comes in. So there's deployment, and then there is evaluation and monitoring, which is the final bit. So those are kind of the four modalities and arguably the four problem spaces, so to say them in order, data model training, model production, or model serving, rather, and then monitoring the valuation. So my first question would be, okay, talk to your user, because you're a platform team.

Simba Khadder [00:31:36]: So if you think of your platform like a product, your goal is to make your ad user happy, and your ad user is a data scientist or ML engineer or something. So what do they of those four things, what's causing them pain, and what does that pain equal? Like beyond just like I hate this. Like, is it slowing things down? Is it causing errors of production? What else? Typically, the question, the answer is almost always going to be one of monitoring the evaluation or data, at least in practice. Ive seen that those two tend to be the ones that have the most pain. And its no surprise then that when you think of mlop startups, the ones that have done quite well are companies in both of those categories, data being feature store companies, and then monitoring the valuation companies. Interestingly, in LLMs, you could almost ask the same questions, and you end up with the same place where most of the problems are data and monitoring evaluation. I come back to that later. But yeah, so that's the first bit is like, which one matters if it is data? The question becomes, what about the data? And one thing about feature frame specifically, and part of how we built it and why we built it, we call ourselves a virtual feature store, or the premise that it sits above your existing infrastructure.

Simba Khadder [00:33:01]: So the problems that we solve are much more about productionizing things and operationalizing feature pipelines. So it sits above bigquery, it sits above the vector store, it sits above these things. So it's the application layer that sits above your infra, and it makes all work together coherently, like a feature store. Gartner came out and called this concept a logical feature store. We weren't even, I came up with the name logical feature. Sort of felt kind of like other handed thing to say, where the logical feature starts and went virtual. But conceptually, it's like the logical layer, the semantic layer of features above the infra. And so if you have a data problem, the question then becomes, what kind of feature store? And there's three kinds which I can get into, but maybe I'll take a second and let you ask any follow ups you have.

Demetrios [00:33:52]: Don't even get me started on Gartner. That's all I got to say about this.

Simba Khadder [00:33:56]: Yeah, but it's an interesting business model. Gartners.

Demetrios [00:34:02]: Well, so the fascinating bit is the, the four pieces of the lifecycle. And I do like how you, you've clumped together. Like there's the data piece, which is, there's so much inside of that data piece that can be a headache. Right? Like just the data acquisition or as you were talking about earlier, you mentioned a few good things like the governance or transformation or access even can be a pain depending on if you're at a gigantic company or not. And then you have the deployment. The model training is another interesting one because that one for sure feels like there's less pain there because the tools in my eyes are very mature. And so you look at like a weights and biases or an ML flow and those tools feel very mature compared to the rest of this life cycle. And maybe that's because the training has been around almost longer than that was like the first thing to kind of pop up in the lifecycle.

Demetrios [00:35:11]: I don't know, maybe you have theories around why those companies are more mature.

Simba Khadder [00:35:15]: Yeah, they just came a little earlier and I think the problem space is a little more well understood. I think data, like you said, what does that even mean? Is very dependent on the type of company. So the solution has to be way more complex. And same with monitoring, evaluation, whereas training and serving. It's like, in the end, it's like this. Lots of matrix multiplications and GPU's is.

Demetrios [00:35:39]: Yeah, just throw more gpu's at it. You're all good. Okay, there we are. Now keep cooking because I know you wanted to say something else.

Simba Khadder [00:35:52]: Yeah, I want to maybe dive because the other thing that I think there's a lot of confusion about with feature stores is, okay, like there's, I get it, like there's this data problem, but there's all these feature stores and they all look kind of different, but they all call themselves feature stores. So kind of like understanding what's actually happening there and what you need. And one thing we've done is we've kind of broken up the feature stores into three distinct categories. The categories are interesting because they all came about at the same time. And there were just three different solutions to the problem. One solution, which was popularized by Feast, I would call the literal feature store, where you literally just store your features there. This is kind of how databricks is feature store. Sagemaker's feature store looks where you transform it and you build it somewhere else.

Simba Khadder [00:36:45]: And they would define the feature as like a table. So the artifact of the transformation is the feature. So the benefits, it's really simple. Like you just have a new sync to put your features into. The cons is everything I just said about productionizing feature pipelines, dealing with streaming data, versioning lineage, none of that stuff really gets solved by the literal feature store. So in some ways it's like a very simple add on, but in other ways it doesn't really do much with the databricks feature store. It's almost like such a lightweight thing that it's almost like a specialized sub part of the Unity catalog. So every single table in databricks is Unity catalog, but has a primary key they just put in the feature store.

Simba Khadder [00:37:36]: It's just like, yeah, it's the same thing. Like a table of a primary key is a feature. That's their approach. And then, yeah, part of the other problem they solve, I should make sure it's clear, is that there's another big problem with features, which is training and inference. At training time you need kind of batch high throughput data of all my historical feature values and then an inference time. I need to be able to just do a quick lookup like, hey, Demetrius just opened up Spotify. I need to do recommendations for him. I need to look up the features now.

Simba Khadder [00:38:06]: And so having like these precomputed cache features and stuff and like Redis or Dynamo and keeping that in sync with your training data is another problem that these things solve.

Demetrios [00:38:16]: Yeah, wait, can we, can we pause real fast right there? Because there's something fascinating for me when it comes to databricks and Sagemaker feature stores and these end to end platforms like Vertex or Azure, I think has one too, right? And it feels like what they did is almost like what Microsoft Teams did against Slack where they said, Microsoft just said, you know what you get office. Cool. You can also have teams. And so Sagemaker said, yeah, cool, you have Sagemaker. Well, you also get a feature store. And I, I would love to hear your thoughts on how that is. Like when you look at the space and you see big platforms that just kind of throw in the feature store as a part of it. Where's your head at on that?

Simba Khadder [00:39:17]: Yeah, so yeah, I'll answer in two parts. First, I'll answer from the perspective of databricks. So databricks or something like databricks, I'm just going to use databricks just because they have a feature store that's well known. So Databricks, their NRR, which for those that know is kind of like their net revenue retention, which is like, if I have x customers and they pay me a billion dollars a year, ignoring new customers, how much will those customers, the same ones, pay me next year? And so it's almost like upsell, like, how much more are you paying daily.

Demetrios [00:39:58]: Bricks band yeah, exactly.

Simba Khadder [00:40:01]: And so I believe their number and someone's going to probably fact check on these, but I'm going to say their number sort of, kind of looks like 150% even. It's 120%. Let's just say 120. If we're at like a billion revenue, that means that those customers paying a billion revenue now pay them 1.2 billion next year. If they do nothing, like they can literally do nothing and they just make 20% more dollars than they made the year before. And they are held to task on that. It's a huge revenue, it's a huge metric that companies like Snowflake and others are held to. How do you do that? Well, there's one bit which is, okay, just make people use the product more.

Simba Khadder [00:40:49]: And there's another bit which is like, okay, how do we get in more places, make it more sticky. So we're like, oh, ML, big use case, lots of money there. Let's just throw tools at it that make it useful for ML and make it stickier for ML, and we benefit from there. So that's kind of perspective is just like we need to, to get people to spend more money on databricks. So they're already on databricks. It's easy to get people to spend more money on a vendor they already have and to get a new vendor in. So we'll just do that. So then there's a question of, okay, feature stores, databricks has one.

Simba Khadder [00:41:25]: And I think this is where the space, and I think a lot of spaces have to justify the incremental value. And so feature form, we took the approach of being open source. We also took the approach of playing very nicely with Databricks feature store, for example, inside every feature, if you're using feature form in databricks, every feature that's built with feature form runs on databricks and appears in the Databricks feature store. So it's like Databricks is feature store, it's only incremental value on top. And then the question becomes, okay, well, how much is the incremental value worth to you? And if it is valuable, we charge x dollars. How much roi do you get on that? And so there's some math to be done there. But yeah, I think you're right in that a lot of companies are releasing these and taking this platform approach. There's this idea of data gravity where if they all lives in one place, it's easier to sell more things on top of it.

Simba Khadder [00:42:22]: And I think the question really becomes quality, which is AWS has a service for everything, right? But it's accepted that the AWS one is the baseline. It's like, okay, it can only get better from here, but is this good enough? And I think the same is true of databricks. But databricks, I think to their credit, still has more of a quality component than maybe like AWS services do. So there's maybe more of an argument to be made. And anyway, I think it's a very long answer to what I think of the, there's a business reason why they do it and that business reason, it's not like we're trying to be an mlops company. I don't think they really, I won't say they don't care about the quality of it, but it's, it's, it's more of a way to drive more spend of databricks. So anything they can do, they could just kind of add on. Is this a way to drive more databrick spend? And in a funny way, if we continue to do as we're doing, continue to do well, continue growing, and it becomes, we become like a de facto feature store.

Simba Khadder [00:43:22]: I don't think databricks cares. I think they're like, cool. We'll just support feature foreign because they're still driving more data to expand. And in the end, that's all databricks cares about.

Demetrios [00:43:32]: Yeah. If you look at what they did with Mosaic, that's exactly what you're talking about. They bought Mosaic and they said, now you have a place inside of databricks to do all your LLM stuff.

Simba Khadder [00:43:45]: Yeah. And Tabular was an interesting acquisition where they were just like, cool. We're just gonna, I mean, that one has a very fun story behind it. But the very short of the story is like, we can pretty much make a whole differentiation, not be differentiation anymore, by just unifying every data format or table format under one. And so, yeah, they've done very well. I mean, so not a company that's easy to compete against, a snowflake. And numbers are saying, yeah, so I.

Demetrios [00:44:15]: Derailed you when you were talking about the different types of feature stores. That was one type, which is, you called it the literal feature store. Right. Classic feature store like Feast or. I'm pretty sure that's what we had Nicholas on here, using bigquery as that. That's where they stored features.

Simba Khadder [00:44:36]: And so that's why the name gets. And that's what a lot of people think of feature store is, which is unfortunate because it's like, oh yeah, it's a feature store. It's a place where you store features. And this is where the vector store comparison comes in. It's like you store your features in the feature store into vectors in the vector store. But how about my vectors are features? Where do they go? And that's where people get confused. But I would almost argue that a feature store isn't really a feature store. It's not really a storage of features.

Simba Khadder [00:45:02]: It's just a term that got coined. And all of us are stuck using other companies, like Tecton, for example, they've gone and tried to rebrand the term. We call ourselves a virtual feature store. They call themselves a feature platform, which all are just trying to show, hey, it's not just a storage of features. And so if I get to like my next maybe category, which I call the physical feature store, some would call like a feature platform. And this is where like Tecton lids and others live. And the approach there is that you do the compute. So it's not just, it's the features defined as the pipeline, which I think is the correct definition.

Simba Khadder [00:45:40]: If I think of a feature in my head, the top five things the user clicked on, like, when I think of that, I'm thinking of it as almost like a SQL query in my head. It's not the table at the end, it's the actual definition. The physical feature stores approach is like, hey, data scientists can give us these definitions. We will build them and make them production ready for you so that you can very easily iterate on features, use them in prod, get that monitoring, get the version, and get all that, those nice things that you need. And we have this hypertuned compute framework that we think is even better than Spark. You mentioned chop. That's they mentioned, they're saying, are you tired of Spark? You use chop. And so that's our approach.

Simba Khadder [00:46:27]: It's like, okay, we are owning the feature pipeline and we are building those features better than you can yourself and better than spark can.

Demetrios [00:46:37]: So excuse me, this is what Nicholas was doing with Bigquery. They were computing. And now that you say that, I'm remembering the conversation because he was saying, before we implemented Bigquery as like the standardized spot, we had all these ad hoc jobs running because data scientists would introduce a little bit of feature compute into their code. And then when they would run it, it would run here or there or whatever. But once we introduced Bigquery and we added standardized like day and night, we would compute all of them. And you would get the most recent version, right?

Simba Khadder [00:47:11]: Yeah. And this is like super. Yes, that's like the value. And that would be, I would call kind of like in that feature platform place. Now if a company in the feature platform, physical feature store space was to go and sell to this person, they would say, yeah, we do it even better than bigquery. So we will process those features. We will read from Bigquery like a source, and then we will compute them and then put them, and essentially we will handle the data from there. The benefit is it truly might be faster, cheaper, whatever the con is typically adoption cost.

Simba Khadder [00:47:49]: All of the physical feature stores are proprietary. There's typically, because they're proprietary, you can't really tune them in the same way you could tune something like Bigquery or Spark. There's no such thing as an expert at tuning those things. It just doesn't exist yet. The con typically comes with partly datagravity. It's like I'm already in Bigquery, I already have Bigquery, or I already have databricks. Why am I going to buy another compute player just for my ML team? That's where our approach, which we call it the virtual approach, it's virtualizing a layer on top. Pretty much what would happen if you built that feature platform.

Simba Khadder [00:48:32]: But rather than actually building the compute engine, we just build plugins to Bigquery, spark, Snowflake, all these different tools. We tune those tools as well as we can to do the feature compute. So from the data scientists perspective, all they're doing is giving us like their data frame transformation or SQL transformation, whatever. We will then handle productionizing that feature on their existing data infra. And so yeah, the benefit we see is just adoption cost is way lower for companies. Like what most companies build in house kind of looks like an ad hoc version of feature form because almost no one is going to build a compute platform in house and no one's going to be like, we're going to build a better bigquery. That's what Airbnb and Uber did back in the day. But I just don't think that the math makes sense anymore, especially given that other things exist out there.

Simba Khadder [00:49:28]: So really they end up building on top of bigquery or spark or whatever. But that's what feature form is. And feature form is open core. So if you're going to do that and you don't want to pay, you can still just use feature farm to do that for you.

Demetrios [00:49:43]: So explain a little bit more about this idea of connecting things and how then you're using these engines, whatever the compute engines are, and then you're spitting out the features to the virtual feature store and then it lives there. And how is that different? I'm trying to see how that's different than what you were saying in the beginning. The literal feature store of like okay, it just lives here and these are the features.

Simba Khadder [00:50:12]: Yeah. So two points. One is you're not ever really spitting it out to the virtual feature store. It is the whole thing. Maybe put one way, it's like the literal feature store is like cache, like a database for features. The physical feature store is like a compute and storage framework for engine four features. So it's doing the compute, it's doing the storage. The virtual feature store is kind of more like an orchestrator and metadata.

Simba Khadder [00:50:42]: So whereas like in the physical feature store you would compute in Veri compute engine and then you store in their storage engine, or maybe this for Dynamo or Redis, but generically their computer storage, we do the same thing except you plug in the compute like Bigquery and you plug in the storage like Redis. And then we will make all those things work together. Click House has written quite a few articles about us and about using feature form and click House because from their perspective, similar to Databricks, they want to have more use cases of Clickhouse. And ML is a big one for about. And so a feature store just makes it easier to use Clickhouse for ML so that data scientists don't have to learn the ins and outs of using Clickhouse and optimizing click house pipelines. They can use feature form, which already has all that built in. And so now a data scientist has to think about, hey, I have this transformation, I'm going to write it, I'm going to register a feature form. Feature form will then build it and put in prod for me, keep it up to date, do stream processing, incremental processing, do point time preference, all the things you'd expect to be done.

Simba Khadder [00:51:45]: Feature form does similar to what a feature platform does, but it's being done on your data infra.

Demetrios [00:51:52]: I see, I understand now you're the connector where there's already whatever a spark cluster or you guys do Hadoop also.

Simba Khadder [00:52:04]: We don't, we don't. We do have, it's been asked for, we've been asked. We've had some interesting, because we work a lot bigger companies and enterprises. We've had some interesting asks.

Demetrios [00:52:14]: Well, it's funny. Just a random side note on the Hadoop thing. Someone came into the community the other day and they were saying, like, I've had it with Hadoop. The community. The Hadoop community is a graveyard. If you want to make any changes, your prs don't get merged until three months later maybe. And so we're watching Hadoop just slowly crumble between before our eyes right now, which is an interesting thing as technology is constantly updating. You can see that.

Demetrios [00:52:46]: But tangent aside, I get it. So you've got the connectors. Nothing new needs to be brought in as far as the storage or compute side of things. It's just that you're able to optimize whatever is already in production or whatever is already on your stack. So you're optimizing it by saying, hey, data scientist, you don't need to learn click house. You can just do what you're best at and let us work on that optimization of click house.

Simba Khadder [00:53:21]: Yes, I do. What's true. Maybe to make it like a little more clear, let's say you as a data scientist are working on like a streaming pipeline. Like, so you have like a stream of user buys and you need to build a feature rare in the literal feature store. It would be your job or your job to go to data engine to build that pipeline, inflamed current spark streaming or whatever. And then you'd have to build a production grade data engine pipeline and manage it and maintain it and have monitoring and have all the things you would need yourself. That's the literal feature store in the physical feature store and the kind of feature platform approach, you would be able to do it yourself and it would work and be great and it would handle streaming quite well and it would scale well and it would be production grade, but it would have to run on their compute. So you have to bring a new compute engine in, a new storage engine's proprietary engine in feature form.

Simba Khadder [00:54:22]: You still just write your SQL query like you would feature form will then take that SQL query and it knows it has a harness to get that SQL query in and run it in a way that is a production grade data pipeline. It's really clear when you start thinking of things like incremental processing and streaming on top of having versioning, lineage search, reuse all the other stuff you'd expect as well.

Demetrios [00:54:45]: Okay, now you did mention before there's this LLM aspect to things. And you were saying it's, when it comes to the data lifecycle, it's also very similar. Although LLMs are this new thing, you still almost have like the same data lifecycle and you still have the same painst, and I think that's why we're seeing a million evaluation companies come out. So I agree with you there. Can you go a little bit deeper on to what you've been seeing and how you've been thinking about that?

Simba Khadder [00:55:17]: Yeah, I think there's a lot there. I'm going to actually just fully focus on this rag, the concept of rag and the data part of LLMs. So rag, when we first kind of came out as a concept of was this idea of I take my documents, I chunk them, somehow I embed them, I do a nearest neighbor lookup based on embedding of the user's prompt or query, and I have some prompt template that I feed my retrieval. So if I do a nearest neighbor lookup, I get five paragraphs from all of my documents that most look like the question, and then I'll just feed that into my template. And it works pretty well. It does. It's just naive rag implementation, and a lot of people did that and they were like, this is it. This is cutting edge.

Simba Khadder [00:56:10]: I think we're going to look back at this and be like, it's so funny that we thought that this was the end all, be all. This is what rag is. I think rag, at its core is the problem of, I have a context window. It's either a set size, like n size, or the bigger I make the context window, the more expensive the query is. And just, I don't want to just put fluff in the context window. So my goal is to have the most information dense context window possible, which is really an information retrieval problem. Like how do I retrieve context, the best information that I can feed into my context window, where a lot of people think of text and nearest neighbors and etcetera. Imagine if I asked you, hey Demetrius, how should I invest my money? I want to save up for retirement.

Simba Khadder [00:57:00]: You would probably ask about my age, my income, my risk profile. You ask a lot of questions about me and about how I, you know, my situation. Those answers are not in a vector database. Like, you're not going to get someone's age, you're not going to do a vector database, look up to get their age. Those might exist in a postgres database somewhere, but if you're giving that context to a prompt, it would make sense to say, hey, I have someone who is 30 years old, they have this much money, they do this, that this, but no one thinks of it that way. Everyone thinks that you have to do nearest neighbor lookups and embeddings and Vector DBS. And I don't think there's any reason that is true, other than one, it does work quite well, naively, and two, it just. Those companies came about first, and almost all the content about rag is written by those companies or proponents of those companies.

Simba Khadder [00:57:52]: This isn't even like a hit on Vector DBS. Vector dbs are extremely powerful. Like I said earlier, I just think it's part of the puzzle. And really the problem space is, and I think honestly, where a lot of vector databases are going to go is become less a vector database and more like a context retrieval database, where I give it a prompt and maybe some other metadata, and it has all of these signals that maybe like exist in a feature store or something and it decides what to pull from the feature store and feed into the prompt. That's kind of the architecture that I think things are going to move to. We're already seeing a little bit of that happening.

Demetrios [00:58:31]: Oh, I love it. Yeah. So basically right now, that personalization aspect in the prompts, you don't see as much. I've seen people talk about you have variables and you want to have variables that you can use across prompts, and so you store those variables in some kind of a database and then you can reference them in different prompts. That's been really cool to see. I have heard of the idea of let's personalize prompts a little bit more. I feel like instinctively this works for certain use cases but not for others. But then as I was trying to think about what use cases it wouldn't work for, I couldn't think of any off the top of my head.

Demetrios [00:59:27]: So I was like, well, maybe it. That's not true. I'm just making that shit up in my head. So I don't know. Do you have any insight on if it's more relative for certain use cases versus others?

Simba Khadder [00:59:43]: I think it probably is. I think what you'll find the most valuable use cases for lms because lms are relatively slow, they're kind of expensive. You're not going to use lms for fraud detection. Most of where you see lms being used and where value is coming from is making copilot, making people better at their job, making them move faster. So we're interacting with the human almost always, and I think that's actually where LLMs do best, their augmentation tool. I'm personally a little skeptical of what can be done in the automation space. True automation, where no one touches it, because it just. I don't think you can achieve that 99% accuracy you need.

Simba Khadder [01:00:25]: I mean, you get to like 95, which sounds great, until you realize that, hey, if it's fraud, you know, like, that 5% of fraud that's getting through is not just like a.

Demetrios [01:00:35]: We can't accounts. Yeah, that's not super fun to think about.

Simba Khadder [01:00:40]: So I think that everything in some way has this level. It's not just personalization for the user, but I do think that's a big piece of it, even just like, what you're calling variables. I mean, those are just features or signals or whatever. We give these things different names, but really all it is is I have these conceptual things, and I want to pull numerical or textual data points about them so I can feed that into the model so the model can do a better job. And I mean, at its core, that is a feature. Yes.

Demetrios [01:01:11]: You're opening my mind now, because I was thinking of it just as attributes of a human, but it can. It's exactly that. Like, the variables I know are describing a company, and then in your prompt, you just put the company name in quotations and it can grab that. But that is exactly what it is. It's a feature. Yeah. So you have all these different variables that you want to be able to give the most context and the most accurate context to the prompt so that LLM can do the best job that is possible. And so it makes sense that you would break it down and really get as descriptive as possible for all these important themes, not just like, hey, this is company a.

Demetrios [01:02:03]: It's like, if you have a very rich description of what company a is, you add that as the variable.

Simba Khadder [01:02:12]: Yes. And you're seeing like, graph frag and knowledge. Graphs kind of get pushed to, which is a very similar pitch of what I'm saying. Um, and the really funny thing is, is like, then you have graph rag, and they're like, you need that and not vector databases. Like, there's this weird, like, Eva or philosophy, even rag versus fine tuning versus, like, awkward. I it's like everyone's just trying to justify that they are the best way to do them, because there's no real good data yet. No one knows. I think there's very few enterprise grade LLM products in the world.

Simba Khadder [01:02:44]: I think really, the only one that you can really look at as gold standard is bit hub copilot. I think everything else is still either is still kind of in very valuable, but maybe more either consumer driven, which consumers can handle a little more, like it can be a little weird sometimes, or just very experimental. In B two B, you see a lot of things are going out that don't actually work very well. They have great demos, but they're not quite there yet. I think where we'll get to that next level is when we get to the point where it's just like, hey, here's your data. I have this, a very specialized retrieval, these very specialized prompts. Maybe those are models themselves, which I've actually seen people starting to use traditional ML models for re ranking, which is fascinating. And.

Demetrios [01:03:34]: Yeah, a little more. I didn't catch that. Yeah, it's cool.

Simba Khadder [01:03:38]: So if I have a million signals that are relevant to the prompt, the question then becomes, which ones do I feed into the prompt for this specific query and how do I properly place them in the prompt so that the model can understand it and that question and how to do that, I think that is where every Lama index s type company, I think that's where everyone kind of starts. That's where the big hard open question is. I think the vector databases will start to compete there. I think a lot of companies will start to compete in that space. And I really think that's probably one of the hardest problems to solve. And if you can solve it, well, it just becomes a no brainer to use. But I actually, similarly to feature source, I don't think there's really a perfect way to solve it. So you either get very specialized verticalized ones or you will get, and, or you will get more generic ones that give you ways to tune them and specify what you want to do, where.

Simba Khadder [01:04:51]: But really that's the, that's the problem to solve with outlines. And I think a lot of people get too caught up in like, I need a vector database and then to the nearest neighbor lookup and they forget the core problem, which is just information, like patching, like how do I pack the right info and the right density of information to achieve the best result?

Demetrios [01:05:11]: I think that's why for a lot of people, DS PI felt so interesting because it has that capability of fanning out and testing a lot of different prompts and then saying, well, here's the winner of this choice. You could see that being something that becomes more attractive and you start to see incorporated into a lot of these frameworks.

Simba Khadder [01:05:40]: Yeah, I think that's true. I think evaluation, I don't go too deep into it. But I find evaluation to be really interesting and kind of funny, similar to how recommender system evaluation used to be. Because the hard part of recommendations is most people don't give it a thumbs up or thumbs down when you give them a recommendation. Very rarely happens. And on top of that, there's almost like a certain profile of person who does that, and that's probably not the average person. So if you take both those things into account, the question becomes, well, how do I decide if this was a good recommendation? And the answer is implicit info, which is like what Google does. Like, okay, Google, if you search something, did you click, how long did you stay on that page? Did you go back and click on another page? Like you can kind of deduce if a user had value.

Simba Khadder [01:06:25]: And I think evaluation is so focused on, like, we're going to do this statistically or mathematically or even funnier. It's just like we're just going to layer LLMs on top of LLMs until you're so confused about what's happening that you can understand. And will you ever say this is good or bad? When really the answer is all that matters is the end user. And like we talked about, most of the time, the end user is literally a human. So did they find value in it? If yes, then that's good data, and if no, then no. And this is actually opening like chat GPT does, right? Like sometimes they'll give you the two prompts next to each other. Like for sure they're using that information to like decide and tune their models. So I think that's what you're going to, you're going to always see these new user experiences that are driven to teach some other more, which again, we saw in recommendations.

Simba Khadder [01:07:15]: And I think that we're going to see the same thing in LM. So I think it's so interesting coming from Rex's to lms and also having seen a lot of things in between because I just am seeing the same pattern play out.

Demetrios [01:07:27]: History rhymes for sure.

+ Read More

Watch More

The Future of Feature Stores and Platforms

Posted Oct 31, 2023 | Views 790

# Feature Stores

# Platforms

# Tecton

On Juggling, Dr. Seuss and Feature Stores for Real-time AI/ML

Posted May 30, 2022 | Views 962

# Juggling

# Dr. Seuss

# Feature Stores

# Real-time AI/ML

# Redis.io

10 Types of Features your Location ML Model is Missing

Posted Oct 07, 2021 | Views 552