MLOps Community
+00:00 GMT
Sign in or Join the community to continue

BigQuery Feature Store

Posted Aug 23, 2024 | Views 169
# BigQuery
# Feature Store
# Malt
Share
speakers
avatar
Nicolas Mauti
Lead MLOps Engineer @ Malt

Nicolas Mauti is the go-to guy for all things related to MLOps at Malt. With a knack for turning complex problems into streamlined solutions and over a decade of experience in code, data, and ops, he is a driving force in developing and deploying machine learning models that actually work in production.

When he's not busy optimizing AI workflows, you can find him sharing his knowledge at the university. Whether it's cracking a tough data challenge or cracking a joke, Nicolas knows how to keep things interesting.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Need a feature store for your AI/ML applications but overwhelmed by the multitude of options? Think again. In this talk, Nicolas shares how they solved this issue at Malt by leveraging the tools they already had in place. From ingestion to training, Nicolas provides insights on how to transform BigQuery into an effective feature management system.

We cover how Nicolas' team designed their feature tables and addressed challenges such as monitoring, alerting, data quality, point-in-time lookups, and backfilling. If you’re looking for a simpler way to manage your features without the overhead of additional software, this talk is for you. Discover how BigQuery can handle it all!

+ Read More
TRANSCRIPT

[00:00:00] Nicolas: Hello, I'm Nicolas, uh, Lead MLOps at Malt, uh, and unfortunately I don't drink coffee. I like the smell, uh, but I don't like the taste too much, even if I try regularly, to be honest. And so my morning routine is more about orange juice. I have also a very strange habit, and my colleague told me so. I'm a huge fan of mint syrups with water, and so yeah. [00:00:35] Demetrios: What is up good people of the world? Welcome back to another MLOps Community Podcast. As usual, I am your host Demetrios. Today talking with Nicholas, he got very transparent on how they are using BigQuery as their feature store. And what the pain points that they had before they implemented it as a feature store were. [00:00:57] Demetrios: And now, the benefits [00:01:00] and how exactly they have implemented the feature store in BigQuery to set them up for success. He echoed something that I have heard time and time again, which is some of the biggest unlock from creating feature stores are decoupling of the feature creation, feature generation, and the code creation, or just coding in general, model creation. [00:01:28] Demetrios: He said it very eloquently. I'll let you listen to his whole way of putting it, but basically when you're doing one of these two tasks you're in a completely different headspace and he has seen a huge uptick in the capabilities from the team since decoupling these two tasks. That being said, this is a exciting episode for me because it's a prequel Welcome to our Data Engineering for AI and ML virtual conference that we've [00:02:00] got coming up on September 12th. [00:02:02] Demetrios: Nicholas was a person that filled out the call for speakers. We didn't have enough room because we're full as far as speakers go. But I said, man, this is such a good talk. Just come on the podcast. Let's chop it up and let's talk about it. In a more free flowing form, and I got to grill him. Because I had seen the talk, I knew what he was going to get into. [00:02:23] Demetrios: So hopefully you enjoy. As always, if you like this episode, feel free to share it with one friend and let them know how they can get their feature store game on point. [00:02:40] Demetrios: I didn't mention it to you, but uh Dude, Big Query is All You Need is a great title. I like that title. It's on, I know this originally wanted to be a talk and I convinced you to do it as a podcast, so I appreciating, I appreciate you being flexible with me. On this, let's get into it because [00:03:00] you, can you break down the scene? [00:03:03] Demetrios: What is the end goal that you were trying to go for and how did BigQuery play its part in this? [00:03:12] Nicolas: Um, so it's a pleasure to be, to be part of this podcast and for sure it was a talk, uh, at the beginning, but I think we would have a great discussion and so it's good to take some time to, to talk about that. [00:03:25] Nicolas: Thanks. So, uh, yeah, so to explain globally, um, so, um, and how this project started, because in fact, it's a project to use BigQuery as a feature store, uh, in our company. So, how it starts, um, in fact, we, um, train a lot of model, uh, at Malt, um, so basically what we are doing is, Recommender system, NLP task, stuff like that. [00:03:54] Nicolas: So matching, um, and so this kind of, uh, machine learning model. [00:04:00] Um, and so, uh, for that, we train several model, uh, for recommendation. Um, and so, um, before that, We did the featuring just before training the model. So basically you grab some, uh, past interaction on the platform, and then you will grab some, uh, properties from your, um, so we are matching freelancer with projects. [00:04:24] Nicolas: So it's a, uh, process. And so we are grabbing some information about the freelancer, um, um, their skills, uh, daily rate, stuff like that. And also from the project. Uh. And so, um, in fact, we did that for all our models, um, and, uh, it was really part of the training and we didn't add any source of truth or stuff like that. [00:04:53] Nicolas: Uh, and so it was, uh, a very pain point, uh, a pain point, uh, So, [00:05:00] uh, [00:05:00] Demetrios: it was a pain point because you couldn't really version it or you didn't understand you couldn't reproduce it. [00:05:06] Nicolas: Yeah, exactly. In fact, you have several problems. So you have some, uh, consistency problems, uh, because you can have two data scientists that train models that want to use the same feature, but it's not computed exactly the same way. [00:05:20] Nicolas: Uh, and so it could create confusion, also confusion with the product because if you want to take to the product, okay, we are using that in our model, uh, but. This has the same name, but it's not computed the same way by two data scientists teams. So it's not, it's not very good for, for the project and for the comprehension of our algorithm. [00:05:40] Demetrios: So there was feature sharing amongst the team, but it wasn't. Actually reliable feature sharing. It was almost like, yeah, it's the same name, but it's not the same feature underneath the hood. [00:05:52] Nicolas: Yeah. It, it was mainly copy pasting sharing, I would say, but, but for sure. Yeah. It was, it was [00:06:00] not very strict and so it was not one source of truth. [00:06:03] Nicolas: So we add some tables, but sometimes you have to do computation to prepare your feature or to compute your feature because it's not, uh, directly in the table like you want. Uh, and so sometimes you have to do some coons or stuff like that, and, you know, the business is really complex. It's not so simple, uh, and so sometimes you have some different rules to, to create the same feature. [00:06:24] Nicolas: So it was one of our problems. The second one is the efficiency. I would say, uh, it's not very efficient to, uh, make the computation of the feature on, on, on, on one part of the code and the same on other part of the code. Let's, uh, share this, uh, training, this, um, computation. Uh, so, yeah. So, cause [00:06:45] Demetrios: it was, it was computing features. [00:06:48] Demetrios: In each point. So it wasn't like caching those features or it wasn't storing the features anywhere that you could pull from? [00:06:55] Nicolas: No, it was not storing. Yeah. We, we, we were not storing the feature. We [00:07:00] were just computing on the flies and training the model. [00:07:03] Demetrios: Wow. Yeah. And you were able to get the desired speed because these recommender systems were low latency, right? [00:07:13] Nicolas: Yeah. So there we are talking only about the, I think. We can discuss that but there is two parts. There is the offline serving for the training the model and the online serving to serve the model. So to serve the model, we have a completely different approach. In fact, we are just loading in memory the features because it's not so big. [00:07:36] Nicolas: And so we can just do it, um, in memory. So it's not a problem. And so for that, uh, we have a daily basis approach and a daily basis computation. So each day we recompute the feature. It could take, I don't know, maybe one hour, something like this. And then we inject them in, in memory. So it's not the problem. [00:07:57] Nicolas: The problem was mainly for training the model. [00:08:00] [00:08:00] Demetrios: Yeah. Okay. So the serving side felt like, okay, we're all right here. But as far as the training goes, you didn't have the ability to, A, get reliable feature sharing happening. And B, have the computational load or cut down on the excess computation. It wasn't the most optimal way of doing it because you were computing features in various different spots and then not saving them and going through the code and computing the features all over again. [00:08:35] Demetrios: And so that, that makes a lot of sense. Was there a third pain point? [00:08:39] Nicolas: Yeah, there is a third, [00:08:41] Nicolas: but it's linked to the two other ones. But, um, something that is very important and it's very part of the feature store and how you manage your feature for training your model is what we call point in time retrieval. [00:08:54] Nicolas: Yeah. So, in fact, when you want to train a model, so it depends on the model, but in your case, [00:09:00] we are using past interaction between freelancer and projects. And so if you want to train a reliable model, so you know that, for example, uh, one year ago. A freelancer said yes to a Java or any language you want, uh, project. [00:09:17] Nicolas: But maybe the freelancer, um, has different skills now. Maybe he don't want to do Java anymore, or maybe he said no one years ago, but now he has the skills. So, um, you have to get the state of the freelancer and also of the project at the time where the interaction was done. And so this is very difficult to, um, Recompute that, uh, especially if you want to do it on the fly. [00:09:49] Nicolas: So you will have basically to do some big join between tables that have historical data and, uh, it's very complex. So it's very error prone. And [00:10:00] so the data scientist could make some. error doing this featuring, and it's also very expensive. So it's not very efficient, efficient also, because you have to scan a lot of historical tables to do that. [00:10:14] Demetrios: And so you thought to yourself, there's got to be a better way. How did you go about figuring out the solution? Because I think the solution that you ultimately chose, which was using BigQuery as your solution, and that was all you needed, right? That is a little bit counterintuitive. I would imagine most people would start to go and look at feature stores as their first solution, or maybe they're going to try and figure out some combination of Redis and some open source, probably like a Feast, but why did you say, let's see if we could do this all [00:10:54] Nicolas: in BigQuery. [00:10:55] Nicolas: So you are right. So Redis, I think is more for the online serving. And as [00:11:00] I said, we are doing it in memory, so we don't have this problem. Um, but in fact, we, um, we think about it. We, we checked feature store and stuff like that. And in fact, we realized that none of them fit all our needs. Uh, we had a lot of needs. [00:11:16] Nicolas: We didn't talk about this before also, but. One challenge we had also was the monitoring and the anomaly detection in the feature. And if you do it on the fly, you cannot, uh, do analysis on that. And so, yeah, so it was a lot of challenges. But the [00:11:35] Demetrios: monitoring, sorry to interrupt, the monitoring on the feature creation or what was on the data itself? [00:11:44] Demetrios: What monitoring are you talking about? [00:11:45] Nicolas: Yeah, about the data and so about the feature. So for example, if you are, um, if one of your feature is a daily rate of the freelancer, um, maybe you want to assess, so you want to assess two [00:12:00] things. And the first one is, and this one is more linked to alerting. You just want to be sure that when you train a model, uh, all the delay rates are not null because there is an issue in the backend or I don't know where. [00:12:12] Nicolas: And so you want to be sure that your data looks right for training the model. And also, I think you want to monitor, um, the feature across the time and to be sure that your daily rate doesn't increase or decrease. Um, it could happen just due to inflation or, or, or seeing like that, but you want to be alerted by that and you want to, to know that you have this drift, uh, in your data and maybe to retrain your model or to check if it's a product problem, so maybe just a change on the product side or. [00:12:48] Nicolas: I don't know. Um, so, so that's why you want to win it all. [00:12:54] Demetrios: Yeah. Because at the end of the day, if you're getting the wrong rates and [00:13:00] suggesting to freelancers projects that they are way too expensive for, that's going to create a bad suggestion and ultimately not lead to matching. And I imagine that there's a lot of people that are on the platform. [00:13:17] Demetrios: And if they get recommended three different jobs, that the rates are way too low for them. If it's too high, I imagine they're not going to care. If it's more than their daily rate, then that's going to be awesome. It's like, well, let us, might as well try. But if it's too low, that's going to create a bad experience. [00:13:34] Nicolas: Yeah, for sure. In fact, in both ways, it's not very perfect because in one way, I think you won't be able to match, um, because the, uh, daily rate is too, or you propose Too low freelancer and the daily rate is too high in the project. But the other way is not very good also because, um, um, you will, uh, contact very high qualified [00:14:00] freelancer with a low, um, low, low, low pricing project and so it's not good [00:14:06] Demetrios: also. [00:14:07] Demetrios: At the end of the day, one side is not going to be happy. Yeah, exactly. You got matched. All right. So, so then you said BigQuery. [00:14:16] Nicolas: Yeah. Walk me through that. Yeah. So BigQuery. So in fact, we, so we check a lot of products and we thought about, okay, maybe we can use something that it's already in our stack, because it was already in our stack for analytic stuff and stuff like that. [00:14:30] Nicolas: Um, and so, in fact, we say, okay, let's just build some, uh, feature tables. It will be one table with all the features. And so the data scientist team will have access only to this table. So you have one column per feature. Uh, and this table will be computed on a daily basis. So each day we compute the feature, sorry, for All your freelancers, all your projects, and, uh, just [00:15:00] historize this table. [00:15:01] Nicolas: And so each day, so we compute all our features and then we, uh, put a timestamp and store, store it also in BigQuery. And so this way, when you train a model on the data science side, you just have to construct your training data set with your, your why, what you want to predict. The idea of the freelancers, the idea of the projects. [00:15:22] Nicolas: and the date of the observation. And then you have just one join to do on this table, and you have all the data for all the dates you want. And so, that's it. To be honest, we are done. And, uh, if you want to do online serving, so you mentioned Redis, for example, before. So if you want to go with Redis because in memory, uh, it's too large, you can just take the computation of the day, push that into Redis. [00:15:54] Nicolas: And then when you want to do some live, um, serving of your application, you can [00:16:00] just ask Redis and you have the latest data available for each entity. [00:16:05] Demetrios: And so I imagine now you are able to version your features because you said you just do, basically, you make these computations once a day, and then you can look back in history and say, what was it on April 17th? [00:16:20] Nicolas: Um, so in fact it's not very, it's not exactly version, so it's just you have a timestamp and so as, as you historize each day, the value of each feature, you can just go back in time. And so maybe I didn't understand well the question, but, or your argument, [00:16:38] Demetrios: but. That was it. That was exactly it. And so I'm, I'm wondering now that you have this in play, what have you seen in the past? [00:16:52] Demetrios: So it's like, you, you really explained the before to me nicely. And there's pain points. What's the [00:17:00] after and where are the pain points? [00:17:03] Nicolas: Um, so what's the after? So the after is really good now because we have these features. And so it's work, as I said. So the dataset system want to create a model. Okay. [00:17:13] Nicolas: They have one table. They create their dataset with the Y, with the 2ID and the date, and they can just join and get the feature they want. So it's worked very, very well, but I would say that we had a few challenges with that. The first one is, uh, what happens when you want to add a new feature? Uh, that's something that it's not already part of the feature table. [00:17:40] Nicolas: For example, the number of projects that the freelancer already done on the platform, for example. This could be a feature. And so [00:17:51] Nicolas: How do you hide a feature and also how do you add all the history of this feature? Yeah. [00:18:00] So maybe you can You have to backfill it, right? Yeah, exactly. That's exactly the term. It's backfilling. Is it? So yeah. So this is backfilling. So for that, we developed a homemade solution with a script. And so basically, um, so. [00:18:14] Nicolas: You have two different ways to backfill. So in Factor, um, Data Warehouse is, um, composed of layers. So you have the ingestion layers, then you have some transformation layers, and at the end you have the feature, uh, the feature tables. And so, um, at the ingestion layer, We, uh, ize everything. So basically if you want to, uh, add a feature, you can just get to these tables and you have all your history and you can do the SQ query you want or, um, um, scan the data the, the way you want, just to compute your feature for all the dates. [00:18:56] Nicolas: If, if you have the data, for sure, if it's a new feature on the [00:19:00] product, you don't have the data. But a lot of feature we already have, uh, this data in these tables. And you can then, uh, update your feature table. And so for that, we just build a small script where the freelance, um, sorry, the data scientists just have to input their query and it will automatically place a query on the feature table and, and feed it, and it's also versioned. [00:19:24] Nicolas: And you have a changelog and everything. So, uh, if you want to know what happened to this feature at this time, or, uh, when this feature was added or stuff like that, you can go to this changelog or, um, check the version of, of the feature tables. [00:19:39] Demetrios: And is there a process, like, do people have to create some kind of a PR request in order to get a new feature on that table? [00:19:48] Demetrios: Or is it free for all? I can, anyone can throw up new features. [00:19:53] Nicolas: No, exactly. So the process I talked about before, just before, uh, we call it Upgrader [00:20:00] in your, uh, uh, in your team. Uh, and so basically, so you have a project on GitLab with the Upgrader and you have a folder when you can put all your SQL scripts. [00:20:11] Nicolas: to update your feature table, and then it's played automatically. So not by the CI, but by a scheduler. We are using Airflow. Uh, and so, yeah, and so we have this scheduler that will play, uh, all the, uh, SQL files that were not already applied on the table, uh, to update it. So for sure, the data science team have to write this SQL, um, query. [00:20:36] Nicolas: Then submit a pair, and when it's validated, we can, uh, apply it on the feature table. [00:20:42] Demetrios: Oh, fascinating. And is there a way now that folks are able to go back and, and share easier? Because you mentioned the sharing was a bit of a pain point, but it was more sharing was a pain point because two people [00:21:00] were using the same name for something that was computed differently, and it ended up giving them different results after they added that feature to their model. [00:21:08] Demetrios: Now it seems like you've conquered the problem of having two different computations of a feature with the same name. But I am guessing that it's a lot easier because people can just reference the BigQuery table and see, okay, this feature seems like what I want. I'm going to add that to my model. [00:21:31] Nicolas: Exactly. And maybe it's a teasing, but it's not that bad. We're already live and we're working on, on this way, but, um, I would like to develop something to help the data scientist team to find the right features because now you have that. And also in BigQuery, you can, um, so each column is a feature. And so you can add a description for each column and you know, um, easily with, uh, some, even some LLM or stuff like that, you can [00:22:00] maybe say, okay, I want to train this Returns the features that could be interesting for this model or, uh, just a basic search could be useful also just to search the feature or in plain text or even with something like, uh, some keyword matching just to, um, make the discovery more easier, you know, um, it, it, it could be a good feature for the data scientist team. [00:22:25] Nicolas: Because sometimes they don't have the same term for the same feature or two. [00:22:30] Demetrios: Yeah, a hundred percent. And that feature discovery is huge because if you're going through and getting inspiration from what others are using in their models, what types of features they're using, then that can hopefully translate into you creating a better model. [00:22:49] Nicolas: Exactly. So, um, [00:22:51] Nicolas: and so I think it's, Also a point we discussed before, but for me it is very linked to the monitoring, because when you, uh, have your [00:23:00] feature, uh, you want to know how it behaves, and how is the feature, and uh, do I have a lot of null in this feature, uh, what, what are, what are the shapes, you know, of the feature and of the data behind this feature. [00:23:14] Nicolas: Uh, and so for that, we also leverage, uh, a tool that we already had, uh, Grafana. Uh, and so basically, Grafana is plugged to BigQuery, and so in Grafana, you can just explore each feature, and it will give you some, um, descriptive statistics about, uh, the features. So, the mean, the average, if it's a categorical feature, you have the counts by category. [00:23:40] Demetrios: But it's not giving you the usage of the features, is it? Do you have any type of lineage of, these 10 models are using these features, or this feature is used the most by, uh, 50 different models? [00:23:57] Nicolas: No, that's a good point. We don't have that right [00:24:00] now. Um, [00:24:01] Demetrios: I, I get the feeling that it's probably kind of hard to implement because you're looking at the whole data lineage as opposed to just how the, the features itself, right? [00:24:13] Demetrios: You can't do that in BigQuery. You can't really know who's pulling from what features, I guess. [00:24:19] Nicolas: Sure. Um, so yes, um, in fact, we have some data lineage tools that tells us Okay, this colon is popular, but we can not, we can not know that it's popular. And used by this model in particular. We just know that, okay, this colon is used, but we don't know if it's used by this model or this model or this model. [00:24:42] Demetrios: Huh. Fascinating. Yeah, because I just know, I've heard so many horror stories and I imagine you deal with this day in, day out. If one of these columns goes haywire, and like you were saying, you're monitoring it and you're seeing, wow, this [00:25:00] column is now not working as we Anticipated it to be working. [00:25:05] Demetrios: Something's going wrong here. What downstream effects does that have? Which models is that actually affecting? So you can know potentially you've got to retrain those models. You've got to roll back. You've got to do something. before the models that are out in the wild feel that effect. [00:25:24] Nicolas: Exactly. And so now we don't have this right now. [00:25:29] Nicolas: Um, so no, you don't have, we don't have this at column level. Um, we have this at table level, but it's not very interesting in this case because all the model will use the same table. So, because yeah, we have dependencies. And so, We talked about alerting before, but we have also automatic alerts. Um, so for that, we also leverage some tools that we already had, uh, great expectation, uh, for that. [00:25:58] Nicolas: And so, great expectation. [00:26:00] [00:26:01] Demetrios: What exactly, yeah, you might have just about been saying this. What exactly is it doing? Is it monitoring the data that's flowing into BigQuery? [00:26:12] Nicolas: We are monitoring. So, in fact, we are calculating each day the future for all our entities. So, entities is freelancer or project. Global name for that. [00:26:22] Nicolas: Uh, and then when we have this table, we will run some tests on it. So great expectations are like unit tests on, um, tables. And so if one of the tests fail, uh, it will just stop the whole process. And we won't train any model after that. And we'll just have slack alert about, okay, this column doesn't match your expectation. [00:26:45] Nicolas: And so, for example, this column is always new or you have, you know, A new value in a categorical column. So your categorical column must be this or this and you have that. And so, uh, it will block. And so we are alerted, uh, for that. [00:26:59] Demetrios: Yeah, it's [00:27:00] like this just came in at 14 and usually it's between 0 and 1. [00:27:06] Demetrios: What's going on here? It's way out of distribution. You might want to go check it out. Yeah, exactly. But this is Again, I guess the interesting piece is you're just monitoring the specific columns for, with great expectations, right? Yeah. You're not actually monitoring the data that is coming in and going through this ingestion and then your transformations and whatnot. [00:27:29] Demetrios: No, [00:27:30] Nicolas: we, we also have a lot of monitoring about that. It's. Okay. I mean, you know, it's more, uh, data engineering, uh, work than MLOps work, but yeah, so they have some tools. Uh, basically they also have great expectations. So in fact, you have great expectation all the way to just detect as soon as possible the issues because it's easier to fix and it won't impact a lot of, uh, downstream jobs or tables or stuff like that. [00:27:57] Nicolas: So. They have great expectations. They [00:28:00] also have some, um, yes, some scripts or some tables just to assess that the data in this table are coherent with, um, are in phase with, um, data in this other table or stuff like that. So, yeah, so they have great expectations. They have some tweetings, uh, along the whole chain, um, but in our case for featuring, we are using Great Expectation. [00:28:25] Demetrios: Talking about Great expectation, it makes me realize that you're probably doing something along the lines of continuously retraining models. I write in assuming that? Yeah, exactly. Can you break down how that works? [00:28:45] Nicolas: So it depends of our models, to be honest. So it depends about the generation of the model and, uh, a lot of stuff. [00:28:53] Nicolas: Uh, so if, also if it's a recent model, we prefer to train it. [00:29:00] By hand, for the first time, just to be sure that everything looks good. But, um, otherwise, we have, uh, in most cases, we have a monthly training of our model. So, um, as I said before, so we have, um, some checks about the features and if everything looks good, we will just train a new model. [00:29:23] Nicolas: Uh, and if the metric of the model looks good, we will deploy it automatically. But, [00:29:28] Demetrios: and is this all happening through GitLab? [00:29:31] Nicolas: No, it's, uh, everything is done in, uh, Airflow. [00:29:35] Demetrios: Airflow. Okay. [00:29:36] Nicolas: We're using Airflow for, for the whole scheduling of all these tasks and also for What we discussed before about the ingestion and then all the The layers of our data warehouse and then the features table. [00:29:50] Nicolas: So it's all scheduled in Airflow. [00:29:53] Demetrios: And is it just on a time basis? [00:29:57] Nicolas: So it's, it's on a [00:30:00] time basis. So we have some, um, so Airflow use DAG for graph, um, to, to run the task. So we have some daily or monthly DAG. Uh, but if we want, we can just trigger one manually. So if we detect an issue with the model, or we want to rerun this task, or we want to fix that in a table, you can, we can just clear the task and it will, uh, restart it and fix the whole stuff. [00:30:26] Demetrios: Because I'm wondering about if. You recognize that there's some kind of model misbehaving out in the wild. Do you have the capability, A, to recognize that a model is misbehaving and B, just to say, all right, well, let's go retrain it and figure out if that solves the problem. [00:30:46] Nicolas: Yeah, yeah, yeah. We have the possibility to start a train manually and to just, um, release a new model. [00:30:54] Nicolas: Um, yeah, it's completely possible. The [00:31:00] detection of the, uh, of the bad behavior, I would say, of the model. Yeah, we also have, uh, some, um, different monitoring. So in fact, when you monitor a model, I think you have, um, multiple, um, way to, to, to, to monitor it. So we have, um, business metrics. So for example, the matching, we know that, um, or as a, the average, um, the average conversion is that. [00:31:33] Nicolas: So if we see that the conversion is dropping. So maybe it's a problem. I, I don't like this metric because when you, um, when you learn that your model does not work correctly from the, um, from the business, it's a little bit, uh, too late. I prefer to detect it early because when it's a business, [00:32:00] it's not very good. [00:32:01] Demetrios: Yeah. There's some very not happy people that are talking to you or hitting you up on Slack. [00:32:07] Nicolas: Exactly, and I can understand that, so yeah, for sure. Um, so that's why we have also some, um, earlier monitoring. So, just, um, we'll just log everything we wrote down from the model, and you can, for example, check the score, and how it's, um, changed over time, and, um, this kind of stuff, so. [00:32:32] Nicolas: Uh, with that, we try to detect earlier, um, some issues and that's also why we also added some monitoring about the feature because it's much earlier as your model is not already trained. Uh, and so, uh, if we are able to detect some issue in the feature, you can avoid issue, uh, at the, um, in the output of your model and in the business. [00:32:59] Nicolas: [00:33:00] So, [00:33:00] Demetrios: and are you the one that's responsible if there is some metric like latency that all of a sudden shoots up, so you're serving the model and it takes one second to serve the model out of nowhere? [00:33:15] Nicolas: Yeah. So, um, it's also part of my job and it's another kind of monitoring. It's more, I, I would call this platform monitoring. [00:33:26] Nicolas: So we are monitoring CPU, RAM usage, uh, also latency. And so, yeah, and if there is an issue with that, I can just discuss with the data scientist team about how to solve that. And yeah, so, so yeah, it's also part of the job. And sometimes, yeah, with the latency, we can detect other issues. Sometimes, uh, behind the latency issue, you can find maybe a quality issue. [00:33:53] Nicolas: And for example, you work on too much freelancer. And so that's why the latency increased because your pool is very [00:34:00] large. And so maybe it's a problem for you, from your model. Uh, but you detect it through latency. So sometimes you have this kind of effect for sure. [00:34:08] Demetrios: Yeah. That's a great point that the problems when you go to the root cause, it's not as simple as if it was just as a software problem, because you have to Think about, well, there might be a reason that this is happening that is inherent in the model that just got pushed out or just got updated or is going AWIRE for some reason. [00:34:34] Demetrios: So you have to look at it from that angle and that's why it makes sense that you would be in charge of that as opposed to someone else that isn't as well versed in the ML side of things. Yeah, I heard it put one time as the monitoring side. My friend Shuby told me, this was even four years ago, he said, I like to [00:35:00] monitor and think about monitoring on three different levels. [00:35:02] Demetrios: One is the actual accuracy of monitoring. The model. Another is the data. So we're monitoring the data. And that seems like what you're doing with like the features and what's happening with the features and also with great expectations and Grafana on those columns. And then he said on the systems level, so that latency and all that. [00:35:26] Demetrios: So you have those three different ways that you're monitoring. that model that is out there making predictions because it can go wrong on one of those vectors and it can screw everything up. [00:35:39] Nicolas: Yeah, totally agree. I would add the, also the business level. Uh, I think it's important and you have a lot of factors that could impact the business, but, uh, your model could be one of these factors. [00:35:55] Nicolas: And so, Even if it's not very precise and it's not directly [00:36:00] linked to your model, I think it's very important to put this in place because at the end, it's the goal of the model. So to be honest, even if the score of the model looks weird, if the business increase and it's okay, maybe it's not a very big problem. [00:36:17] Nicolas: So you have to balance and so, but at the end. You are, uh, evaluated on the business, not on the accuracy of your model, you know? [00:36:28] Demetrios: And you don't think, how, how do you separate in your head the accuracy versus the business? [00:36:36] Nicolas: So I think it's not separating the accuracy and the business. It's more, okay, if the business goes wrong, let's look at the accuracy. [00:36:45] Nicolas: And so if the accuracy goes wrong, also, if you can, Start to see something on the accuracy that seems to go wrong. Okay. Let's, let's deep dive into the model and I think the issue should be there. If the business is wrong, but the [00:37:00] accuracy is very good, maybe we can start to look elsewhere. Maybe it's a, maybe it's a model and maybe you can also invest in that, but maybe it's elsewhere. [00:37:10] Nicolas: So it's more get a lot of different metric at different stage to, uh, debug more efficiently and to. Uh, be able to spot the right spots of the, of the issue. [00:37:25] Demetrios: Yeah, it's, it's such a funny thing that you say on. Potentially, business metrics are going up, all the other metrics are going down. So in that case, now, even though that feels like a very rare. [00:37:39] Demetrios: Rare occurrence, but in that case, don't touch anything. Just let it run. [00:37:44] Nicolas: Yeah. Or maybe just investigate, but because maybe you can do better, uh, if your wallet is better and maybe it could be better because there, as I said, there is also other factor from, for, for the business. So maybe just the sales teams that works pretty [00:38:00] very good or there is a lot of other factors. [00:38:03] Nicolas: So I didn't say don't touch anything, but it's more. So, okay, let's have also the business metric to just say, okay, maybe my model is not as good as I think it is, but it works. So maybe it's okay. And so say, yeah, it could help you to prioritize your work and to know if you have to investigate a lot of the model or not, depending of the business also. [00:38:31] Nicolas: So I think it's, it's, it's more. It's not business versus accuracy of the model or, or, or, or output of the model. It's more, okay, uh, let's get a lot of metrics from different spots and just cross them to check where, where the problem could be. It's looking at it [00:38:52] Demetrios: more holistically and then diving in where you see some things that don't feel right. [00:38:59] Demetrios: Have [00:39:00] you done? A retro on now that you're using BigQuery as your feature store, basically looking at it and saying we've been able to alleviate all these different pain points, but I am assuming that BigQuery comes with a cost. And so you recognize we're paying this much more in hard costs that we can see from BigQuery, but we were paying before this much in people time. [00:39:33] Demetrios: that they had to go and recreate a feature when they recognized that it wasn't the feature that they wanted or they had to. So it's all, it feels like there was a lot of fuzzy costs that you really couldn't account for before, but now that you have BigQuery, it's very clear what you're paying. [00:39:54] Nicolas: Yeah, exactly. [00:39:56] Nicolas: Also, we are paying only once a day when [00:40:00] we compute the feature for all the entities, whereas before. As the featurings were done, um, when we were experimenting or training a model, uh, also locally or testing some stuff, we were paying at each execution of the training and of the featuring. So now we pay once a day and then the data scientists team can access whenever they want and nearly at no cost to the feature. [00:40:30] Nicolas: We, you just have the cost to get once the feature, but it's pretty okay. Okay. So you have that. Also, I think something that it's not really cost, but something very interesting is that now the data science team is able, is better able to, um, split the work between feature engineering and model training. [00:40:53] Nicolas: And so when they want to train a model, uh, they can just, okay, I will need this [00:41:00] feature. They can start to implement all these feature in the feature table. We have, and then they train the model and they don't mix both. And I think it's more clear in their head when they are working about, okay, no, I'm optimizing the model and no, I'm doing feature engineering because I think it's, and I did some, uh, data science before, and I trained some model in the past and So it's very, for me, it's pretty different job. [00:41:33] Nicolas: Not different job, it is very different task. So you don't have to think about the same, same thing. And so I think it's much easier for them now to split this work and to have on one side the feature engineering work and maybe they will work for, um, two weeks on that. And after that they could, uh, start to work on the training of the model and just think about that, about the hyperparameter of the model, about the [00:42:00] Uh, structure of the model and stuff like that. [00:42:03] Nicolas: And so, I think it's better for them also too. [00:42:06] Demetrios: It's funny because we had on here the creator of Feather, which is a feature store that got open sourced by LinkedIn probably two years ago. And one of the things that he said was the biggest boon of having a feature store was The fact that you could decouple the code from features and feature generation. [00:42:29] Demetrios: And that's exactly what I'm hearing you say is how you're in such different headspaces when you're thinking about what kind of features do I want? How am I going to create those features? All of that versus. What's the model doing in the overall coding the model and trying to figure out how to make the best model that you can. [00:42:50] Demetrios: That being said, uh, do you know that the hard numbers on like how much money you saved because now you're not computing features ad hoc [00:43:00] in five different places when a data scientist is training a model or when five different data scientists are training models and you just compute it once a day. So you have a clear cost of what it is. [00:43:10] Demetrios: Did you go back and say, what were we spending and what are we spending? And now we can give a talk at the next FinOps conference. [00:43:21] Nicolas: Um, I don't have very precise numbers to be honest. Also because, um, it was a long project. So we started to work on that one years ago. And so the team changed and there is. No more people in the teams that are using more BigQuery. [00:43:36] Nicolas: So for sure the cost increased. Um, but, um, when we started the project, uh, I, um, calculated that the featuring was Uh, several hundreds bucks per month, something like this. Uh, and so, yeah, I think we saved, uh, several hundred bucks per month, I think. Cause it's gone down to zero or a nominal price. No, no, it's not done to zero, but maybe it's no, uh, no, maybe it's 100 per month and it was maybe, uh, three or 400 per month before. [00:43:54] Nicolas: So for sure we re, we reduced. But I think it depends. It's not applicable for other company. [00:44:00] It depends about the volume of data that you, that you have. Also, we train bigger model. No, so it's not very comparable. And so, yeah. Well, [00:44:09] Demetrios: let's talk about that because when do you think this architecture or this style of doing it would fall over? [00:44:15] Demetrios: So someone out there is saying, Oh, maybe I'm going to try this and just make BigQuery my feature store. Where would you recommend to not? Take this approach. [00:44:28] Nicolas: I would say maybe our biggest challenge no, would be if we want to do some live featuring, uh, or live computation of the features. Um, like [00:44:41] Demetrios: on the fly. [00:44:42] Demetrios: Yeah, [00:44:42] Nicolas: yeah. In [00:44:43] Demetrios: flight basically with some Flink, et cetera. [00:44:47] Nicolas: Yeah, exactly. Because I think Bcra is not, uh, very good to just add data, uh, on the fly, uh, in it. So we are working in batch and I think for [00:45:00] that it's worked pretty well. But if you want to go with full on the fly pipeline, maybe it's not the best way to go. [00:45:09] Nicolas: And also, as I said, BigQuery won't be enough. And it's not all you need. If you want to do So, we talked a lot about training of the model and offline serving. But if you want to do online serving, as I said, in our case, We are just grabbing this data from BigQuery and putting it into memory, but, um, if you, your data is too big to, to be in memory, you will have to use another, uh, database like Redis, like you said before, or this kind of low latency database. [00:45:48] Nicolas: So for sure. Uh, for online serving, if it doesn't fit in memory, you will have another database. And also, if you want to, uh, have very up [00:46:00] to date data and you cannot do daily computation of the feature, uh, maybe BigQuery is not well suited for that. [00:46:08] Demetrios: Yeah. That seems like a very respectable answer and non biased. [00:46:15] Demetrios: It's like showing where it can fall over. If you're using some kind of a use case that needs real time or very fast, fast computations, in flight computations of features, then you have to look at a different style of architecture. But if you're going with batch, this seems like a really nice way to leverage something that I imagine most people have some kind of BigQuery type database in their stack. [00:46:49] Demetrios: And so This could be that first step until you get to a place where now your data is too big and it doesn't fit in memory. Now we have to re architect. [00:47:00] And have you thought about that forward compatibility? So once you do hit a stage where you need to change things around, where are you going next? [00:47:15] Nicolas: Um, about the word featuring process or if you want to change a feature, if you want to. [00:47:22] Demetrios: So about the architecture, like when you are to evolve it because now you, you have different requirements. Where do you think, A, what do you think those requirements will be in the future? And B, how do you want to evolve it? [00:47:36] Nicolas: Yeah. Um, so I think it's a similar response as before. I think the. Our two main challenges will be the size of the data, uh, and for sure at some point it won't fit in memory anymore. [00:47:52] Nicolas: And so for that we will have to, um, benchmark, uh, some, um, [00:48:00] low latency database and to check how we can, um, input this data and get this data at serving time. So I think this is one of, of, of the challenge we will have. And the second one will be the live featuring. Um, and so for that, we already have some Kafka ingestion. [00:48:20] Nicolas: Um, and so we know that we could leverage that also, I think, to, uh, do the computation of the feature, but my main question now is, do we output this computation directly in BigQuery in our feature table and, and we will have to do a lot of, uh, insertion in BigQuery, as I said before, and, um, Yeah, so, or we will manage this data or maybe we will do micro batch or I don't have the answer right now, but for sure, it's all future challenges. [00:48:56] Demetrios: Yeah. [00:48:57] Demetrios: Yeah. You know, it's coming, but at this point in [00:49:00] time, it sounds like you don't really need to focus on it too much. It can be something. And who knows, by the time you cross that bridge, there might be a tool out there that services your need perfectly. [00:49:11] All right, real quick, I want to tell you about our virtual conference that's coming up on September 12th. This time we are going against the grain and we are doing it all about Data Engineering for ML and AI. You're not going to hear RAG talks, but you are going to hear very valuable talks. We've got some incredible guests and speakers lined up. [00:49:34] You know how we do it for these virtual conferences. It's going to be a blast. Check it out right now. You can go to home.mlops.community and register. Let's get back into the show. [00:49:47] Demetrios:
[00:49:47] Demetrios: So well, dude, I appreciate this. You're in France right now? [00:49:47] Demetrios: You're not in Paris, are you? Yeah, no, I'm in Lyon. And there's some Olympics going on or what? Yeah, I know. Did you go and protest the Olympics? No, I [00:49:47] Nicolas: don't protest the Olympics, but yeah, I think it's very, very cool. Okay. Um, but yeah, so no, I, I'm in Lyon, but, uh, Malte is in Paris and so there is a lot of people in the team in Paris, so yeah. [00:49:47] Nicolas: Who had to take [00:49:47] Demetrios: the week off because they said, we can't get to the office. [00:49:47] intro: Yeah. There's [00:49:47] Demetrios: Olympic gates and everything up or fences. They don't let us go through the streets. Yeah. They [00:49:47] Nicolas: are, we are very remote based, so it's, it's not a big issue. So yeah, so I think the office are pretty empty, uh, in, in Paris, but [00:49:47] Demetrios: yeah, I believe it. [00:49:47] Demetrios: Well, thanks dude. This is awesome that you were so transparent with me and that you taught me a ton about how to just leverage what you already have. And make [00:50:00] use of what's in your stack to the maximum capabilities. [00:50:05] Nicolas: Sure. And it was a pleasure. And so I'm happy if it was interesting for you and if you think that it will be interesting for the community. [00:50:14] Demetrios: We'll see about that. We'll let them give us some feedback. If anybody out there is listening still. At the end of the episode, drop in some comments and let us know what you thought. That's all for today.

+ Read More

Watch More

Feature Store Master Class
Posted Jan 19, 2021 | Views 682
# Feature Store
# about.rappi.com
# Intuit.com
# ifood.com.br
Feathr: LinkedIn's High-performance Feature Store
Posted Sep 01, 2022 | Views 952
# Feathr
# Feature Stores
# LinkedIn
Building ML Blocks with Kubeflow Orchestration with Feature Store
Posted Jul 21, 2021 | Views 869
# Open Source
# Coding Workshop
# Presentation
# Kubeflow
# Feature Store
# publicissapient.com