MLOps Community
+00:00 GMT
Sign in or Join the community to continue

BigQuery Feature Store

Posted Aug 23, 2024 | Views 312
# BigQuery
# Feature Store
# Malt
Share
speakers
avatar
Nicolas Mauti
Lead MLOps Engineer @ Malt

Nicolas Mauti is the go-to guy for all things related to MLOps at Malt. With a knack for turning complex problems into streamlined solutions and over a decade of experience in code, data, and ops, he is a driving force in developing and deploying machine learning models that actually work in production.

When he's not busy optimizing AI workflows, you can find him sharing his knowledge at the university. Whether it's cracking a tough data challenge or cracking a joke, Nicolas knows how to keep things interesting.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Need a feature store for your AI/ML applications but overwhelmed by the multitude of options? Think again. In this talk, Nicolas shares how they solved this issue at Malt by leveraging the tools they already had in place. From ingestion to training, Nicolas provides insights on how to transform BigQuery into an effective feature management system.

We cover how Nicolas' team designed their feature tables and addressed challenges such as monitoring, alerting, data quality, point-in-time lookups, and backfilling. If you’re looking for a simpler way to manage your features without the overhead of additional software, this talk is for you. Discover how BigQuery can handle it all!

+ Read More
TRANSCRIPT

Nicolas Mauti [00:00:01]: Hello, I'm Nicolas Lead MLOps at Malt and unfortunately, I don't drink coffee. I like the smell, but I don't like the taste too much, even if I try regularly, to be honest. And so my morning routine is more about orange juice or I have also a very strange habit and my colleague told me so. I'm a huge fan of mint syrup with water and so yeah.

Demetrios [00:00:35]: What is up, good people of the world? Welcome back to another MLOps Community podcast. As usual, I am your host Demetrios. Today, talking with Nicolas, he got very transparent on how they are using Bigquery as their feature store and what the pain points that they had before they implemented it as a feature store were, and now the benefits and how exactly they have implemented the feature store in Bigquery to set them up for success. He echoed something that I have heard time and time again, which is some of the biggest unlock from creating feature stores are decoupling of the feature creation, feature generation and the code creation or just coding in general model creation. He said it very eloquently. I'll let you listen to his whole way of putting it, but basically when you're doing one of these two tasks, you're in a completely different headspace. And he has seen a huge uptick in the capabilities from the team since decoupling these two tasks. That being said, this is a exciting episode for me because it's a prequel to our data engineering for AI and ML virtual conference that we've got coming up on September 12.

Demetrios [00:02:02]: Nicholas was a person that filled out the call for speakers. We didn't have enough room because we're full as far as speakers go, but I said, man, this is such a good talk. Just come on the podcast, let's chop it up, and let's talk about it in a more free flowing form. And I got to grill him because I had seen the talk. I knew what he was going to get into. So hopefully you enjoy. As always, if you like this episode, feel free to share it with one friend and let them know how they can get their feature store game on point. I didn't mention it to you, but dude, Bigquery is all you need is a great title.

Demetrios [00:02:45]: I like that title a ton. I know this originally wanted to be a talk, and I convinced you to do it as a podcast, so I appreciate it. I appreciate you being flexible with me on this. Let's get into it because you can you break down the scene. What is the end goal that you were trying to go for, and how did Bigquery play its part in this.

Nicolas Mauti [00:03:12]: Yeah, so it's a pleasure to be part of this podcast and for sure it was a talk at the beginning, but I think we will have a great discussion and so it's good to take some time to talk about that. So, yeah, so to explain globally. So, and all these projects started because in fact it's a project to use Bigquery as a feature store in our company. So it starts, in fact, we train a lot of model at malt. So basically what we're doing is recommender, system, NLP task, stuff like that. So matching, and so this kind of machine learning model. And so for that we train several models for recommendation. And so before that we did the featuring just before training the model.

Nicolas Mauti [00:04:15]: So basically you grab some past interaction on the platform and then you will grab some properties from your. So we are matching freelancer with projects, so it's process. And so we are grabbing some information about the freelancer, their skill, daily rate, stuff like that, and also from the project. And so in fact, we did that for all our models and it was really part of the training and we didn't add any source of truth or stuff like that. And so it was a very pain point. A pain point. And so it was a pain point.

Demetrios [00:05:03]: Because you couldn't really version it or you didn't understand, you couldn't reproduce it.

Nicolas Mauti [00:05:08]: Yeah, exactly. In fact, you have several problems. So you have some consistency problems because you can have two data scientists that train models that want to use the same feature, but it's not computed exactly the same way. And so it could create confusion, also confusion with the product because if you want to tell to the product, okay, we are using that in our model, but this has the same name, but it's not computed the same way by two data scientist teams. So it's not very good for the project and for the comprehension of our algorithm.

Demetrios [00:05:42]: So there was feature sharing amongst the team, but it wasn't actually reliable feature sharing. It was almost like, yeah, it's the same name, but it's not the same feature underneath the hood.

Nicolas Mauti [00:05:54]: Yeah, it was mainly copy pasting, sharing, I would say. But for sure, yeah, it was not very strict and so it was not one source of truth. So we add some tables, but sometimes you have to do computation to prepare your feature or to compute your feature because it's not directly in the table like you want. And so sometimes you have to do some coons or stuff like that. The business is really complex. It's not so simple and so sometimes you have some different rules to create the same feature. So it was one of our problem. The second one is efficiency.

Nicolas Mauti [00:06:30]: I would say it's not very efficient to make the computation of the feature on one part of the code and the same on other part of the code. Let's share this computation. So, yeah.

Demetrios [00:06:47]: Cause it was computing features in each point. So it wasn't like caching those features or it wasn't storing the features anywhere that you could pull from.

Nicolas Mauti [00:06:57]: No, it was not storing. We were not storing the feature. We were just computing on the flies and training the model.

Demetrios [00:07:06]: Wow.

Nicolas Mauti [00:07:06]: Wow, wow.

Demetrios [00:07:06]: Yeah. And you were able to get the desired speed because these recommender systems were low latency, right?

Nicolas Mauti [00:07:15]: Yeah. So there we are talking only about. I think we can discuss that. But there is two parts. There is offline serving for training the model and the online serving to serve the model. So to serve the model, we have a completely different approach. In fact, we are just loading in memory the features because it's not so big. And so we can just do it in memory so it's not a problem.

Nicolas Mauti [00:07:44]: And so for that we have a daily basis approach and a daily basis computation. So each day we recompute the feature. It could take, I don't know, maybe 1 hour, something like this, and then we inject them in memory. So it's not the problem. The problem was mainly for training the model.

Demetrios [00:08:02]: Yeah. Okay. So the serving side felt like, okay, we're all right here. But as far as the training goes, you didn't have the ability to, a, get reliable feature sharing happening and b, have the computational load or have cut down on the excess computation. It wasn't the most optimal way of doing it because you were computing features in various different spots and then not saving them and going through the code and computing the features all over again. And so that makes a lot of sense. Was there a third pain point?

Nicolas Mauti [00:08:41]: Yeah, there is a third, but it's linked to the two other one. But something that is very important and it's very part of the feature store and how you manage your feature for train your model is what we call point in time retrieval. In fact, when you want to train a model, so it depends on the model. But in your case, we are using past interaction between freelancer and projects. And so if you want to train a reliable model, so you know that, for example, one year ago, a freelancer said yes to Java or any language you want projects. But maybe the freelancer has different skills. No, maybe he don't want to do Java anymore. Or maybe he said no one years ago, but no, he has the skills.

Nicolas Mauti [00:09:30]: So you, you have to get the state of the freelancer and also of the project at the time where the interaction was done. And so this is very difficult to recompute that, especially if you want to do it on the fly. So you will have basically to do some big join between tables that have historical data. And it's very complex, so it's very error prone. And so the data scientist could make some error doing this featuring, and it's also very expensive, so it's not very efficient. Efficient also because you have to scan a lot of historical table to do that.

Demetrios [00:10:16]: And so you thought to yourself, there's got to be a better way.

Nicolas Mauti [00:10:21]: Yeah.

Demetrios [00:10:21]: How did you go about figuring out the solution? Because I think the solution that you ultimately chose, which was using bigquery as your solution and that was all you needed. Right. That is a little bit counterintuitive. I would imagine most people would start to go and look at feature stores as their first solution, or maybe they're going to try and figure out some combination of redis and some open source, probably like a feast. But I. Why did you say, let's see if we could do this all in bigquery?

Nicolas Mauti [00:10:57]: So you are right. So Redis, I think is more for the online serving. And as I said, we are doing it in memory, so we don't have this problem. But in fact, we think about it, we checked feature store and stuff like that, and in fact we realized that none of them fit all our needs. We had a lot of needs. We didn't talk about this before also. But one challenge we had also was the monitoring and the anomaly detection in the future. And if you do it on the fly, you cannot do analysis on that.

Nicolas Mauti [00:11:34]: And so, yeah, so it was a.

Demetrios [00:11:36]: Lot of challenges, but the monitoring, sorry to interrupt, the monitoring on the feature creation or what was on the data itself, what monitoring are you talking about?

Nicolas Mauti [00:11:48]: Yeah, about the data and so about the feature. So, for example, if you are, if one of your feature is a delay rate of the freelancer, maybe you want to assess. So you want to assess two things. The first one is, and this one is more linked to alerting. You just want to be sure that when you train a model, all the delay rates are not null because there is an issue in the backend or I don't know where. And so you want to be sure that your data looks right for training the model. And also I think you want to monitor the feature across the time. And to be sure that your daily rate doesn't increase or decrease.

Nicolas Mauti [00:12:32]: It could happen just due to inflation or things like that. But you want to be analyzed by that and you want to know that you have this drift in your data and maybe to retrain your model or to check if it's a product problem. So maybe just a change on the product side or, I don't know. So that's why we want to monitor.

Demetrios [00:12:56]: Yeah, because at the end of the day, if you're getting the wrong rates and suggesting to freelancers projects that they are way too expensive for, that's going to create a bad suggestion and ultimately not lead to matching. And I imagine that there's a lot of people that are on the platform and if they get recommended three different jobs that the rates are way too low for them. If it's too high, I imagine they're not going to care. If it's more than their daily rate, then that's going to be awesome. It's like, well, I might as well try. But if it's too low, that's going to create a bad experience.

Nicolas Mauti [00:13:36]: Yeah, for sure. In fact, in both ways it's not very perfect because in one way I think you won't be able to match because the delay rate is too, or you propose too low freelancer and the delay rate is too high in the project. But there's a way that. Very good also because you will contact very I qualified freelancer with low pricing project. And so it's not good also.

Demetrios [00:14:09]: And at the end of the day, one side is not going to be happy with.

Nicolas Mauti [00:14:13]: Yeah, exactly.

Demetrios [00:14:15]: All right, so then you said bigquery. Yeah, walk me through that.

Nicolas Mauti [00:14:20]: Yeah, so, bigquery. So in fact, so we check a lot of product and we thought about, okay, maybe we can use something that it's already in our stack because it was already in our stack for analytic stuff and stuff like that. And so in fact, we say, okay, let's just build some feature tables. It will be one table with all the feature. And so the data scientist team will have access only to this table. So you have one column per feature and this table will be computed on a daily basis. So each day we compute the feature, sorry for all your a freelancer, all your projects and just stories, this table. And so each day, so we compute all our features and then we put a timestamp and store it also in Bigquery.

Nicolas Mauti [00:15:12]: And so this way when you train a model on the data science side, you just have to construct your training data set with your, why what you want to predict the idea of the freelancers, the idea of the projects and the date of the observation, and then you have just one join to do on this table and you have all the data for all the dates you want. Okay. And so that's it. To be honest, we are done. And if you want to do online serving, so you mentioned redis, for example, before. So if you want to go with Redis, because in memory it's too large, you can just take the computation of the day, push that into redis, and then when you want to do some live serving of your application, you can just ask Redis, and you have the latest data available for each entity.

Demetrios [00:16:07]: And so I imagine now you are able to version your features because you said you just do. Basically you make these computations once a day, and then you can look back in history and say, what was it on April 17?

Nicolas Mauti [00:16:21]: Yeah, so in fact, it's not very, it's not exactly version, so it's just you have a timestamp. And so as you historize each day, the value of each feature, you can just go back in time. And so maybe I didn't understand well, the question, but, or your argument, but that was it.

Demetrios [00:16:42]: That was exactly it. And so I'm wondering, now that you have this in play, what have you seen happen? So it's like you really explained the before to me nicely. And there's pain points. What's the after? And where are the pain points?

Nicolas Mauti [00:17:06]: So what's the after? So the after is really good now because we have these features. And so it works as I said. So the data scientist team want to create a model. Okay? They have one table, they create their data set with the y, with a two id and the date, and they can just join and get the feature they want. So it's worked very, very well. But I would say that we had a few challenges with that. The first one is what happens when you want to add a new feature that's something that it's not already part of the feature. Table for example, the number of projects that the freelancer already done on the platform, for example, this could be a feature.

Nicolas Mauti [00:17:50]: And so what, how do you add a feature? And also, how do you add all the history of this feature? Yeah, so maybe you have to backfill it, right? Yeah, exactly. That's exactly the term. It's backfilling. So, yeah, so this is backfilling. So for that, we develop a homemade solution with a script. And so basically you have two different way to backfill so in fact, our data warehouse is composed of layers. So you have the ingestion layers, then you have some transformation layers, and at the end you have the feature tables. And so at the ingestion layer we historize everything.

Nicolas Mauti [00:18:41]: So basically if you want to add a feature, you can just get to these tables and you have all your history and you can do the SQL query you want or scan the data the way you want just to compute your feature for all the dates. If you have the data, for sure, if it's a new feature in the product, you don't have the data. But for a lot of feature, we already have this data in these tables and you can then update your feature table. And so for that we just build a small script where the data scientists just have to input their query and it will automatically place a query on the feature table and fill it. And it's also versioned and you have a changelog and everything. So if you want to know what happened to this feature at this time or when this feature was added or stuff like that, you can go to this changelog or I take the version of the featured tables and is there a process?

Demetrios [00:19:42]: Like do people have to create some kind of pr request in order to get a new feature on that table? Or is it free for all? Anyone can throw up new features.

Nicolas Mauti [00:19:56]: No. Exactly. So the process I talked about before, just before we call it upgrader in your team, basically you have a project on GitLab with upgrader and you have a folder when you can put all your SQL scripts to update your feature table, and then it's played automatically, not by the CI but by scheduler. We are using airflow. We have this scheduler that will play all the SQL files that were not already applied on the table to update it. So for sure, the data scientist team have to write this SQL query, then submit a pair, and when it's validated we can apply it on the feature table.

Demetrios [00:20:44]: Oh, fascinating. And is there a way now that folks are able to go back and share easier? Because you mentioned the sharing was a bit of a pain point, but it was more, sharing was a pain point because two people were using the same name for something that was computed differently and it ended up giving them different results after they added that feature to their model. Now it seems like you've conquered the problem of having two different computations of a feature with the same name. But I am guessing that it's a lot easier because people can just reference the bigquery table and see, okay, this feature seems like what I want, I'm going to add that to my model.

Nicolas Mauti [00:21:33]: Yeah, exactly. And maybe it's a teasing, it's not already live and we're working on this way, but I would like to develop something to help the data scientist team to find the right features, because now you have that and also in bigquery you can. So each column is a feature, and so you can add a description for each column. And, you know, easily with some, even some LLM or stuff like that, you can maybe say, okay, I want to train this model, return the features that could be interesting for this model, or just a basic search could be useful also just to search the feature or in plain text or even with something like some keyword matching, just to make the discovery more easier. It could be a good feature for the data scientist team because sometimes they don't have the same term for the same feature too.

Demetrios [00:22:32]: Yeah, 100%. And that feature discovery is huge because if you're going through and getting inspiration from what others are using in their models, what types of features they're using, then that can hopefully translate into you creating a better model.

Nicolas Mauti [00:22:50]: Yeah, exactly. And so I think it's also a point we discussed before, but for me it is very linked to the monitoring, because when you have your feature, you want to know how it behaves and how is the feature. And do I have a lot of null in this feature? What are the shape of the feature and of the data behind this feature? For that, we also leverage a tool that we already had, Grafana. And so basically Grafana is plugged to the query. And so in Grafana you can just explore each feature and it will give you some descriptive statistics about the feature. So the means average, if it's a categorical feature, you have the Kunt by category and stuff like that.

Demetrios [00:23:44]: But it's not giving you the usage of the features, is it? Do you have any type of lineage of these ten models are using these features, or this feature is used the most by 50 different models?

Nicolas Mauti [00:23:59]: No, that's a good point. We don't have that. Right?

Demetrios [00:24:02]: No, I get the feeling that is probably kind of hard to implement because you're looking at the whole data lineage as opposed to just how the features itself. Right. You can't do that in bigquery. You can't really know who's pulling from what features, I guess.

Nicolas Mauti [00:24:21]: Sure. So, yes, in fact, we have some data lineage tools that tells us, okay, this colon is popular, but we cannot, we cannot know that it's popular and used by this model in particular. We just know that, okay, this colon is used, but we don't know if it's used by this model or this model or this model.

Demetrios [00:24:44]: Huh. Fascinating. Yeah. Because I just know I've heard so many horror stories and I imagine you deal with this day in, day out if one of these columns goes haywire and like you were saying, you're monitoring it and you're seeing, wow, this column is now not working as we had anticipated it to be working. Something's going wrong here. What downstream effects does that have? Which models is that actually affecting? So you can know potentially you've got to retrain those models. You've got to roll back, you've got to do something before the models that are out in the wild feel that effect.

Nicolas Mauti [00:25:27]: Exactly. And so now we don't have this right now. So we don't have this at column level. We have this at table level. But it's not very interesting in this case because all the model will use the same table because we have dependencies. And so we talked about alerting before, but we have also automatic alerts. So for that, we also leverage some tools that we already had great expectation for that and so great expectation.

Demetrios [00:26:02]: Yeah, what exactly. Yeah, you might have just about been saying this. What exactly is it doing? Is it monitoring the data that's flowing into bigquery?

Nicolas Mauti [00:26:14]: We are monitoring. So in fact, we are calculating each day the future for all our entities. So entities is freelancer or project, it's global name for that. And then when we have this table, we will run some tests on it. So grid expectation are like unit tests on tables. And so if one of the tests fail, it will just stop the whole process and we won't train any model after that. And we will just have slack alert about, okay, this column doesn't match your expectation. And so, for example, this column is always new or you have a new value in a categorical column.

Nicolas Mauti [00:26:53]: So your categorical column must be this or this and you have that and so it will block. And so we are alerted for that.

Demetrios [00:27:01]: Yeah, it's like this just came in at 14 and usually it's between zero and one. What's going on here? It's way out of distribution. You might want to go check it out.

Nicolas Mauti [00:27:12]: Yeah, exactly.

Demetrios [00:27:14]: But this is, again, I guess the interesting piece is you're just monitoring the specific columns for, with great expectations. Right. You're not actually monitoring the data that is coming in and going through this ingestion and then your transformations and whatnot.

Nicolas Mauti [00:27:32]: We also have a lot of monitoring about that.

Demetrios [00:27:35]: Okay.

Nicolas Mauti [00:27:36]: You know, it's more data engineer work than mlops work, but, yeah, so they have some tools, basically. They also have great expectations. So, in fact, you have great expectation all the way to just detect as soon as possible the issues because it's easier to fix and it won't impact a lot of downstream jobs or tables or stuff like that. So they have great expectations. They also have some, yes, some scripts or some tables just to assess that the data in this table are coherent with, are in phase with data in this other table or stuff like that. So, yeah, I, they have some toolings along the whole chain, but in our case for featuring, we are using great expectations.

Demetrios [00:28:28]: Talking about great expectations makes me realize that you're probably doing something along the lines of continuously retraining models. I write in assuming that.

Nicolas Mauti [00:28:42]: Yeah, exactly.

Demetrios [00:28:44]: Can you break down how that works?

Nicolas Mauti [00:28:48]: So it depends on our models, to be honest. So it depends about the generation of the model and a lot of stuff. So also, if it's a recent model, we prefer to train it by end for the first time, just to be sure that everything looks good. But otherwise, we have, in most cases, we have a monthly training of our model. So as I said before, so we have some checks about the features, and if everything looks good, we will just train a new model. And if the metric of the model looks good, we will deploy it automatically.

Demetrios [00:29:31]: And is this all happening through GitLab?

Nicolas Mauti [00:29:34]: No, everything is done in airflow.

Demetrios [00:29:38]: Airflow.

Nicolas Mauti [00:29:39]: We're using airflow for the whole scheduling of all this task and also for what we discussed before about the ingestion and then all the layers of our data warehouse and then the features table. So it's all scheduled in airflow.

Demetrios [00:29:56]: And is it just on a time basis?

Nicolas Mauti [00:29:59]: Yeah, so it's on a time basis. So we have some airflow. Use dagan graph to run the task. So we have some daily or monthly dagger, but if we want, we can just trigger one manually. So if we detect an issue with the model or we want to rerun this task or we want to fix that in a table, we can just clear the task and it will restore it and fix the whole stuff.

Demetrios [00:30:29]: Because I'm wondering about if you recognize that there's some kind of model misbehaving out in the wild. Do you have the capability, a, to recognize that a model is misbehaving and b, just to say, all right, well, let's go retrain it and figure out if that solves the problem?

Nicolas Mauti [00:30:49]: Yeah, yeah. We have the possibility to start a train manually and to just release a new model. Yes, it's completely possible. About the detection of the bad behavior, I would say of the model. Yeah, we also have some different monitoring. So in fact when you monitor model, I think you have multiple way to monitor it. So we have business metrics. So for example, the matching, we know that.

Nicolas Mauti [00:31:31]: Or the average, the average conversion is that. So if we see that the conversion is dropping. So maybe it's a problem. I don't like this metric because when you learn that your model does not work correctly from the, from the business, it's a little bit too late. I prefer to detect it early because when it's a business it's not very good.

Demetrios [00:32:04]: Yeah, there's some very not happy people that are talking to you or hitting you up on slack.

Nicolas Mauti [00:32:10]: Exactly. And I can understand that. So yeah, for sure. So that's why we have also some earlier monitoring. So we just log everything we wrote on from the model and you can for example check the score and oh, it's changed over time and this kind of stuff. So with that we try to detect earlier some issues and that's also why we also added some monitoring about the feature because it's much earlier as your model is not already trained. And so if we are able to detect some issue in the future, you can avoid issue in the output of your model and in the business, are.

Demetrios [00:33:04]: You the one that's responsible? If there is some metric like latency that all of a sudden shoots up, you're serving the model and it takes 1 second to serve the model out of nowhere.

Nicolas Mauti [00:33:18]: Yeah, so it's also part of my job and it's another kind of monitoring, it's more, I would call this platform monitoring. So we are monitoring cpu ram usage, also latency. And so yeah, and if there is an issue with that I can just discuss with the data scientist team about how to solve that. And yeah, so yeah, it's also part of the job. And sometimes here with the latency we can detect other issues. Sometime beyond a latency issue you can find maybe a quality issue and for example, you return too much freelancer and so that's why the latency increase, because your pool is very large. And so maybe it's a problem for you from your model, but you detect it through latency. So sometimes you have this kind of effect for sure.

Demetrios [00:34:11]: Yeah, that's a great point that the problems, when you go to the root cause, it's not as simple as if it was just a software problem because you have to think about, well there might be a reason that this is happening that is inherent in the model that just got pushed out or just got updated or is going a wire for some reason. So you have to look at it from that angle. And that's why it makes sense that you would be in charge of that as opposed to someone else that isn't as well versed in the ML side of things. Yeah, I heard it put one time as the monitoring side. My friend Shubi told me this was even four years ago. He said, I like to monitor and think about monitoring on three different levels. One is the actual accuracy of the model. Another is the data.

Demetrios [00:35:14]: So we're monitoring the data. And that seems like what you're doing with like the features and what's happening with the features and also with great expectations and grafana on those columns. And then he said on the systems level. So that latency and all that. So you have those three different ways that you're monitoring that model that is out there making predictions, because it can go wrong on one of those vectors and it can screw everything up.

Nicolas Mauti [00:35:42]: Yeah, totally agree. I would add also the business level. I think it's important. And I, you have a lot of factors that could impact the business, but your model could be one of these factors. And so even it's not very precise and it's not directly linked to your model. I think it's very important to put this in place because at the end, it's the goal of the model. So to be honest, even if the score of the model looks weird, if the business increase and it's okay, maybe it's not a very big problem. So you have to balance.

Nicolas Mauti [00:36:23]: And so, but at the end, you are evaluated on the business, not on the accuracy of your model, you know.

Demetrios [00:36:31]: And you don't think, how, how do you separate in your head the accuracy versus the business?

Nicolas Mauti [00:36:39]: So I think it's not separating the accuracy and the business. It's more, okay, if the business goes wrong, let's look at the accuracy. And so if the accuracy goes wrong, also if you can start to see something on the accuracy that seems to go wrong, okay, let's, let's dive into the model. And, yeah, I think the issue should be there. If the business is wrong, but the currency is very good, maybe we can start to look elsewhere. Maybe it's a model and maybe you can also invest in that, but maybe it's else there. So it's more, get a lot of different metrics at different stage to debug more efficiently and to be able to spot the right spots of the issue.

Demetrios [00:37:28]: Yeah. It's such a funny thing that you say on potentially business metrics are going up, all the other metrics are going down. So in that case, even though that feels like a very rare, rare occurrence, but in that case, don't touch anything, just let it run. Right? Yeah.

Nicolas Mauti [00:37:47]: Or maybe just investigate because maybe you can do better if your model is better. And maybe it could be better because there, as I said, there is also other factor for the business. So maybe just the sales team, that works pretty good, or there is a lot of other factors. So I didn't say don't touch anything, but it's more. Okay, let's have also the business metric to just say, okay, maybe my model is not as good as I think it is, but it works. So maybe it's okay. And so say, yeah, it could help you to prioritize your work and to know if you have to investigate a lot of the model or not, depending up of the business also. So I think it's, it's, it's more.

Nicolas Mauti [00:38:37]: It's not business versus accurate accuracy of the model, of our output of the model. It's more. Okay, let's get a lot of metrics from different spots and just cross them to check where the problem could be. Yeah.

Demetrios [00:38:54]: Looking at it more holistically and then diving in where you see some things that don't feel right. Have you done a retro on now that you're using bigquery as your feature store? Basically looking at it and saying, we've been able to alleviate all these different pain points. But I am assuming that bigquery comes with a cost. And so you recognize we're paying this much more in hard cost that we can see from bigquery. But we were paying before this much in people time that they had to go and recreate a feature when they recognized that it wasn't the feature that they wanted or they had to. So it feels like there was a lot of fuzzy costs that you really couldn't account for before, but now that you have bigquery, it's very clear what you're paying.

Nicolas Mauti [00:39:58]: Yeah, exactly. Also, we are paying only once a day when we compute the feature for all the entities. Whereas before, as the featuring were done, when we were experimenting or training a model, also locally or testing some stuff, we were paying at each execution of the training and of the featuring. So now we pay once a day and then the data scientist team can access whenever they want and early and no cost to the feature. You just have the cost to get once the feature, but it's pretty okay, so you have that. Also, I think something that, it's not really cost, but something very interesting is that now the data science system is better able to split their work between feature engineering and model training. And so when they want to train a model, they can just, okay, I will need this feature. They can start to implement all these features in the feature table we have, and then they train the model and they don't mix both.

Nicolas Mauti [00:41:14]: And I think it's more clear in their head when they are working about, okay, no, I'm optimizing the model and no, I'm doing feature engineering because I think it's, and I did some data science before and I trained some model in the past, and it's very, for me, it's pretty different job, not different job, but it's very different tasks, so you don't have to think about the same thing. And so I think it's much easier for them not to split this work and to have on one side the feature engineering work, and maybe they will work for two weeks on that, and after that they could start to work on the training of the model. And just think about that, about the parameter of the model, about the structure of the model and stuff like that. And so I think it's better for them also to this one.

Demetrios [00:42:09]: It's funny, because we had on here the creator of feather, which is a feature store that got open sourced by LinkedIn probably two years ago. And one of the things that he said was the biggest boon of having a feature store was the fact that you could decouple the code from features and feature generation. And that's exactly what I'm hearing you say is how you're in such different headspaces when you're thinking about what kind of features do I want? How am I going to create those features? All of that versus what's the model doing, the overall coding the model, and trying to figure out how to make the best model that you can. That being said, do you know that the hard numbers on, like how much money you saved? Because now you're not computing features ad hoc in five different places when a data scientist is training a model, or when five different data scientists are training models, and you just compute it once a day so you have a clear cost of what it is. Did you go back and say, what were we spending and what are we spending? And now we can give a talk at the next Finops conference.

Nicolas Mauti [00:43:25]: I don't have very precise numbers to be honest. Also because it was a long project, so we started to work on that one years ago. And so the team changed and there is no more people in the team that are using more bigquery. So for sure the cost increase. But when we started the project, I calculated that the featuring was several hundred bucks per month, something like this.

Nicolas Mauti [00:43:59]: But I think it depends, it's not applicable for other company. Depends about the volume of data that you have. So we train bigger model. No. So it's not very comparable.

Nicolas Mauti [00:44:35]: And so, yeah, well let's talk about.

Demetrios [00:44:37]: That, because when do you think this architecture or this style of doing it would fall over? So someone out there is saying, oh, maybe I'm gonna try this and just make bigquery my feature store. Where would you recommend to not take this approach?

Nicolas Mauti [00:44:56]: I would say maybe our biggest challenge would be if we want to do some live featuring also live computation of.

Demetrios [00:45:07]: The features, like on the fly, in flight, basically with some flink, et cetera.

Nicolas Mauti [00:45:15]: Yeah, exactly, because I think Bigquery is not very good to just add data on the fly in ithood. So we are working in batch and I think for that it works pretty well. But if you want to go with full on the fly pipeline, maybe it's not the best way to go. And also, as I said, bigquery won't be enough and it's not all you need if you want to do. So we talk a lot about training of the model and offline serving, but if you want to do online serving, as I said, in our case we are just grabbing this data from bigquery and putting into memory. But if your data is too big to be in memory, you will have to use another database, like Redis, like you said before, or this kind of low latency database. So for sure, for online serving, if it doesn't feature memory, you will have another database. And also if you want to have very up to date data and you cannot do daily computation of the future, maybe bigquery is not what you did for that.

Demetrios [00:46:36]: Yeah, that seems like a very respectable answer and non bias, it's like showing where it can fall over. So really if you're using some kind of a use case that needs real time or very fast fast computations, in flight computations of features, then you have to look at a different style of architecture. But if you're going with batch, this seems like a really nice way to leverage something that I imagine most people have some kind of bigquery type database in their stack. And so this could be that first step until you get to a place where now your data is too big and it doesn't fit in memory. Now we have to re architect. And have you thought about that forward compatibility? So once you do hit a stage where you need to change things around, where are you going next?

Nicolas Mauti [00:47:44]: About the world featuring process or if you want to change a feature, if you want to.

Demetrios [00:47:50]: So about the architecture, when you want to evolve it, because now you have different requirements. Where do you think, a, what do you think those requirements will be in the future? And B, how do you want to evolve?

Nicolas Mauti [00:48:05]: Yeah, so I think it's the same response as before. I think our two main challenges will be the size of the data and for sure at some point it won't fit in memory anymore. And so for that we will have to benchmark some low latency database and to check how we can input this data and get this data at serving time. So I think this is one of the challenges we have. And the second one will be the live featuring. And so for that we already have some Kafka ingestion. And so we know that we could leverage that also, I think, to do the computation of the feature. But my main question is, do we output this computation directly in bigquery or feature table? And we will have to do a lot of insertion of bigquery, as I said before.

Nicolas Mauti [00:49:11]: And yeah, so we will manage this data or maybe we will do micro batch or. I don't have the answer right now, but for sure it's our future challenges.

Demetrios [00:49:24]: Yeah, you know, it's coming, but at this point in time, it sounds like you don't really need to focus on it too much. It can be something. And who knows, by the time you cross that bridge there might be a tool out there that services your need perfectly. Yeah.

Demetrios [00:50:39]: Well, thanks, dude. This is awesome that you were so transparent with me and that you taught me a ton about how to just leverage what you already have and make use of what's in your stack to the maximum capabilities. Yeah, sure.

Nicolas Mauti [00:50:57]: And it was a pleasure. And so I'm happy if it was interesting for you and if you think that it will be interesting for the community, we'll see about that.

Demetrios [00:51:06]: We'll let them give us some feedback. If anybody out there is listening still, at the end of the episode, drop in some comments and let us know what you thought. That's all for today.

+ Read More

Watch More

1:05:16
Feature Store Master Class
Posted Jan 19, 2021 | Views 764
# Feature Store
# about.rappi.com
# Intuit.com
# ifood.com.br
Feathr: LinkedIn's High-performance Feature Store
Posted Sep 01, 2022 | Views 1K
# Feathr
# Feature Stores
# LinkedIn
Building ML Blocks with Kubeflow Orchestration with Feature Store
Posted Jul 21, 2021 | Views 1.1K
# Open Source
# Coding Workshop
# Presentation
# Kubeflow
# Feature Store
# publicissapient.com