MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Global Feature Store: Optimizing Locally and Scaling Globally at Delivery Hero

Posted Sep 24, 2024 | Views 180
# Global Feature Store
# MLOps Practices
# Delivery Hero
Share
speakers
avatar
Sai Bharath Gottam
Senior Machine Learning Engineer @ Delivery Hero

With a passion for translating complex technical concepts into practical solutions, Sai excels at making intricate topics accessible and engaging. As a Senior Machine Learning Engineer at Delivery Hero, Sai works on cutting-edge machine learning platforms that guarantee seamless delivery experiences. Always eager to share insights and innovations, Sai is committed to making technology understandable and enjoyable for all.

+ Read More
avatar
Cole Bailey
ML Platform Engineering Manager @ Delivery Hero

Bridging data science and production-grade software engineering

+ Read More
avatar
Stephen Batifol
Developer Advocate @ Zilliz

From Android developer to Data Scientist to Machine Learning Engineer, Stephen has a wealth of software engineering experience. He believes that machine learning has a lot to learn from software engineering best practices and spends his time making ML deployments simple for other engineers. Stephen is also a founding member and organizer of the MLOps.community Meetups in Berlin.

+ Read More
SUMMARY

Delivery Hero innovates locally within each department to develop the most effective MLOps practices in that particular context. We also discuss our efforts to reduce redundancy and inefficiency across the company. Hear about our experiences creating multiple micro feature stores within our departments, and our goal to unify these into a Global Feature Store that is more powerful when combined.

+ Read More
TRANSCRIPT

Gottam Sai Bhrath [00:00:00]: Yeah, I'm Gottam, working as senior machine learning engineer at Delivery Hero. And yeah, I used to drink coffee a lot in my mornings, but not anymore. I mean, yeah, that's a long story, but yeah, so I just drink nice tea, which still helps me a bit. And yeah, just kick off my day.

Cole Bailey [00:00:21]: Yeah, hi. For my coffee, I don't drink coffee either. Never really got into caffeine. But what I've been drinking recently is these kind of like cold little yogurt drinks with your nice boost in the morning. But yeah, I'm Cole. I'm engineering manager at the logistics ML platform team in Delivery Hero.

Stephen Batifol [00:00:41]: Hello, MLOps community. We're back for another podcast. I'm your host, Stefan. Today is not Demetrius and I was talking to Gotham and cool, which are both working at the re hero. And for the people that don't know, Delivery hero started as a food delivery company and now they operate in 70 plus countries, their own brands like Foodora and Foodpanda. And today we talked about a feature store and how to build a fitter store, but we also talked about the balance between centralized and decentralized solutions. DH Delivery Hero is a very interesting company as they also acquired a lot of different companies. So it's a very interesting use case.

Stephen Batifol [00:01:22]: We also talked about the end of the free money before when you had a problem you could just hire engineers and nowadays it's a bit different. So we talked about that and the impact it has as well. And also mentioned when ML platform teams don't work, sometimes having a global ML platform team might actually not work and it might be better to start from the bottom to then go up and showcase that your ideas are working. So yes, it was an amazing episode and as always, if you like this episode, feel free to share it with one friend so they can know how they can work on feature shores and they can know what it means to end up free money in the tech world. Thank you. Yeah, yeah. So maybe you can tell me. So both of you have been a Derry hero for a while, right? So how is it? How has it changed since your life? Basically?

Gottam Sai Bhrath [00:02:17]: Yeah, that's quite a bit. So, yeah, I can start because I think maybe I went through so many changes, but yeah, some of them are good. Maybe we can start from. So I've been here almost three years now in Delaware hero and in Berlin. And so I come from the machine learning platform background as well. And when I moved to delivery hero, kind of a similar role, but the, the team was sitting in Berlin and we were catering to the whole delivery hero. And as you might already know, we have many different companies and all across the world, and we were supposed to be like this global machine learning platform team who would be building some frameworks, some tooling, so for all of the company, and it actually was going really quite good. Like, we were a couple of us in the team and we had this grand vision of making so many services and everything lies and saved our costs and, yeah, that's how that platform thing started.

Gottam Sai Bhrath [00:03:24]: And yeah, it was going good for a while. And I think we were using GCP, vertex, AI, which was kind of also new at the time. And yeah, I guess vertex pipelines slash q flow, which was. Is being used for us as well, which was easier for us to build. So data scientists, everyone can just add some steps and they create some pipelines and it was easy for them, not so easy for the engineers side, but anyways, it was working and, yeah, so we cut it. We had fraud team, we had more science team. We also were being used for all the other central teams and I think there was a shift in the momentum where being central comes with some challenges. And there are, as I said, so it's a huge company and there's so many teams, and if there's another X team which wants a different feature, and then there's an Y team which wants another different feature, and we can't really prioritize easily which one to go first.

Gottam Sai Bhrath [00:04:25]: And this model is kind of a tough thing to manage and maybe we'll talk about it a lot more as we go into the actual topic of the day itself.

Stephen Batifol [00:04:34]: Yeah, we will for sure.

Gottam Sai Bhrath [00:04:36]: So that didn't really work out as you can expect. Yeah, we tried to adopt a new model, so we are right now trying to have individual local teams develop problems and solve the solution and then try to build something unique and something global, which all the teams can use, which I guess, of the day.

Cole Bailey [00:04:56]: And.

Gottam Sai Bhrath [00:04:57]: Yeah, so since then, our global MLP, then. Now I'm working as a machine learning platform engineer as well, but it's in a specific brand, as Cole mentioned. He's from logistics, I'm from Pandora, and it's basically just how you have different oxynar. Huge. And. Yeah, so that's a very minuscule journey of how it went for me.

Stephen Batifol [00:05:21]: Well, that's quite a journey, though, I've got to say. Yeah, I remember chatting with you, I think it was one or two years ago, and then you were part of this global team and you had this grand vision. Everything was beautiful, everything was nice and then also sometimes it felt like you would work on something and then another team would basically do the same. And then now the topic of today basically is mostly to talk about the global feature store that you have. So you said, you know, like you had this grand vision, like a global team and everything, but then how are you then doing it again? Like, how are you doing something global again while being in Pandora, you know, which is. I mean, maybe you can explain what it is as well because logistics might be clear to other people. Pandora, not sure. So maybe you can start like explain what Pandora is and then you know, how you're going to actually try to do a global fit historian.

Gottam Sai Bhrath [00:06:20]: Sure, I'll maybe just give some context and then Cole can take over about how everything went and how it came to actual life. The whole planning went through. So yeah, so yeah, Pandora is again a mini organization inside DH. And we have multiple markets in AIPAc, Europe and of course Turkey. So we are like catering to all this, almost 18 countries. And anything related to these 18 countries product and tech belongs to this organization called Pandora. It's just a name which we are called and especially us. So we are the machine learning team specific to Pandora.

Gottam Sai Bhrath [00:07:00]: Basically whatever we do is for these customers and yeah, so there are.

Cole Bailey [00:07:08]: A.

Gottam Sai Bhrath [00:07:08]: Lot of multiple teams and fortunately only one machine learning platform team, at least for now. Let's see if that changes or not. I don't know. So for now it is us and we basically cater to the models regarding personalizations, recommendations. So whatever you see on the food Panda app. So food panda or MXFP or Foodora, which I think is in Europe. So you can see the home screen, the product. So we have anything that to be done with personalization, our recommendations is kind of coming from our team.

Gottam Sai Bhrath [00:07:44]: So that's the specialization we do. And of course there's a lot of other things like analytics. There's the usual app developers also. So all these include under the Pandora tech, right? So we also have Pandora product, which of course is all of our product managers. And we have Pandora tech, which is all of our tech. Right. So this is our team structure. And yeah, I think feature store was one of our offerings as well, but it was just the Pandora as you can expect.

Gottam Sai Bhrath [00:08:15]: And I think that's one of when the whole thing was happening we wanted to make this kind of a centralized service, but we don't want to repeat the same mistake that happened. So that's how this idea also came to life, that the old model didn't actually work out. So we wanted to have some kind of a new model which will actually make this feature store globalized. So the lot of teams can also use it, and especially because a lot of teams are already kind of either building it or they actually want to use it, and they don't have enough people to actually build it. So it was like a perfect use case for us to just start and see if it makes sense or if we can actually make it work, because, you know, as I said, so the global is kind of tough to make it work. So, yeah, that's how the idea started. And then we start approaching all of our teams, all the other teams. We had, like, a huge thing about the vision and vision, because it's really important where we are headed.

Gottam Sai Bhrath [00:09:14]: So that's how it started. And then Cole can add more about how. What went into the planning and how getting together.

Cole Bailey [00:09:23]: Yeah. So maybe before getting too much into global feature store, I can add a bit of context. So I'm in logistics. I've been in logistics, but logistics has changed a lot also over the time I've been here. And I guess we're also one of the reasons why the global grand visions didn't work out, because we have a very strong machine learning organization internally with their own ways of working and opinions, which we were not very eager to use for text AI, for example. And we're one of many, right? It's not just us, but one of many. So we see, like these kind of micro email platforms in different areas. So, yeah, and logistics, like Galtin was saying, pandora caters to, you said 18 countries, right? So kind of more regionally specific.

Cole Bailey [00:10:04]: Logistics caters to 15, soon to be 70 countries, but only for the logistics problems.

Stephen Batifol [00:10:09]: Right.

Cole Bailey [00:10:09]: So we don't do any recommendations, personalization, like Gautam does, that's all distributed elsewhere. But we do things like, when is your food going to arrive? How do we staff riders? How do we price delivery fees and stuff like this, like the core of logistics to make the fleets more efficient. So, yeah, when it comes to feature store, I think that Gotham was saying we wanted to do things a bit differently. I think what naturally happened was, for example, in logistics, we had invested a lot into what we call live feature store, like a real time feature store where we have fleek pipelines that aggregate streams and serve those aggregations back to the models. Whereas what Pandora had built was like a batch feature, still, where you go to the data warehouse, you have a SQL query, you run that once a day or maybe once every few hours, upload that to your production caches and stuff. And serve that to the models. It was a natural compliment. It's like we wanted batch features, but we never found the time to build it.

Cole Bailey [00:11:00]: They also wanted live features, but never found the time to build it. So we're like, okay, how do we sit together and like combine these things? And I think the new kind of buzzword that's going around is inner sourcing. So in the past, in the hypergrowth stage, whenever we saw this logistics was just say, okay, we're going to hire a new team, we're going to build our own batch feature store, solve it with headcount. Now that doesn't work as well. The economy is what it is now. Inner sourcing is the hype. I think global feature store is one of the first examples where we're trying to really do this inner sourcing at a bigger scale, where for example, Pandora owns the batch feature store implementation, they do some work to make it scalable so that other verticals can also install it in their own info and then can also maintain it and operate it in their own departments and maybe contribute back. So if we find for our batch features, we need this kind of small tweak to what they've done, we can contribute that back and then everyone can benefit.

Cole Bailey [00:11:58]: That's the idea. So I think global features store, basically all the departments came together. It was a room of like twelve, I think departments with like seven different feature store implementations.

Stephen Batifol [00:12:08]: Wow.

Cole Bailey [00:12:09]: And we spent three days just saying, okay, what have you built? What have you built? What have you built? How could we somehow bring this together? So that was how it all really started coming together. It's still work in progress, of course.

Stephen Batifol [00:12:20]: Okay, but that's very interesting. I mean, DH is, I guess it's a bit of a different company as like the most usual tech companies, where you also have like actually different companies inside because they got acquired. And then, so then I guess you had like different tech as well. So like it's also like how on your end, like how do you make it work? You know, when like people have some very strong opinions about, no, I want to use this tech. We've always used this one. And then the others are like, no, we have to use that. You know, it's like, how do you balance that? Usually at the age.

Cole Bailey [00:12:57]: I think the real answer is you balance it at a local level. Right? So in logistics we have our opinions and we stick to them, and then other people can have other opinions and that's fine, right. But then when you look at these global feature store, like inter sourcing initiatives, then all these opinions have to be somehow resolved. It really depends, right. There's some kind of high level opinions that are more well established across departments and verticals, but then when you get into the nitty gritty details of mlops, then it really gets impossible to define every single design decision for every single department. Right. Because our needs are so different, actually. So for future store, I think what we ended up doing, right, was basically bringing all of these representatives from different departments together, having a multi day workshop to discuss and align on the most important details and then kind of nominating a small committee of people who would represent and continue making sure we're deciding on these things when we need to and try to get some POCs and MVP's working that actually scale across multiple departments.

Stephen Batifol [00:13:58]: Yeah, that's quite interesting. It really sounds like, I mean, not like to compare it to politics. You have people coming and they all gather and meet and then they talk and then you do other things. I find it quite interesting, to be honest. Like, it sounds like also it's really like a very bottom up approach, you know, like actually, you know, like someone would be like, okay, we need that. Then you build it and then actually you show that it works and then maybe other teams want to join in. It's more like, I guess before maybe, tell me if I'm wrong, but before was more like, you know, from the top to the bottom, you know, like you had like the central team, which was the one, you know, to be like, okay, we are central team. Hey, everyone has to do that.

Stephen Batifol [00:14:42]: Whereas now it sounds like more like, okay, we have this need. We're going to build it. Look, it works. Then you showcase it to other people and then they actually, oh, okay, it's cool. Is it like how it works now?

Gottam Sai Bhrath [00:14:52]: You would say kinda. It's in kind of a middle. So before we had a lot of teams. So it's basically about getting what features are present. And as I said, it was kind of like a pilot program, so we didn't really have so many features. But the main requirement was we need to deploy models. So that's the huge need at the time because the teams, at least we are catering to, had no engineer and which kind of makes us easier because we are the only engineers who will be taking any decisions. And it's usually the other teams who are just data scientists and they want just to deploy their models, which they are not able to before we actually came along and all that they wanted is to deploy.

Gottam Sai Bhrath [00:15:39]: And what we had to do, we just did it. And as I said, so we were using GCP vertex pipelines, which is essentially Q flow, and just trying to figure out how to make this easier for data scientists. It was not, but then that was easiest. That could have been done. And, yeah, as I said, so the things kind of changed. And so we were still in hypergrowth at the time. This was back in 2021 and 22, and we were still, like, in this hyper growth. And then the teams were like, okay, then we got started with all these requests, and now we can't really do or prioritize one team or another.

Gottam Sai Bhrath [00:16:15]: It's kind of an uphill battle. So the teams just decided they just add an engineer and that's the easiest solution than just to wait until the platform team delivers the feature. Kind of makes sense. I understand that. Okay. Not everything can be carried, especially when we start from the scratch, and every little team has their own very little specific features. So, yeah, that's how it actually went haywire. Like, every team had starting to have their own engineers and they had their own opinion, so we couldn't come into an agreement with the other engineers.

Gottam Sai Bhrath [00:16:49]: Now we are in different teams. Right. So as a platform team, we view things a little differently because we have to. All the teams versus the specific engineering, the team where they just want to solve their problem and then that's it. They don't care about anything else. Well, this, everything kind of led to where we are right now. So the global feature store, you take for an example. So it kind of is a similar problem if you see about it and on a larger scale, because right now it was only multiple teams before.

Gottam Sai Bhrath [00:17:17]: And when we started this whole discussion about picture stored, it's multiple departments across all company. And, yeah, there were a lot of.

Stephen Batifol [00:17:24]: Again, it's impossible for everyone to agree anyway.

Gottam Sai Bhrath [00:17:29]: Yeah. So it was like I was personally scared inside that, okay. I mean, it's, it's a really amazing thing to build this and, you know, what if the same problems might happen, you know, but fortunately not, because this time there was no hyper growth. So we kind of are sure there's not going to be just a fallback that. Okay. I mean, if this doesn't work out, there's a fallback. So, no, so we have to do it, especially because there's so many benefits that we can reap. And also there's something already there.

Gottam Sai Bhrath [00:17:58]: It's all about us coming to a room, talking about how we can build this tool so that all the teams can use it, but we have to.

Stephen Batifol [00:18:06]: Actually find a solution. Right. There's no more hiring more people. So, yeah, yeah.

Gottam Sai Bhrath [00:18:10]: I mean, which makes sense. I mean, yeah, that's how it was able to come to this level. Right. And it's also not about just get everything together, pieces from Pandora, pieces from logistics, and just stitch it and, you know, just create a monster like a Frankenstein. No. So it has to be a solution. If not, I don't think all of our engineers would be agreeing to even start this. And fortunately, after long, long rfcs and discussions, right now, there is a state where I think everyone accepts and actually really goods really, really good, where it's not like the logistics has the streaming and the like, and we have the batch, and then there's something coming from the global.

Gottam Sai Bhrath [00:18:49]: So we are all trying to match this and stitch this together in a really clean way. And of course, there are some compromises and decisions which are really tough and few will take time to build. But then if you see it in the longer term, and if you see there's one global solution being used by all the company, and also the local teams have some, if they want to develop some small feature relating to them, it's in the, which makes it even more impressive that, okay, if I want to build something for Pandora, I can do it and all. Okay. I need to make sure it's gonna not gonna break. And we do have a model about how all these reviews go somewhere. Again, a huge discussion about how we must. Yeah.

Gottam Sai Bhrath [00:19:30]: So we'll review them. Who's gonna take this decision? If there's something big that's gonna change?

Stephen Batifol [00:19:35]: Who's the owner of it?

Gottam Sai Bhrath [00:19:36]: Yeah, yeah, I think this, yeah, I'm not sure. Cole mentioned they had this was this workshop for like, what, three days? Three whole days. Just to decide all this, just how many feature stores are there, what we need to build and how we'll be building. That's it. That's whole three days. And, yeah, I think it was really fruitful from all those discussions. I think, yeah, a lot of things came out, especially the whole vision and mission statement. So it's not about the short term hull, but the longer term.

Gottam Sai Bhrath [00:20:04]: And, yeah, we'll see.

Cole Bailey [00:20:07]: Yeah. One interesting tidbit from those, those three, four days where you bring so many different people together was like, I found very interesting just discussions where after one full day of discussions, we were like, wait a second. We don't even agree on the basics. What is actually the definition of a feature? What is the definition of a feature story? People had just different conceptions and different implementations, and it was confusing. Just communication, just as an example, we really have to go back to basics. Let's agree on something here. This is what we mean when we say feature. Exactly this and nothing more, nothing less.

Cole Bailey [00:20:40]: Also feature store. Then we discovered there's many different ways to think about the feature store. It's like virtual feature stores, physical feature stores, this and that, which none of us were experts on, but we all had kind of wandered into one specific state of mind about it, and then we had to consolidate that together. So that was also really interesting.

Stephen Batifol [00:20:59]: Yeah, I feel like, I mean, it's usually the case when people always like feature store first. I mean, I've had this experience in the past as well. In my previous job where, you know, data scientists come and they're like, yeah, that's, I want a features tour. First they all want to features tour because Uber published an article. And then it's like, man, we all want this feature store. And then it's like, yeah, okay, but do you actually need it also, can you actually use the features? And, yeah, for us was also like the same, like, you know, okay, what's a feature? First, what's a feature store? Then what's a feature? So what's a feature that, I mean, first what's the feature store? And then what's the feature that you have?

Cole Bailey [00:21:37]: Yeah, so I think from another three days. Yeah, exactly. So basically a feature should be essentially one column or data point if you think of a table view. So that's a single feature, and that feature should be usable by multiple consumers. That's the big idea. And the feature store is the whole engine that moves data around from where the data is computed to where it's served to models. I think the way we're going with feature store, the feature store also is kind of an orchestrator. You can think of it.

Cole Bailey [00:22:12]: We have a data platform. We don't want to reinvent data platform. So the features are computed in the data platform. Because some people think my original idea is feature store is like, well, to build a feature, I need SQL files. So the definition of the feature is a SQL file. So it's like, no, no, no. Actually, that's the data platform's job. The feature store is just to move the computed feature into inference time.

Cole Bailey [00:22:33]: So the feature is more like the metadata around what does the model need to know to access this feature in both batch and online? And the actual computation of the feature is something else entirely, which we extracted out, probably.

Stephen Batifol [00:22:45]: Okay, when you're training your model, is it also like, do you also use those features for the feature store or when you train, it's just from the data platform.

Cole Bailey [00:22:56]: Yeah, we should. Exactly. The feature store should provide kind of SDK for both offline and online. So when you train the model, you can say use the offline SDK and that SDK somehow gets the features for you. It will end up fetching the features from the data platform. The storage is always going to be in our data warehouse. We're not going to invent a new storage layer when the data warehouse exists.

Stephen Batifol [00:23:17]: I hope not.

Cole Bailey [00:23:18]: So it's just like a virtual layer on top. It knows where to find the feature in the data warehouse.

Stephen Batifol [00:23:23]: You can say, okay, so now you have a feature store with some features. So have you actually seen an improvement? Delivery hero with like then using a feature store and then also like agreeing, you know, on like the definitions and like maybe how to use it and how to add new features. Have you seen, yeah, have you seen some improvement there?

Gottam Sai Bhrath [00:23:44]: Yeah, I think talking about improvements. So we know using feature store and we already seen all the models we just talked about, especially also from Pandora and logistics we can add. Cool. So we already seen a huge benefit having a feature store, hence point of having global feature store. Right. So if there was no value, I'm pretty sure it would not have happened. And yeah, so especially before we are using models without any kind of features and then even just using batch is giving so much value. And if you imagine actually using the live features, which I think we'll be doing very, very soon, it should give us a lot of a value out of it, especially driving the business for us.

Gottam Sai Bhrath [00:24:26]: It's important because when I say personalization and recommendations and when we don't have real time serving for the features and we are only doing right now, it's only on the batch as we already talked about.

Stephen Batifol [00:24:39]: What's the batch time? Was this like a couple of minutes.

Gottam Sai Bhrath [00:24:42]: Or is it, what is it now we have like, based on the model, we have different times when we generate this features, but it's not exactly real time. Right. And we're seeing all this batch features already providing so much value. We also want to build some streaming solution around it, which is in progress. And there seems to be a lot of value. I think once you go live, we'll have actual numbers. But then before there, we do see a lot of like data scientists when they propose the whole thing that, okay, we do want to use those stream features. You know what, this is the amount of money we might drive.

Gottam Sai Bhrath [00:25:23]: You know, this is the basis for even actually starting to work on this right now and especially from the global feature store standpoint. So logistics already has streaming example, right? And we don't. And the whole point of right now doing global feature stories, if we can actually reuse what logistics already have and we don't, that saves a lot of engineering effort, which just keeps adding value to how much money we can actually produce from the actual application itself.

Cole Bailey [00:25:52]: Right.

Gottam Sai Bhrath [00:25:53]: So all this together and yeah, that's where we are headed to. And right now all I can say is there is a lot of value for register. It's just going to be just increasing a lot more. And for global feature store, it's still in an MVP stage. We kind of arrived at the conclusion when it should be done. So by November maybe we'll have another forecast, I don't know, but by November is when we plan to have an MVP of the global feature store. Okay, yeah, we'll have to see how the actual value would be looking like.

Stephen Batifol [00:26:25]: Okay, but it seems, I mean it seems promising already and yeah, I have a follow up question to that. So you like, when you create the features, how do people discover the features actually, because I guess for you, you've been working on it for so long that it's obvious. But like how do you make sure that people, you know, not recreate a similar feature with like different name or the same one with a different name, you know, or like how do you, how do you make sure of that?

Gottam Sai Bhrath [00:26:49]: Yeah, I can start a bit and then. Yeah, cool, you can add. So for us, especially so we have built our feature store based on the feast, the open source and feast fortunately give some kind of a UI which is super basic but not really that amazing that, okay, something's gonna be missed. You can kind of find it, but that's the starting point, right. So we were not initially putting so much for HR discoverability, I think, which was a biggest mistake because now especially in global feature store, we are looking at that we have a way of how to create the features. We are trying to look at discoverability part because it's super important, especially when there customer base like us will be increasing so much and there are some solutions in mind. So we do are planning to use some internal product. I'm not really sure if that's gonna work out to be honest.

Gottam Sai Bhrath [00:27:45]: And yeah, or maybe we'll have something custom, especially now because this is a good part of the GFS like global feature store. Right. So there's another team who really, really needs UI and all of these features you just mentioned, right? And for example, Pandora doesn't need it, but they wanted, they can already develop it for us as well so that we don't need to. Okay, you know what, now we are done with the bash. Okay, let's work on feature discoverability. No, because there's another team already doing it and they're just going to contribute it to global feature store. And then we are just reaping the benefits. So we are not spending time.

Gottam Sai Bhrath [00:28:20]: This is how, if you see about it, like, if you think about it, any feature request, you kind of ask me right now, it will be catered by someone in this company, but everyone reaps benefit model.

Stephen Batifol [00:28:33]: I mean, that sounds amazing, but it's also like, how do you deal with them? Ego of people and stuff. Because, you know, sometimes it's like, yeah, what you built is a ui, you know, but it's like, how do you also like, deal with that? Or like, I don't know, maybe at the age these people have, like, usually, like, now they got it that, you know, you have to maybe sometimes accept things, even your way. Like, yeah, how do you also like, deal with that?

Cole Bailey [00:28:58]: Yeah, I think, I think that's a challenge, especially with global feature store because to be honest, I didn't know these people exactly months ago. Right? These are like fresh, fresh faces to me. Right? Like, I know the egos of logistics. I can, I've learned how to deal with them. But then these are like fresh egos and not saying they're bad. Everyone has a little bit of ego, right? Me too.

Stephen Batifol [00:29:15]: Exactly.

Cole Bailey [00:29:16]: And I think what worked for us in the end was just to really be pragmatic and say, okay, put a nail on discussions where it's like, ah, but optimally, you know, redis is bad for this and this reason. And maybe there's a vector DB, blah, blah, blah. It's like, okay, look around the room. Who's using redis right now? 90%. Okay, Redis works. Just use redis.

Gottam Sai Bhrath [00:29:35]: Right?

Cole Bailey [00:29:35]: Let's start there. We can add more later.

Stephen Batifol [00:29:37]: Right?

Cole Bailey [00:29:37]: So he's being pragmatic and like looking at, okay, what actually works, what do we actually have evidence for? And using data ultimately to make decisions. I think this is what was most impactful, because in the room, someone had built a UI before. Someone had built a really nice catalog before. I'm similar to Gautam's style of feature source. So we're super technical. We don't really need a UI. People want to write their SQL and write really complex SQL and get their really custom feature of production and the number of features we're supporting is relatively small. We're not in the hundreds or thousands of features, but there's other departments that are really catering for ops or something like this, where the features are very, very simple, like this count divided by that count.

Cole Bailey [00:30:17]: They want to do it purely through UI, and the scale of features is much, much bigger because of that. It's a completely different target user in the end. And so they have a stronger need for UI for even on the fly computations, which we've never done for having a feature catalog. So, yeah, it's just different perspectives on what the goal of the feature store is, which also ties into the ego question, because then it's like, okay, you want ops to be happy, I want data scientists to be happy. We have to meet in the middle somehow. That's a bit difficult. But, yeah, I think ultimately we look at what works and then we just try to find, like, is there a way to cater to both needs at the same time? Is there a way to kind of compromise? And in the worst case, you have a few executives sponsoring the whole thing and they can put the hammer down and, like, make sure things don't go off the rails.

Stephen Batifol [00:31:05]: And then, yeah, you'll still like, yeah, whatever. Like, we'll discard that and we'll use redis. But I mean, it's also a good idea. I feel like sometimes, you know, people, I mean, software engineers can also, you know, be like, very, very specific about what they want. But then it's like, yeah, but you know, like, okay, maybe if you want to use this very new cool things like vertex back then and then, but the rest is like, yeah, maybe not. And then can also be a big success or failure as well, basically. Like Redis. I mean, redis just works for those.

Stephen Batifol [00:31:36]: It's like, it's kind of made for that. And maybe it's not the best thing for like 100% of it, but then if it does like 80% of the job already, you know, it can. That can be the cool part.

Gottam Sai Bhrath [00:31:47]: Yeah, but that's the best thing. So right now we just picked Redis because that's the most used. But the way we are trying to build the global feature store is like Litron. It just supports anything. Right. So it's not limited to redis. It's just one we are starting with then. Yeah, it's basically true for any component.

Gottam Sai Bhrath [00:32:06]: So we want to keep it cloud agnostic, component agnostic. So whatever you want to build in the later feature. And this also helps us to be more proactive. I mean imagine later on some organization, like some data Sanderson random department comes and say, okay, now I want to use UI. And for example they come in Pandora and they tell, okay, I need a UI. I think that's way better. Now we don't need to think too much about building from now and okay, plan it, take like what, one month, two months and then release the feature. Right now it's with the global feature store.

Gottam Sai Bhrath [00:32:42]: It's already available. Then you know, just need to deploy it and. Right. So it serves us on so many multiple levels and yeah, that's how we are trying to take it forward as a long term vision.

Stephen Batifol [00:32:54]: Yeah, no, but it's cool as well. I mean I guess it's nice also for you that you can see the difference that now you also have like the support of other teams as well. Instead of being the mean guy, kind of being the one that forsake everyone to be like, no, you have to use that. I guess that's also a good part. And yeah, you've been talking about creating features or maybe creating features of the UI, but also how do you test those? Do you have a way to, I guess also when you hire new people maybe they have to be like, they have to be on boarded obviously. Like how do you make sure that they can be on boarded successfully and then how do you also make sure that you can test those features before releasing them so you make sure that nothing's broken.

Cole Bailey [00:33:41]: I mean I can only really talk about our logistics experience with live features, which this is honestly the hardest part. Like with batch features, I'm envious because it's like live computer once and then you just upload, right upload to the cloud and you're done. But it's like easy, you know, but that's also why they have so much cool stuff that we don't have like feature catalogs and uis and things because we just don't have time for that. But like in streaming it's a nightmare because essentially you have two data sources because you need like six months of historical data to train your model in the first place and you're not going to get that out of a Kafka stream. Like it's just good docker store data in Kafka for that long for multiple reasons. Even if we did, we couldn't access it for mobile training very easily. Yeah, so then you have like some kind of mirrored data source in your data warehouse which hopefully matches Kafka. But does it really? We never know.

Cole Bailey [00:34:27]: We have to check that. So we have like multiple levels of checks and like QA, what we call like data consistency checks. Right? So we check the data sources one by one. Every data source we onboard. Is it consistent? Is it at least 99% consistent? 95% consistent. Like what's the threshold? And then for every feature we built, we also run it once in the offline mode. Then we deploy it to a shadow environment. We wait a day, we collect the logs from the online version, compare those, do a full cross join or full join, and compare the online and offline version and really check how consistent is the data.

Cole Bailey [00:35:01]: It's never 100%. It's impossible.

Stephen Batifol [00:35:03]: Yeah, of course.

Cole Bailey [00:35:03]: Also the computation engines are different, like flink and bigquery in our case, sometimes don't always match up perfectly. So, yeah, that's how we solve it. It's just a lot of writing, a lot of consistency checks as queries and joins and things like this, trying to make that as easy to rerun, reproducible, automate all the schedules and stuff like this to keep an eye on things. But, yeah, that's honestly the hardest part. And I think it's where we spent a lot of time, but we haven't scaled it, so it's super easy. Right. It's kind of a case by case basis for us. I think that's where we want to move to, though, is to find a way.

Cole Bailey [00:35:38]: How do we scale those quality checks in a more general way? Because I think that's what other departments would need if they want to use our feature setup as well. Right?

Stephen Batifol [00:35:49]: Yeah, I mean, especially as you said, for live features, it's very tricky because you can't just check it from the day of yesterday. You have to check everything at the same time. So do you also have some real time alerts, for example, if something is really wrong? Or it's more like you check at the end of the day, or it's automatically checked at the end of the day, and then you receive an alert being like, hey, by the way, look at that.

Cole Bailey [00:36:13]: Of course we have some real time alerts, but not exactly for data consistency. So data consistency we check, we're comparing to historical data, so we're usually waiting for our data warehouse to update, doing it once per day, for example. But the real time alerts are more like all the other stuff. Are the flink jobs running, checkpointing? Is it running? Is the latency good? Do we have Kafka lag, consumer lag popping up in different places? All of that stuff, make sure all the systems are running. And then for the actual data quality checks. We do that in a more offline manner.

Stephen Batifol [00:36:41]: And what about you, Gotham? How is it in the easy life part, the best part?

Gottam Sai Bhrath [00:36:46]: Well, as a thing, when it seems easy, that's where there's so many things that can go. And yeah, so yes, you can expect there's bunch of stuff we were able to build because it's not really complex as streaming. So yeah, we kind of have SQL transformations, python transformations. We have a lot of data quality, data drift checks. Is this. So? Yeah, we do track them and send them alerts whenever there's some bad data that enters and that you're transforming. So we just send them before so the bad data doesn't propagate forward. So as you know, bad data, bad model, bad prediction.

Gottam Sai Bhrath [00:37:22]: But we don't want that. So these are already taken care of. I think that's again the best part because now we can ship them to all the other departments. Right. So, yeah, so we don't really face a lot of issues as much as probably as streaming because nothing really is real time. So majority of them we either deal during the day and yeah, mostly it's happening on bigquery and we are kind of okay seeing not many issues. It's just so for us, the most difficult part right now is to tracking all these jaws because our Dax. So we use airflow for all the sequestration.

Gottam Sai Bhrath [00:38:01]: And when we started it was kind of easier because we had very few features and now we are having almost close to two to 3000 features. And you probably can imagine how messed up the dax would be looking like. And if something fails now it's super difficult to track from the task to the bigquery job to the actual data itself. Now it's getting a little difficult and yeah, so we need to think few more about, okay, how do we actually grab this in a clear way? Because before there was a lot of time to solve this. Okay, take a day or two, it's still okay. But now as we move forward, especially also when you think about global feature story, to have proper take in place that, okay, if something is failing, we need to know fast because there are some teams who are doing really critical business use cases and we don't want to spend time thinking of this black box and you kind of don't know what's failing where. And you know what? So we don't want to do that. And yeah, and also, yeah, I mean.

Stephen Batifol [00:39:05]: If you do that as we'll just be like debugging features, which is also not very fun. In my previous job I was a YAML engineer, you know when like Kubernetes cluster isn't working at one point, but you're like, you know, spend like three days in the nodes and pods and you know, checking whatever. That's also like not really fun. And so like you said some like you were in airflow and bigquery at the moment. So you have like dags running and then, so then other features. Features like computed per country or like how does it work usually or it just depends on the feature.

Gottam Sai Bhrath [00:39:37]: Yeah, it's kind of pretty flexible about how you want to design your tag. We have some conflicts so it can be per country wise. So we usually have it per country, which kind of makes sense. And everything string per country, even the models, everything is per country as you would probably expect. And yeah, so all the features are available as partition by the country as well again so that you can with them while training and serving as well. But uh, yeah, so this is how we structured it. And uh, yeah, it probably might change, not sure. But uh, yeah, we also have one other service which we probably didn't discuss.

Gottam Sai Bhrath [00:40:09]: So we do have live but not exactly like so we have an ether store which we use for redis, but then we don't really update redders like really frequently. Again, maybe a day, once per day. So it's not technically real time, but it's still live features. But as I said, so it's nothing super complex as of now. So we do really want to move to streaming. So we are working on streaming right now.

Stephen Batifol [00:40:36]: Okay.

Gottam Sai Bhrath [00:40:37]: Yeah, now that I hear about all the experiences with the streaming, it's bit.

Cole Bailey [00:40:42]: Of, I'm not sure that.

Gottam Sai Bhrath [00:40:46]: But then, you know, that's the good thing again. So since it's complex, we are getting a lot of value out of it. And since logistics already has this experience, we are going to share this experience and you know, try to at least not be way too scared because now we are not going into the unknown that there's someone already there and we'll tag along. So we are not going in line into those stuff.

Stephen Batifol [00:41:06]: So yeah, that's the good part though is that you can actually learn from like other people. And it's more like, I guess for you, like you know, when you really like part of the platform team, you were like kind of always going into the unknown, you know, and then people wait on you.

Cole Bailey [00:41:18]: Yeah, enough.

Stephen Batifol [00:41:19]: Either they wait on you or they just do it on their own, which is like very different. So like now I guess it's like for you, it's better, like just be like, yeah, actually, you know, I can just have a chat with them and then learn from them. Right?

Gottam Sai Bhrath [00:41:31]: Yeah, that's the best part. So every team has their unknowns and we are just filling in the gaps so that, you know, everyone's just reaching the goal faster.

Stephen Batifol [00:41:40]: Okay. But that's really, really cool. And so what is then the future vision for the global feature story? Is it going to be like one of those, you know, this is KCD meme and the standards. So we have like, yet another standard. So it's like yet another features tour or what's the future is like you said already, like, you know, you're going to scale it up. You're going to, like, unified it a bit. Also, like inner sourcing as well, what you mentioned in the past. But, like, is there like, other things as well that you want to work on?

Cole Bailey [00:42:12]: Yeah, I mean, I think the short term will be the XJCD meme. Right. Like, we'll have to add a new standard and hope and work to make it work. Right. So, I mean, concretely, there's already the first POC, let's say, with one specific department ongoing. Right. And so that department is also committing some engineering resources to figure out the first steps. Right.

Cole Bailey [00:42:34]: So we're kind of waiting on the sidelines because live is like a layer on top of batch. Like, get all the batch stuff right and we'll add our stuff on top is kind of our strategy. So logistics is we're involved, but I'm kind of waiting a bit. But let's see. I think we'll have to go department by department and prove that it actually works. If they have an existing feature store, we need to prove that we can replace that feature store, which is also going to be a huge challenge. This is, I think, the hardest part. What you and Gautam were talking about before.

Cole Bailey [00:43:00]: If you're starting with a team who has no engineers, has a new use case, this is honestly the easiest way to get adoption on this platform. Tooling. Even inside logistics, we have the same problem where we have people who have airflow dags that have been running for years and I wasn't even in the company when they were created. And now we've moved on and we have better tooling and it's like, okay, we should use our tooling, right? And they're like, maybe, but like, how? And like, do I have time to migrate my things? I don't know. So, yeah, I think featurestore will face the same problem where every department needs to kind of prove that it's worth it for them to onboard to this. It provides value to them. So I think what will have to happen, right, is we have to actually combine these different features. So we need a batch feature store.

Cole Bailey [00:43:44]: So I will only adopt it in logistics if you provide me batch and feature catalog and these kind of nice things. So then maybe, maybe for another department they have all that already and what they really need is live. So then they're only going to adopt once we add the live on top of it as well. So I think we'll have to go department by department based on their needs, keep adding features until it gets into that full vision. And then hopefully we start also removing the old standards. Some standards, some old standards. That's the hardest part. So that's the part I think we'll have to see how it goes.

Stephen Batifol [00:44:13]: Yeah, yeah, it's a hard one. And also, I mean, also to add to that is like, do you have, like you mentioned Google Cloud since the beginning, basically. So I guess that's what you're using. But is it like, do you also have other cloud providers because you have different companies? So then how does it work for them?

Cole Bailey [00:44:35]: Yeah, I mean, we. So the last I've heard, I think at least logistics standard is AWS for production workloads, GCP for data, off site workloads. That's kind of our standard. So we do both clouds. I know there's some departments that have used GCP for everything, but at least our data warehouse is GCP. So everyone has GCP for data warehouse with some small exceptions, but they're moving. But yeah, I don't know. I don't remember which departments were using GCP for production workloads and if they're actually moving to AWS or what the timeline would be for that because it's a complex transition, but that's another level of variance.

Cole Bailey [00:45:11]: So then if you think about, okay, we want to deploy the global feature store, it's like, well, it needs to actually be compatible with both GCP and AWS for all departments to adopt it as well.

Gottam Sai Bhrath [00:45:22]: Yeah, I can shed more light on that, that we are not going to AWS. We were one of those ones where everything was on GCP and we plan to do that and. Yeah, so, but, yeah, that shouldn't be a biggest problem for us anyways, at least for now because we are trying to be cloud agnostic. So all the confidence at least should be deployable everywhere. So it is your terraform modules or all the components are kind of compatible. So, yeah, unless there's a specific offering that only comes from one cloud provider that we have to support, which I don't think is there as of now. Until then, I think we are okay. But as I said, so whatever we are talking about.

Gottam Sai Bhrath [00:46:03]: So until we are done with the batch, which is this year, and then we move to the streaming, which is next year probably, and until we have a proper global feature established, we wouldn't really know what is going to happen. And we are positive on the outlook, like how the whole long term vision looks like. And once we have a success this year with the batch, I think we'll be on an amazing path towards that.

Stephen Batifol [00:46:30]: Cool. That sounds cool. Okay, well, I think, yeah, when I come close to, like, the end now. So do you guys have any ending thoughts maybe to, like, end on this amazing futures talk podcast?

Gottam Sai Bhrath [00:46:44]: Yeah, I'll just add a couple more because I also already, the ones I was talking was kind of ending thoughts, but yeah, so just to reiterate. So this whole initiative, I think is very relevant for, especially the companies, was structured like delivery hero. There's central teams and then there's so many local teams, and it's basically about how you strike this right balance between all the centralized and decentralized solution, but also try to support the innovation, because this is very, very critical when you see about local teams developing their own stuff, they're in silos and then there's central teams who doesn't really listen or talk much about local teams, and they build something which no one's going to use it from the local. So we want to eliminate all these boundaries and also having a nice solution, not something, as we discussed, like a Frankenstein monster, so no one can actually use it, or it's just super complex to use. So it's a super complex problem. I think that's why we are also enjoying solving this. And of course, there are a few non technical funnels, which. Yeah, I think Cole will be facing more than me, I guess.

Gottam Sai Bhrath [00:47:51]: But yeah, so this is like the old vision, and I personally am really happy to see where it's going. And yeah, I think we already talked too much about the benefits and there are pitfalls, but I think I'm really confident that we, as a, like, we, the whole team, I think whoever I met is amazing. So pretty sure we'll be boring.

Stephen Batifol [00:48:13]: Cool. What about Zuko?

Cole Bailey [00:48:15]: Yeah, I think kind of in the same direction of things. I don't know. For me, I think the biggest learning, and not just with the features store, but also the work we're doing inside logistics on other ML platform topics. Right. Is like how difficult it really is to scale platform tooling in an organization. I think, I don't know if people don't have experience with that. I think it's just something that might be quite interesting to realize is that it's not as easy as, like you said, just develop a new standard, deploy your new system, and everyone will adopt it. Right.

Cole Bailey [00:48:46]: There's a thousand different considerations people are always fighting for. Do I have enough time and capacity and priority to work on adopting your tooling, learning how it works? And usually also there's this siloing problem where you built some grand vision, but actually you didn't really know what people really needed, and so you solved maybe 70% of what they needed, but the 30% that's really important you kind of missed out on. Right. And then that's something that's really easy to overlook. So, yeah, these are the really interesting kind of challenges that I see and especially with, to have existing systems, existing status quo that you need to improve on and replace. I think that's kind of a really big challenge here for global feature store, anything like this. So, yeah, I don't have like the perfect solution to how to, how to address those things. But we're learning as we go.

Cole Bailey [00:49:35]: And yeah, it's been, it's been interesting, and I'm pretty hopeful that we'll, we'll navigate it.

Gottam Sai Bhrath [00:49:41]: Yeah, we are in this together, so.

Cole Bailey [00:49:44]: Let'S go for it.

Gottam Sai Bhrath [00:49:45]: Cool.

Cole Bailey [00:49:46]: Yeah.

Stephen Batifol [00:49:46]: Well, gentlemen, it's been a pleasure talking to you. Thank you very much for joining the podcast. That was a very, very interesting one. Thank you very much. Like, even I'm hyped about like, what you're building at Dheendeh and the feature shore and everything. So that sounds really cool, too. Thank you very much.

Cole Bailey [00:50:04]: Thanks for having us.

+ Read More

Watch More

Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.5K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io