Real-time Feature Generation at Lyft
speakers

Rakesh Kumar is a Senior Staff Software Engineer at Lyft, specializing in building and scaling Machine Learning platforms. Rakesh has expertise in MLOps, including real-time feature generation, experimentation platforms, and deploying ML models at scale. He is passionate about sharing his knowledge and fostering a culture of innovation. This is evident in his contributions to the tech community through blog posts, conference presentations, and reviewing technical publications.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
This session delves into real-time feature generation at Lyft. Real-time feature generation is critical for Lyft where accurate up-to-the-minute marketplace data is paramount for optimal operational efficiency. We will explore how the infrastructure handles the immense challenge of processing tens of millions of events per minute to generate features that truly reflect current marketplace conditions.
Lyft has built this massive infrastructure over time, evolving from a humble start and a naive pipeline. Through lessons learned and iterative improvements, Lyft has made several trade-offs to achieve low-latency, real-time feature delivery. MLOps plays a critical role in managing the lifecycle of these real-time feature pipelines, including monitoring and deployment. We will discuss the practicalities of building and maintaining high-throughput, low-latency real-time feature generation systems that power Lyft’s dynamic marketplace and business-critical products.
TRANSCRIPT
Rakesh Kumar [00:00:00]: Some of them are like precursor as well. So when we initially introduced the pipeline. Right. We were able to cut down almost like 60% of the code base. Wow. So I'm Rakesh and senior staff engineer at Lyft and I like latte.
Demetrios [00:00:24]: We recently had your colleague Josh on here and he was talking a lot about the data science side, the algorithms, the modeling side of things. Now you're coming from the engineering side and we get to talk to you about real time machine learning, which is everyone's favorite topic, I think, because it is so hard to get right and you all have been getting it right for a while. Maybe we can break down first, what do you do in real time machine learning? Use cases with at Lyft for.
Rakesh Kumar [00:01:02]: So real time use case, real time machine learning, feature generation use cases. They cover most of the critical business use cases.
Demetrios [00:01:09]: And that's a mouthful.
Rakesh Kumar [00:01:11]: So I will break it down. So what we do is we try to process real time data that is coming from the user's devices, various services. We get those data, we process that data, we transform the data, we aggregate the data, and then we pass it down to our machine learning model. And these machine learning models are basically powering some of the use cases. One good example which was covered earlier in the podcast, that was real time forecasting. So for any region, any city, any venue, we want to forecast what will be the demand, what will be the supply at any specific time. So that is one use case. The other use case is surge prices.
Rakesh Kumar [00:02:00]: That means we generally monitor marketplace for any city. And if there is any kind of imbalance in demand and supply, we want to accordingly adjust the prices. For example, if there is less number of driver in that area, we want to increase the prices so that we slightly dampen the demand. Yeah. So that way we are trying to adjust all those things. So these are two examples, but there are a couple of other examples as well.
Demetrios [00:02:31]: Yeah. And the hard part about this is the real time ness to it. Right. That adds a bunch of extra complexity if it's that online feature generation and then creating that model that can be making these predictions in real time.
Rakesh Kumar [00:02:49]: Right. So to just cover the full context, in feature generation part, there are primary two. One is the offline generation and the other one is real time generation. In offline generation you have the data. In offline data store, you have time to, you know, process the data, generate the feature. You do not have that tight latency requirement. Right. But in the real time what happens is that like you need to process millions of data points in real time, that means it could be second to sub second latency requirement.
Rakesh Kumar [00:03:24]: Right. So as you are getting the data, you are processing all these things and you have to provide that data as soon as possible to all these real time machine learning models so they can make all these predictions. And it is very important for Lyft to process all these data in real time. You do not want to serve scale data to model because then your model will be drifting apart from the actual actually what is happening in the world. Right. Let's say your model is like or your pipeline is slow by 30 minutes. It may not be accurately predicting the current market condition. You will be giving prices which are stale, which is old.
Rakesh Kumar [00:04:07]: And whatever you are trying to do, it may not reflect correctly. So it's really important for us to basically process all these data really fast and just provide it to all these machine learning models.
Demetrios [00:04:20]: Yeah, you can potentially lose a lot of money for the business.
Rakesh Kumar [00:04:24]: And we had seen that as well. Like in terms of latency, if we degrade or latency requirement or we take longer time, it affects the performance of the model. Yeah. So it is very important we try to minimize the latency as much as possible. And it could go to seconds or sub seconds as well.
Demetrios [00:04:51]: Well, I want to talk about some tricks that you've done to make sure that latency is as low as possible. If I am understanding it correctly, when you say real time, you're one of the few people that mean real, real real time. It is not it's like a five minute window real time. It's more half a second type of real time.
Rakesh Kumar [00:05:16]: So I think that you've had a kind of a really good question. What is real time? Right. Different people have a different definition about the real time. Sometimes people say synchronous call is real time, some people say asynchronous processing. But in real time that is real time. So for this talk we are talking about the async processing. We are not talking about senior user but for this specific talk we will be talking mainly about the async processing. And there also there are a couple of, what do you call use cases.
Rakesh Kumar [00:05:53]: You mentioned about five minute window, one minute window. We do do that as well. It completely depends on the use case. Right. For forecasting use case we have aggregate like we aggregate data for 1 minute window and 5 minute window as well. And then we providing to them. In some cases we also have short term historical data and then we provide it to them. So it completely depends on the use case.
Rakesh Kumar [00:06:22]: Most of our use Cases based on one minute aggregated data. That's how our models are designed and that's how or that's how the pipeline are supposed to provide that data to the models. So it is not the extreme side where it is sub second or so or when you're talking about synchronous call, it's more about the asynchronous processing of the data.
Demetrios [00:06:50]: Is it like the complexity and the difficulty increases exponentially if you're going from that 5 minute real time to 1 minute real time? Or is it something where you can create a system and once it's in place, you can get to a certain point where you've got that cushion and whether it's 10 minutes or it's one minute, you're good, but you start hitting the ceiling. If you're starting to talk about the sub second latency, like I'm trying to gauge how different it is to be at 5 minute real time versus 1 minute real time versus millisecond real time.
Rakesh Kumar [00:07:38]: So if you go further, like if you're trying to look at the sub second latency and you are comparing with the synchronous processing, it is not the case. There it is like most of the cases, asynchronous call is related one request. But our processing is based on the aggregated data. That means we are getting all these data from different sources and we are trying to aggregate them based on the geohashes. So geohashes is a size of a city block or so. And you can imagine we are getting this data from different cities across different nations. You know, and then we are trying to process that data and then we are processing the data and then we have to go to the city level block and aggregate the data. Coming back to your question, like where is the nice cushion? I would say it depends.
Rakesh Kumar [00:08:35]: It depends on the use case and the data volume that we are trying to process it. I would say if it is a smaller window, it's much easier to aggregate because you are accumulating less data and you can emit the data faster. But if you have a longer window, that means you are accumulating that data in memory for a longer time and then you are emitting that data. So your processing and what do you call the memory utilization increased significantly. And we had seen use cases where we tried to do 30 its aggregation. It was working, but it was not scalable.
Demetrios [00:09:18]: Yeah.
Rakesh Kumar [00:09:19]: And yeah, got expensive. Yeah. So one minute or five minute, those are pretty good use cases.
Demetrios [00:09:28]: Now can you walk me through a bit of the evolution of how you've tackled this specific problem, the real time problem, while being at Lyft, because I feel like you've been through different iterations.
Rakesh Kumar [00:09:46]: I will divide this evolution in two or three phases. First phase, like when we were very early phase, when we were not even using all these streaming engines to process all these data. What we had at that time was cron job based processing. So it will get up at the top of the minute, it will try to get all these data from the Kinesis processes, store it in a kind of temporary location, most likely redis. Then the next cron job will get up, it will process that data and just put it and imagine every step is doing this processing the top of the minute. So even though the very first process, very first cron job has finished its task within one or two seconds, the next one is not going to start immediately. It will take longer. It was working, but as we are growing, it was not a kind of a scalable solution.
Rakesh Kumar [00:10:47]: Our models were not also performing well because inherently the system has latency. So what we try to do is like, okay, since our model will not perform better if there is inherent latency, we want to reduce it. So we research on what are technologies or how we can re architect the solution. And after our research we realized that we can use some streaming based solutions which can process this data. And there you do not need to wait for the next step to get triggered at specific time in the stream processing. As soon as the very first operator has processed the data, it will send it to the next operator or next operator downstream and it will process it. So it's more about like when the data is available, it will process it. It does not have to wait there.
Demetrios [00:11:37]: So initially in this you're talking about like Flink, right?
Rakesh Kumar [00:11:41]: Right. So and we use Apache Beam and Apache Beam is using Flink underneath. We initially did a POC and initial results were great. Like it worked fine. We also choose Apache Beam Python SDK mainly because we are Python Shop. Most of our services are written in Python, people are comfortable in Python and even our models are written in pyth. And so we thought, okay, maybe we'll just use it. The interesting thing is that though Apache Beam is in like we are using Apache Beam Python SDK, but it is doing most of the processing in the Apache Flink.
Rakesh Kumar [00:12:25]: And Flink is JVM based technology.
Demetrios [00:12:27]: Yeah.
Rakesh Kumar [00:12:28]: So it was a good hybrid solution and we used it for what do you call real time search pricing or prime time. And it was good when we launched it for One or two regions or cities. But when we started scaling it out, then we realized that, okay, this is not going to scale because the way we were processing it, our data, it was not very performant, mainly because we were doing most of the processing in one operator. That was one issue that was easy to fix it. You divide the operator, it is perfect, right? It works, right? Then we hit the other problem where, you know, what happened is that even though we separated out the processes, it was processing it. Then we realized that some of the nodes are like, you know, 90% CPU utilization. Some of them were not even doing anything. It was like 10%.
Rakesh Kumar [00:13:33]: And the problem is how we like most of our cities, or I would say the traffic is divided based on the cities. That's how our models process all these things, right? And what we were doing is that we were routing all the traffic from a specific city to one particular shard or one particular node. And some of our top cities, of course, right? New York, San Francisco and all those things, they have almost 80% of the traffic. So now imagine that 80% is going to one node and rest of the other are going to other nodes. That was a very classic hardshot issue. And we're like, oh, okay, that's interesting.
Demetrios [00:14:16]: The distribution here is not adding up.
Rakesh Kumar [00:14:20]: Yeah. So then we realized we are not using it perfectly. What we need to do is we need to divide that further processing. What you do is instead of city, you divide in terms of geohashes. Geohash is again this city block. So what we do is as soon as we get the data, we say, okay, what is the geohash for this particular event? And then we shard it based on that one. So from 300 plus series, then it further translated into almost millions of geohashes. So and these geohashes are almost uniformly distributed across different nodes.
Rakesh Kumar [00:15:03]: So in a way like we. We are not running into any kind of hotshot issue that the load is uniformly distributed across different nodes. And once most of the data processing is done, most of the heavy computation is done then. And this heavy computation is like, you know, filtering those data, aggregating it. And anyway, we wanted all these aggregated data based on the geohash then. And once that is processed, you have a number and then the geohash. So it comes down to a very small data set that you can again reshard it to region. And that's how one region can get all the relevant geohashes.
Rakesh Kumar [00:15:43]: So effectively it's small data and that does not run into any kind of hot sharding issue.
Demetrios [00:15:51]: Oh, those bottlenecks aren't there now. I feel like I've talked to somebody at Lyft before. It wasn't on the podcast specifically, but they were in the community and they were mentioning how having to create different configs for each of the setups that they had ended up driving them nuts. And I think they ended up going and creating a company around it. But as you're talking about this and you're saying, hey, we've got all of these different models, they're pulling in, or we've got all these different nodes, they're pulling in the geohashes. And then you also have the different clouds that you're on and you have the different regions that you're playing with there. Is that something that you looked at and that was affecting you also? When it comes to having to figure out the config on this model or this node or this geohash, I'm not sure how you, how you broke it down is with this region and it's got the, these things that we need to be aware of because are you running just thousands of models across the world or hundreds of thousands? I don't know what, what that looks like and how that works.
Rakesh Kumar [00:17:13]: So most of the real time processing in that case we use model per region. And so that is like one simple simplistic, simple view of the world. But that could be also further divided into sub regions. And there are a couple of use cases there. Let's say San Francisco, it's a big city. And sometime if you're trying to process the data and trying to give it to one model, it may not be perform accurately in different parameters. So what you want to do is you want to divide that region into sub regions. So that way you independently process all these regions.
Rakesh Kumar [00:18:02]: There are trade offs there, but at least on like if you are breaking down to a smaller region, it's much easier to manage them. So that's how like we used to do. So earlier we were processing a region level. Now in some of the use cases we have broke it down into sub region level and that's how we process it. But on the pipeline side we are still doing on geohashes, we are processing geohashes and it does not matter like how you consume it, we can always transform that data either at the sub region level or region level. Of course, our customers, they define how they want to consume the data and it's quite easy to just transform the data and then provide it. And in a way it's Much better for us that we further divide that region into smaller chunks so that way we do not run into hotshot issues as well.
Demetrios [00:18:57]: I like how you mentioned there that our customers, I'm assuming that's like Josh, who I spoke with a few weeks ago, that's the data scientists. They get to define how they want that data, how often they want that data, all of that type of thing. And I think the fancy word for that these days is a data contract. You all have put into place a data contract. You have the producer of the data and then the consumer of the data, and you're able to stay accountable so that if you're changing anything with that data, Josh is in the loop. Or if Josh needs anything changed, then you're in the loop. And I wonder how you got to that point.
Rakesh Kumar [00:19:43]: So how we got it to that point. So what we generally do, there are two strategies that we use. In one case it is like off the shelf, you can use that pipeline generate feature by yourself. And most of the cases they are very simple aggregation. But in case where you have very specialized need and some complicated business logic, in that case we hold their hand and say that you provide me the contract, I will build the feature for you. And generally what happens, they create the data from the offline data stores, they come up with a query, they generate all these features in offline mode, test their model. If everything is looking good, they say, okay, the performance is really looking good. I want to productionize it, I want to make it real time.
Rakesh Kumar [00:20:40]: That's when we come into play. And then we convert that query into a pipeline implementation. So like we transform that query into a pipeline. And our contract is generally that whatever features that you are able to compute using offline data, we will be able to match that data with almost 99.9%. Okay. And when we build this pipeline, we compare the real time feature with the offline one and make sure that it is closed. And this is also required because we have a tight SLA in terms of latency as well as the quality and we do not want to degrade on the quality. So we have another framework or in house framework which basically comparing the real time feature and the offline one offline in the sense that it is a source of truth or ground truth.
Rakesh Kumar [00:21:45]: We are comparing these two feature. We are basically sampling the real time feature and then the offline one and trying to compare how different they are. If they are, that's cool. Yeah. If they're different, then there is alert defined for that one. And it will trigger. And then we say why they are different. But, but most of our features, or I would say all of the features that we generate on our real time feature platform, we have that kind of observability in place.
Demetrios [00:22:16]: So it's really the operationalizing of the feature pipelines.
Rakesh Kumar [00:22:23]: Yes, that's right.
Demetrios [00:22:24]: And so as a data scientist they almost have it, they've got it good. They can just develop these feature pipelines offline and then throw you the shitstorm of making them real time.
Rakesh Kumar [00:22:40]: Yeah, it is interesting. Yeah. Like, and then there are different considerations when we are transforming that offline query into a real time query. At the high level the results are same but when you implement it you have to also consider like how these pipelines are processing this data, how I can avoid the hard charting issues. And like there's lot goes into while designing that pipeline and some of our in house engineers, they are expert in this and then as soon as they get it that way they know like these are the parameter which are not looking good. We have to do a further processing. We have to further divide into based on the number of geohashes and make it process in such a way that it will not run into a kind of hardshad issue.
Demetrios [00:23:36]: Yeah. Now the obvious next question is do you all have, are you using some type of a feature store?
Rakesh Kumar [00:23:46]: We do use feature store and most of the traditional feature store may not work for us mainly because they are point feature, point feature. That means you have a key, let's say user and then you have a value. So let's say when this user was onboarded, so you just provide a date. Right. Or whether what is the city where this user requested a guide, something this one, it's very straightforward. Right. So that is one use case, but the other use case where most of our models work, that is aggregated. Right.
Rakesh Kumar [00:24:26]: We try to figure out whether there is an imbalance in the market and we do not look at the individual session, we look at the holistic view of that city or region. So we want to see how many drivers are there, how many requests are coming from that city. Right. And in this case like we consider that city or geohashes. Right. And we want to not get just the one point value. It's like the whole aggregated value and it is hierarchical. That means a city will have multiple geohashes and within the multiple geohashes as well.
Rakesh Kumar [00:25:02]: Like we have GH4, GH5, GH6 and all of them are hierarchical. So regular feature store will not work. So what we did is we created geospatial based feature store that if you provide a region, then I can provide you all the geohasses coming under that region and their aggregated value and you can choose the level at which you want it. Like whether you want a region level, whether you want the GH5 level, GH4 level, you just provide that and then we can just provide the data to you. And it is interesting and it works and, and it was able to scale well as well. As I was saying earlier, some of our model, they work at region level, some of them at the sub region level, some of them they work at geohash level, a GH4 level and the source of data is the same, but they can consume it differently. They can ask for a region level, we can still provide the region, it is the same data, it's just that the accessing mechanism is slightly different, but it has the same API and that has helped us a lot. Like, you know, we can generate one feature and then this one feature can be shared by multiple models as well.
Demetrios [00:26:26]: How are you making these different layers of the geospatial data? What does that even look like under the hood to be able to serve up that type of offering on the API?
Rakesh Kumar [00:26:42]: Yeah, so it's based on the hierarchical, that means you have a region, then you have GH4, GH5, GH6. So region might have a couple of GH4s, GH4s may have a couple of GH5s and all. And it's hierarchical, it's basically a kind of a tree. And when you're selecting anything, let's say you're selecting a region, you know, all the nodes under that tree or leaves.
Demetrios [00:27:09]: On that tree, and then it's up to the builder or the data scientist to decide how granular they want to get in this specific use case or for this specific model.
Rakesh Kumar [00:27:20]: Yeah, that's right. And like how we store the data, we generally store in two different ways. One is the region level, like we store the entire thing and then the rest of them we have divided into GHC6 level and then we have defined the mapping between GH6 and GH5. So whenever someone sends us a request, we get the data from the store, aggregate them for them and just pass it to them. So it's basically when you look at the GH or geohash, the prefix are, I would say postfix are always matching. The, the latter part of the geohasses is always matching. So if I give you a GH6 and if I tell you to come give me the GH4, you're basically getting the substring of that one. So you're basically just getting the four character from the GH6 and that's how you're able to do it.
Rakesh Kumar [00:28:20]: So it is very straightforward calculation.
Demetrios [00:28:23]: Now the other thing was around the offline and online piece, which I think it would be good to go into how it looks offline when folks are doing offline jobs versus online. And I. I think the online part is quite clear to me right now. But I want to know what do you do for offline and also is offline just for that ground truth, Is it for different use cases that you're servicing? What is that?
Rakesh Kumar [00:29:03]: So offline. So most of our real time data, they end up in offline store for various reasons. One of the primary reason is that you want to see historical data. And when data scientists or ML engineers are working, they want to see how the data looks like. They transform the data. They want to transform it and make it as feature and test out with that model. And this is one example. The other example is like when we want to validate the real time feature, whether they are accurate or not.
Rakesh Kumar [00:29:39]: Right. That is another one as well. Sometime we can also do backtesting. That means if models misbehaved in production, you want to go and replay and see why they misbehaved. You want to debug further. So there are a couple of use cases there offline and online. I think offline world you get delayed data as well. You have like, you know, you can have all the data and you have all the time to process it.
Rakesh Kumar [00:30:07]: But in real world or real time processing, it's difficult if the data is delayed. We cannot wait forever for you to complete like and you do not have that much of time. So what happens there? Like you either just move past, like you ignore that one and move past and you process it. So that's how it happens. And earlier I was telling you that we have 99.9 percentile or percent matching data. But why there is a gap? Why we are not able to match 100%. That is the reason in real time data may be missing, it will be delayed. Imagine person going through a tunnel.
Rakesh Kumar [00:30:47]: Somehow they do not have a network and then the device is not able to send the data. And once they come to an area where they have a nice signal, nice connectivity, the data is synced to the server. And that's how we get a kind of delayed data. Yeah.
Demetrios [00:31:02]: So one thing that I would Love to get into though is the.
Rakesh Kumar [00:31:14]: The.
Demetrios [00:31:14]: Actual decisions that are being made in these meetings when you have to say, okay, we want to go and support xyz, whatever XYZ may be, maybe it is a new technology, maybe it is a new type of architecture or a new experiment where you're going to change the system. How are you all thinking through the trade offs as you're thinking through these bigger decisions? Because I can only imagine that you're having your retros. You're also recognizing that you're supporting quite a large scale and you want to continuously be moving forward. You don't want to get stuck in time. But because you have to make certain decisions and create opinions, there are things that you're going to land on that potentially can be a couple year relationship or a decade long relationship with a certain technology or a certain system.
Rakesh Kumar [00:32:22]: Right. So there are so many considerations go into this. Definitely one of the primary one is that whether this technology is a good fit for this problem domain or not, it may be good fit right now, but down the line it may not be good fit. Things could change, traffic pattern could change. You know, there's a new technology which can solve this problem easily. Where we have, where we require to put in less effort. As I was telling you about when we introduced streaming processing, stream processing at Lyft earlier, it was based on the Cron job. So Cron job may have solved the problem when we were at small scale and it was okay.
Rakesh Kumar [00:33:11]: But as we started growing scaling system then we realized this is not able to scale and we need to look for the other technology. So streaming was one good use case where you do not need to write that much of code, you need to write less code and it's still able to process the data in real time. So that is one the other. And which is other problem which is also relevant to this one is like how much effort that we need to put in to build this technology or use this technology. And that is most important one, as we try to move fast, it is very important that the operational cost or development cost is low on our side. So that way we can ship things faster. The other part is the cost as well. Over time we have moved from one solution to other.
Rakesh Kumar [00:34:06]: Mainly because initially when we launched services it worked fine, but cost wise it was fine. But as we are scaling up we realized that certain places we are like, you know, spending more, we could cut down those things. Then we realized, okay, maybe there is other alternative or maybe there is other way of like doing things. And when we look into it and we realize yes, there so cost is the other one as well. We try to look for what do you call technology which could be cheaper for us. So that is the other one, the open source versus so open source as well. If anything is open source, well adopted by the community or well supported by the community, that is another consideration. One more is the R ecosystem, how well that technology or framework can integrate well with our ecosystem.
Rakesh Kumar [00:35:02]: If it is really hard to integrate it, then it may not be a kind of a good sell for us. Example could be the stream processing. We could have chosen Flink directly, but we did not use it mainly because if we move from Python to Java, it would have been a kind of very huge investment for us. And at that time we were scaling and we were also trying new technology. You do not want to take too many problems and try to solve them like you probably solve one. And that's why we said like, okay, we can choose a technology which could be a good fit and Python. Most of people were familiar with the Python, so we chose Apache Beam because Apache Beam supports Python SDK as well. So yeah, these are some of the considerations that we put in before finalizing any technology or framework.
Demetrios [00:36:00]: Can you imagine trying to sell a data scientist on using Java? No, they will not like, will not work at all.
Rakesh Kumar [00:36:10]: Yeah, Python is well supported and I think our customers are data science and ML engineer. We do not want to introduce any kind of solution, right. Which will create unnecessary friction. We want to create tools which will make their like easier. Right. So and yeah, if it does not fit well in our ecosystem, even the tool is great, it may not work well. So and if there is any kind of bridge, there is a solution which can bridge the gap, that would be great. So we can probably consider both then whatever the solution which can bridge the bridge the gap and the final solution.
Demetrios [00:36:53]: Then it makes sense too that you look at that bridge as how much developer time, how difficult is this going to be to integrate into our system? How often are you trying to look at what's on the market to see if you can cut costs? Is this like a quarterly thing? Is it a weekly thing? Is it a monthly or yearly thing?
Rakesh Kumar [00:37:25]: I would say it depends on case by case. I don't think we look at every day, but we do have monitoring in place. If, let's say if there's a significant increase in cost, we know that why it has increased and then probably if it has crossed certain thresholds, then we will look into it like why they. And if there Is anything we can do to bring down cost, like without changing the technology or framework, we would do it. Like if, if there is a kind of a roadblock which is preventing us not or which does not allow us to reduce the cost, then we have to think about, okay, maybe we should look for the other technology. But it completely depends on the scenarios like how much you are spending, what is the budget and what is our capacity to move to a new technology.
Demetrios [00:38:19]: Yeah, we had Rohit on here probably three, four weeks ago and he talked about how a huge boon has been for folks that he works with recognizing if they don't actually need the freshness of data that they think they need. So just being able to checkpoint data from going from 1 second to 10 second checkpoints, he was like, you can save a lot of money with that.
Rakesh Kumar [00:38:51]: Oh yeah. So there's another process that we have is onboarding. So whenever someone comes and say, you know, you want to build a real time feature, like, and then our question is like, do you really need a real time feature or do you need the offline feature? And we ask all those questions based on their use cases, try to determine if they really need it. If they do not need it, then we point them to the offline data stores you can build there, you can add. Because if you're moving towards real time, it requires more effort, more money to have that solution. If you do not need it, then you are unnecessarily spending that much of time and effort to have it. And that's why we said like we vet all these requests and if they are genuinely or genuinely needed or if that, you know, if it is, it is must to have it, then only we do it. Otherwise we provide the other solutions, other offline solutions.
Demetrios [00:39:52]: But why do you think that people are convinced they need some type of real time feature?
Rakesh Kumar [00:40:00]: Why they are convinced that they needed.
Demetrios [00:40:02]: So yeah, why did, why do they come to you thinking that they need real time sometime?
Rakesh Kumar [00:40:07]: You know, they do not know that how much effort is required, how much money we spend on, you know, generating all these real time features, you abstract all that away.
Demetrios [00:40:20]: So they think, oh, just throw another real time feature in the, so they.
Rakesh Kumar [00:40:24]: Feel it's, it is cheap. And, and that's why they think that okay, we can just have it anyway it is available, so why don't we just use it? Right? But as we own the platform, we generate all this. We know how much effort and effort time as well as money that we need to, you know, spend on this. And we are really conscious like when.
Demetrios [00:40:46]: We meet that decision and so the data scientists get to experiment, experiment with the offline stuff and say, I'm a data scientist and I discover that there's this feature that is when someone logs on to Lyft, they search their ride, but they don't actually get the ride and they log off of it. We predict that X amount of x percent of times they're gonna come back and order that ride. So I want that feature in my model. Something I know I'm oversimplifying the shit out of it, but just go with me here on this. I want that feature on my model. I see that offline, it actually gives my model some help. Then I go to you and I say, hey, I need this feature in my model. You come and you try and convince me that I don't actually need it.
Demetrios [00:41:47]: It's in the model. You say, are you sure about that? Can we maybe do something else? Is there potential for us to just do this batch and eventually we settle on yes, I need it. Then I go to saying, well, what type of detail I want? Or I guess that would have already been solved back in the beginning when I say, what geohash do I need?
Rakesh Kumar [00:42:10]: Yeah. So like if I'm convinced that okay, this is a real time use case and we are going to work on this one, then we will, and this is very super, what do you call specialized use case, where we need more support to build it, then our ingenious will transform that query into a pipeline. And since these engineers know how to build a pipeline which is scalable, they will transform the whole processing. So at the high level you will get the same data, it's just that the processing could be slightly different. And I think one of the optimization techniques that we use, and I have written a blog about it as well, we try to drop unnecessary data early on in the pipeline and we trying to, let's say like you are getting tons of data, they may not be relevant to you, right? So as soon as we find this information, we drop it there. Like we do not let the data pass through the entire pipeline. And we have done that experiment like early dropping versus late dropping and we had seen at least 20% increase in efficiency, even the latency improved. And that's when we made the decision, no, like we will try to figure out all these data early on and if they are not relevant, drop it and then process it.
Rakesh Kumar [00:43:35]: And the next step is also sometime, if you look at the offline queries they do mega joins each other like, you know Tons of joins and all those things. The real time processing does not work in that way. You can do it, but it will very costly and unscalable. What we do is even joins. We try to divide into different dimensions. So the high level idea is that you divide the data into smaller groups and you perform the operation. Smaller data is way manageable than the bigger data. Yeah, so it's more like a divide and conquer.
Rakesh Kumar [00:44:14]: You are dividing the bigger data set into smaller, process them and then process it further. So that way you can manage it. Otherwise it's really hard.
Demetrios [00:44:25]: If you had to guess how many pipelines are you upkeeping right now? Are we talking thousands, tens of thousands or hundreds?
Rakesh Kumar [00:44:38]: So definitely not tens of thousand. We are very conscious about it how we are building the pipeline. There are two ways we can do it. One is like where people can build the pipeline by themselves. The other one is super specialized feature. So super specialized like I would say close 30, 40 pipelines. And these pipeline not necessarily generating only 40 feature. Like one pipeline itself can generate tens and 15s of feature.
Rakesh Kumar [00:45:06]: And when I say feature it is basically for one region. But they will be processing almost millions of data per minute or so.
Demetrios [00:45:14]: Wow.
Rakesh Kumar [00:45:15]: Yeah, so that's how there I do not have a full account of like how many pipelines that we currently run, but it definitely somewhere in hundreds.
Demetrios [00:45:26]: Something that you mentioned here is that there's self serve in a way. How does that work? Are the data scientists going and they're trying to stand up a MVP and then you come in a few days later or however many weeks later and you make it bulletproof?
Rakesh Kumar [00:45:44]: Yeah, so I think self services, those are very simple use cases like where you know, processing only for smaller scale. Like maybe you are processing individual user, you're not doing a heavy processing of the data. And those operations are really pretty straightforward. You know like you are getting user data, you're extracting out what is their write request at time and then the storing somewhere. Those could be the simple one. And since they are really easy, you don't need huge understanding of how streaming processing works. So they can just do it by themselves without much help. And anyway, we have simplified the SDKs and the whole infrastructure so they can bring up the whole infra like the pipeline by themselves.
Rakesh Kumar [00:46:28]: So that is one. The other one is where it is super specialized, where it requires very heavy lifting or heavy processing. That's when some of the specialized team, they come in and help help the data science person or ML engineer.
Demetrios [00:46:44]: No, I know that people talk a bunch about like the data getting skewed in offline online type of scenarios. What are you doing to mitigate that? How do you go about that? Is that in something that you had already talked about?
Rakesh Kumar [00:47:03]: I think I did talk about that one as well. And data is qnes or hot shard. These are like interrelated things. And whenever either you are in offline online, I think online world is more affected by that one. Yeah. And we are like early on when we were thinking that that will not affect us and we build the pipeline and it was not super performant. Like there were some of the nodes which were getting way more data and they had to do way more processing and the latency was huge. One bad thing about stream processing is that if any one node is slow, it will slow down the entire processing as well.
Rakesh Kumar [00:47:53]: It creates a barrier. Like if one node has not processed it, then the whole thing does not move forward. It waits for that node to process, then the other node will also move forward. So that was a problem. And I think the data skewness, you have to look at the data like how the data is distributed. And if you have a good understanding about that one, then you have to come back to the pipeline design and figure out how you can further redistribute the data to avoid the data skewness. In our case, region causes a huge data skewness. It is like 80% by 20% rule.
Rakesh Kumar [00:48:37]: That means 20% of the cities, they are generating almost 80% of data. And so for those big cities we have to be really cautious about it. And in our case we just use the geohash to further redistribute the data. So from like, you know, 300 plus cities to almost millions of geohashes. So that way you are able to redistribute the data and like you can avoid the data skewness as well.
Demetrios [00:49:09]: I didn't realize that that was the same thing. The. You were calling it hot shard. Yeah, which I like that word now that you're saying it.
Rakesh Kumar [00:49:19]: So data, there is this concentration of data on one particular node or something. And the problem, what happens is that like if there's a huge concentration on one node, then that node is called hot node, hot shard, because that is the one which is processing most of it. But they are deleted.
Demetrios [00:49:41]: Yeah, because I guess I was confusing it for the training serving SKU that you get with the, the, the data sometimes and how you can mitigate that. But it makes a lot of sense, this hot shard and the data skewedness seeing the saturation on certain nodes because there's so much more traffic on these different hot cities.
Rakesh Kumar [00:50:08]: Right.
Demetrios [00:50:09]: But then if you break it down to smaller bits and you have these packages where you're just looking at blocks, a square block radius, then it's a different problem that you're dealing with.
Rakesh Kumar [00:50:22]: Exactly. And you redistributing the whole data to all different nodes.
Demetrios [00:50:27]: Yeah, that's a nice trick. I like that. What other tricks do you got for me? What else have you learned over the years?
Rakesh Kumar [00:50:35]: Yeah, so earlier I mentioned that we divided into two or three different phases. So very first phase POC we try to scale. We ran into some of the roadblocks. Hotshot was one of them. We addressed that and then we started creating multiple pipelines. Then we realized that, you know, it is kind of easy to build a pipeline, but as you are increasing number of pipelines, right. You're also increasing the cost. And we closely look into all the pipelines and we realize most of the pipelines are doing almost very similar processing at the beginning of the pipeline, or there are a couple of operators which are doing something similar.
Rakesh Kumar [00:51:19]: We thought like it can be abstracted out and centralized into one place. So that way you do that process only once and that can allow you to save the cost. So what we did is we defined two stage pipeline. One stage is for pre processing all these data, so all these data filtering and building a kind of state for the user. So earlier each pipeline was generating one feature and if you have to generate another feature, you have to build another one. What we said, we just combined all of them into one. So now we have one pipeline which process the data and it drops unnecessary data. It also builds a kind of state of the user, like whether that user has book a ride, whether he got a ride or not, whether he dropped off.
Rakesh Kumar [00:52:15]: So we have maintaining the state of the user similarly for both like passenger and driver. And based on that, we know the current state of the world and we have all these metadata stored and emitted every minute or so. So the downstream processing that is basically aggregating all these data, they can get all the relevant information and then they can aggregate it. And so the aggregation logic is different for different features. So you can have a different pipelines. Does it make sense or so what I was saying is like whatever the common processing, we extract it out and we centralize it. But we separated out other pipelines where their processing logic was different. And that's how we were able to cut cost and centralize the processing.
Rakesh Kumar [00:53:10]: The other thing that we did is like writing code was kind of used to take longer time. What we did is we came up with the YAML based pipeline generation where you do not need to write code. You write a kind of YAML configuration file and it does the processing for you. So that cut down the development time as well.
Demetrios [00:53:34]: Tell me more about that.
Rakesh Kumar [00:53:36]: So what we did is like we realized that if we have to build a new feature, most of the time we are doing very similar processing. So let's say you want to figure out whether this is driver or passenger. That is one function. Right, left. So what we did is like we define all these common functionality. We provide that SDK and then yeah.
Demetrios [00:54:05]: Okay, I get it. You would put it into the YAML as something that people could just use, they can define, it's already defined. And then wow, okay.
Rakesh Kumar [00:54:17]: So we said like, okay, you know, I do not want our data scientists to understand the whole thing like how to write a pipeline. I'll say that, okay, I'll define a very simple way of defining a pipeline. So I will create all these, you know, general purpose functionality for you and then you can just declare this in a YAML file how you want to process. So they do not need to understand how to define the windowing, how windowing work and all. They just need to tell like I need this 1 minute long window or 2 minute long or whatever. And then just define that in a YAML and our system will read that YAML file. Look at like how you are defined the whole pipeline and will build the pipeline for you. Yeah, that cut down our development time.
Demetrios [00:55:09]: Now you did mention that certain pipelines have five different features being created with them. You're getting like two for the price of one. On certain pipelines, is it that they have certain steps that they're doing together? Like there's these certain functions and so you can bundle them up but eventually they diverge.
Rakesh Kumar [00:55:35]: Yes, that's right. Like you. So the common functionalities, they are kind of merged and centralized. The specialized one still branched out and that's how like we are able to save it. Earlier like all these pipeline was kind of independent but you know, we were wasting our resources. We said, okay, let's combine the centralized pieces.
Demetrios [00:55:57]: Dude, this has been awesome. Is there anything else that you want to hit on that we didn't talk about?
Rakesh Kumar [00:56:02]: I think we have covered most of it, but yeah, I would like, if you have any other questions, let me know.
Demetrios [00:56:08]: I'm just so glad you didn't say, oh, we found a way to cut down on our coding time, I thought for sure you were going to say we now use cursor and AI code generation. And I was putting, I was like, please don't say that.
Rakesh Kumar [00:56:22]: Some of them are like precursor as well. So when we initially introduced the pipeline, right, we were able to cut down almost like 60% of the code base.
Demetrios [00:56:33]: Wow.
Rakesh Kumar [00:56:34]: Because of the streaming, like streaming provide all these processing, what do you call operators or what do you call functionality where you do not need to write those code? Like let's say if you want to join anything, you can just write one operator. It does that work for you? It hides all the, what do you call implementation details from you. So if it's pretty simple, then what we did is we further went ahead and said, okay, even I do not want to define all these operators. What is the easiest way to do it? And we came up with a declarative YAML based pipeline. So where you just need to write one line or like for one operator you just need to declare a one line configuration. That's all.
Demetrios [00:57:14]: That's so smart.
Rakesh Kumar [00:57:15]: That pipeline is not more than or like that definition is not more than 20, 30 lines of code. Rest of the other, like processing, that is like hidden by our platform. It is completely abstracted out. So they do not need to worry about it. So now you can think about where we had thousands and thousands of line of code for processing one feature. Then we cut it down to almost 30 or 40 lines of YAML based configuration.
Demetrios [00:57:46]: That's why you get paid big bucks. I'll tell you what, that is a genius implementation. I like it. I.