Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI
Vinoth Chandar is the creator and PMC chair of the Apache Hudi project, a seasoned distributed systems/database engineer, and a dedicated entrepreneur. During his time at Uber, he created Hudi, which pioneered transactional data lakes as we know them today, as a way to solve unique needs of speed and scale for Uber’s massive data platform. Vinoth has deep experience with databases, distributed-systems, and data systems at planet-scale, strengthened through his work at Oracle, Linkedin, Uber, and Confluent. Most recently, Vinoth founded Onehouse - a cloud-native managed lakehouse to make data lakes easier, faster, and cheaper.
Data engineers love to solve interesting new problems. Sometimes an existing off-the-shelf tool will suffice; sometimes we have to get creative and come up with new ways to build with our existing toolkit. And, perhaps most rewarding, some use cases call for us to develop something completely new that takes on a life of its own - see Apache Spark, Apache Kafka, and the entire data lakehouse category for somewhat recent examples. AI and ML engineers find themselves at these crossroads all the time. In this keynote, we will explore how a data lakehouse architecture with Apache Hudi is being used to support real-world predictive ML and vector-based AI use cases across organizations such as NielsenIQ, Notion, and Uber. We’ll explore how a data lakehouse can be used to ingest data with minute-level freshness and provide a single source of truth for all of an organization’s structured and unstructured data. We’ll show how the lakehouse can be used for feature engineering, to generate accurate training datasets and generate production features. We’ll further explain the role of the lakehouse for GenAI use cases, allowing organizations to operate vector generation pipelines at scale and integrate with vector databases for real-time vector serving.
Link to Presentation Deck: https://docs.google.com/presentation/d/1O5n6cjR0CfbUZA4MEhC_we7v25iaKV5nYQfxktLRoTY/edit?usp=drive_link
Demetrios [00:00:02]: Son of a batch. That was fast. And so I'm rocking the one house gear, you know, I know you got a lot of stuff for us. I'm gonna bring your talk up onto the screen and I will be back in like 25 minutes.
Vinoth Chandar [00:00:19]: All right, perfect. Thank you. All right, hello, everyone. I think after that cool, cool video, I'm going to cut out a pretty serious note here, which is infrastructure is often invisible, taken for granted, but very foundational to ultimately that we can build anything of value on. Welcome to today's keynote, where we'll discuss the underlying data infrastructure that we all need to be successful at our MLA projects. To quickly introduce myself, I'm Vinod. I currently work at a startup. I founded one house bringing truly open and as Demetrios shirt said, wicked fast.
Vinoth Chandar [00:01:04]: Data lake houses to everyone. My background is largely in distributed systems, databases, data lakes. I have been part of the data platform teams at Uber, LinkedIn, and where we build some earliest lake house tech, Apache hoodie. We'll build the world's first data lake house and continuing to work on areas like Apache stable to bring more interoperability to the space. And with that, let's dive into the talk. This talk is broken up into sections, each progressively going deeper into specifics. But let's start by just broadly exploring where things are today. Right.
Vinoth Chandar [00:01:46]: This diagram is about like a year old, but I think it does a pretty good job still of capturing the different stages of adoption for different all these technology pieces that you hear around Aima. As you can see, like for example, Genei is in the innovators, you know, yearly adopter groups, and it's kind of like rising through the ranks pretty fast. So the key takeaway is you can see there's a huge number of supporting technologies for these different use cases. So it's going to be like very important to consider what technologies you have, fully understand them, what your team skill sets on your team to do stuff with these technologies and what you can acquire and learn and what you need to learn to build out AI products. So why does all this diligence help us? Because we've been there, we've seen other technologies go through these adoption curves. There is usually the innovation trigger, and then we all talk about it for a while. You know, we might like, you know, a lot of awesome conversations on what can be possible. Then we generally go through a trough when we try to actually make things work in production.
Vinoth Chandar [00:02:58]: Right? And then this trough will come for AI too. It will be no different. And the main goal for today is to leave you with some thoughts on, hey, let's peek around the corner, make some decisions, get some of them right early on so that we can kind of breeze through this instead of an emotional roller coaster, if you will. So what are some of these real challenges? So what are these challenges that we're facing on the ground? Right. As somebody who interacts with companies building data platforms on a regular basis, I see a lot of companies when they try to kind of push it into production. Some of these projects, the data platform basics still matter lot, right? So, like, the data, the models are only as good as your data and investing in data quality governance and, you know, storing and, like, all of that really, really matters. And the second thing is scale. Scale is a new norm.
Vinoth Chandar [00:03:55]: And, and there are like, you know, new data volumes, right? Like the data that wasn't previously consumed in a certain fashion are all being now consumed to explore new use cases. So you need to really think through scale very early on in your vision today. And most of these tools, like, if you start from something as simple as ingestion, you can see that most tools out in the market don't even scale to like terabyte level, you know, data sets, right? So scale is a new norm and there's a bunch of people struggling with that. And I need not tell you, but you already know these, these things take a lot of money to kind of these models to build and train and fine tune. So training on bad data, not catching some of these things yearly, can burn a serious hole in our budgets. Right? Imagine training something for burning a million dollars on it and then later figuring out that there wasn't the right data set. Right? So this is going to be like, super, super important, the data piece. And obviously there's a lot of moving parts.
Vinoth Chandar [00:04:55]: The ecosystem is still, like, maturing, and we probably need new tools that we don't have today that will all be built in the next, you know, twelve to 18 to 24 months. And finally, the billion dollar question, right? How can we safely ethically train AI? You know, this question needs to be answered before we can fly into the future and kind of swim with the droids, if you will. So with these challenges in mind, let's zoom into just this one aspect in this talk, which is the underlying data platform. How can the choices that you make there can make your life simpler? This is a pretty short talk, so I'm going to cut right to the chase. And these are some battle tested design principles that I've seen companies succeed with. And for one, get your data on, you know, structured or unstructured or what have you on top of cloud storage. Make that your source of truth and make that data interoperable with any warehouse, Lake engine, MLA framework, anything. Remember, like data is the only permanent thing, right? So these compute frameworks will come and go like there'll be new ones that evolve and you really need this so that your data, you know, remains kind of future proof.
Vinoth Chandar [00:06:11]: Number two goes hand in hand. Use products that, you know, are built on open source or cloud agnostic components. Lock in is the last thing that you want here in a market that is evolving so fast. Data has inertia, right? Imagine having, you know, large volumes of data in a system that, you know, you can't really get in front of the newest engine on the market, right? That's going to be like a serious setback to your company. So store data in these open data formats, like any open format, basically an open table formats, or like a storage format as well. So all of these are broadly adopted, widely supported, and they help you kind of, again, you know, sleep better at night. Scale and cost, as I mentioned, you know, see the one thing about, for example, even vector searches and all these ML use cases is you, you have to do a lot of, you know, analysis to, you know, go over, compare your data records against each other and all of that. So this needs serious scale.
Vinoth Chandar [00:07:15]: So ensure that your data platform can scale volumes without kind of, you know, in a reasonable way, right? It shouldn't be non linearly scaling as your data scales. And businesses operate in real time, so your product should as well. So try to get data flowing through your different processes incrementally, efficiently. Use technologies like Kafka, hoodies, spark, plink, all of them do a great job at getting your data processed and sent to different stages of your processing pretty quickly. And three and four can become serious bottlenecks for your project. Just don't size them up pretty well. I know some of this can be abstract, so to help visualize this in the flesh, this is what it could look like. So you can have a lot of files and data lying around in a cloud storage or applications producing data.
Vinoth Chandar [00:08:06]: Databases obviously have data. So a sample architecture could be extract CDC from these databases, or have these applications submit events to some of these messaging stores or streaming platforms, if you will, and ingest all this file data on top of cloud storage like I mentioned. And you can expose this to various tools. You can do your regular bi and you can also do an AI and also use these data engineering tools to kind of transform data in between. And we are not limited, just these tools. In the setup that I just showed you, most of the tools that you see work across a wide spectrum. In reality, in the data market, everybody like that's aspirational goal for every vendor to be, is to be good at everything. But in reality there are certain workloads.
Vinoth Chandar [00:08:56]: These are like, you know, each engine or tool is well suited for. And most of these tools can directly query the lake house today or have very simple, easy, reverse ideal solutions that you can move data from the lake house into these. But the data remaining at the lake house preserves optionality for you, lowers your cost, keeps your data free. All said, there are some significant work and gaps as well. And these are very foundational to converging the structured and unstructured data storage and supporting some of these use cases. Emerging AI use cases really better a lake houses. A lot of the work has been around structured data and blending these lake house benefits is unstructured, is going to be game changing. And second, the existing columnar file formats are not really well suited for storing large objects.
Vinoth Chandar [00:09:53]: And, you know, we need better file formats for serving short lookups, point lookups, model serving, indexing, these kinds of things. So these are emerging. So, so you can see that there are foundational, you know, innovation happening in the space around this as well. And finally, a lot of people are using vector databases today to build their rag applications. This is great for use cases that require low latency, but there are plenty of use cases where, you know, they're not very latency sensitive. For example, generating a sales report with an AI summary of product reviews every few times a day, that can be just done on the lake house, right? So these are some areas where I think there are some gaps that are being like addressed in the market. And you'll see it them come to come to life over the next few months. Now we'll look into the lake house and with an ML and AI lens and relate to some real world examples of, okay, what does this look like? Right? Because again, we're talking about abstract data platform, so maybe relating them to real world use cases helps you appreciate this.
Vinoth Chandar [00:11:03]: Let's start with ML. Many of you are already familiar with these use cases feature engineering. We join a ton of data from different sources, transform them, build features, so evaluate. So, you know, once these features are built, we take them, we train them against, you know, training data. You know, we build our models. Finally, we serve the models where you feed them real world input and make predictions. So for example, at Uber Uber basically takes, you know, it just, you know, new writes a requester, it predicts an ETA, right, stuff like that. I've seen the data lake model actually do wonders for these workloads going all the way back even to 2010, settling ten, where we built all these ML driven recommendation products that you see people, you may know, recommendation of jobs and content and all of these things like on the lake, basically.
Vinoth Chandar [00:12:00]: So the great thing about using a data lake house model for this is scalability. It comes with the lake house economics. It brings you that the cheaper storage, the most scalable, most horizontally scalable compute and storage. So you can, you know, like crunch a lot of data very cost effectively and bring the best tools for your data, right? So, you know, so oftentimes, you know, again, in a rapidly evolving space like this, where innovations happening, you need new tools. So you need the ability to be able to build where there's a gap and as well as bring new tools and try them out on your data. So it's really, this model is really great for that because it's built on so much of open source software, centralized data repository. So again, like there's a single data repository source of truth for your data that you can go back to and retrain models. If you get some downstream model wrong, you can go rebuild them again, rescore them, all of that.
Vinoth Chandar [00:12:59]: And it ensures data quality by design, because you can streamline your data architecture in a way that you can make bad data and not get into your models and stuff. So a more recent case study comes from Uber, where back in the day, where we built a really large enterprise data lake, and it powered a lot of the feature engineering that we did to build products like Uber Eats, like I mentioned. So as the data changes, for example, we need to keep the models up to date. And so that's actually one of the reasons why we build this Apache hoodie project to sort of help us build these incrementally. As the data changes, we're not rereading and rescanning the entire data set. We can like our table changes, we can efficiently compute the changes to the features in an incremental fashion. So now let's dig into the current crowd favorite, I guess, which is Genaida, I'm sure many of you are users for some of these use cases already, like content generation, and a lot of us are probably using GPD and the like. And obviously, at the core of it is a search problem, as you all know.
Vinoth Chandar [00:14:15]: And there's a lot of interesting applications being built using this, which can range from virtual assistants to uh, content generation tools to, you know, even, even I think you can even make a movie out of this technology today, right? So let's actually spend more time on the technical underpinnings of this. So what does an outline of a typical Jenny app look like? Right. So core to the understanding, this is understanding these things called vector embeddings. So it's a pretty much a simple concept where you have a lot of data, you just map them to like a vector, like a series of sequence of numbers in an n dimensional space. That's where this embed embed, like by talking to an embedding model. Then you have lots of data with their own embeddings and you use some similarity search to find relevant objects. And then all your prompt engineering kicks in and you can build your application. So the key component of this is a vector database which can be used to store and serve these vectors.
Vinoth Chandar [00:15:32]: There's some, again, some very cool use cases that can come out of this if you're building something like this today. Here are some of the common challenges that we've seen people run into as organizations start to play with their rag applications, but then they want to actually productionize it at the full scale of their enterprise data. A first, you need to integrate this new data type. Vector embedding basically is a new data type. That's basically what it is. You need to integrate this with the existing stack. They should be a native part of the data stack. You need to get all the same benefits of storage, compute separation, update management and all of that.
Vinoth Chandar [00:16:15]: So there is like a bunch of work going on around that. So, but this is something that people we see ask for a lot. Number two, the single source of truth that I keep talking about, you know, many companies currently use a lot of different vector databases, right? If you have a lot of data, you have like different instances of these vector databases, which is fine. But the thing is, these vector databases are pretty much, you know, like youre, you know, old DP databases, right? They're purpose build systems for low latency querying so they're not interoperable with each other. And, you know, you can't fit all of the data into them. So many companies end up slicing and dicing. I have a hodgepodge kind of, you know, data movement of these vectors across different vector databases. So, but vector embeddings, these need the interoperability freedom, just like other parts of your data, right? So the challenge is around decoupling the data pipeline components and the serving problems in architecture.
Vinoth Chandar [00:17:17]: By that, what I really mean is there's a part of the thing where we generate embeddings from existing data that's a pipeline problem. Right. And the serving problem is really serving these searches in real time when the application is built. So these two can be actually decoupled. Right. With the data lake house is sitting in between. So, yeah, so there is use cases where, for example, this is not an optimal choice and scanner complexity. But lacos, our view is that it can complement a downstream vector database really well, as we will see in this, like, really good example where there's a great example of freeing your data from a single purpose warehouse to a multipurpose lake house where you have a lot of data sitting in your postgres.
Vinoth Chandar [00:18:06]: You move it to an open data lake house pretty quickly using all these incremental technologies that I talked about. And you can serve both analytics and AI and even have a downstream vector database to actually power your applications. So this gives you a lot of flexibility and value on your data, how you manage your data, and also with some very tangible, like, say, cost savings and latency reduction and all kinds of goodness. But there are also companies out there building things that are a little bit ahead of the curve. And this can serve as a leading indicator for the challenges that we need to solve going forward. So we sat down with this Kaushik from like, where they essentially were trying to execute in text search based applications and on vector databases, and they found that to be like, it didn't scale cost effectively. And then they had this idea to, hey, what if I directly run this on a data lake house? So they build these data sets in the data lake house and use both Apache hoodie and Delta Lake to kind of run some queries. And they actually had some pretty reasonable, like, you know, on pretty, pretty good, like, latency numbers with much reduced cost.
Vinoth Chandar [00:19:31]: Right. And there's going to be like, probably a lot of work that we can do on the lake house front to continuously reduce this. But these, like, indeed are less, some leading indicators that, you know, we need to watch out for to see how we can make these kinds of applications more easy, easy to use. Right. And yeah, please check out the webinar. It's up by there if this interests you more. And we've been constantly building to make these simpler to achieve as well. So this end, this architecture that I just talked about, we packaged it up into a simple managed pipeline where we can automatically generate these embeddings from any of your data sources.
Vinoth Chandar [00:20:09]: And you can store them on the lake house, you get doctor and all this data management being taken out, you know, in a very simple way to, you know, finally, in summary, not into catchphrases, but this seems important. So the takeaway is the lake house is your bakehouse for AAML. You know, if you are, how reasonable even medium scale data, you should consider, you know, making this the foundation of your data infrastructure. And if you're building for AI and ML, please focus on your data as well as the underlying infra. Build on lake house object storage, use cloud storage. Build an open data architecture. Make sure your data can be brought to any engine that you want, any compute framework that you want. Deliver data faster, quicker.
Vinoth Chandar [00:21:03]: It generally helps your entire organization be more productive because things change and it gets reflected very quickly. And consider cost and scale very early in the pipeline, you know, use incremental processing and scale your processing needs as you go. And with that, I think that's we are the end of our talk. And thanks for tuning in and let me see if you have some questions.
Demetrios [00:21:34]: Oh yeah, I'm sure there will be. I have one big one before we jump into the chat because it's about 20 seconds behind real life. You mentioned how. So there was that like the funny GiF or the design on how everybody just saying the words AI gives you budget today and then tomorrow it's going to be scrutinized. Are there things that you have seen that can really save money when dealing with AI applications and like the data lakes?
Vinoth Chandar [00:22:09]: Yeah. So a key thing that we see is just the cost of vector embedding generation. And that's, so for example, let's take a quick example. You have a note taking application that you're like constant people are writing into. And then every hour there are, let's say thousand edits to a document. And then I, you know, but if you want to. So you probably want to, you know, like you can't make thousand calls each time there's a change to a data object, right? So having lake house technologies that are like intelligent around kind of giving you a point in time view of the object that you can train, that's like really important for people to kind of save these embedding generation costs. So for example, with Apache, Hudi or one house, the way we do it is you can run a pipeline, you know, you can have like a, you know, your edits go live into an operation online database.
Vinoth Chandar [00:23:12]: You can run a pipeline where it gives you a single copy, right? You can say, give me the latest state of this object after this point in time. And it can give you that like aggregated copy. So you make one call versus like hundreds or tens of thousands. So that saves you a lot of cost in terms of like your open AI or your own embedding model cost and stuff like that. And the other part we see is just around, you know, like the amount of data that you keep in an online vector database. So right now, there's no good store for it. So people store a lot of data on like an online vector database, which means you need like a much bigger database instance. You can actually offload a good chunk of them if you're not using them into a lake house.
Vinoth Chandar [00:24:00]: And that should shrink the amount of the cluster sizes that you need, the amount of active databases you need, and can also be saving you a ton of cost.
Demetrios [00:24:09]: Nice. So another question coming through the chat is I'm interested in understanding the difference between databricks, lake house architecture versus one house. What are the pros and cons of each?
Vinoth Chandar [00:24:24]: Yeah, that's a pretty awesome question. So I think if you look at the original lake house paper from databricks, it essentially talks about, hey, Spark is a unified engine. And then there is like this Delta lake as the underlying storage, and then you can use the lake. The angle that they took to the lake house is you can do BiML, data science on a single system. So one house actually broadens us more. And this just starts not just from one house, but even from how we created Hudi at Uber back in the day. So we believe in a world where the storage has like multiple options as well. Right? So we, that's why we support HUD.
Vinoth Chandar [00:25:07]: But we also interrupt with Delta and Iceberg and also at the compute level, we think, you know, maybe, you know, the warehouses are still, you know, good at bi, maybe their flink is great at stream processing. Like you. Spark is obviously amazing as well on ETL and data science notebooks and whatnot. That is ray. Right? So there is, so we are broader definition of a data lake house where we view it as a clean separation of data and purpose built engines for each workload on top. So I would refer you, for example, go back and look at that slide that, that talks about different workload classes, complexity in these systems, and you quickly realize that there is actually a much broader, you know, view that we need for this. So that is the, actually the philosophical, there's like lot of like technical differences. But if you have to distill it down, that's the philosophical difference between the two approaches.
Demetrios [00:26:08]: Another one coming through. Would these large scale use cases apply for data lake houses for only Fortune 500 enterprises? Are there any thoughts on open lake houses for small or medium sized businesses?
Vinoth Chandar [00:26:24]: Great. Great one as well. So this is actually at the heart of why we founded even one house. Because if you think about it historically, this question even comes from the fact that historically, the companies that build data lakes were like big companies who had a lot of data are reno did data science and ML ahead of the curve and all of that, Facebook, LinkedIn. So the world had data lakes before, but look at the world today. Everybody's building data science mlai. So you need one, right. First reason.
Vinoth Chandar [00:26:55]: So if you're, if you, if your company is going to do one, and I think right now you see that data science, ML AI is not restricted to just Washington 500. It's kind of like a common mainstream thing that we do as an industry. So you need one. Number two, the main impediment to actually getting a data lake house is actually how to tinker and do like integrate a lot of open source projects to get it up and running. It's not like four clicks that you go into a snowflake bigquery and then you have things up and running. So this is the big gap that we see in the market, even when we started one house, is that there's this bipolar disorder in the market where you either get like warehouse with proprietary technology and ease of use, or you get like, you know, scale. You can support multiple use cases, you have open data and all of that, but you need to go build it yourself. Right? So this is basically where we are trying to bring that ease of use as a managed service while, you know, and then now people.
Vinoth Chandar [00:28:00]: And the goal is you should pick a data lake house as the default data architecture. Even if you look at what exacts from the large data vendors or cloud providers are saying today, they're all using different terms. Some may say table format, some may say lake house. These are all like marketing things. But at the end of the day, everybody's kind of actually aligning and agreeing that this is the most robust data infrastructure or underlying data platform choice that you can be making today. It's just that we started with very little platformization around the data lake house that it kind of feels like you need an army of really trained Navy Seal later. And we're trying to do our bit to change that.
Demetrios [00:28:52]: Incredible. Well, thank you, man. Thank you so much for everybody. There's a few more questions that came through here, but we got to keep rolling. I'm the timekeeper today, so I really appreciate you coming on here. We know. And doing this. Hopefully you enjoyed my video at the beginning.
Demetrios [00:29:08]: We are going to drop your slides on to the description once we put this live so everybody can go and see it. If you want to share the link with me before then, I can drop it in the chat too, because a bunch of people are asking for it. So thank you and we'll keep it cruising.
Vinoth Chandar [00:29:29]: All right, sounds good. And I'll go check out the other questions from the chat as well. All right. Thanks for having me. This is fun.