Sign in or Join the community to continue

The Daft distributed Python data engine: multimodal data curation at any scale // Jay Chia // DE4AI

Posted Sep 17, 2024 | Views 1.4K

Share

speaker

Jay Chia

Cofounder @ Eventual

Jay is based in San Francisco and graduated from Cornell University where he did research in deep learning and computational biology. He has worked in ML Infrastructure across biotech (Freenome) and autonomous driving (Lyft L5), building large-scale data and computing platforms for diverse industries. Jay is now a maintainer of Daft: the distributed Python Dataframe for complex data.

+ Read More

SUMMARY

It's 2024 but data curation for ML/AI is still incredibly hard. Daft is an open-sourced Python data engine that changes that paradigm by focusing on the 3 fundamental needs for any ML/AI data platform: 1. ETL at terabyte+ scale: with steps that require complex model batch inference or algorithms that can only be expressed in custom Python. 2. Analytics: involving multimodal datatypes such as images and tensors, but with SQL as the language of choice. 3. Dataloading: performant streaming transport and processing of data from cloud storage into your GPUs for model training/inference In this talk, we explore how other tools fall short of delivering on these hard requirements for any ML/AI data platform. We then showcase a full example of using the Daft Dataframe and simple open file formats such as JSON/Parquet to build out a highly performant data platform - all in your own cloud and on your own data!

+ Read More

TRANSCRIPT

Demetrios [00:00:09]: What up, Jake?

Jay Chia [00:00:09]: How you doing? Good, good. Great to be here again, man.

Demetrios [00:00:15]: It's great to have you. I'm excited for your talk. I know that we're running a little behind, so I'm just gonna, like, get right into it. And we can ask all kinds of questions when you're done. So if you want to share your screen, I'll throw it up here on this stage, and then I'll get out of here.

Jay Chia [00:00:32]: Okay, sounds good.

Demetrios [00:00:33]: Yeah.

Jay Chia [00:00:33]: I'll try to leave more time for questions at the end so that everyone has the opportunity to ask them. But, yeah, so very quickly, my name is Jay. Today I'll be talking about multimodal data curation at any scale. You know, it's 2024, and somehow we as an industry just haven't yet really cracked multimodal data curation. Just like data curation in general. Everyone's trying to use Spark. You have all these new tools coming out, you have all these dag frameworks. You're like, I have bigquery, I have snowflake.

Jay Chia [00:01:03]: How do I even think about this? Today I'm going to present, I think, what is probably the simplest possible solution, and it works at ridiculous scales in the hundreds of terabytes. You never really have to worry about this not scaling for your workload. And also, it's so simple. All it uses is s three and daft. So, a little bit of introduction before I keep going. So, my background, I'm a co founder at eventual, where we are maintaining daft. Prior to this, I was a software elite at Freenom, building out machine learning systems for biotech. And then I was a software elite at Lyft level five, where I was building out machine learning data systems again, this time for images and lidar and point clouds and radar and video.

Jay Chia [00:01:50]: So I've been working with multimodal data types for a while and weird file formats, both in bio and also in self driving. And that kind of motivated us to build daft, right. The idea here is, hey, we all know and love our data systems like pandas and data frames with rows and columns. Why isn't it as easy to work with machine learning data types like tensors and images and, you know, HTML and PDF's? Why isn't it as easy to work with these data types as it is with our tabular data types? And that's really the kind of the goal and vision of daft to make it really easy for you to run this stuff at scale. So, zooming out a little bit to set context for the talk. Where are we? So this is the end to end MLAI lifecycle, right? I don't think anything really has changed. The name has changed. People now call it AI instead of ML.

Jay Chia [00:02:40]: But I actually took this, I guess, illustration from character AI, and they do LLMs, right? And it's exactly the same as we've seen across, you know, the whole ML, ML ops. And now AI, where step one, you're crawling for data, right? So that means that you're going out on the Internet, you know, pulling all the data down, dumping it somewhere. You're pulling, you know, user information, dumping it somewhere. Step two, and that's what we're talking about today. You're taking all of that raw data, raw, dirty, unstructured data, and then you're running it through data curation or data engineering, the namesake of the conference. Then the goal of data curation here is to eventually have a dataset you can feed into training. Then after training, you want to evaluate the model. Once you're done evaluating, then you can use those insights to complete the flywheel and go back and start crawling for more higher quality data and then feed that, can feed that back into data curation and keep going.

Jay Chia [00:03:38]: And that's the flywheel of machine learning and AI, that really hasn't changed all that much. And for some reason, we as an industry just really haven't cracked this data curation portion, which is the main pain point where you hear people talking about how data is like most of the problems in ML, and that's what they're talking about. So kind of zooming into data curation, what do we really need here? It's really simple. I think it boils down to three things. Number one, you want to ETL data, take raw data, transform it, extract it, transform it, and then load it back in somewhere else. Your transformation step, maybe you're running a model, you're running some filtering, you're joining tables together. Pretty simple stuff with the added, I guess, complexity. Now that you're working with all these multimodal data types, and maybe you're running models.

Jay Chia [00:04:28]: Number two, you want to run analytics. Sorry, SQL is still king here. You're still going to be running SQL over your tables, trying to understand, for all my images, how many of them are just facing empty roads, how many of them have pedestrians in them? What's the average number of pedestrians in my images? These are extremely important and oftentimes overlooked in data creation. And number three, data loading. This is something, again that we as an industry haven't really cracked. I'm going to show you today that daft is really, really good at all three. But you need this. The final data that you have from data creation needs to eventually be loaded into training, and this needs to be extremely, extremely performant in order to make good.

Jay Chia [00:05:11]: And all that GPU dollars that you're spending. This, I think is really the holy trinity of data curation. You need to be able to do ETL analytics and data loading. And so today I'm going to present to you what I think is the 80% ML stack. All you need is parquet stored in s three, and then you have daft or some sort of query engine daft here. I think I'm presenting as the query engine because it has a whole bunch of features that I think make it really well suited for machine learning. But the beauty is if you have just parquet and s three, so many tools work with it, it's compatible with so many of the existing data stacked that you never have to be worried about being locked in. But you also get all the benefits that you need ready for machine learning already.

Jay Chia [00:05:57]: So diving a little bit deeper, parking s three, what do you get? You get this infinitely scalable cloud storage. You're never going to kill Amazon by dumping all that data. Amazon's going to be fine, don't worry. Just dump all the data in, they're good. You get the benefit of open file formats like Parquet, super well integrated with the rest of the ecosystem. Spark can read it, duckdb can read it, pandas can read it, PI arrow can read it, and so you're never really locked in with an open file format like Parquet. And lastly, we actually recommend that people just store images as URL's in the Parquet file and just store the images in native formats like JPEG. And that means that you can use MPEG, you can pipe that to your browser, browser can read jpg.

Jay Chia [00:06:42]: Again, this very powerful idea of an open file format and encodings that have been built for images. So super simple, no need to overthink it, just do that. And then on the other end you have daft, which can read all this data really efficiently and do all the operations that you might need in order to work with this data. Now what that gives you is all of a sudden now you have capabilities for ETL. You can read the parquet files or maybe even read like JSON files, which we'll show a little bit later, do some transformations filtering, and then dump it back as parquet. That's ETL. You need to run a model on a column that's ETL, Daft lets you do that. You get analytics.

Jay Chia [00:07:20]: We're actually actively building in SQL right now. It's about, I think, 50% to 60% complete and Daft. But yeah, you can run analytics over your data to understand and drill down into your data distributions. And lastly, you get data loading. Daft actually works extremely well with Parquet and URL's pointing out to images to get that really efficient data loading into training, you really don't need anything else from your stack. And that's it. The end. Thanks.

Jay Chia [00:07:48]: End of my talk. Well, just kidding. But it does cover 80% of what you need. The other 20% includes things like you might be hearing about table formats, catalogs. How do you do distributed training? You have feature stores, you have search and retrieval, vector stores, Dag frameworks. I'd argue though that really covers the last mile, 20%. But if you want to get started and you want an architecture that scales really well, stick with something simple. Do s three and parquet and stick something like that on top of it.

Jay Chia [00:08:20]: This is the meme. If you are building an ML platform, you might be thinking spark iceberg. Tf records. I need a feature store. Let me make sure people can do pandas again. Daft in s three. Daft s three. I think that's all you need.

Jay Chia [00:08:35]: With that, let's dive into the demo. I'm going to show you how we go from a dirty set of JSON files I pulled from an open dataset, how we can clean and curate that JSON file into nice parquet and jpegs, which are then super usable for everything, including analytics, including data loading, including just ETL. So I'm going to stop sharing and share a different pane. All right, here we go. So first, what does the raw data look like? It's disgusting. It's this JSON file that I pulled from the Internet. I hid it. The first line, you can't really tell what it is because JSON is, yes, it is human readable, but you can't really read it.

Jay Chia [00:09:18]: The first thing I'm going to do is I'm just going to use daft to read the JSON file and show what that looks like. This is what that JSON file looks like. Every row is basically one page that's been scraped. This is the MMC four dataset. It's a multimodal dataset. Each URL is a web page. The web page contains a list of text. This is a listen of UTF eight text, and also a list of images inside that webpage.

Jay Chia [00:09:45]: And so these are all the images, including a bunch of metadata like the image name, the raw URL of the image, you know, which text index the dataset thinks matches each image, and then a similarity matrix of these images to every text and every text to every image. And that's kind of the main data set. And this is, I'm just using daft to visualize it and just show you what the dataset looks like. Right? And it's just the first eight rows. And so daft is this really powerful data frame interface. It gives you this tabular interface of rows and columns. And using this interface, we're now going to see how we can use daft to further process this data and make it essentially ML platform ready and give us that like golden dataset that we need at the end. All right, so now that we have this JSON file on disk, let's go over to this notebook where we're going to do what I like to call data ingestion.

Jay Chia [00:10:39]: We're going to take all that JSON file data, do some cleaning, do some filtering, download the images, and then eventually give you this nice parquet representation with just JPEG images, which is beautiful. Let's make this baby per. First we're going to again read that JSON file that we had. This is what that looked like. No difference. First thing we're going to do, I am going to try and make this flat. I want, instead of having every web page be one row, I want one image per row so that I have a dataset of images instead of a dataset of web pages. So let's go ahead and do that.

Jay Chia [00:11:20]: We can do that in depth using this explode command which basically explodes both the image info and the similarity matrix into rows. And so now you notice that images, instead of being a list of images, it's now just a single image per row, just a struct of all this stuff. This is a little bit hard to deal with. I'm going to go ahead and splat that struct across columns. That's very easy to do as well. First, let's count how many rows there are. So there's approximately 14,000 images in this data frame now, 14,000 rows. And then I'm going to splat those images across the data frame.

Jay Chia [00:11:57]: So now instead of this having this ugly struct column, let me zoom in a little bit. Instead of having this ugly struct column, we now have these image name and raw URL's as just pure columns. So here we have the raw URL, the image name, matched text index and matched similarity across each image. Cool. Super simple. Right now we have essentially one image per row. I'm going to go ahead and show you how easy it is to go from just raw URL's like that into images. In daft.

Jay Chia [00:12:30]: In daft you can actually instruct it, hey, take this raw URL column which we saw before, perform a URL download on it and then decode that into images. It's like a single line of code. Super easy. You don't have to build any python code yourself. Daft has kind of baked in all this functionality for IO and, and decoding for you already, which is super powerful. I'm going to run this. It's actually going to error out because what we, what we see here daft is going to tell you. Found, not found, right? Because this data set is super dirty.

Jay Chia [00:13:01]: All these files in the data set, these URL's, a lot of them actually are 404 not found because they were scraped a long time ago. The Internet changes and so it's actually fairly difficult to do this task yourself if you were trying to do this data curation with daft. Actually we added this functionality to tell daft, hey, if an error occurs, just make it a null value. And also for all the images, please convert them into RGB. And so if I do this instead, we'll see that daft will now be able to run your URL download and image decode operation across the entire data set showing you the first eight rows. And voila. In, I guess a few lines of code. We've gone from images URL to raw image.

Jay Chia [00:13:45]: These are the actual bytes of the image and decode it into an actual image in depth. All right, let's go a step further now and do a little bit of processing. That's the whole reason we're doing data curation. We're going to resize these images into thumbnails. I'm just going to resize them into 64 by 64. Let's show the results of that. So it's going to give us a few warnings. Hey, some of these images are not found, but in the end it's going to have this new image column.

Jay Chia [00:14:15]: Notice the difference here. It's RGB, but daft actually tells you it's 64 by 64. It's a fixed sized column. Now this is really powerful because we can actually store these thumbnails in the parquet file. That's small, it's fine. Then what we'll do is that we'll pull these full size images out of the parquet file so that the parquet file aren't too heavy. So we're going to go ahead and do that. But first let's upload these images back into our storage as JPEG.

Jay Chia [00:14:42]: Very simply in daft, encode the images as JPEG, do a URL upload this time. And so now I'm going to be stuffing these images back into my storage as JPEG and receiving a URL to those images. And lastly, take out the very heavyweight image column and I'm going to write a parquet file now. So what this parquet file will look like is it will contain all of these columns, including the thumbnail, including your new image Uri, takes out the expensive image column and then now you have this really, really nice dataset located over here in this parquet file and your data is now ingested. Let's take a look at what that data looks like very quickly. Yep, here is the data. As we expect, this is your curated dataset. And notice how every single column now is, every single row now is an image.

Jay Chia [00:15:34]: This is the thumbnail of that image, right? Beautiful. And also a link a URL to your raw data if you want it, your actual ingested dataset for jpegs. And if you want to then data load this into data loading. Very simple. You can just select a couple of columns. You can even run filters wherever the match. Similarity score in this case is more than 0.2, for example, show a couple of rows, you're like, okay, I'm pretty happy with this. Let me start loading this into my training pipelines.

Jay Chia [00:16:06]: All you got to do is iterate on the rows like that and then call next. And every time you call next, you get your arrays and any other columns that you requested for. This is really powerful because it gives you this abstraction for not just data processing, not just analytics, but also then being able to go directly into training, pipe all that data into training, extremely high throughput, directly from the cloud. So that was it for the demo. Let me dive back to my slides and let's go to the conclusions. We saw that daft, it's really useful for all these operations, but let me drill down to what makes it really, really good. Number one, daft is Python native. If you're used to doing feature engineering and stuff in SQL and you're like, oh, I just want to run this Python library.

Jay Chia [00:16:56]: I just want to run this Python model. Yeah, with Daft is super easy because we support all these Python operations. It gives you first class Python experience. Number two, which I didn't get to show today, is that daft actually scales to terabytes of multimodal data. We're used by companies like together AI, character AI. And all these companies, what they're doing is that they are running on terabytes of multimodal data and they are trying to do data curation on all this multimodal data. And with daft, all that workloads can be run on a single machine pretty efficiently, but can also be run on a cluster of machines. If you need to scale to that level.

Jay Chia [00:17:35]: It's fast. And that means, as we all know, more iterations on your data means better models. You get to run that fly view over and over again. It handles complex types, we saw images today, we also handle embeddings, URL's, tensors, and are actively thinking about videos and more. Then yeah, lastly, it's cloud native. It works really, really well with s three. And that's my argument here. All you really need, I think, is s three, an open file format like Parquet and then deft that sits on top of it, and you have pretty much all you need for 80% of your ML workloads.

Jay Chia [00:18:11]: And so that's it. You can kind of find us here at GitDaft IO, install it with Pip, install Gitdaft. And you know, if we have a use case that we don't already cover, come chat with us. We'd love to collaborate. And lastly, we're hiring. So if you're interested in some of this work, you're interested in building a platform around daft, building out really efficient data systems, feel free to get in touch. And with that, I'll probably hand over back to Demetrios for questions.

Demetrios [00:18:41]: Nice. So there is a big question coming through here in the chat. Let me grab it real fast. Do you know if there's going to be a timeline for Daft added as an additional backend for ebIs?

Jay Chia [00:18:55]: Oh, interesting. Yeah, so there's an active thread, active Pr. Well, issue on the Ibis project which is asking exactly this, when can DAF be added as a backend to ibIs? One of the easiest ways to integrate into IBIS actually is by providing SQL support, because IBIS already goes from python data frame syntax into their own representation, into SQL, and then into a backend engine. So that's actually what we're prioritizing. Instead of putting in a whole bunch of custom work to make IbIs work specifically for daft and building out all that custom code. We're just going to support SQL and let Ibis give us SQL. That's probably the easiest way. So that's the current approach.

Jay Chia [00:19:37]: And SQL support, we're expecting it to reach like about 80% in functionality by the end of this month and we'll be ready to unveil what we have there very soon.

Demetrios [00:19:48]: How do you generally run DAF? Do you run it on kubernetes? If you want to make use of the distributed processing?

Jay Chia [00:19:55]: Great question. We're actively working on making this easier today. I run Daft primarily on my laptop. That's one. And also I spin up a big Ec two machine with plenty of cores, and I just run daft locally. We have this concept of a local runner that's actually really performant already. You'll see it perform just as well, sometimes even better than DuckdB and polis. But if I want to run this in a distributed manner, I actually usually spin up a ray cluster.

Jay Chia [00:20:27]: There's many ways you can do that. You can do that on kubernetes, you can do that on just pure Amazon EC two machines actually within the team today. We're about to unveil a project in maybe a week or two. That makes it very easy for people to just, hey, from your cli daft up cluster, then we'll spin up a cluster for you and automatically daft will be connected to it just for you to be able to kick the tires on daft in a distributed context. But that's usually our recommendation. Figure out a way to launch this ray cluster and then connect Daft to it. But in about two weeks we'll have something that people can just easily run as long as you have an AWS account.

Demetrios [00:21:04]: Amazing. Simple is better. Yes, I like it. How about support for Flyte, prefect, Zabl, all those other orchestrators?

Jay Chia [00:21:15]: Yes, we already have a bunch of examples with flyte. Really all these dag orchestrators, the way I think about them is that their role is to orchestrate the compute and then offload the actual compute, heavy compute to engines like daft. Engines like spark flight, for example, already has an integration with ray. And also daft just runs really well on a single node. A lot of our examples with Flyte with prefect actually just involve having daft run. Either you ask the orchestrator for hey, give me one really big node and I'm going to run daft on it. Or you ask the orchestrator for hey, give me a ray cluster and then just connect daft to the ray cluster. And then now you have this amazing abstraction for compute over a single step in your dag.

Demetrios [00:22:03]: So kind of dovetailing on that one, why is daf better than spark Python or beam Python, fleet Python, that kind of thing?

Jay Chia [00:22:13]: Yes. So, and this is a really interesting question, I've given a couple of talks on this. All of these big data frameworks have been built for enterprise settings. And in the last, like, you know, couple of decades, the enterprises all run on the JVM, right? It's all Java. And so if you, if you've ever used Spark and flank and you try to use Python with it, and you get these like 2000 lines, long JVM stack traces, that's where it's coming from. There's all this inefficiency of having to jump between Java and Python, converting data types between the two, serializing data, sending it across the wire across that boundary, and that makes it, as a developer experience, that's so painful. How do I just simply run a Python model over a column? There's all these shenanigans you have to jump through and all these error messages. That's just like, as a user experience, that's horrible.

Jay Chia [00:23:03]: That's one. And number two, Spark and Flink and all these frameworks were built for a world of Hadoop and hdfs. And so the data I o that they do is very slow, actually, for something like s three. And that's why they don't have things like vectorize execution. Daft is actually under the hood, written in rust with a python kind of API. And so we're able to incorporate all of these advancements basically in the way that we think about data. And we're able to do things like columnar data access, run SIMD, all the really nice benefits that you get from being a native library. So just generally faster and easier to use.

Jay Chia [00:23:39]: That's, I guess, the argument.

Demetrios [00:23:42]: So one thing that I want to point out to everyone too, is you've written some really cool pieces on Parquet and the Parquet files, right? I can't remember the, I'm trying to like Google it right now. What was the name of the like blog post that you wrote?

Jay Chia [00:23:57]: I think it's working with the Parquet file format. Let me see. Let me find that for you as well.

Demetrios [00:24:07]: Yeah, we'll drop that in there.

Jay Chia [00:24:10]: Yeah, I'll drop a link. It's working with the Apache Parquet file format that I wrote in daft engineering.

Demetrios [00:24:16]: All right, last question before we rock and roll to our next guest. How do you see Daft fitting in with the medallion architecture? Data catalogs and the modern data stack.

Jay Chia [00:24:30]: Yes, great question. So today, Daft is kind of very purpose built for machine learning workloads. We've been having a very strong focus on supporting multimodal data types, supporting Python. The way that I've seen the medallion architecture at all these enterprise companies is that people always have a very similar, uh, architecture where number one, they run spark, they run specifically Pyspark in Python, uh, for, you know, their bronze to silver kind of, and silver to gold kind of, uh, data pipelines. Right. So data engineering is almost always done in, uh, Python and Pysmark because it gives you that flexibility of just calling arbitrary Python. Um, but then once data gets to the end of the pipeline, then people run SQL, and so then they run something like Trino, uh, to, in order to be able to analyze that data set and kind of provide this really nice SQL abstraction to the end user. So I foresee that today daft will.

Jay Chia [00:25:26]: It's really, really good for that early stage of like, your data is dirty and trying to clean it and you're trying to write into Iceberg and Delta Lake. We already have really good integrations with all these data lake formats, and then we're building in SQL support so that it's also going to be good at the tail end of the equation. Once you have a good data set and you want to interrogate it, I foresee that the goal I think here would be if you're working with weird, complex data types and you want this seamless experience, and daft eventually should be able to do everything from your bronze to your server to your gold tier in the enterprise data stack.

Demetrios [00:26:03]: Excellent, dude. Well, thank you so much, man. I really appreciate this one. We're going to keep it cruising. This was awesome, though. And for anybody that wants to, there's a few people asking about the jobs in the chat. So reach out to Jay directly on LinkedIn. I'm sure he'll be more than happy to chat.

Demetrios [00:26:22]: Get some folks hired. I like it. So talk to you later, dude.

Jay Chia [00:26:27]: Thanks for having me.

+ Read More

Sign in or Join the community

Watch More

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB