Sign in or Join the community to continue

Python Power: How Daft Embeds Models and Revolutionizes Data Processing

Posted Jul 10, 2023 | Views 367

# Python

# Daft

# Eventual

# eventualcomputing.com

Share

speakers

Sammy Sidhu

CEO @ Eventual

Sammy is a Deep Learning and systems veteran, holding over a dozen publications and patents in the space. Sammy graduated from the University of California, Berkeley where he did research in Deep Learning and High Performance Computing. He then joined DeepScale as the Chief Architect and led the development for perception technologies for autonomous vehicles. During this time, DeepScale grew rapidly and was subsequently acquired by Tesla in 2019. Staying in Autonomous Vehicles, Sammy joined Lyft Level 5 as a Senior Staff Software Engineer, building out core perception algorithms as well as infrastructure for machine learning and embedded systems. Level 5 was then acquired by Toyota in 2021, adopting much of his work.

Sammy is now CEO and Co-Founder at Eventual building Daft, an open source query engine that specializes in multimodal data.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Sammy shares his fascinating journey in the autonomous vehicle industry, highlighting his involvement in two successful startup acquisitions by Tesla and Toyota. He emphasizes his expertise in optimizing and distilling models for efficient machine learning, which he has incorporated into his new company Eventual. The company's open-source offering, daf, focuses on tackling the challenges of unstructured and complex data. Sammy discusses the future of MLOps, machine learning, and data storage, particularly in relation to the retrieval and processing of unstructured data.

The Eventual team is developing Daft, an open-source query engine that aims to provide efficient data storage solutions for unstructured data, offering features like governance, schema evolution, and time travel. The conversation sheds light on the innovative developments in the field and the potential impact on various industries.

+ Read More

TRANSCRIPT

Hi, I'm Sammy. I'm the CEO of Eventual and I drink espresso at the splash of almond milk.

Welcome back to another MLOps Community podcast. I am your host, Demetrios, and today I am flying solo for the once in a blue moon times that I do this. I got the pleasure of talking to Sammy Sidhu. This was an incredible podcast.

We go through all of Sammy's history. It was an incredible stint that he went through in the autonomous driving autonomous vehicle era, and he went through actually two acquisitions of these startups that got bought out by bigger companies. The first one got bought by Tesla, and then the second one got bought by Toyota.

Everything that he learned from that time. Doing this very low level stuff with tiny ml and optimizing, distilling these models, really making things work fast and work reliably. He put into his new company eventual and Eventual has a offering that is open source and I encourage anyone to check it out.

It's called daf. We'll leave the link to in the description to that. For anyone who wants to look at it, and let's just jump into this conversation with him because I think you all are going to love what he talks about when it comes to the future of LLMs and machine learning and just data and how we store the data, how we query the data, especially his specialty.

Has been the unstructured data world and the what he likes to refer to as the complex data world and how you can make that much easier than just having some metadata that points to an S3 bucket. And that's really what they're trying to do at Daft. From my understanding of it. It's much faster and it's much easier and much more robust.

So let's jump into the conversation with Sammy right now. By the way, if you enjoy this, it would mean the world to me. If you send this episode to one friend so they can share in this joy with all of us, let's get into it. Here's Sammy Sidhu. Talk to you soon.

Sammy. Great to have you here. Of course you didn't think you were gonna come on here without me talking about California raisins, huh? No, uh, it is true. Um, yeah. So, um, I'm Sammy. Uh, I grew up on a raisin farm and now I work in complex data, not just any raisin farm though. Let's talk about this because there's a lot of people probably that are listening that don't understand the cultural significance of the raisin farm that you grew up on, which is I'm pretty sure a lot of millennials and people our age.

Had your raisins in their lunchbox at lunchtime, and it was just a little box of raisins that probably came with like 20 raisins. And they called, they said on the box. I remember distinctly right. It was this small box, I think it was red, and it said California raisins on them. It is. So I grew up on a farm that was one of the producers for sun made raisins.

And the box I think is to most people where it's the lady putting the tray of raisins out into the sun. That was it. And that's exactly what we did. We would grow grapes on a vineyard and cut them down and then make them into raisins. And the cool thing is that like no matter where I go in the world, I always am connecting with people over SED regions.

I was in Korea for a conference and I mentioned it at a dinner. Everyone's like, oh, I had that all the time. Was a little kid too. Oh wait. So it's always something that connects with people. They monopolize the raisin industry, huh? No, it's interesting cuz it's actually a cooperative, so there's actually a bunch of arms in California that get together and sell the raisins together.

Oh. So what is the process of creating raisins? I know it's dried grapes, right? It's dried grapes. But the thing with sun raisins is that it's made by the sun. So generally what you do is you. Have, uh, table grapes that you grow on a vineyard. And then what you end up doing is when they're ripe from the sugar level at a certain uh, level, you cut them down from the vine and you throw it on a piece of paper on the ground.

And then what you end up doing is you kind of just like rotate it every few days until it's into raisins and then you throw it into a big hopper. So you grew up doing that. You would know when the raisin season was and when the harvest was and all that. And I imagine you were cutting stuff down. Yeah, my parents flew to work, so my summers were cutting raisins and rebuilding tractors.

So how did that take you into complex data? What the hell was that trajectory like? Um, yeah, it's a good one. Um, let's see. Growing on a farm, I obviously liked getting my hands dirty, so I worked on a lot of tractors with my dad. Um, fixed everything around the farm and got into like electronics and computers very early cuz of that.

Um, my parents noticed this and got me out of the farm because, you know, I was pretty much doing nothing at school at that point and the Central Valley and so I moved to the Bay Area and over here I just got hooked on computers. Uh, ended up staying local, uh, going to Berkeley. And, uh, ended up focusing a lot on, um, high performance computing, uh, low level programming stuff, and then get bit by the math bug.

Good. I got into machine learning and my professor, who I worked under in college started a company. Um, and that company worked essentially on building autopilot for everyone else, and I ended up being the first hire there and eventually the C T O. And that's kind of how my journey got started. Um, working from Yeah.

You said autopilot for everyone else. What does that mean? Yeah. Well, at the time, um, what we were trying to build, so, and you, when you think about autonomous drive, there's like multiple meanings. One is what we call like level two, level three driving, which is stuff like, um, hands-off, uh, hands-off driving where like the car will drive us itself on like a highway.

And then you have like things that are like level four, level five, which what we called is eyes offer Minds off, mind Off, which is essentially like you don't have to even pay attention or even look at the road. So what we were building for autopilot was kind of the level two, level three, which is I can go onto a highway, uh, it would drive itself and, and do its thing.

And, uh, the way the market we were targeting is for everyone else. Uh, one thing we noticed was that, um, Around 20 18, 20 19 is that most manufacturers are already bundling a camera, uh, in their vehicle, uh, to do things like parking assist or like automated cruise control already. And the thing that we could do is by making models that we are really good at, making small, putting them into the, in the, in the car, that, and essentially running it on potato hardware.

We could actually unlock all this functionality for self-driving. Without actually adding any hardware for the, the customer. What were some challenges that you had to deal with? Um, you're essentially running these really call bug deep learning models on potatoes. It's really hard to do and think about the computing power was essentially like, let's say a 10th or a half, like a 10th or a quarter of a raspberry pie.

Mm. So what we had to do is get all this data that we would collect. Uh, we would've to then transform into something that we can train models on. Then we would train these like mega models and then shrink them down to run real time on tiny hardware. How would you shrink them down? What were some techniques and tricks that you learned from years of practice, I imagine?

Yeah. Um, it's interesting stuff. Um, there's things like, you know, first is data quality. Um, if you are trading the model on a lot, lot the data, Where some of them disagree or not are, are not consistent. The model has would be bigger to kind of like see through that, but, well, one thing we noticed is that if I actually just get data and clean it very well, you actually need less model capacity to represent your dataset really well.

So just really clean data was one avenue. Um, the other ones were a little bit more exotic, which one of the things we did is, uh, I have a paper on this. It's called Ask Nest. I was gonna ask, I knew there was gonna be a paper somewhere in this story. Yeah. And no, is it, it is. We did have to, a lot of the technology wasn't there yet, and so we had to develop a lot of it ourselves.

But, uh, I had a paper called Squeezed Ass and the idea is that, um, we use network assisted search to essentially find the best, uh, network work for the hardware. So what we did is we used the neural network, or to then find the optimal neck work for the maximum accuracy in the lowest lightency.

For a given hardware set. And that helped a lot. I, I want to know a little bit before we move on about you became CTO of this company and mm-hmm. You were doing the very much ML stuff. You also had that low level distributed computing aspect of it, I guess.

Mm-hmm. And so when you were looking at the systems. What were some of the other pieces that you were factoring in when you were architecting these smaller models on Very, very dumb or tiny, I guess the proper word is tiny ml and the half raspberry pie, or the 10th of a raspberry pie. What else were you thinking about as you were deploying these onto the cars for everyone else?

Yeah. There's a lot of different things. I would say, like the important thing is think with the whole process, right? Uh, you have these models and the thing that we were optimizing for was iteration speed. So what that means is, um, producing new models quickly, dependent on the error had happened. And so we think about that.

One of the things we need to be able to do is, uh, produce these models quickly on these large training clusters. So we were one of the early people who did distributor training. And the thing that was really interesting there is when you start making these models smaller, uh, you end up stress the system in weird ways.

Huh. So one thing that would happen is, uh, because we're training these small models, our GPUs, um, were running this pretty efficiently, but the amount of data you had to ingest. For these models, uh, it increased exponentially because think about if you have like a large model, the amount of GPU U to CPU U work is quite balanced.

But once you start shrinking that model, the CPU work doesn't go away, but the GPU U work is now getting faster and faster. And so we ended up getting bottlenecked in all these weird places for training and we had to like kind of reinvent the wheel to, to kind of make that work. Uh, And then it just kept on going.

Like we kept just running into weird issues. Like when it came to model deployment, what we noticed is that some of these platforms that we were targeting were invented, uh, like a 10 years ago as an accelerator. Oh. And what we had to do is that if we ran it and just to CPU for this, you know, these hardware that, um, you put in the car and these hardwares were, um, you could think of 'em as like these platforms that are made by like a company called Renaissance.

Or nxp, these companies that you pretty much have never heard of, but they have a monopoly in the car, uh, computer chip space. Yeah. Um, but the thing is that sucked about them is that they're kind of designed by committee. And so they just have a bunch of random shit in there. Like they'll be like, oh, we have this one accelerator for, um, these operations and we have this other accelerator for these other operations.

And so we ended up building a compiler to be like, okay, here's an info model. Then kind of like divvy up your model onto these other chips. So we had to do design models that had layers that were easily mapable to multiple parts of the hardware. Damn. So you had to do all kinds of janky shit. Yeah, it was weird.

It was weird. And then I think the worst thing we had to do is one of the vendors we worked with had a chip that the compiler was only in windows. And that was probably like the worst month of my life. And I'm guessing this is a lot of cutting through regulation too because, or the chips were like this because they had to go through rigorous regulation, red tape.

It is true. Yeah. It, that is the case. So typically when we talk about self-driving hardware, they have different levels of what we call aol. Mm-hmm. And the idea is like, how safe is this chip? And so, A lot of these ships have like failovers and redundancies such that like if something breaks, then it can still operating as normal.

Yeah, it makes sense. So inevitably you probably created a lot of stuff while you were there, and is that what made you see the light and realized that, hey, you know what, with this complex data stuff, there's probably a few businesses that you could create around it. I know that. You, you did this and then you went and you had another stint in self-driving cars, right?

And then you started Daft. Uh, yeah. It's a, it's an interesting history. So I was at this company called Deep Scale and then we actually got acquired by Tesla. And so I worked through that whole acquisition process and after Tesla, I then went to Lyft to kind of lead a lot of their self-driving efforts around perception.

And then that team got acquired by Toyota. So I worked at Toyota for a bit. And then after that, you know, after four companies of essentially building complex data systems and being the builder of these systems and user of them, I was like, okay, other people need tools that I kind of had. I, I had were great, but once you leave, you don't really have anything.

And so what we wanted to do is kind of bring the functionality that you're familiar with, like a database or data warehouse for tablet data. But to the complex of it. So you inevitably thought deeply about if the world needs another database, what made you decide and explain more about what you're creating with daf?

Yeah, for sure. Um, I, I guess like I, I'll talk a little bit about what we see as the, the common issue right now with, if you are dealing with complex data. So yeah, if you're a company dealing with complex data, well I most often saw in the systems that I initially built, We're kind of bridging together the best of relational with these kind of batch prostate engines.

So imagine you are a self-driving company or a a, a biotech company. You have a bunch of things like images, videos, 3D scans. What you would intuitively build to kind of build an engine end to end to process these images is getting a traditional database through data warehouse. Putting all the metadata in there and when it came to the assets, uh, like the video or image, you would just have like a pointer to some remote storage like S3 running.

So you'd put a bunch of images in S3 and have a database with all the metadata. Yeah, and what would then happen is when I wanted to process this, I would have to do three different steps. I would first have a data analyst write a SQL query to be like, what data should I process now? Then you would run that query and then get a list of files.

And then you do something like Spark to then process them and then give you the results where you didn't dump the S3 or dump the results back into a SQL table. Then you would finally get another data analyst to actually process the uh, end results and give you the analytics that you cared about. And what we saw is like these really nasty systems that had talked to three different teams to actually do this end to end.

And I was always like, okay, first, if we're running a query, we can't optimize the end to end. And number two, Dealing with spark, with images or videos or 3D scans was like a nightmare. Uh, what would happen is that Spark doesn't have the right abstractions here to represent the data natively. And what would happen is that, uh, you would have one, uh, one machine get too much memory from having all these images and memory.

And it would boom or add of memory, and then that would die and they would rebalance and they would just knock out the rest of your cluster. Like dominoes. Dominoes, yeah. Yeah. And I always just like killed me going through like 10,000 line log files of like Java, um, errors and like biting what was the one line that caused my bug.

So I think I could spend collectively weeks of my life on this. So painful. Yeah, it's so painful. And I, I always thought like, Hey, you know what, if we had a system that was like natively understood that this is a two gigabyte file and don't just load it like dumbly in memory, that would make my life so much easier.

So after all my citizens self-driving, I kind of talked to a few companies about like, Hey, if I work here, what do you want me to work on? And four of the companies that I talked to, uh, out of five said, Hey, we would want you to handle this whole side of unstructured data processing. And I was like, huh, this sounds like a good startup idea.

And then I, I, I paired up with someone I worked with at level five, uh, at lift level five, and he was also very passionate in the space. And we started the company, uh, eventual. And our first product that we worked on or we're working on is called Daft, which is an open source query engine for all those types of data where it understands it very intimately.

And the idea is that you can load in all of this data and then you can. Query it easily. And it's, the metadata is with the unstructured images, and that is my understanding. Is that correct? Yeah. Yeah. So it, it's kind of like, uh, I would say it's a distributed data frame. Um, that's all Python, but the entire engine's ran out in rust.

And what it looks like to the user is something like a Pandas data frame where. I can say, I'm gonna read all these, I'm gonna read these millions of files in S3 as like read files. I wanna then process them, uh, as an image. I want to crop them, revise 'em, and then I want to load them in, into a machine learning, uh, model for, um, either inference or training.

And what it looks like to you is you're just building this query using a lazy api, and when it comes to execute it, it will actually run it on a cluster, um, using this query plan that it develops. But then also intimately understands the data it's running on. So no more room errors and very optimal queries for complex data.

Uh, that's awesome. So then this is daft, eventual is what? So eventual. So we wanna be a lot more than managed daft. Um, so what we're working on, uh, is we have plenty, we have users. So most of the users of Daft right now are actually in the enterprise space. So, They're right now using open source staff to do a lot of query processing of unstructured data.

Um, the part that, the question we get after how do I pro, you know, how do I run these queries on data or unstructured data, is how should I store it for better retrieval? So the next, uh, product that we're working on is a, uh, still will be open source, but a managed, enterprise managed version of this, which is how do I store my data efficiently for Chromebox data?

So, Think of it as a, uh, complex catalog for everything unstructured, and it differs from just that S3 bucket. In what ways? So if you're familiar with like a data warehouse or a database, you get all these amazing features like governance, um, like schema evolution, like time travel. Sure. But there is none of that for s for example, bucket O images in s3.

And so what we're trying to do is give you the things that you're familiar with, with BigQuery. Uh, or, or Athena or lot of these data warehouses, but for Untreasured data. So think of it as I want these teams to have these permissions and I wanna add this new column, or I wanna delete this column, or I wanna have a retainment policy.

Um, these are things that we're adding on top of just Plaino S three. Also much faster loading rather than just a bunch of single images, we can actually compact it into something that's really easy to load in. Oh yeah. How's that? So there's some interesting formats you can do. So what we see right now is, uh, people stuffing a bunch of images in Parquet, which isn't really the best.

It's great for tabular, but isn't necessarily the best format for, for images. So for a lot of this, we're all, we are developing our own, um, container formats to, to be able to load data really efficiently for images. Or, you know, seek to certain parts of the image from cloud storage directly. Oh, nice. So inevitably, man, I'm sure everyone is asking you about how you fit in, in this large language model world.

Mm-hmm. And unstructured data is so hot right now. What is your thesis or how do you look at it when it comes to whether it's whatever, text to image or. Uh, text to animation, text to video, any of that or just straight generative AI with a open AI call. How are you seeing what you're doing at eventual playing into this like, new paradigm of machine learning?

And if at all, cuz maybe you're like, yeah. Right now we're focusing on this slice and it's not actually that important to us to get distracted with the shiny in new toys. Yeah, I mean that, that's a good question. I would say a lot of what we're building is compatible with the future of LLMs. So some of the thing, some of the, uh, use cases we're working on, um, with, uh, some of our early users.

Are around LLMs and generative ai. So a big one is actually retrieval. And so, um, are you, are you familiar with like chain of thought for LLMs? Yeah, dude. So we talked about the exhibit, the example earlier. Yeah. There's this great paper, uh, from Alex Radner and a lot of other people that I can't remember who, but.

He was talking about distilled, uh, step-by-step and it's basically distilling the model. Have you seen that paper? Yeah. Yeah. Okay. You know about it, but I'll explain it for the listeners in case they missed it. It's distilled step-by-step, basically is asking, it's distilling the models, but. When they distill it through the large language model, they're asking for this chain of thought reasoning.

And so it makes it much easier for you to get that distilled model and train it with less data because the metadata is so rich from the chain of thought reasoning, prompting, or chain of thought prompting that you get from the large language model when you are creating that, uh, training data for the distilled model.

Yeah, it's, it's super cool stuff. It's crazy to me how different distillation is now compared to distillation. When we used to do it back in your day, back in my day, distillation was like, kinda was kind of dumb. Like what you would do is you would train a big model and you would just train the, essentially like you would use that output as the ground truth for a smaller model.

Wow. And that was it. And then, I don't know. It was just like, oh, maybe the big model understands the labels better than the, the ground truth. And then, I don't know. But anyway, um, come a long way from there. Anyway. Come a long way. Yeah. I distracted you. Tell me about why you were talking about distillation in the first place.

Oh yeah. So, um, the way I think about, uh, chain of thought reasoning is it, it's quite compatible with this. So one of the use cases we're doing now is, um, being able to process data like the red pajamas dataset, being with the loaded in. Filter it and then do things like tokenization, do things like run an open source model on it.

Um, and then doing things like, I have a set of embeddings and I wanna do this hybrid search over a large data set of, find me, find me, uh, you know, sources of text that were published this day, have this type of style, and then, uh, have similarity closer to this. That's kinda like the first level use case.

The next level use case is actually plugging DAF straight into the LLMs and having, you know, if I ask the llm, Hey, give me some documents that, you know, show me, um, I don't know that are written in the style of, I don't know. What's, what's your favorite author? Hunter Thompson. Okay. Hunter Thompson. And so the thing is, for a lot of these models, we don't really quite have things pre computed and we don't necessarily know what the model.

Query looks like. And so what we can actually do is actually plug and dap as one of the query sources for llm. So think of it like, uh, you know, we've seen all these demos of LLMs writing, uh, SQL, to then send it off to a database and retrieve it, and then parse the results. We can do the exact same thing for complex data, which is write, have m's, write these queries, search over these large, uh, corpuses on S3 or these data lakes, bring back the results, and then have the alums interpret it.

Interesting. Wait, say that again? I'm not sure if I fully grasped what you were saying there. Oh, well, uh, what I'm trying to say is like, just how we, uh, you know, can provide a SQL engine for LLMs today to understand some of your tablet data. We could do the exact same thing for complex data, which is I can have my data lake full of multimodal complex data and have the l l m write daft queries to then try to understand them better.

Oh. Dude, that is awesome. Okay, I see what you're saying. So then these are some of the large language model use cases. I imagine four out of those five dentists that you talked to earlier or whatever they were. When you went out to get a new job before starting eventual, and they said, Hey, we, we want you to work on this use case and this problem area.

They weren't necessarily doing things with LLMs back in those days when you started eventual, and so there's a ton of other use cases. Is it very much like where you see it sore? Where you're seeing daft really take off is. In your past life kind of world where it's the self-driving car world, and I know you mentioned also drug discovery, is that another use case area that you see it taking off?

Yeah, so the, the area that we see it taking off and the areas we're focusing in are essentially the, the domains that don't get much love. And so love that. Yeah. Right now if you're trying to, yeah. And so right now if you're trying to process images, there's a bunch of different tools that sort of work.

And so for that, um, it's kind of crowded, but what we actually have talked to a lot of our users, if I'm trying to build a, um, way to build a quarries on audio data and like read all these different audio sources, uh, transcribe 'em, and then do all these different various analytics and, uh, filtering, there's not really a good tool for that.

And so what we really lean into, Making sure we work really well for these domains that don't get quite much love. And so that seems like audio, uh, 3D assets like game assets, um, uh, microscopy images from biotech, which is like these, these large, um, uh, lossless images that are hard for pi, the libraries to typically interpret.

Yes. And also just weird wacky things like, um, a big use case of ours is, uh, someone sets up a Kafka stream and they dump a bunch of these proto buffs in S three. And then now they're like, Hey, how do I just filter over a billion proto buffs for things that have this in their field? And DAF works great for that.

Oh, nice. Okay. So you could potentially work incredible for playing around and querying a lot of ML Lops community podcasts. That's a great use case. Like if there's like a, uh, a bunch of sources or URLs for the podcast and we can pull 'em in, transcribe 'em, chunk up the text, and then, um, you know, and then find out interesting things you said and then maybe train a model to fine tune it on your voice and replace you.

That's then I don't even have to be here. I can just have you interview the fake me. Maybe I'm not even real right now, man. Who knows? Yeah. So that's uh, yeah, we'll have to do some kind of hackathon on that later because I think that would be super fun to pull all that data. And also it sounds like you're making it really easy compared to what I had thought it would, we would have to do.

Yeah, it's really easy. I mean, you could spin it up in co-op and then get the query working and then it's uh, it's a one line switch to switch it from running locally to running on a distributed cluster. Nice. Oh, that's awesome. So now what are some things that you took from your self-driving days and all this tiny ML stuff that you were doing and you're now bringing into daft or eventual?

Yeah, I think there's a couple of concepts that I think are really important. I think one thing that I took away early was if you focus on traditional data systems in tower space, the individual rows don't really matter. It's about the analytics result. Right? But in complex data and self-driving, the individual rows matter.

So it's like, if I'm trying to search for failure cases, the individual results matter, not the analytic, uh, part of it. And so, It's almost like usable a system that's like hybrid transactional analytics, but you're not really doing transactions. So one thing we built adapt early is how do we make sure like needle in the haystack queries are really performant and easy to do.

Yeah, that makes a hundred sense. So it feels like, and this again, going back to that whole idea of the self-driving cars, it feels like, I know we mentioned it beforehand. It's one of those high risk or very high. What? There's a special word for it that the EU uses to classify the ML use case. I can't remember.

It's something high. Damn, I can't remember. But, um, basically it's dangerous because there's lives that are potentially affected in very bad ways if shit goes off the rails. And so making these needle in haystack. Use cases or um, cases able to, you're able to find them really quickly in daf, seems like, yeah, it's a no-brainer.

You came from that world and now you're seeing, Hey, how can I make that very useful, even if it isn't with this high danger situation? Yeah, yeah. The needle and the haystack type queries, we see 'em a lot now in, in, in the general AI space as well, which is. Find me this document that exhibits X, Y, Z and you end up searching, like you, you're not saying how many documents are there that exhibit X, Y, z, you're, you're saying give me the documents that exhibit this.

And so we, we see this pattern quite often now where you want kind of the best of both the worlds. And so we're, we're doing that. And for the whole safety thing, it's quite interesting cuz I'm seeing a lot of similarities between. The self-driving domain and thinking about safety and the whole L Lab stuff that's happening now.

Like I would say for any AI you develop, it's important to essentially have a safety net so it can't do harm. And the way you think about how rigorous that safety net has to be is what can it do? So when you think about self-driving, what we would do is, You need a way for something to take over. And so we would have the self-driving stack.

You would have a safety layer in software to prevent things like, you know, hitting the curb or hitting a person that was outside of the traditional, like, uh, system that would be running for, you know, the, the end to end automation. But then you also have a safety driver who would take over a minute, so a second sodas.

So they would be, you know, hands hovering the wheel. And grab the car and disengage whenever something what happens. Yeah. Um, and the, the way that any of these ais are developed, including, you know, the ones that we see now is you have feedback loops. You, you essentially put something out there, you observe how it does, you see the failures, and then you improve it.

Where as for self-driving, that feedback loop is very expensive because you need a human to literally constantly monitor it. So all the feedback, all the validation you're getting is supervised by a human. But for the whole Gen a gen AI space, it's quite interesting because I put them in two domains. One domain is where it's okay to fail, and I see that in a lot of creative, uh, applications where if I'm doing copywriting or, uh, you know, image generation.

Yeah, yeah, exactly. Any of those, there's not really a right answer and there's no like right answer to fail. And so it's okay. Right. There's, as a use case where, um, You can put out there and you, the feedback you're getting from users is what images are actually clicking and, and actually downloading, um, based on what they generate.

But then other domains, like, let's say legal AI is, you still have that feedback loop, but then they have to be reviewed by lawyers, but each, in, each time that they're being run, it's gonna improve over time. So I kind of view it analogous, but. A little different. Yeah, I remember the word. The high. It's high stakes.

That's what? High stakes high. Yeah. Dude, I can't believe I was totally blanking on that one. So going back to what you're saying though, a hundred percent, I fully agree with you on that. Where there are these, Uh, expensive feedback loops, but you also wanna make sure that there is some kind of safety net just in case.

And it feels like, yeah, there's these use cases with generative AI where you don't need safety nets per se, because the worst that can happen is you get a deformed face on an image that you generate. Or you get a copyrighted image that you generate and you gotta figure that out and make sure that you just generate something better and you become the curator and whatnot.

Or when it's generating text, then you can add your flavor to it. And there's not really any high stakes there. But when it comes to more of this, Legal documentation or if we're talking about, I, I mean there's, there's a few different high stakes potential here that you do need a bit of a safety net.

Yeah. No agree. Like that's the thing I feel like needs to be understood better. I feel like a lot, I, I see a lot of companies or individuals starting these projects, which I think are great, but you know, for example, legal AI is like, I think I see a future where it works, where the lawyers are using it, but.

At the same time, I pay my lawyers a lot of money because they don't make mistakes. Sure. Otherwise, I would just do it myself. You know what I mean? Yeah, exactly. I, I know how to prompt this. Yeah. Whatever Harvey came out that, so I can figure that out on my own. I, I, yeah, I understand what you're saying on that.

And, and then coming back to what you're doing with Daft and how you are thinking about that, how does that tie in? Yeah, so a lot of these use cases, like for example, Harvey might do, is they have these corpuses of legal documents and they want to be able to get these documents, uh, image, you know, use image processing on them, extract out image, extract out texts, and then run 'em through LLMs.

Um, those are things that DAF would be a really, really good tool. What we see right now in, in the, uh, the generated AI space is that, A lot of people are just writing Python scripts that go over each file, extract the text manually, and then hit the open ai uh, um, api, and they kind of then dump that either in a file or like one of these vector databases and then kind of the process completely restart once they switch the model or switch the kind of thing that they're doing.

Yeah. And so we wanna make that really easy to do, which is I can build my entire pipeline just using a data frame like I would do for Pandas. And when you're ready to commercialize it, Or make it production ready. You just switch a couple flags and then you can run this on a cluster. So, um, it's, it makes their life a lot easier is what I would say.

Yeah. Oh, dude, I see the vision then. I see. And I understand that. So, one thing that I also wanted to mention and bring up is, and you kind of hinted at this before, is around how you think about the. Needs and trade-offs when it comes to tabular data versus this unstructured or complex data and certain things that you need in tabular data and other things that you need in complex data and how you look at the two architectures.

And if you're setting up a system right now and you only need to go down one route, maybe it's the tabular data route. Then what are you going to be building for and optimizing for? And then if you have complex data, what are some things that you're keeping in mind and optimizing for? Yeah, that's a, that's a good question.

Um, I guess like talking about the, like, the axioms about, I think is really important. So when we talk about tablet data, you typically think like, um, integers, floats, um, you know, things like strings and. Most of the time when you have this data, um, the queries you're running are usually aggregations. Um, so what that means is that I have like these files of text.

They could be large files of texts, and what I'm essentially doing is going through each row and then doing things like min, max, sum, uh, these very cheap operations to compute. So if we think about it as what is the volume of data I have and the volume of compute, it's actually much more data heavy. So it's like I might have gigabytes of data, but I only have.

You know, billions of operations, those are quite balanced. Let's talk about an image. Images and things like text for LLMs is completely different. Whereas if I have like a one kilobyte, or sorry, like a hundred kilobyte image, I'm processing potentially trillions of operations on it. Uhhuh, right? Or if I have like, let's say a work of Shakespeare, which is only like a megabyte, I'll be processing, you know, trillions of operations using an l.

Lm, you shouldn't. And so the ratios that we think about of compute over the data is completely different. So when you're building callback systems, you have to think about it as, I will be compute bound almost every single time. Whereas like if you think about traditional query engines, you're like, okay, all they're doing is optimizing or you know, reading from s3 and they don't really bother about the compute, which is fair cuz they're gonna be bottleneck by that.

Uh, so then what are some, okay, I, and I love you break it down by the axioms, and you think about, Hey, let's look at the fundamentals of one versus the fundamentals of another and what the bottlenecks are gonna be and what you're going to encounter. And so then as you're building systems around this, what are some things that you would.

Say, I definitely want to have in my toolkit if I'm dealing with one versus the other. Yeah. Um. I would say if I was building an Alli DIS engine, I would go for something simple and something that failures are handled in, you know, in a way where we can recover easily. So for example, like if, um, you're running things on like a Spark cluster and you, um, it's fine because the amount of work you have to reproduce, it's not, it's not bad.

Okay. But if you talk about that in terms of like complex data cluster, that does not work very well. And so what we're building around that is, First, making sure DAP understands the data types natively so they understand what it image is and how, how expensive it is it to bring into memory, how expensive is it to send it around.

Uh, and also for these, the various other formats as well. Also, understanding that placement and scheduling is very important and we require really high utilization. So an example is if you get a lot of these complex data, uh, workflows and map, it doesn't mean like a spark cluster. You only use about 20% of the hardware, okay?

And so if you're using GPUs or you know you're running LLMs or a computer vision model on this data, you're pretty much burning five times more money than you have to. And so some of the things that we're doing here is saying, okay, you know what? Our users are typically enterprise companies who wanna save money.

By porting things into daf, they can run things, uh, that's, you know, Python using data frames, but then still get really good utilization and cheap throughput essentially for their whole system. And so, yeah, the things we're designing around there once again are, um, no really understanding the data we're working with.

Um, having query plans that are necessarily not optimized for Tableau Analytics, but for commerce data processing. Actually being, um, really tied into the hardware we're running on. So really understanding the AWS machines we run on and how to get the most out of them built into the framework. I wanted to go down one route, but then when you sit that last part, it's, it's like, oh, uh, I thought about something else.

Something else got triggered. You have this background in Tiny ml and how does that play into things and that, like with Daft, what is it optimizing for or how are you thinking about those kind of use cases?

It's pretty interesting. Um, so, So I, I guess one thing I'll go into first is what, what's the eventual goal of Daft, right? Uh, no pun intended there. Um, so the, the way that we're building out Daft is right now, although we require, so our user interface right now is in Python and all of the, uh, engine and stuff is a hybrid between Python and Rusts.

As we're building more and more, we're getting to a, a mode where we can run completely serverless, where. You can run things, uh, in Python on your laptop, but then when it executes, it can completely route out state in a cloud cluster, and then spin down as soon as it's done. And one of the difficult things with that, with, you know, models is that most of the models right now require Python.

Yeah. Uh, so you typically use PyTorch or TensorFlow and you embed that, um, into daft or typical, typical workloads using Python and the AUN as glue codes. But one observation that we've noticed is that, you know, although in the beginning we wanted to make things really easy for Python, uh, to build models and, and run 'em in dap, we noticed that a lot of people would just run the same models, uh, in daft.

Like they run the same, you know, LM models, they run the same computer vision models. Mm-hmm. Uh, and they might change things like weights, but the models are generally there. So one of the things that we're going to in the future, which is where my background comes in a tiny ml, is how do we actually embed models in DAF where we can.

Run them, you know, compile them down to hardware, uh, run them on CPU or GPU natively without any Python or any framework. Um, and then execute that very efficiently on the cluster. And so that's one of the things that we're working on, which is if I have something like a LAMA model, we can actually just package that as part of daft, run it on the cluster.

And it was like, no work on your part, you really have to do.

And then you can bring it down to the running it on a potato, as you said. And having this capability just, yeah, out of the box. That works. It works. And like I think that's, it's, it's getting really interesting cuz if we think about like the inference per dollar for a lot of these models, gps are usually the most efficient, right?

Like in terms of what cost per dollar. Inference per dollar. But the problem is no one can get GPUs. Now. There's a massive shortage, which is stuff that I dealt with that when I was working at Lyft, we would use all the GPS in a single data center, and we had to learn a lot about how do we do things, uh, differently so we can actually do our work.

And so we're seeing that now with this whole gen AI wave, is that I, I, I'm in this, uh, slack group for AWS support, and every day people are like, can I get more GPUs? And the guys were like, no. Right is that we have to start, think about alternatives. And so what do you do if you need to get this, if you get your shit done and you can get gps, you need to adapt. Yeah, yeah, yeah, yeah. You have to figure out what's the, what's the. Way around this and how can we make it work without the GPUs that we have been relying on this whole time.

So what's the workaround? Yeah. So, uh, I think a big one is, I mean, have you seen all those really cool work on, uh, like llama c p P? No, wait. Is this the one where it's just super small llama models? Super small. So this guy, um, essentially got Facebook's Lama model and then he wrote Yeah, he did whisper and then he did, yeah.

Yeah. And you can run it on your computer, right? Like you can run it on a CPU on your whatever laptop. Yeah, exactly. And so what he did is like made the model smaller, so if it's better in memory, and then yeah, wrote a hyper optimized version and purely in simple plus using vectorized operations and whatnot for this one model.

And so when. When you compile this model, you essentially have a single binary that does the infants, the model, and I kind of see that as you know, the future, a lot of these lms, which is you can package these standardized LLMs that are optimized for your hardware as part of your query and use it to extract or do generation or whatever task you want.

Oh, incredible. I mean, this is making the assumption that I. Keep telling people that open source models are going to be good enough, and right now they're not there. But the big assumption that everybody's saying is like, open source will be there in six months. Don't worry about it. It's going no matter what.

Yeah. I a big, I'm, I'm beg on open source. Like I, you know, our whole, our whole company is on open source. Yeah. I, I think open source WorldTech, I mean, I, I see I've seen a know a lot of, uh, analogs. That computer vision was. So in the, I started doing computer vision in like 2012. Mm-hmm. And during that time, the big bad computer vision algorithms were all in companies, Uhhuh.

You had companies like clarify, you had Google, you had like Apple, they were doing all this crazy computer vision, but then, you know, you would've, these open source research papers come out and then they would be a little bit better and then the companies would surpass 'em. And then we got to a point where people just kind of stopped caring cuz the open source was just so good.

And now what people do is they just use them as a black box for their tasks. And that's kind of what we see now for computer vision. Yeah, yeah, yeah. I mean, I hope it is like that. Don't get me wrong. I am a huge proponent of the open source world doing its thing and optimizing it and making it free and open.

Of course, I just, we haven't seen it yet. And a few of these attempts I try and play around with them and I'm like, Dude, this isn't six months behind. This is like two years behind. I, I would agree with that. Like it's probably gonna be more like years like two, one to two, three years rather than months.

Um, cuz like, I mean the moat here once again is like a big cluster to train these models on. Yeah. Like if you have a huge cluster where you can train these models on and. I think you can do it, but to get to the base level, I think the next part is then making these fine tuned data sets of, uh, like q and a, like open AI has done.

Yeah, yeah, yeah. Yeah. A hundred percent. I'm excited for it though. Uh, I, I, I'm a big, I'm a big proponent of, uh, open source and I, I think it will take off. Awesome. Sammy, this has been fascinating, man. I love talking to you about all of this, the history of where you've came from, where you're going, what you're doing.

Last question I have for you is over the years, You have inevitably succeeded on a few things. You've failed on a few things. Where do you feel like you succeeded where others typically have failed, and why do you think you succeeded? Hmm, that's a good question. Um,

I feel like it's because I care about things that people told me not to care about. Oh. Oh. And I think the biggest one is when I started my career, is that the advice I got from a lot of software engineers was, oh, don't worry about making this run fast, or Don't worry about going low level.

Uh, it's not worth it. And the advice I got for them came from an era when, um, you know, computers kept them just getting faster when you do no work, so you can write shitty software and computers will just get faster, and then your code would be no, no longer an issue. Or the other thing they would say is that, oh yeah, computers are much cheaper than software engineers, right?

But then I would always put my optimizer hat on and go into the assembly or go into the low level and actually just. Really understand why things were running the way that they did. And so at deep scale like that ended up paying off and at Lyft and Toyota that ended up paying off. And now I feel like that experience is really getting in there.

We're just understanding why things are slow or why things are fast. Because I did not listen to the advice people gave me in the beginning of my career. Incredible, man. This has been so cool. Thank you so much for coming on here. I think we'll end it there. Yeah. Thanks for having me. It's, it's been a great time.

+ Read More

Watch More

The Daft distributed Python data engine: multimodal data curation at any scale // Jay Chia // DE4AI

Posted Sep 17, 2024 | Views 1.4K

How LlamaIndex Can Bring the Power of LLM's to Your Data

Posted Apr 27, 2023 | Views 2.8K

# LLM

# LLM in Production

# LlamaIndex

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

'Git for Data' - Who, What, How and Why?

Posted Feb 17, 2021 | Views 577

# Open Source

# Presentation

# TerminusDB

# TerminusDB.com