Sign in or Join the community to continue

DuckDB is fast for analytics, but what can it do for AI? // Mehdi Ouazza // DE4AI

Posted Sep 17, 2024 | Views 576

Share

speaker

Mehdi Ouazza

Data Eng & Devrel @ MotherDuck

Mehdi (aka Mehdio) brings over 10 years of data engineering experience across various companies, and he currently leads developer relations at MotherDuck. He's a passionate contributor to the data community, sharing his insights through blog posts, YouTube videos, and social media. Mehdi has a knack for simplifying complex data engineering topics.

+ Read More

SUMMARY

The need for versatile and efficient search mechanisms has never been more critical. This talk will explore DuckDB's underrated search capabilities and usage within an LLM stack.

+ Read More

TRANSCRIPT

Link to Presentation: https://docs.google.com/presentation/d/15M8nAGBJuzECMmzF1hk67f1-LNKokNrWSaxNIu2Qwkg/edit?usp=drive_link

Demetrios [00:00:07]: So let's get Medi on to the stage. Where you at, dude? Oh, there he is. Look at that. How you doing, man?

Mehdi Ouazza [00:00:14]: Hello. Yeah, I'm doing great. What about you? I have my shirt, right? Look, you're doing your duties.

Demetrios [00:00:23]: Look at that shirt. Look at that. This is it.

Mehdi Ouazza [00:00:26]: Now I feel like, I think, I think you. You'll represent the colors just fine, dude.

Demetrios [00:00:34]: Uh, John of and Donno, in case you're just joining, he's been looking for the elusive mother duck for the past 15 years and so maybe he'll find him by the end of this conference, maybe not, who knows? But you got to talk for us, huh? You're coming at us from mother Duck. Folks already saw Hannes talk. He was here talking about Duckdb and different ways that people are using Duckdb. He gave mother duck a shout out in there. I'm excited for what you got. I'm going to throw your presentation on the screen and I'll be back in just a little bit.

Mehdi Ouazza [00:01:09]: Cool, great, awesome. Let me put it in full screen. All right, I welcome everybody. So I'm here to talk about Dax, specifically duckdb, why it's password analytics, but also of course, how it can help in your AI and LLM workload. So I want to start first with this tweet, which has a bit of a hot take from Vicky, which is working at Modslado AI rebranding all Python dictionary in my codebase to serverless vector databases for funding purpose. And I'm not going to say here that vector database is not useful, but I just want you before I tell you which tool you should be using for AI or ML workload to zoom out because we are overwhelmed with new tools, new technology popping up every time and picking those up had complexity. It's also sometimes just for hype. So talk everything you will see at the conference and myself included with some grain of salt.

Mehdi Ouazza [00:02:19]: And a second remark to yeah, do I need this or would I use this diving into DarkdB. So what is it? Is an in process analytical database. And if you're a python person, which I assume if you are at this conference, it's just a pimp install. So just a library to install is going to work within the process of your Python process. There is no server to install and you have of course multiple language and interface, Java, rust or r if it's your thing. There is also a duckdb cli I'll show you quickly after. But Duckdb is really a swiss army knife, as you can see at the picture and diagram here. Because you can read and write to multiple source really easily.

Mehdi Ouazza [00:03:09]: It's just a binary so there is no external dependency. I don't need to install external thing to read to or write to history. It is all packaged within a single library or binary so you can read the postgres table directly from Duckdb. It has its own file format as you can see in yellow the file format support Etsy transaction. It includes all metadata in all tables. You can have it in a single single file and of course you can write it to your common object storage as you please or read from them. And Parker CSV also table format like delta, lake or iceberg. What have you yet also be able to run in the browser? If you're familiar with Webassembly, webassembly is kind of a container to run a low level application and intensive application within the browser.

Mehdi Ouazza [00:04:05]: And that means that if you go to that website shell dot darkdb.org, you disable Internet. But don't do that now because you're all connected and you're following the call. But if you disable Internet you're still going to be able to run those query within your browser. There is no communication to the server and what's happening there is that WDB is running within your browser. WB is pretty popular these days. We've passed the 1.5 million downloads per week and that's just on the python client. So what is motor Doc? Modduck is in the cloud serverless and it's basically anywhere you can run duckdb, you can run motherduck. There is no external dependency.

Mehdi Ouazza [00:04:51]: DuckDB has a mechanism of extension and there is a motherduck extension and you're connected to the cloud and you can scale over there and we offer compute storage and share. So you can create data set and share public key or within certain scope using our organization. And we have a dedicated UI to use. Duckdb is really, and we've monitored a new paradigm so you can have, as we just say before DuckDB running in your client, as you can see on the picture, but also on the server on the cloud. But it's not like one or the other. It's really the two working in concert compared to standard cloud data warehouse where basically the client just sending SQL text over the wire and not doing the actual compute here. Basically you can have compute on both sides. All right, I have a small demo here and after we'll see directly.

Mehdi Ouazza [00:05:46]: Once you understand that AO is fast. I'm using the CLI on three gigabyte data set using the WB file format of course could have been parquet or Delta Lake. And what I'm going to do is going to run quite a complex query here, parsing some string and doing some group by running it locally. And as you can see it's going to take just less than a couple of seconds for 3gB of data. And now what I'm going to do is connect to the cloud and I just do attach MD. I have my token provision in my environment viable. Now I'm connected to the cloud and I can access this cloud database, duckdb stats and I'm running the same query and as you can see I'm getting the results as faster than I did locally. So this is how easy it is to start locally with DuckDb with this local data set I just show you and connect them after to the cloud.

Mehdi Ouazza [00:06:41]: All right, let's go back. Now you're convinced that it is fast, so why an identical database for AI workloads? Well, where is your existing data? When you start an MLR project or LLM project? Basically you need to fetch those data and preprocess them, right? That's the starting point and often they are in the cloud data warehouse or in object storage. And that's where DaX becomes handy because it's really the swiss army knife and you can directly pull everything within a single tool and provision this data. Enter in DocDB file format and do your analytics or protest formation over there. So that's really great. And next to that we also worked. If you look at the end of the pipeline, the inference when you on the ML or LLM site, again, duckdb is a new side of the database I'm repeating, but this is another graph to illustrate that and you can put that compute to the client. So the same way that people are running LNM locally with Olama for example, and they run their LLM on the edge on the client.

Mehdi Ouazza [00:07:50]: You can also run analytics on the client but also some ML workload. So storage and search are an important part of LMstack. Would it be you're doing a rag or a fine tuning? You can see that you have search and storing specific, you know, domain specific data set all over there and that's where WBcan comes in. You can use it as a vector, a vector store, you can store your embeddings and it's great because the same place where you did your pre processing data in your data engineering to have your clean data set and now you can store your embeddings over there. So another side of things is that so dark? DB can run in python but it's really SQL first. There is a lot of neat features within SQL that DyB had and it's just the language has been there for years and that's the one that stand out. That's the one that is the common denominator around different profile. Would it be ML engineer, data engineer? You can push the thing really high.

Mehdi Ouazza [00:08:57]: We have a blog post about that. I invite you to watch this all in SQL hybrid search using DuckDB. You have it as an embedding vector database and you can integrate the full text search, DB support, the full text search and with the embedding methods you have an hybrid search all in SQL. Another real production use case you may know of course hanging phase. They are using the full text search functionality from Duckdb. So they're using duckdb behind the scene when you search through the data set to that. So this is just to show you that this is not, you know, supposition. This is things that where there is people running production workload.

Mehdi Ouazza [00:09:40]: Another thing that I've seen like this actually tweet started from yesterday. You can build different extension with the and people. So there is extension to read parquet CSV which are core to duckdb. But you can also build your custom extension and people like here are saying yeah, why should I build like an extension to do standard linear or logistic regression. And you know, I'm just calling in a SQL function and I'm doing those simple ML workload within my analytical database where I do this transformation where I store my embeddings. So this is really possible. So quick takeaway what duckdb and monitor can do for AI ML workload. So DyDB is definitely a versatile tool for ingest and preprocessing data because quality of data is all what matters at the end for any of AI or ML project SQL for the win simplify workload.

Mehdi Ouazza [00:10:37]: We've just SQL. Believe me, when you start to have complex polyglot pipeline where you do have SQL over there, python over there, if you're roofing things and you can do all everything at SQL level, that's pretty handy and easier to debug. You have a new paradigm in database. It can live on the client as I explained also on the server if you need to scale. So you can leverage the cloud compute. Your expensive MacBook Pro is not anymore just sending text over the wire to the service. Actually compute and I reduce cloud costs and also, you know, improve the latency for the users. And that's how you can scale with modern, as I showed you with the CLi demo, pretty quickly and easily.

Mehdi Ouazza [00:11:22]: And finally, it can serve as a vector database to store embedding. And we have integration with common AmM frameworks like Lamindex. And you can do also a hybrid search with full text search and embeddings. We have a blog about that if you want to dive into. That's it for me. You can reach out on this LinkedIn if you like me, please don't hesitate. If you're interested to know more, I'm always available for question around data or duck, and that would be it for me.

Demetrios [00:11:54]: And you've got an incredible YouTube channel that everybody should go check out. And especially if they're looking to have a good time and enjoy this. It is a one of a kind place to learn and be entertained.

Mehdi Ouazza [00:12:10]: Thank you. That's. That's a really good summer.

Demetrios [00:12:13]: Yeah, dude. Well, can you stick around for like 1015 minutes? Because the big reveal of Jono and the mother duck is coming up after our next talk with Alex.

Mehdi Ouazza [00:12:28]: Cool. Thank you.

Demetrios [00:12:29]: Right. But I don't know, you got. You got a dinner to get to. You got to go. Maybe on your phone, too. Feel free to just jump on.

Mehdi Ouazza [00:12:35]: Yeah, of course. I'll do that. I'll jump on the phone. Yeah.

Demetrios [00:12:38]: All right. Betty, this was awesome, dude. Thank you for joining us. I'll see you in a little bit.

+ Read More

Sign in or Join the community

Watch More

Machine Learning Operations — What is it and Why Do We Need It?

Posted Dec 14, 2022 | Views 865

# Machine Learning Systems Communication

# Budgeting ML Productions

# Return of Investment

# IBM

Illogical Logic: Why Agents Are Stupid & What We Can Do About It // Dan Jeffries // Agents in Production

Posted Nov 15, 2024 | Views 1.3K

# Logical Agents

# Kentauros AI

# Agents in Production

We Can All Be AI Engineers and We Can Do It with Open Source Models

Posted Nov 20, 2024 | Views 697

# AI Specs

# Accessible AI

# HelixML