Bringing Structure to Unstructured Data with an AI-First System Design
Will is currently the Co-Founder of Coactive AI, helping enterprises unlock the value from their visual data. A former Data Scientist at eBay, Will has previously held various roles as a visiting researcher. His most recent work focuses on the intersection of AI and data systems, including performance benchmarks for data-centric AI and computer vision (e.g., DataPerf @ ICML 2022, the Dollar Street dataset @ NeurIPS 2022). His previous work spans from IoT electronics to design and performance benchmarking of deep learning in neuromorphic systems. Will holds a PhD from Northwestern University and a BS from MIT.
Today, over 80% of enterprise data is unstructured and this fraction is expected to rapidly increase with the proliferation of generative AI tools. However, doing anything meaningful with this unstructured content remains extremely challenging as traditional data systems have not adapted, and ad hoc approaches using foundation models and LLMs remain expensive to implement and difficult to scale. In this talk, I will highlight the need to create AI-powered data systems for understanding unstructured data and share key lessons we have learned when building these systems for end-to-end applications.
All right, so excited to introduce our next guest, will Rojas from Cove ai. Hey Will, how's it going? Hey, gang. Uh, is the audio okay? Really? Yep. You look good. Uh, you're a little fuzzy, but your audio is great. Okay. Okay. Awesome. All right. Let me know if there's anything can fix before we start. I think you're good.
All right. Here are your slides. Looking forward to the talk. Awesome. Thanks. So, Ooh, is it good? Yeah. Yeah. I took you away by accident, which was not Oh, all right. So, All good. Awesome. Okay. It looks good from my end. So, hey everyone. Uh, so, uh, just want to introduce myself. My name is Will Garita Rojas. I'm one of the co-founders of Cove ai.
Today I wanna talk to you about essentially bringing structure to and tructure data. And really what this talk is about is some of the lessons that we learned in designing an AI first, uh, system. So I think part of the reason why we're all today is that really the nature of data is changing, right? Go back in time about 10 years ago, right?
The way we thought about data was mostly from the tablet perspective, right from the table A row, in this case, chicken, cheese, one of my favorite restaurants. And so, you know, hey, I'm gonna give a five stars, right? That's kind of what we were thinking about, uh, sort of around the big data movement, but, Unstructured data dominates a lot of data that we see, right?
We're really moving away from data to content, such as a rich text description or a rich visual image of the actual delicious chicken sandwich that, uh, you know, I love to go to. And so this kinda goes back to, uh, bill Gates' quote of 1996 of around content being King. And really the, his main takeaway was that those that actually control the content are gonna be the real winners in the internet revolution.
And if content is a king, well, I would argue is that the king's already here, right? So 80% of worldwide data is actually expected to be unstructured by 2025. Nearly 80% of people say that a lot of this instructured user generated content impacts real decisions. And generative AI is expected to surpass human workers can produce by 2030.
So this is a massive explosion of just unstructured content being generated that we need to make sense of. And the key to all this is actually ai, right? What are we talking about? Say, uh, sentiment analysis of a sentence, or we're talking about object detection. An image AI is actually, you know, our little motto is, content is king.
AI is a new queen, and AI is really gonna be the key to unlocking a lot of this. The value from this content. The main thing that I wanna talk about today is that marrying AI and data systems are doing AI scale in particular, is actually quite challenging. On the right here, you see, kind of like what we generally see in most organizations.
We have a lot of data being generated. Not structures, not of a semi structure, and that has a very clear path of how it gets consumed traditional systems. But for a structured data, it, it's, it's kind of, it's kind of a mixed bag, right? We generally see people archive it story, but then when they do afterwards, it's all over the place.
Uh, there folks that actually do nothing with the data because it's really difficult to, to tackle. See a lot of people throw human label and edits and people do ai power APIs. I think one of the arguments that we wanna make is that we wanna move towards a world where we're actually thinking about not just AI as a one-off thing, but AI as a natural system that can be done on scale.
And this is actually our main focus of what we've been doing at Coactive ai. We're building a reliable, scalable, and adaptable system. In particular, we do it for instructured content, individual domain. But what I really wanna talk to you about, to you about today is some of the lessons that we learned.
And design such a system for three plus years of user research and two plus years of building, uh, DISD thing. And really three lessons. But the first one I wanna just say to just kinda set the stage is that actually, despite all the AI hype, what we learn more and more as we talk to a lot of companies is that few companies do actually little more than just archive their instructured data.
This, however, is changing very rapidly. And AI, as an AI adoption is growing really fast, especially within the context of text. But we see that lagging in other modalities of data. Um, Two other lessons are gonna be, what this presentation really is about is essentially kind of walking through the main pitfalls that we see people do when they try to tackle unstructured data.
One of them being that logical data models actually matter a lot more than you think, and a different way to look at embed, not necessarily as a semantic means of understanding, but as a way to actually cast, compute and be cost effective and do an AI at scale. So that said, let's get started. So logical data models more than anything, but you know, I think probably a lot of people are wondering, well, what the hell is the data?
Is this logical data model, right? So, If you think of AI as this kind of monolith generates metadata, right? You generally have the data stored in some sort of plot store. It gets fed to some sort of foundation model that the AI team owns, and then the output of it gets stored somewhere else and it gets consumed by product folks, BI folks, et cetera, right?
But what's missing here is that this handoff between the data folks or the data engineering team, And AI folks tends to be hugely important. Uh, in particular, the pitfall is to just think about, Hey, just use what we stored, right? Because in the storage layer, the physical data model that we use is generally one of the key value stores.
There is a comment that json and then some value of text, right? But it turns out that these AI models are actually very finicky about how to think about the inputs. Right. So, uh, the inputs are data specific. So it's text, is it audio? Is it image? Is it video? Right? And they also tend to be task specific, right?
If it's text, am I doing sentiment analysis? Am I doing summarization? Am I doing some other tasks right? And so essentially this combinator of data and task makes it such, there's a large number of logical data models that are input. And what we generally see happen is that there's an impious mismatch between the way it's stored and the way it gets consumed.
And usually no one explicitly owns this. The data team ends up building a system that doesn't actually capture the needs of the, the AI team. And the AI team ends up building up bespoke system to kinda fix this and mismatch because you know someone has to do it. And overall we find that the solution then is one such that ends up bottlenecking the entire AI pipeline.
So, uh, to kind of go a little bit over the hidden complexity here, uh, is that in text, for example, let's say you stored it as a, again, a json file with some text about your review of this chicken sandwich. That's awesome. Right? And let's say your task is just summarization, right? In this case, there's no real in mismatched, right?
You can just, there's no transform needed. You can just feed it to us, to the summerization task. But if you have more complex tasks, say, say language detection, you might have to do something really simple like a random selection of a sentence or in, in say, sentiment analysis. You might say you wanna do something more complex, like a key phrase, key phrase, extraction.
But the main point here is that that fiscal data model is actually different from the logical data model, which is not just a key and some value. There's actually some sort of pieces of metadata or even subs are relevant here. This becomes more obvious as you go through multimodal approaches. We may have say a social media post that has a comment, a background, uh, and it's right.
Um, maybe a, maybe a sorry, background song, uh, a video and an image, right? And if you send this to a multimodal model, the way you actually might wanna think about it logically is almost as a post being a single entity that contains multiple of these modalities of data. So the main takeaway here is that to order, in order to overcome this mismatch, we really had to focus on that building logical models at scale.
Uh, in particular when we see this works best is when folks realize that there's a mismatch here, and we create data plus AI hybrid teams that actually work consultants, AI mismatch. And what ends up happening is that not only do you resolve the AI bottleneck, but all of a sudden you have to realize all these pathways for optimization.
To give an example in the image processing, something where we saw this is there's an image, uh, you know, this chicken sandwich image. Everyone's doing the same. Data transform pipeline to feed into a model. And so you're doing the exact same IO on compute three times. But rather if you add, if you have actually the data folks and the, and the AI folks sit down and look at this, it becomes very obvious with there's room for optimization in which like the AI folks can just consume that precompute it pre transform, um, image and nets up overall lead into a lot of really awesome consequences such as, hey, you can use, utilize your GPU more effectively.
So that's kind of lesson one. In lesson two, I wanna just have a different view of embed, uh, particularly for the standpoint of cost effectiveness. So once you actually start doing a lot of this work, this is how it starts, right? You start just generally with one foundation model and very quickly where you see happening in our experiences that, hey, Folks work with a foundation model for a specific task.
Maybe that foundation model's really popular. So there's a second task and one task 25, and all of a sudden you have a lot of io, a lot of compute, and you know, maybe your billing department comes and knocks at the door and tells you, Hey, what's going on with this bill? And when this stuff happens is that AI costs quickly grew out of control and bottleneck feature, foundation model applications.
And I think the pitfall here is that. You really have to break up that model to understand what's happening. And, uh, something that you can do and actually leverage embed here is that, you know, it's, it's an over, it's a vast oversimplification, but if you think of these foundation models, just computation gaps, you can actually, you know, a key takeaway here is that, you know, the majority of the computers happening and the feature is piece.
And actually folks are generally when doing task specific, uh, things, they're generally just changing the last output layer. So if you go back and you actually have this perspective of breaking up the monolith, not thinking as, as this one computational block, you see very quickly that hey, all the exact same IO computers happen up to this point in time.
And this is insane because that means I'm doing the same computer like. Many number of times, which is very, very expensive. So if you actually just have a, an approach where you actually run your data through the, the foundation models, if you cash that, uh, those embeddings, you can now actually serve this to all the tasks and all of a sudden you can do, uh, you can use this foundation models at scale.
Uh, because the of the sense of being not only ency in cost bandwidth, small models. Semantics, this is the what you can do. Semantic with embed is super powerful, but I wanna have a different view here in which we can actually use the embeds to do AI in a cost effective fashion. So some part in thoughts, uh, that I wanna just highlight, especially within the context of, uh, of this conference is I think really we're moving from not no longer data lakes, but to data o data.
Huge. Right? So give an example to illustrate this, right? Imagine you have tabular data of just some bytes, right? 10 million rows of these bytes. Uh, 10 million rows of these say flow 30 twos are gonna be about 40 megabytes in size. So if we equate that to some sort of area for comparison, let's say that's Lake Tahoe, right?
When we jump to text 10 million documents now. So three of magnitude in size to about 40 gigabytes. And if we were to compare that to say, Uh, Le Tahoe, you know, this is now roughly the size of the ca seats. Now we've gone to the largest lake in the world, right? And something to, you know, Once you go from text to uh, video, this is this, you know, you go from uh, from 40 gigabytes to something, the order now terabytes and no longer are you talking about crossing lake, you talking about the Pacific Ocean?
So we tools that. Not only help us tackle text, but also help us tackle visual data. Not only ha I not only have a few minutes, so I'm just gonna say really quickly what we do here at coactive is we unlocked it value of emission video data, really focused on essentially a data centric approach to doing this and bring in really unstructured data to the world of structured data.
Something that's exciting. We'll love to talk to you. We would love to connect and we are hiring. So if these are kind of the challenges that you like to take, uh, you know, feel free to shoot an email, shoot me a message, or, you know, apply on our career space. Thank you everyone. Really appreciate your time.
Awesome. Thank you so much. Will. Yeah, I can definitely say the collective team is, is great. Uh, I'm a big fan of Cody as well. Um, no, thanks, thanks for that chat. I think we were supposed to meet a while ago actually. Will and I, I missed the opportunity, so I'm glad we, we have this chance. Lilly, we gotta do it after this.
Yeah. No, again, a huge thank you to Demetri, to you Lilly, to the entire lost community. This is, this is just an awesome event. I'm super excited to back and all the recording. Yeah, definitely. There's some good, good talks lined up. Um, all right, well have a wonderful rest of your day. I appreciated the, the speedy talking.
I was like nervous to give you a warning, like I don't think he can talk any faster. Oh, sorry. This is my, my, my hack is my half my timer over here. Nice. I like it. Alright. Awesome. Thank you. Thanks.