MLOps Community
+00:00 GMT
Sign in or Join the community to continue

RAG Has Been Oversimplified

Posted Jan 23, 2024 | Views 822
# RAG
# Vector Database
# Zilliz
Share
speakers
avatar
Yujian Tang
Developer Advocate @ Zilliz

Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

In the world of development, Retrieval Augmented Generation (RAG) has often been oversimplified. Despite the industry's push, the practical application of RAG reveals complexities beyond its apparent simplicity. This talk delves into the nuanced challenges and considerations developers encounter when working with RAG, providing a candid exploration of the intricacies often overlooked in the broader narrative.

+ Read More
TRANSCRIPT

Yujian Tang [00:00:00]: My name is Yujian. I'm a developer out at Zillis. And, yeah, so I take my coffee, two shots of espresso over ice with sparkling water and a pump of caramel and then some milk to mix.

Demetrios [00:00:17]: Welcome back to the Mlops community podcast. I am your host, Dimitri Os, and today we're talking with Eugen. What a common conversation. We went real deep when it comes to rags. I know it's everyone's favorite topic, or at least it was in 2023. Let's see how 2024 shapes out. Maybe you are bored of them or tired of them by now, but I promise you, we got into some caveats when it comes to rags. There's a million and one courses or pieces of content that have come out around rags.

Demetrios [00:00:51]: And so how can we differentiate ourselves? Well, we talked about some optimization methods that you might want to consider when you are using rags. We went from what the hell is a rag? To how do I optimize it? And what do I need to be thinking about at the deepest level? How am I building my app so that I can take full advantage of the pros and cons of each design pattern and these dense vectors and sparse vectors and all that fun stuff that all database. Vector. Database people like talking about, I guess. So. If you are doing anything with rags, I would love to hear from you. To be honest, I would love to hear what you're working on. And if you decided against rags, I would also love to hear from you, because the beauty of this is that rags, as we talked about with Eugen, they're not for everything.

Demetrios [00:01:49]: Although you may go on Twitter and hear that, you can just throw some AI at it and have it be a rag or use a rag with it. You don't need to. And the last part, I think Eugen has some cool insight around multimodal rags and what we can do when that starts taking off and when support happens for that, more. So if you are on the market for anything vector database, check out Zilla's cloud and check out Zilla's pipelines, their new feature that they just have for early access for the community and listeners of this podcast. You can find it by clicking in the show notes. And huge thanks to Zillow's for sponsoring this episode. Of course, if you liked it, share it with just one friend. That's all I ask.

Demetrios [00:02:46]: All right, I'll see you in the conversation.

Yujian Tang [00:02:52]: You.

Demetrios [00:02:55]: Dude. Well, I'm stoked to have you on here. I want to do, like, a huge deep dive into rags and basically, where we're at right now it's 2024. Rags have become the hero of the LLM space. I think it's more or less the QA bots are like the hello worlds that you create with llms. And I know you've been spending a lot of time working on rags and their implementations and all that fun stuff, so we should probably start with we had you on last time, and we'll link to the last podcast that you came on in case anyone wants to basically hear. I'm not going to say it's part one, because we're going in a totally different direction. It's not like you need to listen to that one first before you get to this one.

Demetrios [00:03:38]: But we did have a pretty insightful conversation around vector databases and the idea of how much AI has been accessible by the masses now because of things like vector databases and llms. And so whereas before it took quite a lot of effort and a lot of engineering time to get some kind of ML model into production. And that was really the thesis for the ML Ops community from like 2020 till 2022. Now it feels like we're in a place where we can get something, at least something quick and dirty up and running real quick and validate assumptions, validate different ideas that we might have. So before we go any further, the vector databasellmbedding design pattern is probably the one that everyone thinks about when it comes to rags. Am I correct? And correct me if I've gotten anything mistaken there.

Yujian Tang [00:04:50]: Yeah, that sounds about right for most of what people ask about when it comes to the ragstack is like, what are embeddings? How do we get embeddings into vector databases? Are embedding models and llms the same thing? And they are not the same thing. And this is something that quite often gets asked. And people, once they look at these different applications, if you're going to use lang chain, if you're going to use llama index and you're going to do rag, you're going to have an embeddings model and you're going to have a LLM. And people will ask, oh, do they have to be the same? And the answer is no, they don't have to be the same. It's like, oh well, how do I pick one? And then it gets a lot more complicated than just the very basic here's use the default, here's OpenAI boom.

Demetrios [00:05:42]: Yeah, and embedding models. It's probably worth just a quick summary too, for the two people that are listening that have been under a rock and do not know how embeddings work. And vector stores work with embeddings and what you need to do there.

Yujian Tang [00:05:59]: Yes. Okay. Yeah, I probably should have more clarified what these pieces of the puzzle were. So, embeddings models are the models that you use to turn your data that you're going to be working with. This is your proprietary internal data or whatever, and you turn that into vector embeddings so that you can work with it. Basically, because embeddings are numerical representations of this input data that is like a text or image or something. It's like a machine learned, using a machine learning model representation of this kind of data. And embedding models are trained.

Yujian Tang [00:06:43]: Well, now we've moved far along enough into the rag staff that people who are working on the more cutting edge stuff, I think this probably came to light a few months ago. The whole like, you need to be training your embeddings model on specific types of data that you want to work with. You can start off by going to hugging face and getting a very basic embeddings model, and it will work decently fine. But when it comes to enterprise production and when it comes to working with very specific data, you're going to need to make sure that you're training your embeddings model on the right data. And then what these embeddings models do, or what we do with the embeddings models, what you need to do with the embeddings models in order to get the embeddings from them is after you train the model, you need to ensure that you're only taking the output from the second to last layer. So traditional models, traditional natural image processing models will, or even image models, they'll do some sort of, like image models will do some sort of segmentation. Natural image processing models might do some sort of classification, some sort of prediction, something like that. But with the vector embeddings, that's not what we're looking for, actually.

Yujian Tang [00:07:54]: We just want to know what the semantics the model has pulled out of the input is. And so that's why instead of taking the output of the model, what we actually do is we take the output from the second to last layer and then just to touch on the llms and the vector database, in case there are other people who have not heard about these yet. Llms are the large language models, right? GPT. Everybody kind of has heard of that one. There are many others, right? Like mixed drill came out. Perplexity is, I think perplexity is coming out with another one.

Demetrios [00:08:27]: No way.

Yujian Tang [00:08:27]: I don't know. There's like a huge number of llms now. Falcon, llama, basically everything you can think of is an, actually, I have a personal gripe on this is at this point, I don't even know what makes something a large language model anymore. How large does it have to be? 1320? A trillion parameters? I'm talking about billions with these first few numbers. And then I heard someone saying like, oh, actually I think anything under 7 billion is a small language model. And I was like, I don't know, it seems pretty big to me. Yeah, literally like last year we're like, oh, large language model with 5 billion parameters. And now it's kind of like, yeah, if you don't have more than 10 billion parameters, is it really a large language model? I don't know.

Demetrios [00:09:14]: I can guessing you don't get the stamp of approval of being large, you're not large enough. And yeah, it is funny because a lot of people are coming through and saying, yeah, this year it's all about the small language models. And so now it's like, who's deeming what is large and what is small? Have we come to a consensus on that yet? I don't think so either.

Yujian Tang [00:09:42]: And these people are like, yeah, we're going to be talking about small language models and they're going to be talking about like 3 billion parameter models. I don't know, man. And then the last part is the vector database piece, which is where you store your vector embeddings. Once you get your vector embeddings and you basically use this as a semantic store, as a way to store the meaning behind the image and the text and the audios and the videos that you have. And a way to work with that semantic meaning in a mathematical manner.

Demetrios [00:10:09]: Yes. And now land the plane. What does it look like when you're actually building the application? The end user will, whatever, type into their q and a bot, what does.

Yujian Tang [00:10:21]: The end user actually have to do?

Demetrios [00:10:23]: Well, I think I'm more just thinking the end user will type into the QA bot. Right. Like I'm looking for more examples of this or whenever it says that. Or tell me about, if I'm reading a book or I have a bunch of pdf documents, tell me about this paragraph more or where it talks about this more. Right. Then behind the scenes, what happens.

Yujian Tang [00:10:56]: Yeah. Okay, so actually, the way that a user is expected to interact with the app also plays into the way that you should think about designing your app. But we can.

Demetrios [00:11:06]: Great point.

Yujian Tang [00:11:06]: We'll touch more on that in a bit. So what actually happens behind the scene is, so there's a few ways that you can structure your architecture and it depends on how you would like to build it. But what I would suggest this is my suggested architecture, is what happens behind the scenes is the query comes in, it goes to the LLM for the LLM to determine what the actual question is. So perhaps you have, you want to give it some context. You want to say, pretend that I'm, I don't know, like an astrophysicist and I'm reading this book about the cosmos and I want to know about black holes. Or then, you know, it gives it some context, it pulls out what it actually is. And so when you say that, the LLM has to know that, okay, well actually what I want to go look for in the database, let's say you have a database of books, is I want to go look for and I want to filter on the book cosmos and I want to look for black holes. And so what it will do is then it will say, okay, it'll form a way to query the vector database.

Yujian Tang [00:12:04]: And so vector database queries are typically formed in a call, right? Like either some sort of API call, some sort of HTP GRPC, some sort of call like that. So it's passing data along, it makes some sort of call to the vector database and it says okay, let's filter on this book and cosmos. We're only going to look for text about black holes. Okay, so then maybe you get, I don't know, ten responses back, right? You get ten responses back and then you send those responses back to the LLM. Remember the first query, it's like, okay, we want to know about black holes from this book. And then it says, okay, here's my response, here's what it is in the virtual database. Here's the book name, here's the section name, the chapter name, the author name and all this information. And then what it'll do is it'll take that information, it'll crunch it down and say, okay, here you go.

Yujian Tang [00:13:00]: Black holes are, I don't know, supermassive clusters of gravity or something. I'm not really sure what they are, but then it'll return the answer to you basically. So what goes on in the background is basically the LLM first takes your input and then interprets it turns that into a query queries. The vector database takes that context and then turns that context into something that's human readable and gives you back something that makes sense.

Demetrios [00:13:28]: I was really hoping you were going to give me some crazy definition of a black hole right there, but we can hopefully edit it at tackhead in post production.

Yujian Tang [00:13:41]: I don't have any crazy things about my goals. Let me think if there's anything I know a lot about that would be, like, fun to know about.

Demetrios [00:13:48]: The fun fact I do appreciate this is just like, a little bit of a side note and a tangent. Since we are going on a little tangent here, I ask speakers of the podcast to fill out a speaker form when they come on to this podcast. You filled that out. And one of the questions on there is a fun fact about you. Do you remember what you put for your fun fact? You said, I can build a fire. Is it like Boy Scouts fire, where you're just kind of in the woods and you have flint and that stuff? Or is it with a whole bottle of kerosene?

Yujian Tang [00:14:26]: Yes. So I don't know why, but that was, like, one of the first things that came to mind for that question. And I am really proud of being able to build a fire because of one specific instance, which was like, I was down in Coronado beach with some friends, and we brought korean barbecue, and we brought something to cook it on, and we brought a bunch of we. And we had firewood and all this stuff. And then we're having a lot of trouble starting the actual fire. Right. You got to imagine you have three kids who spend most of their time in front of computers. Very nerdy.

Yujian Tang [00:15:05]: I come from a software background.

Demetrios [00:15:06]: Right.

Yujian Tang [00:15:07]: So you've got these three kids who spend basically, like 20 hours a day in front of a computer, and they're just sitting there at a beast trying to figure out how to start this fire. And so we actually get the fire started. So first we figure out, like, okay, we're just going to go and go around and find dry leaves and stuff like that. And it finally works. Oh, my God. We poured so much kerosene on it, and we couldn't get it started. But what finally worked was we got a bunch of dry leaves, and then we got the kerosene on the big logs, and it finally caught a fire. And then it wasn't like the right type of fire because we were trying to barbecue something and it was like a TP stick fire, and so we had to rearrange it and all this kind of random shit.

Yujian Tang [00:15:45]: So it was a very fun time. It was really, like, my first time ever actually building a fire. And that was just like the one random thing that came to mind while I was filling out the form. So I was like, okay, here we go.

Demetrios [00:15:56]: So the mental picture that I have in my head of the three of you trying to build this fire, and you're used to writing unit tests and all that fun stuff, and there's no unit tests for these fires. It's like, there's no unit. Do we need more kindling? Do we need more gasoline? Let's just go get gasoline and dump it on this and see if that'll work. As you were telling this story, basically, I was thinking about the parallels in what I just did to get more context out of you on this fire. It was like I was a rag right there. Yeah, I was a human rag, because you said, I can build fire. But then I said, give me more context. I need more context.

Demetrios [00:16:47]: And I went to the vector database, which is you, and I got more context, and now I can plug it into the LLM, and I can say, all right, here's all the context you need about Eugen's fire capabilities. So we went on a huge tangent there, man. I want to get back on track, and for those that are listening, I promise we're going to get deep into the rags and rag stack. I like that word that you used. It's good. One huge question that is on my mind when it comes to rags, I'm sure you've heard a lot of people talk about this, is, do rags eliminate the ability for llms to hallucinate because you give it that context?

Yujian Tang [00:17:35]: No, it doesn't. So this is something that we talk about a lot in many of our earlier presentations, and I think a lot of people talk about this when it comes to rag stuff is like, rags can hallucinate. How do you get rid of that? Well, you should use some sort of context to inject your data, and injecting your data does it reduces a certain type of hallucination. It's not going to just make things up about your data anymore, but that doesn't mean that it's not possible for the rag to still put out the wrong answer, because in the end, the way the LMS work is still using statistics, and it's still just like, here's the context I'm going to predict based off the context. But if the training data was weird and the context contradicts the training data, you might get some weird results. And so an important thing to do to kind of deal with that is basically to have your observability tools and have your evaluation tools afterwards. And I think that's part of the whole reason that there's so many tools that have recently kind of moved from what was originally, like, ML ops into now what they're calling LLM ops.

Demetrios [00:18:50]: Yeah. And do you feel like that is a critical part of the ragstack?

Yujian Tang [00:18:54]: Oh, for sure. If you're going to be doing this in production, you got to know that your answers are decently right. Even humans? No human is right 100% of the time. But you got to make sure that your rags are at least matching up to what people are capable of. And so, like, examples of, like, you know, some of the things that people talk about in this space, like, make sure that the context that you get is relevant.

Demetrios [00:19:26]: Right.

Yujian Tang [00:19:26]: Just because something is semantically similar doesn't mean that it may necessarily be semantically relevant. And make sure that you cite your sources. Right. So this is something that I was building earlier last year using Llama index was like, how do you build a rag app with citations, and how do you ensure that the actual answer given doesn't deviate too far from the data that's pulled back from the vector database?

Demetrios [00:19:57]: And how did you do that?

Yujian Tang [00:19:59]: Building citations is actually very easy. What it kind of requires you to do is to realize that you're going to build an application that's going to need citations, and so you have to store the text that you're going to use along with your embeddings. So you're going to store more data, but in return, you're going to get that groundedness and that context and that relevance and that ground truth data back. And then actually. So this was something that wasn't originally available on Llama index that I ended up writing a pr to make this available.

Demetrios [00:20:38]: Nice. Yeah, it's almost that trade off of, hey, we're going to store more data, but we're going to have more trust, so it's worth it 100% of the time. Yeah. And you said something else that's interesting in there that I wanted to just dig into for a second, the context and relevance. What's the difference between those two there?

Yujian Tang [00:21:01]: Right. I think it probably makes more sense to differentiate between relevance and similarity, I guess these are both attributes of context, I would say. But context is the additional data or the original form of a data, or the data that the surrounding data for something. So, for example, an apple a day keeps a doctor away. You need the context that the apple that you're talking about is a fruit and not like the Mac. Right. Laptop. So context is kind of like that.

Yujian Tang [00:21:38]: And then similarity and relevance is like, let's say I want to find something for an apple a day keeps a doctor away. And maybe a similar sentence to that might be like a banana a day keeps the chiropractor away. But is that.

Demetrios [00:21:57]: You just made that shit up, didn't you?

Yujian Tang [00:22:01]: But is a banana a day keeps a chiropractor away relevant to my question about what keeps a doctor away? And not really. So that's kind of like, the difference there. What would be relevant might be like, an apple a day keeps the doctor away because it makes you eat fiber. I don't know, something like that.

Demetrios [00:22:21]: Yeah, fiber. It's a big one these days. Get that gut health right, and you won't have to see the. Yeah, yeah.

Yujian Tang [00:22:29]: I just heard a podcast about it, so that's why I was thinking, wait.

Demetrios [00:22:32]: Was it the one?

Yujian Tang [00:22:34]: No, no, it was like one of the ones with Andrew Huberman.

Demetrios [00:22:38]: No way. All right, I got to go watch that now. That's on my list. I just watched the Atia one. Okay, so now I want to just make sure that I'm getting this right. Just a little pause, because the ragstack, we introduced something new into it that we didn't have in the beginning when you said, oh, these are the things that you need for the ragstack, which is the embedding model, the LLM, and the vector database. But then we just said, oh, yeah, and by the way, you might want to have some kind of an evaluation guardrails tool on there. And funny enough, I'm sure you've been seeing some of this stuff pop up on social media, where I think it's Chevrolet.

Demetrios [00:23:19]: If you go to the Chevrolet website, they use an LLM for their customer service bot, and it's just straight plugged into GPT four. And so I've seen people talking about how why pay for OpenAI when you can just get it through Chevrolet's website? Or I've also seen people asking, are Teslas better than Chevys? And it will come back like, Teslas are known to be better than Chevys. Yes, on the Chevrolet website. It's just like, the amount of brand damage that you're doing right now is incredible. So it's worth its weight in gold to get some kind of tool that will make sure your rag is not spewing out things that should not be spewing out.

Yujian Tang [00:24:09]: Yes, it is definitely worth looking at doing something like that. And you definitely need something like that when you're going to be putting this into production. I guess since we're on the topic of tools that you can add to your rag stack. Other things to think about could be like data pipelines. Like for example, how do you pipe new data into your vector database? And this is actually something that we have been building at Zillis. So Zillis recently released our own pipelines thing. And so one of the things that people always ask me, I'm pretty sure I did, I think I did like 14 talks last year and I got this question, at least like twelve or 13 of them, which was like, how do I get my data into a vector database? And it's like, oh, well, you need to use the embeddings model, but you can make it easier by building a pipeline around it or something like that. Right? So we have, for Zillus right now, we have an open source model as our embeddings model on pipeline.

Yujian Tang [00:25:13]: But that's kind of just one way to get into it, right? We want to offer people a way to get into it and to think about using pipelines, because when you're in production, you probably don't want to, I mean, you could, I guess people do, actually do do this in production. You could just every day or so execute a script, runs for ten minutes, put your data into your vector database.

Demetrios [00:25:35]: Yeah.

Yujian Tang [00:25:36]: Hey everyone, my name is Aparna, founder of Verizon, and the best way to stay up to date with mlops is by subscribing to this podcast. But if you want, know, more automated version of like, if it starts taking you too long to go and do the script, then get a pipeline, right? And Zillow's pipelines aren't the only pipelines that are available, but this is like one way that you can use a pipeline to pipe your data from some sort of data source into your database in a more, let's say streamlined, automatic, easy, whatever kind of manner.

Demetrios [00:26:10]: Yeah. So then basically, if I understand this correctly, it's like as we build out this rag stack in my head, we've got the LLM, we've got the vector database, we've got the embedding model. We need a way to connect that and connect all of these along with on the output. We need a way to make sure that there are guardrails and that we validate the output before the end user sees it. Is there anything else I'm missing there? I feel like probably potentially some kind of prompt tooling.

Yujian Tang [00:26:47]: Lane chain, llama index. These are orchestration frameworks, I guess prompt tooling, prompt engineering kind of frameworks, also kind of like fits the, if the shoe fits, I guess so. Yeah. That would probably be, like, one more thing in there. I have seen that a lot of people like to also do their own prompt engineering, and I think that that is actually something that I would really like to see. This is a feature request for the Lang chain llama people, if they're listening, I'd really like to see a way to mess with the prompts inside of the apps, which is, I guess it's kind of weird because it kind of defeats the purpose of the apps themselves. The frameworks themselves are supposed to be there to give you this way to interact with the LMS. I'm not sure, but that's something that I think is like, if you're working in production, you're probably going to want to tweak your own prompts.

Demetrios [00:27:39]: Yeah. You want easy access to that. You don't have to dig around, that's for sure.

Yujian Tang [00:27:44]: Yeah.

Demetrios [00:27:45]: The other piece that I am fascinated by is, I think some people have come and asked about why would you need to use rag for some use cases? Is it not good enough just to use search, like keyword search or similarity search for that matter? Do you need the whole throwing it into an LLM and getting a bunch of algorithms, or can you just do search and have it give you back something? Is LlM over engineering it just because of the hype?

Yujian Tang [00:28:24]: Yeah. So for this, I would invite you and other people to kind of consider like a thought experiment, which is, let's say I have semantic search and I ask the question, what ingredients are in an apple pie? When I do a similarity search for that, I'm not going to find, like an answer that says, you need three apples, a pound of butter, a pound of flour, whatever. What I'm going to find is things like, what are the ingredients for a blueberry pie? What are the ingredients for a pecan pie? And that's what I'm going to get back. And so if you think about it in that manner, you'll quickly recognize that you need something that's going to take your natural language, query your question, and turn it into, how do I find the things that's going to be similar to the answers to this question?

Demetrios [00:29:27]: Yes, I see. That makes 100% sense. And it's like, we want the answer. We don't want the similarity of this question.

Yujian Tang [00:29:38]: Correct.

Demetrios [00:29:39]: I should have known that. I can't believe I didn't know that. And I'm asking you that. Now that I think about it, I kind of feel like an idiot.

Yujian Tang [00:29:48]: This is a very common question that I also get.

Demetrios [00:29:52]: Good. So one other piece that I'm wondering is what kind of rag optimization do you do? And should we be thinking about like let's say we've got the rag up and running, but now we want to make it better. What are some ways that you can turn some knobs and levers you can pull?

Yujian Tang [00:30:13]: Yeah, so I think this is part of where some of the preprocessing stuff comes in. And this is also something that kind of, this comes before the embeddings model, which is thinking about how you want to chunk up your data. How do you want to split up your data? And earlier when we were talking about the end user and how the end user is going to use the rag app, I referred to this, I alluded to this like, oh, we'll cover this later.

Demetrios [00:30:36]: Foreshadowing.

Yujian Tang [00:30:38]: Yes, foreshadowing. When you think about the way the user is going to use an app, it also makes you think about the way you want to chunk up your data. So for example, if you want your user to use your app in a conversational manner, like it's going to be a chat bot, then you probably can chunk up your data into smaller sections. It's like, oh, okay, we probably need to find very specific, very small kind of things. But if you perhaps want your user to use your application as something like, something that helps them complete blogs or writing things, kind of like what Jasper does or copy AI, then maybe what you want is you want to give it more data that isn't the example of bigger blocks of paragraphs. Or maybe also depending on not even getting to the part where we talk about what the user is going to do. But maybe let's think about what data you are working with to begin with and what does it look like. So if you have data in the format of a conversation, like a text back and forth, then you're going to want smaller chunkings and you're probably going to want to chunk on specific characters.

Yujian Tang [00:31:49]: Maybe you have specific characters to separate new text blocks, then you're going to want to chunk on that. And then if you have a big document, then you're going to want bigger chunks. If you have maybe a QA or a lecture where there's going to be like a short and then long answers, you're going to want a way to chunk on specific characters. And possibly also you're going to want to be able to save coupled texts. And then one other thing to think about with that is how much context overlap do you need between your data chunks, right? So maybe you have a document and these paragraphs are like 500 characters, but you need the sentence before and the sentence, the last sentence of the paragraph before and the first sentence of the paragraph after. Then you're going to need some sort of chunk overlapping, right. So there's a bunch of different things to think about when it comes to how am I going to set up my data to even embed? This is like prior to building your rag app, you can build an example one. And then once you build an example one, you probably want to think like, okay, there's definitely ways I can make this better.

Yujian Tang [00:32:56]: How do I make this better? The first step you're going to want to look at is what does my data actually look like inside of the application? And then of course the eval stuff.

Demetrios [00:33:03]: Yeah, that is really interesting when it comes to this idea of start from the end and then go backwards and recognize how you can optimize that way. And I also look at other ways that we are interacting with llms as far as the point and click ways that we're doing it too. Right? Like I think about notion AI, and that's a great interface of we don't have to use it like a chat bot, but we can if we want to. So if you have all of those options open, how are you then thinking about optimizing in the background? You know what I mean? Do you do both and then just sacrifice on the storage like we talked about before, where yes, you are going to store more, but you're also going to have more trust? Or is it something like you're finding that happy medium?

Yujian Tang [00:33:59]: Yeah, actually I have something to touch on that as well. So there's a couple of ways that you can even think about storing the way you store your chunks and embeddings can also affect the way that your application works. So a couple of techniques that I've talked about in some of my talks have been like, you can store larger chunks of data with only maybe one of those sentences embedded, and you can use that as the embedding. You can store the entire chunk with it. Then you get more context every time you have specific things that you're looking up. The other way is that you do the reverse. So you embed an entire paragraph and maybe you only store a sentence, and that way you can get more specific responses to a more generalized query, perhaps. So there's a couple of different ways to kind of address that as well, just like the way that you store your data.

Yujian Tang [00:34:50]: As for the trade off there. I actually haven't seen this in production, so I'm not entirely sure exactly how the trade offs work, but it sounds like, yeah, kind of. Once again, one of the many trade offs that we make here is like compute versus storage or storage versus latency or storage versus quality or things like that. And this is just another one of those things that we'll have to optimize for when we build in production.

Demetrios [00:35:21]: And along those lines, what are things that you feel like rags are not suited for? Because it feels like you seen a lot of square pegs trying to be fit into round holes. Right. Just because rags are rags and they're the hype these days.

Yujian Tang [00:35:43]: Yeah. What are rags not suited for? I mean, really, you can use a rag for anything you want. I don't know how well it will work.

Demetrios [00:35:52]: Just throw all your data in the vector database and use it for whatever you need. We got you covered.

Yujian Tang [00:35:58]: Yeah, exactly. Throw all your data in the millvis. Use it for whatever you need. We got you covered. Someone asked me the other day, they were like, oh, I want to be able to map all my users, and I want to be able to track all my users, and I want to map them based on specific characteristics that I've predefined for them. And so I'm like, well, then you're not even looking at unstructured data. You're looking at structured data. Like, go use a SQL database and write an automation.

Yujian Tang [00:36:22]: That's what you should do. That's a great example of when to not to use rag. Okay, there you go. I don't get a lot of people asking me about, a lot of people just ask about rag because it's familiar, but it's also different, right? People are like, oh, this is like, we've seen this before in 2013 and 2000. And whatever the chat bots that pop up and people are like, oh, it's a chat bot. And this has that familiarity to it. But then when it comes to actually using them and building them, and it's like, how do you make these better? Then it's kind of interesting. It's kind of different.

Yujian Tang [00:36:56]: People like that. So I would say anything that you can think of that relates back to chat bots from 2013, then use rag to replace that. Things like, oh, actually I have a great example of something that might not need an LM for similarity search. So I recently talked to somebody who's building some sort of fashion AI tool, and I build like a couple of these myself for fun.

Demetrios [00:37:19]: And one of the things you got to go a little deeper into that fashion AI. I didn't take you much as a fashionable guy there. Not, no offense, but I feel like you wear a uniform of a t shirt and jeans or sweatpants maybe every day, and you're often on your free time building fashion apps. What's going on there? I need more context. I need to do the LLM thing, the rag thing. Again, I'm not necessarily a fashionable person.

Yujian Tang [00:37:54]: But it's just, like, a fun thing to mess around with. Everybody wears clothes, and a lot of people wear clothes, and they look kind of. I get it. But some people wear clothes, and they look really nice. When I was at Aws, I saw this one lady very dressed up. She had, like, this nice peaco. I don't know if you don't know the lot of the words for some of the other things that she was wearing, but I was like, wow, this person looks really put together, and you look everyone else, and everyone else is wearing a tech shirt. Like, I'm wearing a tech shirt and jeans, sweatpants, walking around, and I'm like, okay.

Yujian Tang [00:38:35]: I don't know. It was just, like, a thing that kind of popped into my mind. It would be cool, it'd be fun. So I built one.

Demetrios [00:38:42]: Now we have more context. Yeah. Give me what you were going to say. What was the takeaway? Sorry, I totally derailed that. I had to know. What is the contact?

Yujian Tang [00:38:51]: Yes. Yes. You don't need an LLM to convert your picture of address into the description that this is a picture of address if you have pictures of dresses in your vector database. So that's an example of something that's not a rag app, doesn't use an LM, still uses a vector database similarity search.

Demetrios [00:39:10]: Okay. But you're not using clip or anything like that to get the metadata.

Yujian Tang [00:39:20]: The person who built this fine tuned her own segmentation model.

Demetrios [00:39:24]: Okay, nice.

Yujian Tang [00:39:25]: I was like, wow. I personally wouldn't put that much effort into building a fashion app, but I'm glad that you did because this is cool.

Demetrios [00:39:33]: All right, so wait, what did your fashion app do?

Yujian Tang [00:39:36]: I basically found, like, an open source seg former, like, segmentation transformer thing, and it just segments out. It clips, like, your shirt and your pants and whatever, and then it matches it to other pictures. I had a bunch of pictures of Taylor Swift, a bunch of pictures of random celebrities, and just put them up there. It was like, match your outfit to Taylor Swift or match your outfit to some random celebrity and just like, threw them in there and was like, all right, here you go.

Demetrios [00:40:02]: I'm going to go see if I can match my outfit to Taylor Swift.

Yujian Tang [00:40:05]: Now.

Demetrios [00:40:05]: Give me the link for that. What is it? Is it live? All right, we're going to put this in the show notes for anybody. Okay.

Yujian Tang [00:40:14]: It's on GitHub. It's on GitHub.

Demetrios [00:40:15]: We will put that.

Yujian Tang [00:40:16]: I'll hear this fashion VDB. It's actually kind of, yeah, here we go. Let me drop in here.

Demetrios [00:40:24]: Yes. I love it. This is awesome.

Yujian Tang [00:40:27]: It's not super simple to run. You have to run like the ingestion. And then there's like two things you have to run. I think one is like, ingest, and then the other is like, oh, no. One is set up gradio fashion. The other one's gradio fashion. And then just you get a nice ui, drag and drop. See pictures of Taylor Swift compared to.

Demetrios [00:40:47]: Oh, I love it, dude. That's cool. That is really cool. And you didn't need any llm to deal with that. And you don't need to do rags at all for that.

Yujian Tang [00:40:56]: No LLM, not a rag.

Demetrios [00:40:58]: Don't need to overcomplicate things. But is it giving you a recommendation of what you should wear if you want to look?

Yujian Tang [00:41:07]: It's just like, your outfit looks kind of like this one. It's not that far along. It's not like, I see.

Demetrios [00:41:15]: All right, cool.

Yujian Tang [00:41:18]: I'm not really sure that I would be qualified to be telling anybody what they should be wearing. You need someone with some background on that. Oh, this person's a fashion designer. Okay, that makes sense.

Demetrios [00:41:29]: Right? There you go. And so if you were to make this into a rag app, I feel like you could do a few things where maybe you're recommending it. Is that recommendation on how do you make this outfit look better or look more like Taylor Swift? And then you send it to the LLM and it will give some kind of a recommendation.

Yujian Tang [00:41:53]: Yes. So first step, this could be easily a first step into launching a multimodal rag app.

Demetrios [00:41:59]: Right?

Yujian Tang [00:42:00]: Like you've got images and you want to do rag. So now you've got text, you got images. This is perfect to get into multimodal. What you could do is you could have some sort of LlM or clipvit or something like that and be like, describe the pictures to me. And now you can search the pictures. So instead of putting a picture of yourself and saying, finding one that's close to whatever Taylor Swift, you can be like, oh, show me a picture of Taylor Swift performing at a concert in a green dress and it'll look at the pictures and it'll pull like, oh, okay, here you go. Or you can give it a picture of you and be like, describe what I'm wearing in this picture, and it'll describe what you're wearing. And then you can be like, find me a picture of Taylor Swift wearing something similar to.

Yujian Tang [00:42:38]: And then it'll know a picture of Taylor Swift wearing something similar to whatever it is that you're wearing.

Demetrios [00:42:43]: And are there any considerations that we need to be thinking about when it comes to multimodal rags versus just the LLM rags?

Yujian Tang [00:42:52]: Oh, man. I think that we are just on the tip of the iceberg with multimodal rags. I think that 2024, right now, we're talking about multimodal rags. There's not a lot of these that have really been built. I haven't seen any in production yet. I think this is the year that a lot of people are going to build them. We're going to see a lot of the things that people are going to run into. I think one of the problems that we're going to run into with multimodal rags that I have no solution for, that I haven't heard any solutions for yet, is just the fact that there is a likelihood or a probability, let's say there's a probability that your LLM is also going to hallucinate on what's in the image.

Yujian Tang [00:43:30]: And so now you have two steps of hallucination. So you have a hallucination of what may be returned and a hallucination of what may be in the image. And so I think this is going to be something that is going to be kind of the next as we use the LLM more and more in an application. In LLM applications, I think one of the main issues is going to be this issue of reducing compounded hallucinations.

Demetrios [00:43:53]: Yeah. Because if it hallucinates once, then it's going to create downstream effects and it's going to be exactly less reliable. Yeah, that makes a lot of sense. And then to debug, that also is not as straightforward as where did it hallucinate? You have now two points of failure, not that single area where it can be hallucinated. All right. And any other ideas on apps that are of interest when it comes to using multimodal rags? Like what are some other use cases that you've seen or that you are thinking about?

Yujian Tang [00:44:32]: I've heard requests for video search or catalog search from people. I think video search, if you're going to do true video search, that's going to be computationally expensive. That's going to be tough.

Demetrios [00:44:43]: Why do you say that? Just because. Frame by frame.

Yujian Tang [00:44:46]: Yeah. I mean, how many frames do you have? And then it's not just like the frames, right. When you have a video file, it's not just the images. You also have the audio, and they have to match up. And so you can't individually do image audio stuff on videos. You have to do it on video files. Right. I mean, you can try to do it with image audio stuff separately, but you have to work it on the video files themselves, and they just take up more space.

Yujian Tang [00:45:16]: They're memory heavy. They have a lot of numbers, so they're naturally compute heavy. So I think that's going to be an interesting space. It's going to be kind of tough.

Demetrios [00:45:27]: And just so we're clear on that one, I feel like the multimodal can be one. Imagine we are having a chat bot that can chat with a movie, and you have different parts of this multimodal, which is one, all the frames in the scene, and then you have another, which is all of the audio that comes out of that. And then you have another, which is like the script that I imagine you could try and get. Right. So there's all those different pieces that come along with it.

Yujian Tang [00:46:02]: Yes. There's the frames, the audio, the script. You can turn the audio into images using MFTC. One thing that I really would like to see, and I think this is going to be quite tough for people to build, is like, video to video search. I want to know, hey, I see this. All right. This is going to sound really nerdy. Okay, so I'm watching this Naruto music video, right? I want to see where these clips are coming from.

Yujian Tang [00:46:37]: What episode is this from? What did I miss? Because a lot of times I'm like, I haven't watched all of the series, but sometimes I'll be like, oh, yes, this is cool. We'll watch one of these. And then I'll be like, oh, where's that from?

Demetrios [00:46:49]: Nice. Yeah. So it's almost like what we were talking about before, where you have citation. You have citation in the videos.

Yujian Tang [00:46:58]: I want to see citations for videos.

Demetrios [00:47:00]: Yeah, there you go, dude.

Yujian Tang [00:47:02]: I think we're a little bit away from that, but it would be cool.

Demetrios [00:47:05]: Yeah. I mean, a little bit away as in this year or this decade.

Yujian Tang [00:47:14]: I would be surprised if this wasn't solved in this decade. I would not be surprised if it wasn't solved this year.

Demetrios [00:47:22]: Okay, cool.

Yujian Tang [00:47:23]: Or solved. Mostly solved. Let's say usably solved.

Demetrios [00:47:27]: Usably solved. Yeah, that first iteration, dude. Well, this has been fascinating for everybody out there who wants to play around with the pipelines that you were talking about. We can just go to zillas.com slash pipelines, or Google.

Yujian Tang [00:47:46]: You can just cloud zillows.com. Sign up for a free account if you spin up one of the serverless options. The serverless option is the free option. You get immediate access to pipelines and it's in the UI. You can click around and build your own.

Demetrios [00:48:00]: Excellent. And if you do that, make sure they send out, I think you mentioned you send out a feedback form to anyone that does that. And so in that feedback form, let them know that you listen to this crazy chat that Yunjid and I just had and let them know that we sent you over that way. It would be great to hear that, man. I am still impressed that you can build fire, and I appreciate you coming on here and talking with me. This was brilliant. ā€Š ā€‹ā€Š

+ Read More

Watch More

19:03
Databricks Assistant Through RAG
Posted Jul 22, 2024 | Views 1.4K
# LLMs
# RAG
# Databricks
Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 2.1K
# LLM Applications
# RAG
# Anyscale
Everything We've Been Taught About ML is Wrong
Posted Jul 17, 2023 | Views 528
# LLM in Production
# Machine Learning
# Anthropic