Solving LLM Data Problems
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
The main problems faced by LLMs such as hallucinations, lack of domain knowledge, or outdated info are all data problems. How do we fix these data problems? Add a layer on top of the LLM with the ability to search the data we need to use.
Introduction
Now we're gonna keep it cruising. We're gonna bring on my man Eugene, where you at? Where you at? Here he is. Oh, how's going dude? All I can say is people love freaking vector databases. So you are gonna hit this outta the park. It's gonna be too easy.
Can you hear me? All right? I can hear you. Can you hear me? Yep. I hear you loud and clear. So, dude, I'm gonna take it, I'm gonna let you, uh, take it from here and I will be back in another 30 minutes. I want to mention to everyone if they have not checked out, Liz, oh no, not messages, solutions tab right here.
You've got the virtual booth. Truth from, and these are the people that bring you vis. So you know what it is. All right, I'm gonna leave you and I'll be back in like 20, 25 minutes, man. Looking forward to this chat. All right? Okay. Uh, I think everyone can probably see the screen now. Um, yeah. So welcome to this talk.
My name is Jun. I'm gonna be talking to you about the key to scaling l l M applications. So everybody's kind of building LLMs now, and one of the major things that we are concerned with is how do you scale these applications and how do you productionize these applications? Right? Some of the problems with LLMs come with, uh, mostly data.
And, uh, costs. So they don't have the domain knowledge that you need. Uh, they don't have UpToDate data, right? For example, GT three was trained up until September of 2021. Um, or it's just, it can be incredibly expensive to rerun many, many queries, especially if you're gonna be running them frequently. So, a little bit about me.
My name is Junin Tang. I am a developer advocate at Zillow. Um, these are my contact links, but, you know, listed below the QR code will take you to my, um, my LinkedIn. So if you've got, you know, you've got your. Phone out. You can scan that and, uh, connect with me and message me. Um, a little bit about Zills.
Zills is the company that is behind vis, which is an open source spec database. We are the world's most popular open source spec database and mainly, uh, Are really, really good at scale, and I'll tell you about why, uh, Novis has such good performance at scale later on. Um, but for now, that's kind of what you need to know, right?
Uh, and these are some links to go find us either, um, you know, join us on Slack, come talk to us about batch databases, unstructured data, AI agents, you know, we've got, uh, channels for all this stuff. And also check us out on GitHub. Novis is a project that is written in Go. Okay, so the content of this talk is pretty much gonna be talking about large language models.
I'm gonna walk you through neural networks in general, kind of like, you know, starting from, so, Way back. Not way back, but like, you know, it's about 10 years ago up until now, talking about some of the challenges that LLMs face, right, the data challenges. Um, then we're gonna talk about the CBP framework, which is a framework that we here at Zillow, uh, abide by and, uh, advocate for in terms of.
Building out production, not level. Um, L L L M apps and the CBP framework stands for chat, G B T, vector Databases and prompters code. Then we will go over, you know, what is a vector database? Cause we're gonna be talking about 'em, but we're never really gonna cover them until the end. We're gonna talk about what is a vector database and we're gonna talk about vis, uh, and if at the end we have time, I will go over a quick demo of, um, how.
How, uh, you can build a scalable LM application, or I'll go over quick demo of a scalable LM application. So part one, um, unless you've been living under a rock, you know about chat. Bt you know about Clyde, you know, from anthropic, you know about Bard from Google. Uh, they've been. In the news like crazy and there's been a ton of development in this space.
So let's go back 10 years ago, um, when convolutional neur networks got pretty big, and the reason we're gonna talk about this is because convolutional neural networks give you this thing called. Local context. So you can see from this kind of, you know, very simple image. Uh, Novis is the world's most popular open source vector, blah.
And you can see that these arrows are pointing to the neurons that are right next to them. And that is basically what local context is in the sense of a convolution. The next kind of thing that came about was, uh, self attention or global context. I didn't finish this sentence. This one in the slide obviously just says VIS is the world's.
And then, you know, it should say most popular open source vector database. Um, but you can see that the way that the self attention architecture is different is that we are pointing, uh, our. Connections to every neuron in the next layer
and finally have causal attention, which is kind of the way that these statistical stochastic, um, generational models work, right? You have. Every token after the first one has access to the first one. Every token after the second one has access to the second one. So in this case, mil Ovis is the world's most popular vector database.
The world, the word world has access to, you know, mil ovis, but mil vis doesn't have access to the word world. And this is directional global context. Right. So this brings us to, uh, kind of today how LLMs are, um, made. Obviously LLMs are, are not, The simple looking, um, you know, they have transformers and encoder, decoders, whatever, however you want to kind of imagine that architecture.
Uh, but at their core, l LMS still are stochastic models. They fit this neural network. Um, the underlying, you know, uh, principle of neural networks is that they are stochastic. And so what happens really is if you are. And L uh, if you are working with an L lm, an LM is gonna generate the sentence. Vis is the world's most popular vector.
And the way it finishes this sentence is it asks the data basically, you know, given all of these former tokens, what is the most likely. Next word. And so in this example we say, like we say, that database point a six, most likely. Next word. Um, obviously the downside to this is hallucination, which is kind of what the, what the, these, these slides are kind of focused around, oriented around is solving the problem of hallucination as one of the examples of vector database usage and why vector databases are such an important part of the l l M stack.
So I'm gonna just show some, some math here for those of you who, uh, are into the math behind these things, right? So basically the way that you formulate this problem that LMS are solving, the way they're getting these statistics, the statistically most likely next token, is that given some set of tokens, T zero t1, all the way to TN predict TN plus one.
And the way that we formulate that in a probability, Uh, perspective is the probability of database, given that vis is the very similar to be, uh, statistics. If you kind of remember that, you can think of the way that, um, these generative models generate they're tokens in the form of these statistical models.
Uh, so challenges with LLMs, um, you know, these are very, very popular models, but they still have some of their own issues. And like I was saying earlier, the, one of the biggest issues is hallucination. And so, for example, if we query chat, G b T, uh, how do I perform a query using novis? It's gonna gimme some code.
Uh, and the code is actually gonna look like it could be right. Because, you know, this looks like this, this looks like Python code, right? But what we'll notice if we understand, like we know how VIS works. So if we've read the, uh, you know, the documentation, this is an open source software. So the documentation is, is, uh, out there, right?
We'll know that, uh, this is actually not how. This is actually not how you use vis, this is not how the connection is done. So, uh, later on we'll see how this is actually done. Um, but for now I just want to show this example of what is a hallucination. So the solution to, uh, these hallucinations is basically to inject domain knowledge, uh, into large language models.
Really to inject domain knowledge on top of them, on top of large language models, uh, but into l LM applications. And how do we do that? We use the CVP framework. So the key idea behind the CVP framework is that we can view L LM Maps as a general purpose computer, kind of like how computers work. There's a processor, there's memory and there's code.
And in this CVP framework, you will use chat, BT or any other l lm. So you could actually replace letter C with L, but we've just decided to go C. Um, but this can be interpreted as the cpu, the processor, the, you know, the, the main compute power behind these L LM applications fee is the vector database.
This is the storage. Um, you know, your, your rom kind of your hard drive. In a sense. Uh, and so in this example, like, you know, VIS could be a vector database, and then Prompt US Code is basically your interface. So, you know, like the, the UI in a sense, the, the code that makes the ui, the os whatever, the way that you interface between the processor and the storage, it's the type of code, right?
And an example of an application that, uh, uses. This CVP framework is an application called OSS Chat, which if you use, will actually give you the correct way to use vis, um, in this chat. This is a, this is a chat bot that talks to open source software, and we've got a bunch of different, um, open source software on there.
Now, it's not just Novis, uh, there's like PI Torch, uh, yolo. Um, There's many other, you know, auto G P T. There's many other open source softwares that we've put behind OSS Chat. And the way this works is basically we generate, we go into, we get the documents, we generate some set of embeddings from the documents and the answers.
Uh, so we'll ask chat gt to generate some questions, um, that would make sense based on the documents as well. So you're working with the question space and the answer space. And then we use this cloud to host these, um, sorry, to hold these embeddings. And when a user comes and asks a question such as, how do I perform a query using novis, they should get a correct response back like this because we've scraped the actual code, the actual documents that are related to the, um, Related to the software.
And so we also used this thing called G B T Cache, which was built, uh, which is another open source, uh, library. And this was built originally as part of OSS Chat because we saw that there were some questions that were asked many times and it is ineffective. Uh, inefficient. Um, and, uh, it's costly to run the same questions multiple times through an l lm, when you could actually just say like, Hey, maybe a user, maybe many users come and say, how do I use PyTorch to resize my images?
You know, if you have 10 users coming and ask that, you probably wanna catch that and just query your vector, your database and say, okay, here is the, here here's your answer. We're just gonna return this. We don't need to, um, you know, query the l LM to query, create the answer for us or to, to decompose the query.
If you're using something like LAMA index, Um, okay, so how does this solve hallucinations? Why, why does this work? Right? So I've, I've told you that this is supposed to work this way, but why does it work? So what Vector databases do is they give us an access to domain knowledge, and then they allow us to perform semantics, search on domain knowledge via the veteran beddings.
And vector and beddings are these numbers that represent some sort of object. Uh, in this example that I've shown here, the object is a word. And this example kind of shows you that if you take the word queen and you subtract the word woman and you add the word man, you get the word king. Uh, and this is, this is a very, um, I, I hesitate to say old cause it's not that old, but like this is a seminal paper in this space.
And the advances in technology have gone far beyond what is shown here. This is a very simple example of vector. There's no such, like, nobody's operating on two dimensional vectors anymore. Uh, the standard size is like, so if you're using open ai, I think it's like 1536. If you're using coherence like 7 68, um, there's some that are like 3 84.
So the, the standard size of these vectors are hundreds. Um, so in practice, the way this works, the way that you can operate on these factor embeddings, you know, do math on objects like images or words, is you take your knowledge base, you run it through a deep learning model, and then you get the factor out from that.
You get the factor embeddings from your deep learning model. And the way that works is that you actually take the outputs from the second to last layer of your model in order to get the vector embeddings. So a deep learning model will typically do something such as classification, uh, or maybe like see recognition, part of speech tagging, something like that.
And instead of asking the model to. Give us the, you know, the classify, the classification. We just, we just pull the, the values from the second to last layer and that is a representation, uh, of the object that we put into the, um, into the network. And then we put that into a vector database such as vis, okay, so what is the Vector database?
Why do you need one? Vector database here at Zills. Um, we like to say that a vector database is a database purpose-built store index and query large quantities of vector embeddings. Uh, and you know, with, with mil V focus really is, that is on billion scale, right? Is on these large quantities of vector embeddings.
So this is a bunch of, bunch of text dump. Um, but the basic idea that you want to get, uh, get from this slide is there are actually other solutions to performing vector search or to working with vector embeddings. You don't need a vector database. You can use a high performance vector search library such as face.
Um, you know, Facebook's Facebook, AI semantic search, maybe similarity search, F A I S S, or you can use, you know, hierarchical navigable small worlds, H N S W. Um, you can use Annoy, which is approximate nearest neighbors. Oh yeah. Which is built by Spotify. Um, but if you want to, you know, have a more production ready factor.
Uh, search vector similarity application outta the box. Then you probably need a vector database. So vector databases allow you to do things like filtering on your vectors. So if perhaps you only want. You wanna filter on some of your metadata. Um, it allows you to do hybrid search. So for example, um, you know, you can search your text and you're a dense vector, maybe a sparse vector.
Um, it provides you with backups, right? Your, your data is backed up. You're not gonna lose it really easily. Um, high availability, you don't have to worry about. Uh, scaling, um, maybe sharding if you're doing a lot of streaming data. Um, aggregation search, parallel search, lifecycle management, multi-tenancy, working on a GPU accelerator, for example, we have a gpu, um, accelerated, uh, what's it called?
Integration with nvidia. And one of the other things we do really well is the billion scale storage. So let's walk through some. Vector indexes so that you can understand kind of how we're comparing these vectors and how that works. So this is the one I was talking about earlier called noi, which is the proximate neighbor search.
And this builds a binary tree. And the way it does it is it just, you take two separate, um, you take two uh, points in your space and then you split the space in half, and then you do it again and again and again and again until. You have like maybe like three vectors, four vectors, five vectors. This is a hyper parameter that you can control that is, um, in one section.
And this produces, like I said, like a tree kind of index. And the way that you search is you say like, okay, find me. You know, the closest space and then the closest vector in that space. Then there's I V F, which, you know, this looks like a OID diagram. Um, and basically the way this works is some sort of K means type clustering algorithm, and you come up with, say, some number of centroids.
So you throw your data in there and then you cluster all your data and then you get these like, you know, areas. Uh, that are clustered around a certain Synthroid. And when you search, what you do is you say, let's say I wanna search five centroids, maybe your, your.is like somewhere in here, right? You search these five centroids, and then you see which one of the actual vector is closest to your new data point.
Then H N S W. This is a really, really popular one. Um, and it's also a somewhat complex algorithm. Basically you're starting, you're creating a graph index of all of your, uh, data points. And then when you insert them into the graph, what's actually happening is you're also assigning each data point, a uniform random variable.
And that uniform random variable is just determines what layer they're gonna be in. So for example, let's say, you know, we're building something and we're assigning uniform, random variable. If you get between zero and 0.9, maybe you're in layer zero. If you go between 0.9 and 0.99, you're in layer one. If you're between 0.99 and 0.999, you're in layer two.
And so on, and so on and so on. So as you can see, the layers get sparser and sparser as you go up, um, just because of this uniform and a variable. Now, that does mean that it's possible that you have 4 million data points and everything is in layer zero, but it's unlikely. And the way the search works is you start at the top layer, and then you just sparsely search your way down.
And it's really fast because it's graph search, right? Your graph is already built. You just have to find the next point. So let's talk a little bit about, you know, architectures. Why is Novis fast? Why is Novis good? Right? Um, Novis uses a distributed system, native backend and uses, um, these, so this in this diagram, you probably don't need to pay attention to most of this.
You just need to know. Okay, how do I interact with the sdk? Sends it to a load, bouncer, load, bouncers, bounces, all the, you know, the requests. And then the important part is to know the stuff about, like the query nodes, the data nodes, and the index nodes. Typically, uh, these nodes can actually be really the same node, right?
They're like kind of like a, kind of like a pod in Kubernetes. Um, but we have. Dedicated nodes to do different things because it, it makes better performance basically. So the query nodes query, they query the database and they return the data that you need. The data nodes hold, the data that you need to kind of work with in memory and the index nodes.
Do your indexing when you are, well, when you're indexing. Um, and I'll just address this here and with Novis, you should almost never need to re-index because of the way that the architecture is, is. Novis actually saves all your data in 512 megabyte chunks instead of one whole, you know, block. And that makes querying much, much faster because then what you do is you query these 512 megabyte chunks in, um, in parallel instead of, you know, running through an entire, uh, one.
10, 3,000 gigabyte block. Um, and then, you know, when, when your data is in there, we save it into S3 or min io, Azure block, you know, some sort of, uh, permanent storage. So yeah, that's effective databases. Um, that's novis. Do we have questions at the end of this or? Um, yeah, we can definitely get some questions going.
There are a few that are coming through in the chat, but it takes probably about like 20 seconds for the stream to come through. And you're good with the screen. I can take it off. Sharing your screen. Um, yes. You wanna do the demo? Let me, lemme do it. Let me do the demo. You see this? Yes, I can. Do you want me to ask a question now or at the end of this?
So, we'll, we'll prompt people to ask their questions now, and by the end of the demo, hopefully it'll be full in the chat. Yes, that'll work. So I'll just show a quick demo here. There we go. Um, so we'll say something like, what can I build with Lane Chain? I think Lane Chain is on here. And if you try this at home, I would go to chat Bt and ask chat bt what you can build with Lane Chain and see what it says.
Um, because it will not, it will not tell you this, but this one, you know, tells us, you know, lane Chain can help us build a variety of tools with LLMs, uh, you know, we can say like, show me how to use. Conversation chain in lane chain, and it should show, you know, something about conversation chain, which I think is, uh, one of the tools at Lane Chain.
Um, but yes, I suggest that you go try this with a L L M and without like any layer on top of it and just see what happens. Oh, see, build out the code, et cetera, et cetera.
Yeah, that's, uh, pretty much, pretty much it for my, my demo here.
There we go. There we go, dude. Sweet. All right, so what made you use go instead of rust? Uh, that's a good question. Um, I don't know, actually that's, that's not what you're supposed to say in this. Make some shit up. We're hallucinating. Like chat, G p T, man. Come on. Hallucinating like chat. G P T I hallucinate more than chat G P T.
Wow. Um. Yeah, I, I, I, I really don't know though. I, I don't know like how popular Russ was back when VIS was written in like 2017. I think Go was like coming out and it was like a pretty popular language and people were like, oh, this is, uh, you know, really performant and um, also safe. Whereas, you know, rusts is, is actually pretty good for a lot of ML type stuff.
I would say that, um, well, Russ was around in 20 16, 20 17. Yeah, I know that actually. Um, I'm not really sure though. All right. Well we got a few more that came through here. Um, how does Novis compare to Vespa and Quadrant? How does Novis compare to Vespa and Quadrant? Well, um, actually, oh, I have something for you for this.
Uh, let's, there's this thing called, uh, we just released this thing called Vector DB Bench, and I will, is there somewhere for me to like put, put paste this link? Yep. Just throw it in the chat right here. And I'll put it into the big chat. There we go. That is awesome. I think people are gonna love it. Yes.
This is a record, uh, database benchmarking tool that includes, um, vis Zeus Quadrant and Aviate. I see. And, uh, with results for these four and as, oh, and, uh, actually Elastic Coach, which those for these five. And then it allows you to perform your own. Uh, benchmarking as well. But if you download it, you can see all of the results for these five existing ones.
And I will just tell you that, um, Novis really outperforms quadrant when it comes to load testing, like the number of vaccines you can put in there. And, uh, there's spoiler alert, hold on.
Uh, and uh, also, So queries per second and for vectors at large sizes and small sizes. All right, so I like it. We question coming through with Matteo. Okay. How can you cash semantically identical questions that are worded differently? Oh, great question. I like, Ooh, that is a really good question. Um, You would save them with the vector, uh, with, with their vector representation.
And if the vector representation is close enough, we just return the closer one. It's, it's actually, it's just straight up. Just, just look at the vector representation. Um, No other, no other real tricks there. Nice. All right. All right. So for more people, so a few people are saying in the chat that they really appreciate your honesty and they love to talk.
It was very helpful. And uh, dude, I'm gonna direct you over to the chat for people that are asking more questions. Feel free to keep it going. And we just recorded a podcast episode that is gonna drop real soon. Hopefully next week if I can get everything, my ducks in a row, as they say. So I look forward to that happening.
And Bradley just, or Brady, sorry, the Brady just asked another question in the chat, so I'm gonna kick you off, go over there because I've got a fireside chat to get to, and I'm gonna bring on the next, the next guest, as I mentioned earlier, and I'll mention again. You can call me Ringo today because I'm keeping time like a clock, baby.
I'll see you later. Dude, it was great you all.