MLOps Community
+00:00 GMT
Sign in or Join the community to continue

[Exclusive] Zilliz Roundtable // Why Purpose-built Vector Databases Matter for Your Use Case

Posted Mar 15, 2024 | Views 662
# Vector Database
# MLOps
# Zilliz
# Zilliz.com
Share
speakers
avatar
Frank Liu
Head of AI & ML @ Zilliz

Frank Liu is Head of AI & ML at Zilliz, with over eight years of industry experience in machine learning and hardware engineering. Before joining Zilliz, Frank co-founded Orion Innovations, an IoT startup based in Shanghai, and worked as an ML Software Engineer at Yahoo in San Francisco. He presents at major industry events like the Open Source Summit and writes tech content for leading publications such as Towards Data Science and DZone. His passion for ML extends beyond the workplace; in his free time, he trains ML models and experiments with unique architectures. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.

+ Read More
avatar
Jiang Chen
Head of AI Platform and Ecosystem @ Zilliz

Jiang Chen is the Head of AI Platform and Ecosystem at Zilliz. With years of experience in data infrastructures and information retrieval, Jiang previously served as a tech lead and product manager for Search Indexing at Google. Jiang holds a Master's degree in Computer Science from the University of Michigan, Ann Arbor.

+ Read More
avatar
Yujian Tang
Developer Advocate @ Zilliz

Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.

+ Read More
SUMMARY

Engineering deep-dive into the world of purpose-built databases optimized for vector data. In this live session, we explore why non-purpose-built databases fall short in handling vector data effectively and discuss real-world use cases demonstrating the transformative potential of purpose-built solutions. Whether you're a developer, data scientist, or database enthusiast, this virtual roundtable offers valuable insights into harnessing the full potential of vector data for your projects.

+ Read More
TRANSCRIPT

Demetrios [00:00:09]: What is happening? Everyone, welcome. Welcome to another mlops community roundtable today. Who we got something special in the works, talking all about vector databases and hopefully going to demystify a bit of what you're looking at when you're thinking and you're testing out your vector databases. I figured, you know what? As we like to do around here, I might as well bring out our guest of honors with a little diddly. And so I brought my guitar. And if you will permit me, in this moment, I'm going to start playing that guitar because we've got three incredible speakers, or panelists, if you want to call them that. And it's only right that we give them a warm welcome. So while I am jumping in here and making some noise, or what some people might call music, I would love to hear in the chat where you all are calling in from.

Demetrios [00:01:27]: Let's get this started. Today we're gonna talk about these vector databases. We've got three of the best out there coming and giving us spaces. Frank, Eugen and Yan are gonna be schooling us lots of head of AI platforms and ecosystems at Zillis. Head of AI NL at Zillis, we've also got a developer advocate. It's time to start the show. And if anybody knows what's going on with these amvector databases, drop a question in the chat. We want to hear from you.

Demetrios [00:02:41]: Ayo, let's bring out our guests of honor. Now, where are they at, Mr. Frank? There he is. For everybody that do not know, head of AI and ML at Zillis, I see you laughing. You were not expecting that kind of intro, were you?

Frank Liu [00:03:00]: I was not expecting that kind of intro, but I'm very pleasantly surprised, as.

Demetrios [00:03:04]: I am always when someone serenades you. We've also got you. Jen, where you at, man? Hey, there you are, the developer advocate at Zilla's. And last but not least, Jen, where you at? There you are, head of AI platform and ecosystem. So today, folks, we're going to be talking all about why purpose built vector databases matter. And if you should be thinking about them for your use. Case, we really want to get into a deep dive on the vector database scene, because right now it feels like just about every database out there has bolted on a vector offering. And so I think there is this huge question in people's eyes as far as when do I just use the vector database that I've got and use these vector offerings from whatever my favorite vector database or favorite database flavor is versus when do I just totally look at going all in on a vector database and why I would want to.

Demetrios [00:04:22]: What are the trade offs that I'm thinking about? And so I'm excited. I can't believe we got all three of you on here on this call at the same time. It's an absolute privilege and an, so I think what we should start off with, and Frank, because I brought you up first, I'm going to direct this first question at you. When it comes to vector databases, there are always the questions of if you need it or not because these LLM context windows are increasing, right? And most specifically when we talk about rags, I think a lot of people as Google put out their paper like, oh, do we need vector databases anymore? Because now the LLM, you can throw the Bible at it and you can throw the Quran at it and you can throw the tibetan book of the dead at it, whatever you want to throw, you can throw it all and more at it. So where are you sitting in this? How do you look at it?

Frank Liu [00:05:28]: Well, yeah, you could throw the Bible at it, you could throw the Quran at it, whatever. But I think the real question is, there's a couple of key factors here I think to think about first is do you really want to pay, if you're going to query over, let's say, the Bible, do you really want to pay a dollar per query? Right? Do you really want to say, okay, I'm going to have all of this stuff into the context window for my larger language model and I'm going to pay a dollar to ask a question. Now if I have 100,000 questions, I have to pay $100,000 just for all of that. So that's one question. So it adds the context. As you have more and more context, the price, what you're charged becomes more and more expensive. And not just from a dollar perspective, also from a compute perspective as well. So large language models, they're transform based at the end of the day.

Frank Liu [00:06:20]: So they really are quadratic when it comes to token in terms of compute complexity, right? In terms of runtime complexity. So it's both a factor of time as well as cost. I think the other thing to think about it is, okay, you have a million tokens that you can feed your large language model, but if I'm trying to index, let's say the San Francisco public Library, or if I have a ton of books that I want to index, it's not all going to fit into the context window at the end of the day. Still, I think these two factors, irrespective of how long your context window is, irrespective of whether or not you have a language model that how large it is, you will still need vector databases to really store and retrieve all context that you really want.

Demetrios [00:07:09]: Yeah, Eugene, I see that you were, like, thinking maybe should I step up to the.

Yujian Tang [00:07:20]: Like, the other thing that I would say with this is just still. They still have this kind of lost middle problem. I don't know what Claude three's new cost is, like, $75 for a million tokens. And that's a lot just to kind of add on to what Frank was saying. A dollar per question might actually be underestimating it if you're going to be.

Jiang Chen [00:07:49]: Talking to the know.

Demetrios [00:07:51]: Yeah. And it does feel like I often wonder about the performance on if that is the best way to do it. So there is a clear piece around speed not being up to par, and then the other side is like, you're potentially going to burn a hole in your pocket. But is this actually going to give you the most relevant answer? Is probably the most important thing, because maybe people will pay at the end of the day and they will wait. But if it also doesn't give you that the most relevant answer or the best answer that you could get, I think then it just makes no sense. So that's another piece that I would love. I haven't seen much around that. I don't know if any of you guys have.

Frank Liu [00:08:40]: I think that's an interesting question. I think a lot of the research, at least I've seen today, when it comes to retrieval, is very much what people like to call needle in the haystack, have. I think one of the earliest examples is you have, you know, you have the Harry Potter series. They added whatever Harry Potter and the Chamber of Secrets, and you take this entire book, all of its tokens, and you fit that into the token window. And then right in the middle, you have a sentence like, and Eugen is a developer advocate, as it is, or something silly like that. Right. And Eugen loves machine learning, something like that. But to me, it's a really unfair comparison.

Frank Liu [00:09:21]: First, because how many times does the phrase machine learning appear in the Harry Potter series? Right? How many times does the word ujan appear in the Harry Potter series? Probably no time like, it doesn't appear at all. So it's very easy. It's very easy for a large language model for these attention mechanisms that are in LMS to pick out that subset, those phrases. And the other thing is, I think there's also recent research. I was reading a paper recently. It talks about the holy grail of sort of large language models is reasoning and generalization right across different tasks. And if you try to do reasoning across, you have long context, and you add these sort of bits and pieces and you try to do reasoning across long context, the performance becomes much worse as you increase the amount of tokens. So at the end of the day, it's still relevant.

Frank Liu [00:10:18]: Retrieval is still very relevant in my eyes, and I think it will continue to be now, maybe one day, I think one day, really far down the road, it doesn't really make sense to use rag anymore. I do see that potentially happening, but I think in the short term, it doesn't matter how long these context.

Demetrios [00:10:36]: Yeah. The other thing that I often think about too is how much goes on outside of just the prompt and the context window and how many other pieces you need to be thinking about. That is not just like, oh, cool, I'm going to put all this data into a context window and then it's going to make my life easier. And really it reminds me of that famous Google paper back in the day where it had the model and that was like just one part of the entire system. And so it still feels like this. The context window is just one part. How clean is the data? How are you ingesting that data? How are you quality controlling? If you're getting tables from PDFs? How do you make sure that those tables are actually what are in the PDF? And there's just so many other questions that are beyond the context window. And so that's another huge piece.

Frank Liu [00:11:35]: Yeah, absolutely. I think I want to give Jang some time to speak about this as well, because he does those pipelines and he's got tons of experience. You mentioned PDFs, tables, all these kind of different modalities of data. But the analogy that I like to use is that the context window, I saw this somewhere. I don't remember where I saw this, but I thought it was a great analogy. Right. The context window is like RAM, or it's like an L, one l two l three cache, and your database is like this. Now, can you store everything in your database? Probably not, unless you want your cost to be astronomical.

Frank Liu [00:12:11]: But it's meant to give you different layers of storage, so to speak. Right? And you don't want to, just as you don't want everything to be in your ram all the time and continue to ask questions over that. You want to offload some of those capabilities to disk to your vector database. Anyway, that's just my two cent. I think perhaps maybe we haven't done as great of a job of educating the community as why vector search is still relevant even with these long context models that can do needle haystack retrieval. But that's what we're doing here, right? Hopefully in the future things will change.

Demetrios [00:12:56]: Talk to us about pipelines.

Jiang Chen [00:12:57]: Yeah, so coming from a background of search indexing, I do have a lot to say about this kind of, I would say a hype of long contacts will take over everything. To me, that doesn't really make sense from a production perspective because, well, there are just so many things to consider where serving user queries in a real world, you're considering the latency, you're considering the user experience, you're considering the cost. All sorts of aspects are not considered when doing this very abstractive academic research on middle in haystack experiment. It's more of a stress testing of the capability of large language model rather than promoting. While you should use this in your production setup, I think that makes no sense. But why? So I can give some examples or some throw in some ideas here. So one thing is, in the search world we always have the analogy that the serving time, like serving user query, is just tip of the expert. There's a whole amount of efforts being spent in the offline indexing time.

Jiang Chen [00:14:14]: Doing so is because a lot of the analysis of the understanding of the content are really expensive. They involve running many machine learning models and they involve doing very lengthy analysis on those content. And you cannot simply cannot afford doing that when the user is sending a query to you, right? You are supposed to give back an answer within like ten milliseconds, something like that. So that there's a huge principle of doing all those kind of heavy lifting stuff offline. And I think that's one of the main principle of rag as well. For rag, you do retrieval to augment or to enhance the large language models. And a huge part of retrieval is that you need to understand the content and prepare all those efficient index for those content before the user query even comes. So I think that pattern is a great advantage or value of wright.

Jiang Chen [00:15:23]: There are also other considerations, because by doing this offline you are saving time, not only time, but also cost. Because if you were not doing that, then every time the user query comes, you need to do that once more for that particular query. And that's not affordable. So even though, say, understanding this whole Bible takes, say, $3, spending $3 once is whole different story than spending it for millions of times when the user query comes. So all those sorts of things were now considered in this experiment. I'm personally excited about this experiment because that's really extending the boundary of the capability of large language models. But looking back into what kind of production system we want today, I think rag is still relevant. It's never more relevant than right now.

Demetrios [00:16:19]: I really enjoy this idea that you're talking about. About what? The needle in the haystack should be looked at is more of a stress test, not like a daily driver type thing. It's not a best practice that people are advocating for. It's just like, how far can we push this? And at least not today. And we shouldn't be thinking that's how we want to architect our systems and we want to leverage that. It's just that, okay, this is another tool in our arsenal. I also don't want to let anybody think that I did not let it go over my head. We are using the Bible as our reference here, which some people on YouTube have said I bear striking resemblance to a person in the Bible.

Demetrios [00:17:11]: And so we use a different book. Now, if you want, we could talk about maybe like encyclopedia.

Yujian Tang [00:17:19]: Encyclopedia works as well.

Demetrios [00:17:21]: Yeah, the encyclopedia or Harry Potter was great. Harry Potter's good. Harry Potter is probably more culturally relevant than the encyclopedia. Now that we got that out the way, Frank, were you able to find what the name of that paper was that you referenced a minute ago? And if not, I'll let you look for it.

Frank Liu [00:17:45]: Yeah, I unfortunately have not. I'm going to need a little bit more time.

Demetrios [00:17:48]: I wrote in the chat when there.

Frank Liu [00:17:50]: Are so many papers coming out that it just kind of drives me crazy. And I try to go through this these days. I only read the intro and the conclusion of papers.

Demetrios [00:18:00]: Smart man. You are smart man. Yes, I try and understand it, and then if it's really interesting, then I'll try and dive into it. All right, there is one thing that I want to talk about. So we talked about this rag idea and we talked about, hey, is there still room for vector databases there? But let's move on a little bit more because as we are talking in the general setting of vector database, purpose built vector database versus just vector database bolted on the normal database that I'm using, whatever my flavor is. And before we get there, maybe can we set a bit of a scene, because are there other things? So if you have another database that is specialized in certain areas, and then it throws on the vector database aspect to it, that's one thing. But with vector databases, are there other types of data that you would think about storing in a vector database. And if like, why would you do that?

Yujian Tang [00:19:10]: I guess I can take this one to start with and then Frank and John can give their opinions on this. So I've been saying this for a while because I heard this from James and I was kind of curious to look into this, but vector databases are really compute engines. And the whole name vector database is actually just a misnomer. It's just that it's easier to tell people it's a database because you're storing something. But the reality is you're actually storing something in. You are storing vectors in memory, but most of the data that you store is actually held into permanent storage. So like s three or minio or something like that. So you can store basically any types of data that you want in a vector database.

Yujian Tang [00:19:52]: Just like you could treat it just like a NoSQL database. And you would store things like text for rag or if you're going to be working with multimodal rag, you're going to be storing images, probably links to images, videos, things like that. And you should store this stuff and you should store the metadata. Maybe it tells you like, hey, this is when this was published.

Demetrios [00:20:12]: This is who wrote it.

Yujian Tang [00:20:13]: This is XYZ about the information that you're storing. And this will basically let you use something that was originally built as the compute engine to do vector search on top of a permanent storage layer as if it were a real database. And this is basically kind of how, let's say NoSQL databases work and the difference between the way that a purpose built vector database works and something that you just tack on to, let's say NoSQL database or SQL database is in the way that it's designed. And we all kind of have this intuition that things are best used for the purposes that they are designed for. And so vector databases are designed to be able to do this high amount of compute efficiently and effectively at scale. I mean, let's just take open source, for example. If you just look at some of the open source projects, you can go in and you can look at how much work was really put into this. And I'm not going to say that that is the only indicator, how many lines of code maybe is not the only indicator of how robust or feature rich a system is, but it is definitely something where you can kind of look at and be like, okay, there's a lot of features here that kind of help you do things perhaps at an enterprise scale or at a more robust level.

Jiang Chen [00:21:38]: Yeah, I think if you think of vector database in the ecosystem of search or information retrieval, it is supposed to be the most efficient way of retrieval, some information that has been offline, indexed, and organized. However, when you organize the information, you probably don't only want to organize, say, the IDs or the index of it. You also want to kind of put the original content as close as possible to it because that's what you really want to retrieve and return back to your customers or clients. So with this mindset, we do design ways of storing not only vector data, but also other kind of ad structure data, which is the source of the information where the vectors are generated from. By placing them closely with vectors and return them at the same time altogether, we can help the developers to build more efficient applications because they do not need to retrieve this raw data from somewhere else again, and they can achieve the best efficiency from this parameter.

Frank Liu [00:22:55]: Yeah, sorry, I was sort of busy searching for that paper. I have not found it yet. So I apologize to the couple of direct places. I will find it.

Demetrios [00:23:07]: Yeah. If I don't find it by the better search it.

Frank Liu [00:23:10]: If I don't find it by the end of the session, I will build a new search. I'll build a new search. Yeah, I'll post it on Twitter and I'll tag. Demetrius.

Demetrios [00:23:19]: Yeah. Archive. You're going to build a whole new archive search function so that you can find it next time.

Frank Liu [00:23:25]: Yeah, I'm going to bug grab all these papers because the way I'm searching for it now, which it sounds really stupid, it's pure keyword based. I'm searching like reasoning capabilities as context window increases and it's just not popping up. There's actually some diagrams in there which I actually remember very fondly. And if I find those images, I'll sort of share them with you as well. Demetrius, maybe you can post them.

Demetrios [00:23:54]: Yes, please. We'll send an email out to everyone too. The other thing that I was going to mention, unless, did you have something you wanted to tack on there? I imagine you were.

Frank Liu [00:24:06]: I think Jang and Eugen said it really well. Right at the end of the day, a vector database is a compute engine. But the one thing that I do want to say, this is my personal opinion, I do think Jang disagrees with me here. So it would be interesting to just chat about it a little bit more. If you look at, look, you got guys like MongoDB coming in and Datasex and elastic or whatever. I'm not trying to throw shade on them. I think these are really great systems. But what does their marketing team say right they say, hey, you should just come use us because you can store structured or semi structured data in here.

Frank Liu [00:24:48]: You can store json documents in MongoDB and you can store vector data too, right? Kill two birds with 1 st. Just use us. And I think they really missed the point in that these databases, at the end of the day, vector databases were built for different things. Vector databases are built from the ground up to support vector search and filtered search and hybrid sparse dense search. It doesn't mean that vector databases don't support json data. I have a blog post that's coming up that's all about just storing using your vector database as a pure NoSQL data store. You can just store json data without vectors. You can do that if you want.

Frank Liu [00:25:26]: It's like, hey, if MongoDB wants to play that game, we can play that game too. It's just that we don't, because I think we've always been, at least in my eyes, we've always been very much about accelerating the adoption of vector search and accelerating the adoption of embeddings and embedding based retrieval. That is my two cent I know that there are differing opinions out there, and I'm not saying that you should store json data in a vector database. I'm not saying that. All I'm saying is that if you don't want to manage multiple databases, if you don't want to manage multiple sources of data, start with a vector database. Right? Future proof yourself. Don't drink the Kool Aid, so to speak. Or what is the phrase that folks use for this kind of stuff these days?

Demetrios [00:26:14]: What are all the cool kids saying?

Frank Liu [00:26:16]: What are all the cool kids saying?

Yujian Tang [00:26:18]: Doesn't want to throw shade, but I'll throw shade, try to store a million vectors in MongoDB and retrieve them and come back to.

Frank Liu [00:26:27]: I think. I think Mongodb is great. It's excellent, right? But at the end of the day, you're doing different things. It doesn't mean that we don't support json data. We just don't support it as well as MongoDB does. Just like MongoDB doesn't support vector data as well as we do.

Demetrios [00:26:42]: And that's exactly it. It's like, why is there a whole market of so many different databases? Because each database has their specific flavor that they do things really well at. And for use cases, you need these different flavors because it is what you're optimizing for.

Jiang Chen [00:27:02]: Yeah, as a matter of fact, we are actually building this data connectivity with all sorts of connectors. So even if some of the data, while they're more meant to be stored and managed by another purpose built database or data management system, we do have a valid path for those data to become vectors, like being generated as vectors and then being sent to vector database for efficient and timely retrieval.

Demetrios [00:27:37]: So there's some questions coming through in the chat that I want to ask, and then I want to get into a few ways to future proof yourself and how to use specific features in the vector databases that can help you. But we've got one question coming through. Since we are talking about pros and cons of different vector databases, someone is asking about Milvis versus pine cone. I think one piece is. I'll start. Millvis is open source. Pine cone is not right? I think that's probably the biggest thing in my mind. But you all have fielded this question many more times than I have.

Demetrios [00:28:22]: So is there anything else that you can say? I'm looking at Eugene licking his lips. He's like, let me add it. Hold on, let me turn myself off. Mute. I got other stuff to say. I was already talking about Mongo. Now Pine cone, this is the best day of my life. You got center stage, dude.

Yujian Tang [00:28:39]: I don't want to be too inflammatory here, but go on Pine Cone's website.

Demetrios [00:28:45]: See.

Yujian Tang [00:28:49]: How much data you can really store. We've got some benchmarks for this. They're open source, as is the ethos here at Zillis. And you can really take a look at the data sets we use and the way we test the data and benchmark the data and just look at. You can just text the benchmarks yourself. I don't think I need to say too much about this. I'm just going to say look at the benchmarks. It's all open source.

Yujian Tang [00:29:13]: You can see exactly how it's done. All the data is out there. You can bring your own data, just benchmark it yourself.

Jiang Chen [00:29:19]: Yeah, I think on top of that, other than the open source mailbox, we also have managed mailbox, which is Zillis Cloud. And in Zillis Cloud we added tons of features on top that we have even more performance. Even though Mailvas is already performant. We have even more performant core or indexing engine on Zillix Cloud. And we have this feature called Zillix Cloud Pipelines which provides the streamlined fashion of generating vector embeddings and then storing and retrieving them in vector database. And we also have the data connectivity that we mentioned. We are building data connectors with all sorts of data sources in this ecosystem and all those features, I think those provide some differentiation between us and Pancom if open source is not only the differentiator.

Demetrios [00:30:21]: Excellent. The next question coming through in the chat is asking. I'm vector DB, curious but very ignorant on the topic. How flexible or feasible would it be to use the same vector DB of images with annotations for data loading during training as part of the inference deployment?

Frank Liu [00:30:44]: So you're looking to use vector databases in your training loop. And I think it's actually not a common use case that we see today just across the board, but it will be very common. And we've already got, I won't necessarily name who, but we already have folks that actually use bilbis in the training loop, not just for large language models, but for other models as well. And one of the things that they use it for is, for example, the duplication of your training data. And really interesting way that you can use it is you can actually just say, hey, I have this model that I've trained. I'm going to use it to fetch items that are to make sure that I fetch items for training my new model that I know are not duplicates of what I've already used. And there's also other ways, other really interesting ways to use your vector database in the training loop as well, which I won't get too much into some of the details, but specifically as it comes to your use case, when it comes to images with annotations, if you have, let's say you're trying to train, let's say a very large multimodal model, you can take some of these smaller embeding models that you have, and especially if you have these multimodal models, and then you can say, hey, I want to, maybe for this particular batch, or I want to train this particular expert on some, one particular cluster of images, or one particular set of images, right? And that's one of the really interesting, just one of the large subset of ways I think you can use Millbis, you can use not just Milvis, other vector databases as well. Inside of your transcript, I think we're going to see more and more of these sort of use cases for vector search in the future.

Frank Liu [00:32:27]: Really ties in very nicely with ML ops.

Jiang Chen [00:32:31]: Yeah, we did see some use cases of using vector database as retrieval and for deduplication of the images during the training phase. I don't think we have seen kind of combining this workload with the online serving workload because one comes in bulk and it'll probably take over some of the computing resource of your collection. So even though both of the use cases are being seen in your clients use cases, it kind of makes sense to split them into two different collections or even two different physical setups so that you don't hurt the performance of the online serving with your offline training use cases.

Yujian Tang [00:33:17]: Yeah, we can kind of actually see some people are already doing something similar to the question askers. Question askers question, right. But yeah, if you go look at arise and Voxel 51 and Galileo and Trrera, some of these companies are doing this kind of thing where they're using a vector database to kind of show you the data quality and the way that your data is clustered. And I know I just worked on something with Jacob from Boxwell 51 where we use Milvis to store some data and to load some similar data to observe some of the vector embeddings and how they looked like. And so I think this is definitely something that people are looking at doing and in fact already being done by a bunch of startups.

Demetrios [00:34:13]: Great question, Burhan. I really appreciate that one because I did not think of that at all. And so it's almost, yeah, that beginner's mind came through and who would have thought that this is actually a pattern that you're starting to see emerge and probably will be something as we move forward. So now before we jump into embeddings, because I feel like we can't talk about vector databases without talking about embeddings, I do want to talk for a minute about the different ways to future proof your AI applications and especially when you are using a purpose built vector database. What are things that we should be thinking about as we are moving through our lifecycle and how we're setting up our system when it comes to most specifically like the vector database, are there features that we need to be keeping in mind? Is there metadata that we need to be thinking about? All of that fun stuff.

Yujian Tang [00:35:13]: There's a lot of new features coming in Millbis 24, like hybrid search. And by the way, I'm going to define hybrid search here. Hybrid search is when you do metadata filtering on top of your vector search, pre filtering on your metadata for your vector search.

Demetrios [00:35:30]: Okay?

Yujian Tang [00:35:31]: That's what hybrid search is. Anybody who says otherwise is wrong. And then there's also multivector search coming. So that's when you're going to be able to search multiple vectors and you're going to be able to re rank based on how the vectors are. So for example, a use case for this might be that you have both a semantic similarity aspect that you would like as well as maybe a visual similarity aspect that you would like, in which case you would want that vectors to do the semantic similarity search to get your traditional, or what we call vector embeddings right now to do your semantic search. And then you're going to want maybe something like pixel embeddings to do your visual search. So there's definitely these use cases that we start seeing come up for multi vector search and maybe even you have text and you want to compare it and re rank based on the text or things like that. So there's a lot of these kinds of different new use cases that are a little bit more, let's say advanced that are coming in.

Jiang Chen [00:36:25]: Milbis 24 yeah, this term of hybrid search is really funny. It's like everybody using it for different meetings, to my understanding. Well, initially I was thinking like hybrid search is probably you're searching the same thing through different modality or like different ways of referencing things. Well, to avoid the confusion, I will just describe the way of search directly rather than using these fancy terms. So yes, gen side in the upcoming 2.4 release we're supporting the hybrid, the mix of cert with dense and sparse invitings because there are definitely some great features or properties of sparse inviting graphs. The individual concepts of lengthy documents a bit better or in a more understandable way than the density embedding. So that there are definitely some developers fearing sparse search, if not in standalone, but also combined with the density embedding retrieval. So that yeah, we totally echo that requirement.

Jiang Chen [00:37:40]: We build that feature really fast and we are releasing it very shortly. There's also, while we're still enhancing the other aspects of the mules offering just internally, we're working on providing a set of easy to use utility functions to generate the embeddings from the client side so that you don't need to play with all sorts of frameworks and get confused in between them if you have a relatively simpler requirements regarding which kind of vector embedding you want to generate. For example, if you want to generate the sparse embedding from splayed model or BM 25, which is more like old school way of representing text, or definitely the hugging face sentence transformer and openi and other services.

Demetrios [00:38:39]: So you bring up a very good point and I think it is the perfect segue into embeddings in general. And there was a question that came up in the ML Ops community slack, and it was kind of around the lines of so what are some good practices around generating embeddings. What are the best models? And the consensus was it's kind of just like throwing spaghetti at the wall right now. You see what works, and hopefully you get lucky. Go on to the leaderboard, grab a few of those models, and then does it work for your use case? Cool. You're lucky. Does it not? Keep searching. Maybe you guys have seen better practices than that because it feels like there's got to be a different way.

Demetrios [00:39:24]: Right.

Jiang Chen [00:39:24]: Well, first, I want you to say this whole story of machine learning and artificial intelligence is like throwing strategy onto the wall and see what works. All this ecosystem of technology are being developed, but we don't want every single developer to do that again and again. We do want to provide some insights on that. Yeah, I would like Frank to speak on that. Frank is definitely one of the great experts in this area.

Frank Liu [00:39:53]: Okay. My personal opinion. So I see a lot of folks, I talk to a lot of devs out there, and when they're building out their application that leverages LLMs or leverages generative AI, they just say, I'm using OpenAI's API, or I'm using OpenAI's at an endpoint, which is great. I don't want to throw shade on them. I think it's a great embedding model, but there's no way that there's one embedding model that fits your use case. If you look at, I wrote this blog, and there's this great example on there where. I don't remember where I got this example from. Again, I get a lot of these from different locations, different locations around the web from different people, and I sort of repurpose them.

Frank Liu [00:40:41]: So this is a great example where one sentence is, let's eat. I'm going to say let's eat Eugen here. And the other sentence is let's eat, comma Eugen. And these mean two very different things. Right now for some applications. For some applications, let's say you're talking a legal context or you're talking in things that are a little bit fuzzy. You probably want the embeddings for these two sentences to be very close to each other. They're very related.

Frank Liu [00:41:17]: They have almost the same tokens, with the exception of some punctuation. And the people, the person that I'm talking about for both these, in this.

Jiang Chen [00:41:25]: Case is a UJ.

Frank Liu [00:41:26]: It's a Senate.

Demetrios [00:41:27]: Right.

Frank Liu [00:41:28]: And if you want to talk about these two senses from the perspective of what does it actually mean? These mean very two different things. Two very different things. In one sentence, I'm saying, let's actually physically eat Eugen, which let's not do that. And in the other sentence, I'm saying, hey, Eugen, let's go get lunch, or let's go out to eat. These mean two very different things. It depends on what you want to do. There's no one size fits all embeding model. And if you look at a lot of the metrics out there of how, oh, embedings aren't as good as B of 25, it's because you're not using something that is tailored to your application.

Frank Liu [00:42:11]: Right. That is semantically relevant to your application. And I think, try my recommendation. If you're looking to get started with vector databases or vector search, try different models. I know it's hugging face and sense transformers, and there's voyage as well. They make it very easy for you to try different things. Try different models and see what works best for your application. Use the LGTM at ten test.

Frank Liu [00:42:38]: Right. Try ten different examples. Let's see which one looks the best. Or try 20 different examples and see which one looks the best. At the end of the day, humans are still the best evaluators, right? And you should, depending on what it is that you're trying to build, evaluate your embedding model accordingly.

Yujian Tang [00:42:58]: I'll give a slightly different answer here.

Jiang Chen [00:43:02]: You should fine tune your embedding models.

Yujian Tang [00:43:05]: You should fine tune your embedding models with what you're going to be doing. And actually, I talked about this in my paper with Voxel 51 as well. We took some examples with clip vit and the CFAR ten data set, and we searched three words, Ferrari, Pony, and Mustang. And so you can see, Ferrari is going to get you cars back. Pony is going to give you animals back, but Mustang is going to give you both horses and cars back. And so, depending on what you're working on, let's say, for example, you're working on cars. You don't really want pictures of horses back. And so you'll have to fine tune your embeddings models to kind of get back the right context.

Jiang Chen [00:43:49]: Yeah, I think there are just so many embedding models available on the market, and unfortunately, most of the developers just choose, say, openi embedding, not because of its quality, but because of the name of Openi. So that's definitely not the way I would recommend. I think, surprisingly, just with a very small data set, you can probably test the effectiveness of the embedding model on your particular use case. So that my top recommendation is that if you have this capacity of doing some testing and evaluation, definitely build your own data set which can speak best for your use case than any other open data set, such as Ms Macro or beer. And also this public leaderboard from MDEB.

Frank Liu [00:44:48]: Right.

Jiang Chen [00:44:49]: This MTAP leaderboard has been overfitted by many models already, so it's really hard to simply judge. I mean, that's still instructive, but we shouldn't simply judge by their ranking on this leaderboard. However, that being said, if you don't have the capacity of doing any evaluation, then choosing from some big name is still a practical way of making the job done.

Demetrios [00:45:16]: So there's something fascinating that you said there, like creating your own data set. And I think I just saw some posts on one of the social medias talking about how a big question, an open question is how much data is enough to be able to say, like, yeah, this is good. And especially if we're doing for taking you Jan's recommendation and saying, all right, we'll fine tune our embedding model. Have you seen best practices around that? What is a good amount of data look like and how can you assure the quality of that? Also, how can you make sure that it is nice and diverse, robust? All the fun stuff in there, too?

Yujian Tang [00:46:05]: So there was a recent paper from 2023, or maybe it was 2022. I think it was 2023. It's called neural priming, and it comes out of.

Jiang Chen [00:46:17]: Oh, man, I don't know what school.

Yujian Tang [00:46:18]: It comes out of, but Ali Farhadi is the lead researcher for that. He is currently the CEO of AI two. And basically what they show is that if you prime a model with some images, this isn't even fine tuning. This is like kind of fine tuning. It's very small amount of fine tuning. With the right context, you will get some decent percentage improvement in results. And of course, this will vary from data set to data set. But for images, you're really only looking at 2025 images to kind of skew the model towards what you're looking for.

Yujian Tang [00:46:55]: And for text, you're looking at maybe like 100 to 120 sentences.

Jiang Chen [00:46:59]: From just my experience of working with evaluation, I think the bare minimum is definitely tens to hundreds of examples to give you a meaningful indication of how well this model performs. If you have thousands of examples, you are probably in a good shape, as a general rule of thumb.

Demetrios [00:47:20]: So basically, to try and summarize this so that it's not throwing spaghetti at the wall, right. What we're going to do is we're going to go on the leaderboard. We're going to grab a few models, test those out with maybe 100, if you can, 100 different types of examples. And then if you see that there's one that's performing well, go ahead, add that extra tuning. Fine tune it with as much data as you can. But you probably don't need to be thinking more than, like, 1000 examples. You probably should be thinking more in the hundreds of examples and maybe able to get away with the tens of examples.

Frank Liu [00:48:04]: The data has to be really high quality, though, right? So you don't want to. What large language models are really, or what large language models are really good at is they have just a huge wealth of pretrite data. And if you want to use very little data, that's okay. But you just have to make sure that it's really relevant to your data set, make sure that it's a good representation for what things are actually going to look like when you go into production.

Demetrios [00:48:33]: Yeah, that's so true. And that was another great paper, the lima paper. Like, less is more alignment. I remember that one. I'll drop that one in the chat. That one I can definitely find because it's an easy name. Right. Eugen, if you have the link to the paper that you've referenced, throw it over.

Demetrios [00:48:52]: So before we jump, folks, I want to get a few cool, maybe hot takes. I don't know. We'll see mild takes. But I think there's something that we could talk about, which has been taking Twitter by storm. Right, which is the Colbert model. And it feels like these late interaction models, maybe. Eugen, can you talk to us about what those are and why that's interesting and what you think the reason for it to become more popular right now is.

Frank Liu [00:49:29]: Yeah.

Jiang Chen [00:49:30]: So this Colbert paper and the late interaction model behind it, that's. That's really interesting. And that's catching the eyes, because right now, people kind of realize that this paradigm of inviting model plus vector retrieval is a perfect thing for retrieving a bunch of candidates of the final result. But if you add another layer, which is called rerun car model on top of this, to kind of refine the results and select the very best from the best candidates, then you can probably achieve better results. We did some quantifiable analysis on this. Yes, you can improve the recall numbers or other quality metrics by a few percent. So that's meaningful difference. However, this revine core model is super expensive.

Jiang Chen [00:50:28]: You are probably doing something that you should do in offline indexing at online query serving time just because you're using this rerunker because they use the cross encoder model, which needs to basically just look at the whole query and all the candidate documents, rather than just computing the cosine similarity score, which is quite cheap computation wise. Right? So to solve this problem, there's this late interaction model being proposed which kind of combines the good things about embedding, vector retrieval and this cross encoder re ranking. It generates a whole set of embeddings from one single query or document and then do the late interaction by computing, I would say, the similarity score between them at query time. So that's cheaper than regular, but that represents more raw information than the single dense vector embedding. So I think that's a great idea. But again, putting it into production, that's questionable at least per se because it definitely used more space to encode this information. So that creates challenge during the online retrieval time. And moreover, a lot of the efforts from those community has been put on single vector disinbeding so that the embedding model is being refined and being trained over and over with better data so that it does achieve better performance than the Colbert embedding, which has been only trained by the academia community.

Jiang Chen [00:52:22]: However, we are seeing more and more efforts on this, so it's hard to say what will happen in the future. We are looking into this and we are also thinking of, we start to think about how to integrate this pattern into the vector retrieval facilities that we are building, so that in the future, if this becomes, I would say, a popular choice among developers will have a solution that pairs it.

Yujian Tang [00:52:50]: Here's your hot take for this. Okay, forget Colbert. The only Colbert you need to know is Stephen Colbert. What you should really be doing is you should be using Milvis's multi vector search with re ranking to produce the same results at a better price. And it's easier.

Demetrios [00:53:05]: We all love cheaper. That is very useful to know. The cheaper the better, I think. And that's cool that you all have been inspired and taken that into account. I know we talked a little bit about. I want to take the next two minutes just to shine a light on some of the cool stuff you all are doing at Millvis and Zilla's. Because we talked about the pipelines, we also talked about the reranker. What other features out there? I know that we said that in the pipeline we've got a few features that are coming out, like is it sparse embeddings and a few others, but this is the rerunker is coming out soon.

Jiang Chen [00:54:00]: We have just supported the reruncor model in business cloud pipelines and for the open source Millbas we are about to support the sparse and dense vector hybrid retrieval. I cannot avoid using the word hybrid in the next release.

Demetrios [00:54:19]: Frank loves it. Frank's like, oh, we got to get a better naming convention around this. Well fellas, this has been awesome. I really appreciate you doing this. One thing. As we end, I would love to ask you a question that I tend to ask when it comes to vector databases, it doesn't really fit into the specific is this a specific vector database or like kind of catch all database that has vectors support theme? But I was talking to some data engineers two weeks ago from quantum Black, and they were saying how hard it is to make sure that when you put information into a vector database for your rags, how do you ensure that that information, if it gets updated, it is the one that is being retrieved when you want to go and ask answer questions. So they gave me an example of like an HR policy being updated, and now all of a sudden you have less vacation time. And so if you have a chat bot that gives the old vacation time, that's not good, because then the company all of a sudden is like, hey, no, actually you only have ten days, not 20, and you could get some pissed off employees.

Demetrios [00:55:48]: So how have you all seen best practices around that?

Frank Liu [00:55:53]: Yeah, treat your vector database as like a living entity. And this is one of the great things about what some of the stuff Hejang and his team have done, which.

Demetrios [00:56:06]: Is.

Frank Liu [00:56:09]: You have all this data, you have all these embeddings inside your vector database and create these real time sources, these real time syncs, and make sure that it's up to date at all times. I know it's like a really simplistic answer, which it is, but I think at the end of the day, that's really your vector database. You'd want to treat it almost as you do other databases, as well as a single source of truth for semantic information, for semantic representations. And that shouldn't change just because you're only using it just because you want to be able to do, just because a proximar is in every search is different from the way you typically query other databases.

Jiang Chen [00:56:50]: Yeah, I think it all comes down to manage the data carefully, carefully crafting the business logic that identify each piece of information. For example, if the HR, the policy is in some document, then once you update the policy, you definitely have updated the document, then just do like purging and reinserting or better upset of the single document right. Or if that's a single piece of text chunk, then just identify the text chunk and then update it.

Frank Liu [00:57:21]: Use triggers.

Demetrios [00:57:23]: Yeah, so that's awesome to think about for that use case. And I think the other one is a little bit more gray where you're like, okay, we actually want to see how this document has evolved over time, so we don't necessarily want to do a find and replace, but it's 100% like what you're saying. Data engineering has been working on these types of problems for a long time. It's typical data ops database stuff. And it's not like all of a sudden we should forget about all of that, because now we're dealing with vectors. We can still bring some of those best practices into the mix.

Jiang Chen [00:58:07]: Yeah, of course.

Demetrios [00:58:08]: I think that is it. Our time here is sadly coming to an end. I want to give you all a huge thanks and for everyone that joined us, virtually, this was super cool. Thank you. For all the questions. For anybody out there that is looking for a new piece of cool clothing, I will remind you that we have this I hallucinate more than chat GPT shirt, and it is hot off the press. They're selling like hotcakes. Maybe you can get yourself one by scanning that QR code.

Demetrios [00:58:48]: Before I end this stream, I'll let all these guys go. This has been awesome. Thank you all.

+ Read More

Watch More

29:58
Using Vector Databases: Practical Advice for Production
Posted Jun 20, 2023 | Views 1.4K
# LLM in Production
# Vector Databases
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io