Using Vector Databases: Practical Advice for Production
Sam Partee is the CTO and Co-Founder of Arcade AI. Previously a Principal Engineer leading the Applied AI team at Redis, Sam led the effort in creating the ecosystem around Redis as a vector database. He is a contributor to multiple OSS projects including Langchain, DeterminedAI, and Chapel amongst others. While at Cray/HPE he created the SmartSim AI framework and published research in applications of AI to climate models.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In the last LLM in Production event, Sam spoke on some of the ways they've seen people use a vector database for large language models. This included use cases like information/context retrieval, conversational memory for chatbots, and semantic caching.
These are great and make for flashy demos, however, using this in production isn't trivial. Oftentimes, the less flashy side of these use cases can present huge challenges such as: Advice on prompts? How do I chunk up text? What if I need HIPAA compliance? On-premise? What if I change my embeddings model? What index type? How do I do A/B tests? Which cloud platform or model API should I use? Deployment strategies? How can I inject features from my feature platform? Langchain or LlamaIndex or RelevanceAI???
This talk details a distillation of a year+ worth of deploying Redis for these use cases for customers and distill it down into 20 minutes.
Introduction
Now, wait a minute. What is this? What's going on here? Uh oh. Excuse me. Do you guys know who Sam Partee is? Anybody? Anybody? No, dude. They don't know you. They don't know you. What is going on Mr. Partee? They don't know you. Uh oh. I can't hear you though. Uh, oh. I think Mr. Par, I'm not sure if it's my side or how, how am I supposed to just, it's, it's, it's impossible to after you in anything.
You're just good thing what am here. Uh, the best part is, is that I'm not sharing. The top quality vector database content like you're about to bring. So I think people came more for the Vector store talks than they did for the random l l m improvised songs, but you know, both of them make a nice little sandwich.
Hey, well look, it is gonna be some good content. It is gonna be a little bit more advanced than the last one, so if you didn't see the last one, definitely go check out the last one. It's gonna fill in some blanks. There we go. But, uh, before I share your screen, I just, I just wanna make sure that you saw this.
Excuse me. Do you guys know who Sam Partee is? Anybody. Anybody? No.
Why do they, why do they not know you, Sam? I don't care. I, it's cuz I'm knowing. I'm knowing. I'm, I'm irrelevant. That's, yeah. That's gonna change right now because we are about to make you famous. Everybody's gonna be saying your name after this talk right here. All right. I'm sharing your slides. Let's hope it doesn't crash.
Let's go, baby. Or I'll see you in that. I hope it doesn't crash. Right? Okay. All right. So as I said, um, this talk is a continuation of my previous talk, so if you didn't go see part one, it's gonna lay out in a little bit easier of a way than, uh, this one will. But I'll start where, uh, a little bit further than where I started last time and go over this.
I show the slide almost every talk I do, um, but I feel like it's important to start here so that everybody kind of gets on the same page. I'm gonna rush through this. So once again, Definitely go watch my previous talk and you'll get some more information. All right, so what are vector embeddings? Vector embeddings are essentially lists of numbers.
I say this a lot, but think of it like, and I've, I've recently used this analogy a lot, so a list of groceries, and it's every single one of those groceries is something that you need to go get. Well, in this case, it's a list of numbers where each of those numbers means something about some piece of unstructured data, audio, text, images.
Um, and it's highly dimensional, sometimes sparse, but usually dense and highly dimensional in a sense that as again, each one of those items means something. So in that list, you can say, um, if you think about like convolutional networks, how they pick up on, you know, filters. If you think about the ness, the curve of every digit.
Um, each one of those dimensions in the list actually means something about that input data. It's never been easier to create and there's APIs to do so. Our friends at hugging face open AI cohere, um, congrats to Nielsen Co on the round, but it's never been easier to actually create these embeddings. So again, I'm gonna be blitzing through this, but if you didn't watch the last talk, or if you need more background, you can go down to that ML community blog that I.
The way that we search through vectors, each of those important data points, you can subtract essentially, or use something like cosign similarity to compare the distance between them. If you use one minus the distance, you get their similarity. And so in this case, um, we're able to see an example of three sentences and a query sentence where that each are turned into a semantic vector, meaning, Not lexical, something like a BM 25 search, but actually semantic.
And that implies that we're looking for the meaning of the sentence rather than the words of the sentence. And so in this case, we're calculating the difference in meaning between all of these sentences and we can see that that is very happy. Person is the most similar to that, is a happy person, which makes sense.
Semantically, those are the most similar. Obviously vector databases, as I explained in my last talk, are used to perform similarity searches. So you can create a search base within the database that has fraud operations, and then, uh, with those, uh, embedding models supplied by our friends, take unstructured data, index them inside that database, and then perform searches against them.
Obviously I work at Redis, and Redis does this within addition to Redis search. Um, and so when you combine those two things, you get a pretty powerful vector database. Um, and I obviously am gonna be skipping through this, but, um, if you didn't know, Redis is a Vector database. We have a bunch of integrations.
Shout out to Tyler Hutchson, uh, for doing a lot of these. Um, and we have two indexing methods, or a n is H N s W, um, which should be H N S W, not h hsn. W um, And then flat. And then we have some distance metrics. L two cosign internal products, support for hybrid queries support for J S O. I explained a lot of this in my last talk, but we're gonna get to the fun stuff now.
Okay, so now here's where I'll slow down. This is design patterns. So these are things, as I've said, and I said, um, to a bunch of people that when I was giving this talk, um, if you saw chip's talk from the ML group, Uh, I did some of this, but I've expanded upon it. These are design patterns that I've seen, that I've deployed for either customers or in practice or in demos, um, for l l m usage with vector databases.
Um, so not all of these apply to every case, but they're things that I've seen out in the field once again. And so it's important that everybody kind of gets on the same page with. One interesting thing I thought would be to start is this Sequoia, M L L M survey. I'm sure most people have heard of this bc, but 88% of the surveyed group believed the retrieval mechanism was essentially a key point of the architecture of their l m stack.
Um, Everybody knows that I, I, you know, I'm not gonna go for LLM ops or Vector Ops, Daniel Spino. But, um, I do think that in this case, the LLM stack together a vector database and a large language model. It's clear that there's a synergy here, and you'll see that throughout the design patterns, um, again, 88% believing that there is some retrieval mechanism necessary for their large language model stack.
So the first one, and the one that you see the most is context retrieval. And I showed this in my last talk, but it is the most important one. And that is, uh, kind of an overarching group and a design pattern here. But the whole point here is that you have some question and answer loop or some chat bot or some, uh, recommendation system.
And your, your goal is to retrieve contextually relevant information from within your knowledge base to be able to supply the l l m with more information such that it has contextually relevant information at the time of generation. This is cheaper and faster than fine tuning. It allows real time updates.
So very importantly, if you have a, a very constantly changing or rapidly changing knowledge base or source of information, you are not able to change the model and redeploy in, uh, such a speed or such a, you know, velocity. That it actually is represented in the end result. So the only way you can do this is by having vector databases and injecting that context into prompts when you find semantically or otherwise relevant information to include in those prompts.
This also protects you from sensitive data scenarios. So say that you use something like, Uh, you're like a bank for instance, and you have multiple sets of, say, analysts, and you want to be able to say, this group of analysts is allowed to access these particular documents, and these particular group of analysts are allowed to set, uh, access these documents.
Well, if the model has been trained or fine tuned on those documents, you can't guarantee that, um, you can't say, oh, I'm positive, but will not say this particular piece of the knowledge base. But with a Vector database, you can, you can use things like role-based, uh, authentication control r a. Um, with, uh, better databases like Redis and say that this particular user does not have access to these particular parts of the index.
Um, and that allows you to then, uh, segment your data, um, accordingly. Um, and so that allows you then to say, uh, have different, more interesting application architectures. Um, protecting your sensitive data. Um, this is good for all types of use cases, not just question and answer, but it's the one that's the most easily recognizable.
There's a couple examples and honestly, many of the examples that I have are gonna be on Redis Ventures. Um, that's our github.com/redis Ventures. It's our team repo, so go check that out. One more advanced thing on context retrieval is something that I've done a. I think it's hypothetical document embeddings.
Um, but essentially the point here is that you're gonna end up using summarization on top of context retrieval, and then use that again as the context and a prompt. So there's actually two indications of a generative model here. So the first is you have this context retrieved from the queries. And then you take those and you use that actually to then get the context for the l l m that is gonna answer the question.
And so this, this, because it takes multiple trips to an LLM model, is slow. And it may take seconds to do. Um, and so the way that we've used this in the past is actually you, if you're familiar with asynchronous python programming, you can do what's called a gather or even just two asynchronous calls that both go to go complete some action.
So the first one would be, go do this without hide, go search for context. And you can have a either a threshold, Or you can have something like a sim, you know, some, uh, similarity score metric that you are counting on in that case to say, is this context good or not? Obviously, if nothing's retreated, then you can go to the second asynchronous call, which would then be the hide approach.
Since it's gonna take a little longer, you can wait and say, okay, did that first one return anything? If not, then now let's wait for the high approach to finish. And so that is one way that we've seen that, uh, you can get around the problems of. Context, maybe not being perfect for a specific question. Um, I do recommend going and checking out that paper.
There's a lot more to talk about here, but I don't have that much time. Okay. So feature injection, really cool. One, shout out to Simba, um, and co, uh, at feature form for putting in, um, their, the ability to actually use Redis as a vector database in addition to a feature store. Um, this is something that we've done a lot of since, uh, Redis is commonly an online feature store.
Um, so when you're able to use both of them in the same context, uh, pun intended, um, you're able to then use the qualities of both of them at the same time. So what do I mean by that? Well say that you have a, uh, e-commerce website. And a user logs in, and that user has bought some product from your e-commerce solution.
Well say that chat bot is, you know, it's, it's, you know, whole thing is designed to help the user with issues. Right? Well, if the user says, uh, I had a problem with my last order, what you could do in that case, Is then go have, you know, uh, semantic representations where you're bucketing and you're saying, I'm identifying that this is a user asking about ordering and doing a lookup.
But instead in this case you can have this, such that a prompt can recognize when you need to go and retrieve features from the database. And then retrieve that contextually relevant information at the same time. And so this allows for real time inclusion of that entity information, whether it's a user or product such that that model beforehand had no idea that that user had bought that product.
But now it's able to go and look up specific things about those, uh, users, and then include that in the context window. So every time the user does something in that context window, you may have last 10 items, user bought. Or, uh, last, you know, uh, type of user, active user, you know, rewards user, thanks for being a rewards user.
Um, and so all of those specific things can now be injected into the prompt because your feature store is, uh, uh, allowed to be in the same loop included, I guess, in the same loop as your vector database and your l l M. And so, especially with Redis, you're able to do this with the same infrastructure. Um, which is really nice, um, because you can co-locate metadata with your, uh, factors.
So your features with your factors. Um, go check out feature form, um, on that. Um, they, they're, they're able to do both. It's, it's pretty cool. They just released it, so go check that out. Um, Samantha Cashing talked about this one last time too. This one's getting even more popular. Um, and so basic concept here is everybody think about, uh, typical ha you know, cashing, right?
Hashing inputs used as a key, right? And that, you know, you do some like search 16, um, or, you know, something, something similar to decide what hash slot things go to. Um, and then you can do a lookup by saying, okay, I have a new input. Is it the same as that one that I've hashed? Okay, well in this case, what if, you know, going back to the product scenario, someone said, uh, can you tell me about X product?
And then the next user said, can you tell me about X product question mark? Well, technically those aren't gonna hash to the same thing, but the similarity semantically might be 99.9%. And so shouldn't that return the same answer? And you can set that threshold. You can say, oh, I only want 98 or 99% similar to be returned.
And you can even caveat it and say, this was a, you know, pre-written response. Even better. Some people are pr if you have a very, uh, uh, set set of questions and answers like an FAQ and you're having a bot to answer all of them. You can just go through and answer all of those questions and then have them cashed.
The, the benefit of this is you're not invoking an L lm, you're not, uh, incurring the cost monetarily or computationally, and it's gonna speed up your application. So QPS wise, it's gonna get a lot better if you've pre, um, embedded all of those answers and queries. So basically query gets, you know, you embed the query and then every time something's answered, you store it in the database such that if a new user does it, you get that same answer back according to some threshold.
And so this is becoming very popular. Um, and if you want, uh, to try out, uh, kind of an alpha of this, definitely, uh, let me know cuz we have some stuff coming. Um, so another one that was talked about earlier actually is guardrails. Um, in vector databases can be used as such. One really cool example of someone who's doing a lot nec, not necessarily in vector databases, although it does integrate with line chain and does, uh, can do retrieval, um, is NEMO guardrails from Nvidia.
Really interesting way to have this, like colan definition of what you should be able to do as a bot. Um, and defines how that bot can express itself, um, defines the flow for it. Um, and these are really important for large language models as it allows the, uh, you know, the, the ones that are really prone to hallucinations, especially the ones that are very large.
To be used in ways that are much more, uh, contained or strict. Um, and so Vector database is in a similar fashion, not necessarily in this case, but, uh, vector Database is in a similar fashion. You can say, okay, uh, if no context has returned, return, said answer, and have a default answer. Or if no context is returned, choose another path.
And so you can think of a downward facing, directed a cyclical graft, right? Uh, you know, and you can branch from there and vector your databases, like register so fast. You can do multiple context lookups and have that chain all the way down such that you don't necessarily have to incur as much cost going to the l l m every time.
You can just simply have that, that tree of options for yourself. And so getting really interesting with how you actually. Pre-compute all of those embeddings and make that directed as cyclic graft of options for your model to go and explore in the case where no context is. Retreat is a really interesting way to handle putting like a barrier on your LLM or guardrails.
The way I like to explain this sometimes is it's like bowling. Sometimes LLMs, they get stuck in a gutter, right? I know I shoot in the gutter all the time. I'm awful at bowling, but. Guardrails, just as they sound, they're like bumpers. Um, and so you can be a 10 year old or you can be me and you slam the ball down the alley way too hard, and it will still run into the bumpers and most likely hit something at the end of the alley, um, or lane or whatever they call it in bowling.
All right. So long-term memory. Talked about this one I think last time as well, but this one's a really interesting one. Um, now that context length is a lot bigger. Um, it's, you know, somewhat, uh, like people immediately assume that this kind of thing might be irrelevant, but even as context windows get bigger from practice that I have, it's not necessarily best to always just slam everything in the previous history into a context window.
So if you have every single thing in the previous conversation in a context window, there might be tons of irrelevant information. And so doing a d doing a different type of lookup that's actually based on, you know, a specific user or a specific topic or a specific query, um, is gonna end up returning much better results.
Thinking about things like how many tokens go into each and embedding, and I'll get into this in a little bit so I don't wanna get into too much here, but, um, how many tokens go into each vetting? How many specific pieces of context do I retrieve for each prompt? And so thinking through and going through each of those things is really important.
And it's not just, oh, there's a hundred K tokens, now let's slam everything up into the context window and get charged a bunch for, you know, really computationally expensive implications. Um, it's better to be, you know, work smart, not hard, essentially. Um, so that's, uh, that's that topic. Common mistakes. So I thought, uh, this would be an interesting one to go over.
So, These are things that I've seen people do or talk about that uh, you know, make me think, you know, people are just not, they don't really care. They just wanna do demos and then leave it. Um, and so this is like common mistakes specifically for production uses. Okay. The first one is pretty obvious, which is laziness.
Um, I know it sounds funny that I was kind of just talking about context window two. But it doesn't solve everything. Um, and so I, as I said, I was gonna get into this in a little bit, you see a lot of use cases where someone just takes, and I have all these integrations over here. Um, you have all of these integrations and there's a lot of defaults that are in them.
So defaults for things like, um, you know, prompt tokens, LM tokens, or, um, you know, specific chunk sizes or, um, you know, Ways that actually tokens are parsed or taken from documents or there's a lot of assumptions that are made and people using them without understanding them often don't really realize how their data is chunked up.
And they've never actually inspected it inside of their database, so they've never actually looked at it after it's chunked or looked at. You know, when you actually go to answer a question, what's in that prompt? And so thinking about all of these things is really important, and you'd be surprised how you can use worse models for the generative side and actually improve upon these factors.
And do something which I've been doing recently, which is, uh, it's like K fold if everybody's, you know, remembers that traditional machine learning exercise. Um, in this case, you take the variables here, context, window size, number of retreat, pieces of context, number of tokens for embedding, um, and those types of variables.
And then you essentially, you can grid search. For a good one. Um, and then capable, which is you take each one of those and do it on different chunks of your data set. Um, and that's a really interesting approach to make sure that you've reached the best variables, uh, that are the factors listed here. And there's obviously a lot more.
But for your specific problem and for undefined problems, sometimes this is really hard if you have a really open-ended problem in the case where, you know, it's generation could be a number of things. Like think more creative problems, that's gonna be really hard to do this on. But something like I was talking about earlier where you have an FAQ and there's questions and answers that are supposed to, that are there and supposed to be correct, this is something that you can do and is something that I've done a lot.
Um,
So I don't think that was to me. Okay. Sorry. The chat distracted me. Index management, um, I don't know. Lemme check my time. Okay. I gotta get going. Okay, so how do I set up an index if I have multiple use cases? This is one that people specifically, uh, don't really, I, I don't know. I'm not sure if they don't know or they, it comes up su like a ton is like, imagine if I'm Shopify and I want to have a per shop, uh, you know, set of embeddings, right?
We'll, think about this. There's a couple different ways to do it. You could do one gigantic index. And then have some type of metadata or filter on each one of 'em for the store. So the first thing you do is actually filter it down to just ones that belong to that store and then do a vector search. But how many total embeddings are in that index?
How many, uh, what's the size of those embeddings? Does your database charge money? Per index, how does it charge you? And does that data support hybrid database support hybrid in the first place? Is it efficiently supporting hybrid or is the metadata stored somewhere else, which I'll get into in a little bit.
Um, those specific things are actually super important. So I call this index management, but it's really, um, index architecture, I guess. Um, Which would be a better way of saying it. Um, what I've typically done, and this is specific to Redis, cuz Redis does not charge per index. It's just, you know, you get memory.
Um, and so in this case, a lot of times the, uh, multi-index scenarios, um, there is some overhead to having multiple indices. Um, but the multiple index scenarios, if there's, you know, under let's say, uh, a thousand indices. It's better to have multiple small indices, and then as soon as it gets over something like a thousand, maybe 10,000, something like that really depends on the size of your index once again, and a couple other scenarios.
But, um, in that case, then you can move to a large index or specifically grouped indices where you can do a search at the same time in asynchronously. Um, and so you can kind of split it up, um, in that way. So thinking through how your indices, um, are grouped and specifically set up in terms of which approach you take is really important, not only for things like QPS recall, but it's also so important for how much you get charged and how much you are able to spend or use moving forward as your use case grows.
So do that ahead of time. Project your cost, test performance and mock scale. Make fake schema, you know, make fake embeddings. Um, so just, you know, go through that stuff. Um, and you know, really dive into that cuz that is something that, uh, I've seen people go, wow, okay, I've reached an untenable costs. Um, now cuz I get charged per index and I don't know what to do.
Uh, I have to change all of my backend code now. Um, so think through all of that. Okay, separate metadata. So this specific two plots here come from actually a recommendation system. Uh, how's it going? Demetris am I, am I over time? I'm over time. Way over, but, and there's so many amazing questions in the chat, but don't worry.
All right, well, lemme just get through it. Lemme just get through it. I'm gonna blitz blitz. Blitz. Alright. Separate metadata comes from axis use case. Um, you see on the left hand side you have books either before two network hops. Metadata in this case is, you know, stored. It's not actually but stored separately.
It requires two network cops to go get. Now if you can use the k and n search to go and get those specific things, then you have a huge boost because you are eliminating a network cop. There's a lot more here, but go check out the ones there. I spoke about it at gtc. Architecture Considerations. I'm gonna skip that one.
Okay, there's a lot more here. Go check out chip stock if you wanna explanation to this. But all of these things are very important. Um, if you wanna go to check out that. Providers, I'm gonna skip it. Example, architectures. I wanted to get to this, okay. Example architecture one you have, uh, this is for aws, an example architecture for question and answer.
Um, vector database for this enterprise you use on SageMaker, LLMs, coherent, multilingual monitoring, Prometheus, API layer, fast api, csb, aws chat bots, q and a, et cetera, et cetera. We have Terraform for it down there. Uh, these slides will be on my website, part t.io. Azure based document intelligence. Very similar thing, all in those repos there.
So go take a look at that. Um, but that one is actually like one place to deploy Terraform. Shout out Anton. Thank you very much. And then, uh, a general on-premise one, which is really interesting. Uh, this is one that was actually set up by us. Uh, and so in this case, uh, it's all on-prem. This can be deployed anywhere.
There's also a Kate's version of this. Um, if you're interested, um, I'm sorry I didn't get to more than me to the talk. I talked too much. Apologies. Um, but, uh, hopefully I can do it again or something and I'll, uh, I'll host it again. Yes. Well that dude, why couldn't you talk that fast the whole time? That's what I felt like I was talking fast.
Oh, that was classic. There's so many incredible questions in the chat, man. So anyone that is asking these questions in the chat, I think Sam will probably go in there right now and answer them. But also he's in Slack and we have that channel community conferences in Slack, and so tag him directly and we can start threads with that because.
Then more people can chime in. It's a little bit more organized and Sam knows, okay, we're talking about this and it doesn't get as chaotic. The chat is great here, but if we're really trying to have deep conversations, throw it in Slack and uh, we'll do that anyways. Sam for, uh, for everyone. I think the, the main thing of this talk is you know what you're talking about.
People should know you, and especially the horses and cows around my neck of the woods, they are gonna know your name next time I come with the questions, aren't they? Yeah. And I, uh, again, I, I'm too passionate. I'll, I'll talk about it forever, so I, I just took too much time, but anybody who has specific questions do tag me like Demetrius said, and I'll get to it, I promise.
And so the other thing that people can do is if you click on the solutions tab, on the left hand sidebar, there is a whole Redis section and you can go super deep and you can see that Redis has all kinds of cool stuff on offering. And so that is, uh, that is very worthwhile. You can enter their virtual booth and check it out.
So Sam, sadly we have check. Yeah, there's, are you, are you sharing something else on the screen that I should share? Wait, what is this? Well, I was just, I was gonna say go follow me, um, at Sant or on LinkedIn, uh, first slides and the talk once Demetrius gets to it, which, you know, you never know what Demetris obviously, uh, but, uh, I'll, I'll repost them on my website as I do for a bunch of my talks.
Go check out the cookbook Red US Ventures. It's all up there. Um, take a screenshot, um, and, you know, hit me up Twitter or LinkedIn or something. Yep. So luckily this time you guys are the, uh, oh, I got you off, not your slides. I wanted to see your face when I say this one, you guys are the, uh, The diamond sponsors.
So you're gonna be the first talks that I edit. You're gonna be coming out like in a few days, hopefully, if we can get to it. But in the meantime, I'm gonna kick you off and we're gonna see each other in two weeks when I'm in San Francisco for this L LM Avalanche meetup. Love having you.