MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Vector Databases and Large Language Models

Posted Apr 18, 2023 | Views 2.9K
# LLM in Production
# Vector Database
# ChatGPT
# Redis
# Redis.com
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Share
SPEAKER
Samuel Partee
Samuel Partee
Samuel Partee
Principal Applied AI Engineer @ Redis

Sam is a Principal Engineer who helps guide the direction of AI efforts within Redis. Sam assists customers and partners deploying Redis in ML pipelines for feature storage, search, and inference workloads. His background is in high performance computing and machine learning systems.

+ Read More

Sam is a Principal Engineer who helps guide the direction of AI efforts within Redis. Sam assists customers and partners deploying Redis in ML pipelines for feature storage, search, and inference workloads. His background is in high performance computing and machine learning systems.

+ Read More
SUMMARY

Generative models such as ChatGPT have changed many product roadmaps. Interfaces and user experience can now be re-imagined and often drastically simpified to what resembles a google search bar where the input is natural language. However, some models remain behind APIs without the ability to re-train on contextually appropriate data. Even in the case where the model weights are publically available, re-training or fine-tuning is often expensive, requires expertise and is ill-suited to problem domains with constant updates. How then can such APIs be used when the data needed to generate an accurate output was not present in the training set because it is consistently changing? Vector embeddings represent the impression a model has of some, likely unstructured, data. When combined with a vector database or search algorithm, embeddings can be used to retrieve information that provides context for a generative model. Such embeddings, linked to specific information, can be updated in real-time providing generative models with a continually up-to-date, external body of knowledge. Suppose you wanted to make a product that could answer questions about internal company documentation as an onboarding tool for new employees. For large enterprises especially, re-training model this ever-changing body of knowledge would be untenable in terms of a cost to benefit ratio. Instead, using a vector database to retrieve context for prompts allows for point-in-time correctness of generated output. This also prevents model "hallucinations" as models can be instructed provide no answer when the vector search returns results below some confidence threshold. In this talk we will demonstrate the validity of this approach through examples. We will provide instructions, code and other assets that are open source and available on GitHub.

+ Read More
TRANSCRIPT

Link to slides

Intro

Oh, there we go. Okay. I'll try to blitz through it cause I know we don't have much time. Um, but thanks for everybody that stuck around through the technical difficulties, um, talking about what vector DA databases and large language models. I'm Sam Par T I'm a principal engineer at Redis. Um, and without further ado, we'll get started.

Vector Databases and Large Language Models

Uh, next slide. Yeah. Uh, first talk about what are vector beddings? Uh, mostly since, uh, not everybody knows, uh, what a vector database is for even what goes inside them, uh, vectors are commonly, uh, they commonly represent unstructured data, audio, text, or images. Um, they represent these in a highly dimensional, uh, embedding.

And essentially I just say this is a list of numbers where each of those numbers represent some piece of information. And these come out of machine learning models, which you've heard all about today probably from, you know, opening eye, hugging face, go here. Um, and this become incredibly easy to use these and actually extract embeddings from, um, these APIs and use them.

Um, we'll give an example here of similarities search and how this actually. Uh, I'm breezing through this, but I actually with Demetrios and the whole crew at Lops community wrote a blog post on this a little bit ago, which includes a lot of this. So, um, if you have any more questions about how this kind of search actually works, just let me know.

And visit that link. Uh, so three semantic vectors, that's our search space here represented by the three red lines in the plot, do you see to your right. And then one semantic vector is our query. Um, that is a happy person. You can imagine that any three of these were created by the code on the last screen, um, from a hugging face model or an open AI model or what have you.

And each of these, um, three semantic vectors makes up our vectors serve. Um, when we take our query vector, that is a happy person. What we're doing is calculating how similar they are. We're trying to find the most similar vector to that query. Well, how we do that, we just calculate the distance. We say, how far is that vector, that list of numbers that represents the input that we put into that machine learning model.

How far is that from any of the other vectors in our vector? And we do this through a metric called cosign similarity in this case, um, where that actually calculates the cosign distance between any number of those vectors you see represented in the plot there. And so you'll see these numbers down on the bottom right.

Um, and that is a very happy person, is obviously the most similar sentence to that, is a happy person, um, even though that is, and happy and, uh, are all words shown in the other sentence? That is very happy person is the most semantically, similar to that as a happy person. Um, and so that is a major advantage of these models is the ability to capture semantic representation, which will become very important later.

So the search space that we were talking about, it can actually be represented inside a database called a Vector database, and that's where these vector databases come in, is that you have all of these embeddings that you've created from any number of these APIs or models, and you put them into a vector database, which also provides the ability for a secondary index.

Um, and so in this case you can have an index and all of these embedding stored in the same place, which allows you to then provide a query interface to applications. So vector databases are essentially a methodology by which you can operationalize the ability of vector similarity search. Um, this makes it easier to deploy and do things like crude operations when you go to production and you're actually using these in a rail application.

Uh, next slide please.

So why am I talking about this? Why is Redis even in this conversation? Uh, because with Redis search, Redis is a vector database. Um, so when you have both Redis and Redis search, you have the ability to do secondary indexing on hash or JSON documents stored inside of Redis. Um, we have two index types, flat and hierarchical navigatable small.

And a bunch of integrations coming out. You probably heard from Harrison Chase earlier, or if you're going to the meetup, uh, meetup in, uh, San Francisco later today, you'll hear even more from people like Sebastian, um, from FAST API and Harrison and Simba from feature form, um, about all of the cool things that, uh, you know, they're doing in that space.

We've integrations there. Um, also really cool in the relevance AI that actually allows you to basically have a gooey on top of your vector database. All of these are very, I. Um, but really one that we're really excited about is our GPU index that's coming out with, uh, Nvidia. So we're working directly with them to be able to actually put your index on gpu.

Um, so that's a little bit of why, uh, Redis is talking about this and what we've been doing in the field. Um, but if you want to try it out as an open eye example cookbook that Docker run right there will tell you how to spin an instance up, um, and try it out yourself. But next slide.

But we're here to talk about large language models. So what do vector databases have to do with this? Um, and because large was not large enough, this essentially means these large language models are already incredibly encompassing, you know, trained on all of Revit and all of Wikipedia and all of these various places.

But they don't know everything and especially they don't know everything about, um, what you are doing. Your confidential information, your proprietary documents, your rapidly updating pieces of information. So we're gonna talk about how Vector Database actually solves that portion of the large language model problem.

Next slide, please.

So there's three that we'll talk about today. Context retrieval, large language model memory, and large language model caching. Um, I'll talk about each of these in the context, uh, of various use cases. Um, but essentially the, the, the hottest one right now is called context retrieval. You see people doing this with like the retrievers in line chain.

The way I like to think about this is, uh, the vector database is a golden retriever and the large language model is essentially, uh, someone playing fetch and every time they want to go and get something, the gold retriever goes out and gets it for 'em. Um, I say this because the operations performed by a vector database are relatively simple and straightforward, just like playing fetch.

Um, however, um, the operations performed by a large language model are not. Um, and so that's why I like that analogy because it specifically supplies the large language model with that particular piece of context that it needs for a particular information and retrieves for it. So that is also relevant to a lot of different use cases we'll go over.

Um, but a really interesting one is actually large language model memory. It's similar in the case that it's providing a con contextual database outside of the large language model that the large language model may not have been fine tuned on. But in this case, it also provides specific enhancements for things like chatbots, which we'll talk about.

And then the last one is a simple caching use case, but in this, uh, Uh, type of area. You can't just say, is this the same piece of text? Is this query the exact same? Because really it's not always the exact same words, but they might be the same question. And so you can imagine how vector databases might be used for that.

So we'll talk about each of these, um, as we go down. So next,

Okay, so q and A systems are really huge right now, um, for all types of use cases. Um, Google Docs really didn't like my bullet points when we, uh, translated these. Um, but, um, you'll see a bunch of different ones, a bunch of different use cases that should have been listed there at the bottom. Um, but you'll see an architecture here that you can find on GitHub.

Uh, Renis Ventures is our GitHub, so there'll be a bunch of different ones in there that you can go out and check out. Um, but here the venture database is used as an external knowledge base, like I've been talking about. And so when you go and ask a question, what's gonna happen is that question's gonna be turned into an.

And that embedding is gonna be used to search through the vector database for semantically similar context, and that context will be retrieved that golden retriever analogy for the next stage. You can think of it like a chain. Thanks Harrison. Um, you can think of like a chain, um, that the next stage will be the generation where that context is gonna be used to inform the large language model of something that it may not know about, something that may be proprietary, something that may be confidential or something you might not want to have put into the fine tuning process.

This is also cheaper than fine tuning and faster and allows for real time updates to the knowledge base. Think about, you know, you wouldn't want to have something on a millisecond time scale that you needed to fine tune for. Instead, you could have an external database where that context is rapidly updated.

Say if you're making trades in the stock market or so, And you really wanted the latest news that you wanted to go put into, um, something that suggested what stocks to trade on. I don't, this finance is my thing, but that kind of thing. You would need an external knowledge base that would rapidly updated the pace of something like the stock market.

Next use case, please slide. Thank you. Okay. So long, long term memory. I'm gonna bli through this. Um,

we'll just go on the query cap. Uh, one back, one back if you could.

So just like I mentioned for chat bots, it's really, it's really useful to actually have context. So, um, on the previous page, you, you saw how in q and a system you might actually have, uh, the user ask a question and then the previous chat history be used as the context answer that question. Well chat, G B T memory is a really interesting project that A allows for, uh, addressing topic changes and multiple user sessions and addresses this problem of context length.

I know we now have 32 K tokens, but at not all models have those and at the same time, Even 32 K tokens isn't enough in a lot of cases. Um, and this particular methodology allows you to have only the last K messages relevant to a particular message in some chat history isolated for that particular session or use case.

Um, so this is a really interesting way to address that problem. For a chat bot like scenario, highly suggest checking out the project. Next slide.

And then lastly, this, actually, this diagram was taken from, uh, the Zs team. They had just released G P T Cash. Um, it's a really interesting concept. Um, some people have already started implementing this with Redis that we've seen. Um, G PT Cash is a really cool project though. Um, it's, it's essentially where you use a Semantic Vector database to say, um, If I have a query that is semantically similar enough, and I already have cashed the answer to that semantic query, and there's some threshold I have decided upon.

That I say is okay for that query to be answered in that amount of time, then I can simply say, return that answer. Um, and so what this does is it saves on computational and monetary costs. It speeds up your applications. Cause large language models are slow and generally it's applicable to almost every single use case.

That employs a large language model. Um, we've also been working with, uh, uh, a version of this, uh, called the Triton Response Cash, um, with Nvidia, which is soon to be coming out as well. Um, really interesting work in saving on computational costs with cashing. Um, so definitely, uh, go and check that project out and keep up to date with, uh, where we are in, uh, that project as well.

Um, one last. I know it's a lightning talk. I think I'm at way too long given the technical difficulties. But, um, if you have any other questions, hit me up at sam [email protected] or at sam par on Twitter, our GitHub and our solutions page for the marketing folks that are there. Um, and so if you have any other questions, definitely let me know.

There's a ton more to talk about here, and we're gonna be giving more talks about the year. Um, but thanks to Demetrius and then community folks who are having me on, appreciate it. Thank you so much, Sam, that was awesome and I appreciate you going at lightning speed.

+ Read More

Watch More

29:58
Posted Jun 20, 2023 | Views 1.1K
# LLM in Production
# Vector Databases
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io