MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Embeddings and Retrieval for LLMs: Techniques and Challenges

Posted Jun 20, 2023 | Views 734
# LLM in Production
# Embeddings and Retrieval
# Chroma
Anton Troynikov
Anton Troynikov
Anton Troynikov
Founder / Head of Technology @ Chroma

Anton Troynikov is the cofounder of Chroma, an open source embedding store built for AI applications. Previously, Anton worked on robotics with a focus on 3D computer vision. He does not believe AI is going to kill us all.

+ Read More

Anton Troynikov is the cofounder of Chroma, an open source embedding store built for AI applications. Previously, Anton worked on robotics with a focus on 3D computer vision. He does not believe AI is going to kill us all.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

Retrieval augmented generation with embeddings and LLMs has become an important workflow for AI applications.

While embedding-based retrieval is very powerful for applications like 'chat with my documents', users and developers should be aware of key limitations, and techniques to mitigate them.

+ Read More


Now I've got Anton bring on the stage and Anton, I made this little wardrobe change just for you. Cause I felt like you would vibe with the hat and yeah, and the eye hallucinate more than Chachi, bt, that's, that's a really cool shirt. I think I might have to get a version of that myself. Maybe you guys can send me some swag.

That'll be sick there. There we go. We'll do a swag exchange. Speaking of swag, oh my god, I gotta give out swag. I see some incredible, uh, questions coming through in this chat. Gio Rio, reach out to me man. I need to give you some swag and for the other one, what do we got here? Lan? Yes, you mentioned that. Uh, YouTube stuff.

Yeah, and it did because we are keeping it all on the platform. Reach out to me on the platform. I'll get you some swag. There we go. So Anton, I think you got swag that, uh, you're gonna be contributing to the stock pile. That's right. Which is awesome. Chroma has been absolutely on a tear lately, and you are going to show us all about some of the embeddings and retrievals for LLMs.

I am gonna just let you take over, man. Good. I'll get this, uh, this commercial off of this screen right now. Great. And let you share your screen if you want. I think, um, yeah, we gotta make sure that works before I jump off. Sure. So feel free to share it. And then we are going to hop in to the, uh, To the next talk.

Great. And one quick question, Demetrius, uh, should I be able to see the chat from the audience? No. So you gotta go to the, I'll drop the link in the, okay. So I actually do have to go to a separate link. That's, that's weird, but I'll, I'll take it. I'll take it. Yeah. I mean, it's 2023. You think this stuff will video be solved?

Video client technology, one of the hardest problems in computer science. It's so, so hard, man. And drop me, drop me the link. I'll, I'll make sure that I have that open as well. Um, so that we can do the QA towards the end. Um, yeah, I'm gonna, I'm gonna stick to time. I'm gonna try to get this thing back on track.

All right. Let me just click this open and then I'm just gonna make sure that I've got everything that I need. Yeah. Maybe later. Uh, and then, all right, I have, I believe I have the chat going here. Uh, yes. Maybe, perhaps messaging disabled. I don't know. We're looking, looking. All right. I mean, if you want to, I can also jump in when there's Yeah.

Uh, questions. Something that would be helpful. Yeah. If in case everything is broken, I would love to have a moderator who can see it. That's my job, man. You are in luck. All right. That is me today. All right. All right, man. Um, definitely vibing with the outfit here. Uh, I'm gonna kick it off. Thanks. Thanks very much for hanging around.

All right. Here we go. You guys share your screen still, though? I still don't see How's that? Ah, there we go. Game on. Game on. All right, excellent. All right. Um, hi folks. Thanks for joining my talk today. As part of this conference, uh, the LLMs and prod folks have been really putting on great events, and I'm glad to be part of part two.

Uh, despite missing out on part one, having gone down with a pretty severe. Non covid respiratory virus for some reason. Um, so today we're gonna talk about, you know, using embeddings and retrieval, um, for LLMs. And specifically I'm gonna go into some of the techniques and challenges that you might face if you're deploying this kind of, uh, idea in your application.

So let's get started. The basics. So what's the basic idea of using retrieval in an L LLM context? The basic idea is rather than just relying on the models, Trained knowledge. In other words, the data it was trained on, we can actually add additional context to the model's. Uh, to the model's input by making sure that we've, by pulling in relevant information relevant to our query, to the model, using a programmable memory module, which in our case is an embedding story.

So what is an embedding again, for most of you, this is just a recap, but an embedding model is something that takes data of any kind of modality. Could be text, images or video, uh, and returns a set of numbers. It's a dense vector. In other words, every entry in this vector is populated and it's called an embedding.

So each, um, when an embedding model processes a piece of data and creates this vector, that's called an embedding. Uh, and you can think of each embedding like a point on a map, and it's a map of meaning. That's the beauty of embedding models. Embedding models encode semantic, that is meaningful information, and they're trained in such a way that when they embed data, similar things are embedded close together on this map.

Now, embedding models are usually. Working in very high dimensions. So 500 to say 1,546 for, um, open ai, eight to two model. Uh, so we can't really visualize their full dimensionality, but this is what one embedding model might look like, uh, when projected down to 2D space where we can see it and we see things that, you know, things to do with cities are close together, body parts, food, et cetera.

In this map of meaning, Tend to lie close to one another, and we can exploit that fact. So this is how, you know, you typically build a retrieval augmented generated system using an embedding store like Chroma. Um, and we exploit the fact that things that are close together in meaning, um, tend to be close together in an embedding space by basically taking our query.

Uh, so first we, you know, first we embed our data set. Um, we then take a query from our application using an l l m, embed that query and get the embedding for it. Then find the nearest neighbors in embedding space to that query, get the associated documents with them, and then use those as additional context to the L L M to generate our final answer.

So in practice, what this means we're doing is rather than treating the large language model as a general purpose, general knowledge machine, we're actually relying mostly on its reasoning capability. So its ability to actually draw conclusions from and transform text into one form or another. So again, by providing additional context, we're essentially giving this model programmable memory.

Now, you know, we've made this very easy to try with chroma, you can, you know, go ahead and, and try it Here. It's as simple as PIP install chroma db and these are all the commands you need to get going. It's, we've designed to, we've built this deliberately to be as simple as possible, um, and it's very powerful.

A lot of people are using, Chroma for all kinds of applications, Inc. Including document based question answering, where you can basically interact with a large cop corpus of text using natural language in a nonlinear way. And that's very powerful in all kinds of domains like legal and, and medical and research, um, and even customer support or, um, really, you know, sales automation, any sort of domain like that.

Uh, it's a very powerful general technology. Another emergent place where we've seen this sort of retrieval, augmented generation working, and this is a much more emergent category, but I think will be very powerful in the future, is in the category of agents. And agents are basically, um, they're basically AI powered.

Automatons, for ones of a better word, that use embedding based retrieval as a store for the interactions that they've learned, the skills that they've learned, and as a, basically as, um, as a memory for their environment and interactions. And that makes 'em very powerful. Very recently, a research paper came out, uh, called Voyager, which created a Minecraft agent and uses chroma, um, as the memory layer, which quickly achieves state-of-the-art performance.

These agents are, you know, fairly toy like. Today, but I think in the near future, um, they'll actually be very powerful. So all of this sounds wonderful, right? It sounds like we have this great general purpose technology and we can go ahead and deploy it right away. And you know, the previous panel spoke about some of the infrastructure considerations in the requirements for building things like this, and those are understandable.

But what I'd like to talk about. In a little bit more depth today are actually the challenges around getting this right from the perspective of the models and the embeddings themselves. And these are some common questions that we receive from, you know, developers and, and teams building with embeddings, space retrieval with chroma and.

Top of, top of mind of course, is which embedding model should I use for my data? There are plenty of text embedding models available, and some of those are specialized to particular domains like code. Some of those are general purpose and it can be hard to understand which model you should be using for your particular task.

Then the next question that we get, and this is again a very frequent question, is how do we chunk up our data? So typically when you're using an emitts based, uh, retrieval approach, your data is going to be divided into chunks. Chunks so that it fits into the target large language models context window, and contains the relevant information.

But that can be difficult to do. So this is, this is another challenge. This is a question we get asked very often and finally. What we get asked quite often is how do we figure out if the retrieved results we get are actually relevant and, and, and relevant? Relevant actually means two different things.

First, relevant might be relevant to the task that the user is asking the model to perform, but relevant might also be specific to the user. For example, one user might care about. Um, one attribute of, of say, you know, uh, they might care about how fast a given vehicle will drive while the other one, uh, a different user cares much more about the miles per gallon that it can get.

So the things that are relevant to search for, differ by user. And that's, you know, that's a very simple example, but you can imagine, um, many other cases like that where you have a retrieval augmented generation system, you know, using embedding space retrieval, as we've discussed here, and it needs to adapt.

To each user individually, and actually that's one of the most powerful things about ai. It's very flexible. It can quickly adapt to all d, all kinds of different use cases and users. So, well, here's the bad news. The bad news is today nobody really has the answers to these problems. Um, this technology is still very early and obviously it's getting very rapid adoption today.

It's an era of very broad experimentation. Um, A lot of the best practices and tooling is being worked out right now today. The good news is that these are really important problems. Um, and so there's a lot of investment, uh, both in terms of, you know, capital and time and compute power by a variety of groups around the world dedicating to answering them and the fact that we have.

Um, open source tooling, things like chroma to support that kind of investigation will accelerate how quickly we actually can answer these questions sufficiently well so that we can build great applications on top of them. So let's go into some of these questions in a bit more depth. So first of all, the first question is, which embedding model should we use?

Now the diagram on the right is actually four different embedding models trained, um, in slightly different ways using the same data. Uh, and this is the sort of map of meaning that you get out of them. And what you'll notice is actually these maps are more or less. Very close in terms of, in terms of layout to one another, up to basically a rotation.

And there's some empirical research that's come out not that long ago, which essentially shows that embedding models tend to learn very similar representations as long as they have similar training objectives. So that's one thing to remember that maybe, maybe the real. Sort of improvement you can get from your application isn't by switching, embedding models, but in case you do wanna go down this route, there are several existing benchmarks.

So you know, I've got a few listed here and, and hopefully the slides are shared later, so you can click on these links, but they're called beer and MT tab and kilt. These are existing benchmarks and they are for what's called the information retrieval task, which is the backbone of embeddings based retrieval.

Augmented generation has retrieval right in the name and their data sets. Um, but they're also frameworks for benchmarking retrieval systems, and this will allow you to benchmark your own data with a variety of embedding models. And so how would you go about doing that? Well, first you need to collect human feedback on the relevance of the return results.

This can be as simple as adding a thumbs up or thumbs down button, uh, in your application so that your users are essentially labeling the relevance of your data for you. That allows you to construct your own data set specific to your particular task, and then you can obviously embed that data set with a variety of different models and then run the retrieval system and evaluate the effectiveness of that retrieval system using any of the frameworks developed for these existing benchmarks.

They are designed to be fairly straightforward to use and, and you shouldn't be scared by the fact that they're developed for. Machine learning research, they're very relevant to production deployments of this kind of application. So the next question that we get is, of course, how do we chunk our data?

There's a few things to consider. One thing that is a pitfall that I've seen a few people fall into is that the embedding model. Just like the, um, large language model has a finite context length. It's a fixed content context length, and so what happens typically if you input a document larger than the model's context length is it will get truncated, which means the embedding will only capture the meaning in the first part of the document before truncation.

So you have to be careful with that. You have to really pay attention to make sure you're not throwing away parts of your documents. Now the next thing to consider is, of course, when you're chunking, how do we consider semantic content? And what, what I mean by that is you've probably, you don't wanna be dividing in the middle of sentences or even in the middle of words, because you're losing meaning.

Um, You're losing the semantic content that allows the embeddings to actually function properly. And also you're gonna get garbled garbled retrieval results. There are research results that show basically distracting content or partial content inserted into a language models context when they're really degrades their performance fairly significantly.

So you have to be careful around these parts of chunking. Now, the other thing to consider is, A lot of the types of documents and data that we use as humans and we'd like the model to, you know, perform operations on, already has a natural structure. Things are typically divided into chapters and sections and pages.

And because they're designed to be read by a human, they're also nicely structured to be read by the large language model. Um, and so you might consider using some of that structure in your chunking strategy. There's a few great tools, um, built for natural language processing or for. Large language models, in particular, the natural language toolkit, N L T K is one such thing.

And it provides, um, quite a few pieces of tooling that allow you to break things up by sentences or by sections or by paragraphs, uh, in an automatic way. And Lang Chain provides, um, a few different, uh, chunking strategies, um, which are already pre-programmed. They're just packages you can run and chunk up your, uh, your data automatically Now.

These are fairly basic ideas and I think that there are more powerful ideas, which are currently very much in an experimental mode, but which we'd like, uh, which we are experimenting with and we hope to see more of the community experiment with too. The first of these, which is really exciting for me, is actually using the language model itself to figure out where the semantic boundaries are, where the boundaries of meaning are, and the way, the way to do that is essentially have the model, um, predict.

The likelihood that the next token is the one in the document, and when that likelihood gets low, in other words, the perplexity of that token, when the perplexity gets high or when the likelihood of those tokens gets low, this means that, okay, the model is uncertain about what to predict next, which means probably, or perhaps we haven't evaluated this yet, that's why it's experimental, that perhaps this is a semantic boundary.

This is where the meaning basically where the previous section of meaning has wrapped up. And a new one is about to begin. And you might consider using that as a chunking boundary. We can also try using informational hierarchy. So again, um, oftentimes the sort of data that we use as humans, um, tend to, tends to have a natural hierarchy, just like it has a natural structure.

So, for example, you could. First, summarize, ask the model or, or do it in a human way. Summarize each chapter in a book and embed, um, the summary of each chapter. And then when you find that your retrieval step, you know, selects a particular chapter as the nearest neighbor to the query, you can then look at the paragraph embeddings of that entire chapter in a different collection and find the paragraphs inside that chapter which are relevant to the query or the task at hand.

Um, And finally sort of a different approach, which is again, highly experimental. We've been trying this lately. You can try to use embedding continuity. So what I mean by embedding continuity is that. Uh, basically you can look at the distances between generated vectors as you feed chunks into the embedding model because again, by definition the semantic embeddings that we generate, if their meaning is similar, they'll be close together and you can try to find meaning boundaries by looking at the distances, um, between the previous and next chunks as they're generated.

Um, it's very similar to sort of like a time series analysis, but in, but in higher dimensions. So again, Some techniques worth trying. And then sort of a, a very interesting question right now for application developers, and this is one of the things you, you really need to figure out to make your application robust, is, is the given retrieval result for a particular query?

Actually relevant. So the traditional approach, um, in information retrieval here is called re-ranking. And there's a great re-ranking model out from CO here, um, ai, which is worth trying. And there's other re-ranking approaches cuz again, information retrieval is a historically relevant, uh, task. Um, you can also add human feedback to your re-ranking model to essentially train the re-ranking, uh, a little bit more for your particular task and domain.

And of course, you can use heuristics to augment your retrieval as well. Uh, things like performing a keyword search. Uh, alongside the query and then, you know, taking the best results, um, according to the re-ranking model between the keyword search and the semantic search over embeddings. Or you can already, um, scope your search if you know something, um, in the metadata of the query.

So if you know that the, that the user is asking about a particular, um, part of your documentation, you can already scope the search to only that part of the documentation, um, so that irrelevant results from other parts don't get shown up. Um, Chromi supports metadata filtering out of the box. So I think the real question here is as is there some sort of algorithmic approach?

And what I mean here by algorithmic is something that we can just run without too much tuning, which actually uses some inherent property of the data. So what the key insights here are is the relevance of a result. Depends not just on the embedding model used, but also the distribution of the data in that particular data set.

And so at Chroma we're working on an algorithmic approach, um, to this relevancy problem in, uh, embedding space, semetic retrieval. We hope to show you something interesting. Uh, fairly soon we'll be open sourcing that and, and letting people just give it a try. So, um, that's a quick overview. I think now is a good time to open it up to q and a.

So I'm gonna stop sharing my screen here. And I'm going to come back to you guys. Nice. So I can see chunking being an issue for languages where the action verb tends to come. Mm-hmm. At the end of the sentence. Mm-hmm. Have you thought about that? Yeah, look, multilingual embeddings are one of those big open questions right now.

Um, I think Google has put in the most effort to make their embeddings really, truly multilingual, and that's not surprising given Google's global reach. Um, but things like chunking according to the sentence structure of the language, like for example, you can imagine, uh, non-European languages or, and I mean chunking even beyond, um, Beyond, you know, the, the, the placement of actions and verbs, but also the, even just the meaning of characters, right?

If we're talking about some East Asian languages where our character has often much more meaning than, than an individual English word, then the chunking strategy has to be fundamentally different. This is kind of why. I like the model based approaches more because good language models should be able to model those semantic boundaries so long as they're trained on text from those languages.

That's part, that's part of why I'm bullish on the more experimental approaches here. Makes sense. Okay, so for chunking, I find that N NT, L K, or Lang chain quite often cut off a paragraph in the middle. Hmm. GPT four with proper prompting actually does better with chunking. Is there a way to use an L L M model to do chunking and convert?

To embedding in one shot? Yeah, so, um, there isn't a model right now that will produce ch that will basically act as a language model and produce embeddings at the same time. I think there's definitely a future for that. Um, I think that one reason we haven't seen this yet from the API providers is basically if you have access to the embeddings and the output and the input of a model, you can reverse engineer that model.

With enough API calls, you'll be able to distill it much more effectively cuz you know, the internal state. Um, but that aside, I think, you know, I've, I've presented at least one approach to using both an embedding based chunking strategy and a model based chunking strategy. I think if you have access to a local model where you can also, you know, get the probabilities of the next tokens from it, uh, although I think opening eyes still provides that over the chat interface, it's worth a try with that perplexity based approach that I mentioned.

But I don't think there is a one shot model based approach. These are things that you'll have to build for yourselves if you do build it. Uh, please let us know what the results are. We're very interested in this direction. Nice. There we go. Submit a pr. There we go. Pr, welcome. Look please. Genu. Genuinely, genuinely, the, the sort of contributions that we're getting from the community right now is one of the reasons so gratifying to be an open source.

We're developing this technology together as a community. We're, we're we, there is nobody better placed in the world today to figure out what to do with these technologies than the people trying to build with them. The era of, the era of sort of, The machine learning researcher in the ivory tower with sort of billion dollars of compute, completely dominating what's possible and, and dominating research is over.

Mm-hmm. You, you now have access to these APIs. You are ac equally at the cutting edge as anyone at open ai. So good. Yes. I love it. All right, so we've got some awesome questions coming through here. I'm gonna keep it going for another like five minutes and then we will, you were true to your word man. You are like a clock.

TikTok, and I appreciate that because I am not, I have been all over the place on timing today. Now, let's see this, uh, not to get into a pissing contest, but what are the benefits of chroma over PG Vector? Yep. That's a really great question. Um, and this is something that is really important to how we're building chroma.

So, PG Vector is designed primarily for a semantic search use case. It's great, you know, it's great in the case where you have a fixed dataset that is updated infrequently, and you already have your other data in Postgres. So if all you're doing is, for example, semantic search over a fixed dataset, PG Vector is fine Now.

The, there's two issues with PG Vector. One is the recall performance of larger data sets is not great, and it requires a lot of tuning to get right because it uses basically a clustering and inverted index approach to do approximate nearest neighbor search. What that also means is if you have online.

Um, changes online, mutations coming into your embedding store because you know, you wanna store and understand user interactions. It's kind of the beauty of AI that you get to interact with it. Um, PG Vectors performance will degrade very quickly, but chroma is built from the ground up. Uh, and we use an algorithm called high hierarchical.

Navigable small world graphs, which is a mouthful every time I say it, no matter how many times I say it. Um, which is a graph based data structure. It exploits the structure of, um, embedding space in a graph based way, um, to provide good computational efficiency. But it allows us to scale and deal with mutations online in a much better way.

And so we're more suited for the sort of application based use case where you have a lot of user interactions and your data is being updated online. Um, Yeah, I, I would say those are, those are two key differences. The other part, of course, with chroma is a lot of these problems or sort of challenges that I've identified in my presentation today are things that we intend to build into the product.

We will. Our core product hypothesis is essentially, you shouldn't need an infrastructure engineer to run this, and you shouldn't need a data scientist to get this capability. All of this should be built into the product that you use and that should be chroma. Yeah. That goes back to that idea of the researcher in the ivory tower.

It's like, let's get this into everyone's hands and. Let the community run with it. Yeah. And we've seen how powerful these communities can be. So yeah, strong believer. Strong believer in the open source communities, power in ai. And I will say that is a perfect segue into a big thank you from our side for you guys sponsoring this event.

I mean, chroma has sponsored this and the community, which is huge for us because it helps us create more events like this. And also it allows us to do all kinds of in-person events. So I can't thank you enough. I mean, it's a, what do they call, the opposite of a vicious circle. It's a virtuous circle.

Virtuous cycle. Yes. Cycle. That's it. Cycle. There we go. Yeah, the cycle is strong. Yeah. So there's a few more questions coming in, and since we do have the time, I'm gonna ask him. Sure. Any advice on chunking slash embedding of code? Yes. For example, embedding a whole code base to do some L L M assisted information retrieval.

Great. Yeah. So there's a couple of things that I touched on here. One is, which embedding model to use. There are a number of code focused embedding models available. Um, open AI would really love you to use a to two also for code. It is trained for both. Um, it's unclear whose wins currently. Do the benchmarking report your results.

We'd love to know. Um, now in terms of chunking the code base, the great thing about code is you automatically have a lot of structure. You can chunk by function, you can chunk by class, you can chunk by file, right? And so you can leverage all of that structure and it depends on what information you wanna feed.

Um, the language model, I think this is actually a great. Um, candidate. And, you know, if, if you're doing software engineering best practices, none of your files are too long, none of your functions are too long. So it's more impetus to do things the right way. Um, and so I would gen, I would genuinely try it, like, um, chunking across those semantic boundaries, like specifically saying, okay, we're gonna chunk by function, we're gonna chunk by file, and then maybe connecting all those functions, this function belongs to that file.

And then maybe if we call that function we'll, We'll also pull in its documentation. So this is the other piece. If you have documentation for your code base, you can tie that in to the actual definitions of the functions and return both of those to the language model together. That's, that's one of the things that's really powerful about this kind of base retrieval.

You can send in one query and get both kinds of results together. Um, so yeah, so that's, that's basically my advice is like, try a couple of the code betting models. Um, And then if, if they're succeeding for you, if they're returning relevant results, chunk up on the basis of, of like semantic information code is, code is highly, highly structured and there's no reason not to leverage that structure.

Yeah, that's such a great point. That is, I mean, take advantage of it cuz language is not quite like that, so you gotta love it when you can. Uh, Michael was asking same question as I had for Sam at Reddi. Any hints and tips for embedding tables, whether that's encoding them or otherwise? Yeah, so this is, this is actually another challenge which I neglected to put in my slides, but it's becoming an increasingly interesting topic.

So, obviously a lot of the world's data today lives in these structured databases, right? We think of them as algebraic data structure. So you're, you know, SQL tables, et cetera, JSON gloves, um, and. As far as we've been able to figure out the tabular, the embedding models don't deal too well with tabular data directly.

So there's a couple of interesting approaches that we're trying here. The first is, uh, grab your tabular data and then give it to a language model and have it summarize it in natural language, and then embed the natural language description. And then you can add, you know, a pointer to the actual table.

Um, Uh, in the metadata. So you retrieve a natural language and then you're able to query the appropriate table because now you know what you need to query. Now here's the other part. Large language models are great at what you should think of as sequence to sequence tasks, and one sequence to sequence task is taking a natural language description of information you're looking for, and transforming that into a SQL query.

Given the, given the sort of the, the description of the columns and the structure of the table. Right, and, and sql, because there's so much SQL available online, they're pretty good at generating SQL statements. So this kind of two stage approach, um, might work for you. I think that trying to embed tabular data directly currently doesn't work very well in the future.

In the future, I think. So here's what I think is gonna happen in the future. I think in the future, all of these sort of like hierarchical kind of, Discreet steps that you have to engineer into your application will be handled by some sort of model, whose entire purpose is to figure out what information is relevant to the task, and then perform computation over that stuff.

But we're, we're not there yet. And the way to get there. Is to generate as much data around this, these kinds of tasks as possible so we can start training those models and start operating with them and understanding what they need. It's a fundamentally different class of model that we need here. But I think it's, and what's more, I mean for me personally, one of the parts of, one of the parts of the definition for AGI is it's a system that understands when it doesn't know something and knows how to seek and then integrate information.

Right, and so the, the, the kinds of models that can perform information-based retrieval and synthesis, they're also in the path to agi. So these experiments we're doing today might actually lead us there tomorrow. So, So anyone that is playing the drinking game with me, take a shot. He said, AGI, there we go.

That is what we're doing. But do, do note that, I don't believe we're there yet. Yeah, I I got that. That's a strong, uh, disc disclaimer. Right. But there is a part of me that wonders like, how much, how much value does that actually bring? Or is it just over engineering? Like a hammer looking for a nail type thing.

Which, which part specifically? The using a vector database. Like on the tabular data. Right. And yeah, I think, I think actually what you want is not necessarily to think about this as a vector database or about tabular data. The task that you're trying to perform is retrieving the relevant information for the target language model.

Right. That's what you're doing. So wherever that information lives right now is the right place to keep it right. You have to, you just have to find a representation that allows you to retrieve it. Now you have to preserve the fact that you can really flexibly query these things. So if that means emitting.

SQL from a natural language query, if it means doing basically a keyword search because the language model can figure out which entities are in your query and then search for those entities as keywords in your database or the semantic based vector search, these are all valid techniques. I don't think that there's an over-engineering here.

I think that this is because we have these different types of data and they're represented differently. We ultimately have to transform our natural language, request our natural language task into something that's understandable at the data level. So, It's just different ways of doing that. Yeah. Great answer.

Uh, last one for you. Uh, and we completely ate the, uh, the prompt injection game. We're gonna do that tomorrow. It's been postponed because this conversation with Anton has been incredible. If you wanted the prompt injection game that was on stage two today, we did a little bit of improvising like a jazz band.

But that's what we're good at, running these virtual conferences. This question from Manoj Manoj, this question's awesome. First of all, uh, hit me up because you deserve some swag. So we're gonna get you some chroma swag and anyone else's question that I asked Anton, we're gonna get you some swag too. So hit me up.

Uh, the, there's more questions in the chat and on. Mm-hmm. And I'm gonna direct you there. Mm-hmm. Once we're done with this. So keep the questions coming. What's, uh, all right, here we go. The last question for you, Manoj is asking, is there a limitation on how many different heuristics. We could try to co-locate embedding.

This is a great question. I love this question for a variety of reasons. I spent so bef I thought you would, I thought before I founded Chroma, um, I spent, you know, probably the last seven years of my career in robotics. And the number one thing that you learn about robotics is that all heuristics are brittle.

If the heuristics worked, you wouldn't need a model. To actually be able to get the results that you want. And I think something very similar is true here, which is why I'm talking about a future where models actually handle a lot of the, a lot of the tasks we want them to perform. Now, there's another implicit part is I don't think, I don't think today there's a limit.

I think it's necessary to experiment and find out what works so that later we can generalize these things into actually usable tools. So for now, I think just use as many heuristics as get your application into, into a usable state. And then see what information you can extract from those heuristics. Um, now the broader question here is this question about co-locating embeddings.

Um, I'm not sure if this is what's meant, but there's a very interesting result, which has been folklore. For a while in, in sort of the AI community, and there's been recent empirical research about this where it seems that embedding models tend to learn similar representations, and when they learn different representations, it seems that an fin transform.

So in other words, just a matrix multiplication between two vector spaces is enough to transform one type of embedding from one space into another. Embedding space, um, and learning those transforms because they're just a single matrix multiplier rather than a large deep neural network, uh, is much, much cheaper.

And what that means is if you have pairs in two different embedding spaces, it's probably possible to link those together. And that has a lot of implications for multimodality. It's got a lot of implications for per user relevance tuning, um, things like that. So I hope that addresses the question.

Awesome. All right. Very, very cool man. So, I am gonna direct you to the chat now. Yep. Now you gotta go to the, you gotta go to the, the chat and chat with, I still have messaging disabled. I don't know what that means. You may have to fill out your profile and Okay. Then you'll get your messages. Oh, okay. Oh, you gotta give us your email so we can send you some swag.

You could have my email. All right. We gotta, you know what we're gonna do? We're gonna send you some chroma SDRs. They're gonna. Chase you down and see if you wanna buy some managed promo. All right. I'll complete my profile. I'll hang out in the chat with, uh, with the folks for a little while. Thanks, Demetrius.

Thanks for helping Moderat. Here. Here we go. I'll be in SF in two weeks. Hopefully we can meet up. Oh, check. Yeah, send me an email. We'll, we'll get it done. All right, man. All right, man. I'll see you later. That was fun.

+ Read More
Sign in or Join the community

Create an account

Change email
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Posted Feb 24, 2024 | Views 1.1K
# Semantic Search