MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Information Retrieval & Relevance: Vector Embeddings for Semantic Search

Posted Feb 24, 2024 | Views 1.4K
# Semantic Search
# Superlinked.com
Share
speakers
avatar
Daniel Svonava
CEO & Co-founder @ Superlinked

Daniel is a Co-founder and CEO of Superlinked - an open-source vector compute framework for building RAG, RecSys, Search & Analytics systems with complex semi-structured data. Superlinked works with Vector Database providers to make it easier to build vector-powered software.

Previously, Daniel was an ML Tech Lead in YouTube Ads infrastructure, building systems that enable the purchase of $10B / year of YouTube Ads.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

In today's information-rich world, the ability to retrieve relevant information effectively is essential. This lecture explores the transformative power of vector embeddings, revolutionizing information retrieval by capturing semantic meaning and context. We'll delve into:

  • The fundamental concepts of vector embeddings and their role in semantic search
  • Techniques for creating meaningful vector representations of text and data
  • Algorithmic approaches for efficient vector similarity search and retrieval
  • Practical strategies for applying vector embeddings in information retrieval systems
+ Read More
TRANSCRIPT

Daniel Svonava [00:00:00]: So I'm Daniel Svonava. I'm one of the co founders and the CEO of Superlinked and Black, no sugar, nothing. Straight black.

Demetrios [00:00:13]: Hey, yo. We are back with another mlops community podcast. What a doozy this one was talking to Daniel. I gotta say, before we go anywhere, thank you being so patient with me. Thank you so much. Because he explained it to me like I was five. I felt like I was inside that Reddit subreddit community. I had to ask him and re ask him because I really wanted to make sure that I understand what exactly he is talking about when he mentions vector compute and how it can enhance these different systems that are using vectors and embeddings.

Demetrios [00:00:54]: And so he broke it down for us and it was a little bit reminiscent of DSPI when I talked to Omar, the creator of DSPI, which, by the way, has anybody else noticed that that feels like it's absolutely blowing up? If you are using DSPI in production, hit me up because I would love to hear about your story and what you're doing. I don't know anybody using it in production, but I feel like it's soon, it's coming, and with all the attention that it's getting, it should be happening. But who knows? Especially because the creators and maintainers are university students. So I'm not sure if anyone feels comfortable using it in production yet. But if you are, tell me. Okay, side note, little tangent over. Let's get back to Daniel and the vector compute. This is something that I felt like I needed a metaphor before I understood it.

Demetrios [00:01:46]: And he broke down the metaphor very nicely. He said, you know what, think about when you are shooting photos with a DSLR and you get a raw file. That file is not the most beautiful image, but what it has is all of the information contained in that image. So what you're able to do when it comes to tweaking that image is bring out the pieces that you think are best in that image. And the metaphor there is that when it comes to embeddings, throwing them into your vector database, with the vector compute, he wants all of the embeddings to be flattened and he wants the vector compute to be that piece that allows you to bring out what is most important when it comes to your use case and what you are optimizing for and what you want to show people. So in the case of a recommender system that you want to show people some for you page type stuff, you're going to preference certain features and certain parts, certain embeddings more than others, and when it comes to a rag system, you're going to preference different embeddings. Now if that ain't a good metaphor, I don't know what is. I'll let him explain it more.

Demetrios [00:03:19]: Hopefully you all aren't throwing things at your screen or phone, whatever you're listening to this on because it took me so long to understand what the hell was going on and hopefully after this you too are able to understand what is going on. If you end up using superlinked, let me know. I would love to hear what your experience is with it. If you have strong opinions about why or why you shouldn't use it. I would also love to know that I know there are opinionated people in the community and so that's cool too. This episode I want to preference this with it is not sponsored. I just love Daniel and what he's doing and wish him the best. Go check out superlink.com if you get the chance.

Demetrios [00:04:04]: And also he is creating Vector hub so that's a way for you to compare all of the different vector databases and what their features are. So that's cool because it's basically like non biased third party able to look at all the different vector databases. Not just the specific vector databases that are only vector databases, but everything. I mean everything from the vector database that is something like a cassandra that bolts on a vector database to like a quadrant which is only for vector database or vectors. I said vector database way too many times in that last sentence. So let's get into the conversation with Daniel. Check it out. Superlinked Vectorhub, I think is what he said.com and we'll leave it all links in the description for you anyway.

Demetrios [00:04:58]: And if you enjoy this, if you think that it is worth talking about it more, hit me up or share it with a friend. Let's get into the conversation and I hope you enjoy my new song prompt templates which you can stream anywhere, especially on Spotify right now. You went from being a recommender system, basically. What was the v one idea? And then how did you realize that you needed to pivot?

Daniel Svonava [00:05:37]: Yeah, it was a journey. I would call it kind of incremental steps to success. Not like full pivot, but many, many years ago we wanted to build a social graph across the Internet, right? Kind of a social login, basically. This was kind of COVID was just kicking off and everybody was online wanting to connect, scattered across communities. So we built various apps for professional communities with the eye towards shared login and kind of shared user preference model under the hood, right? I have been in the YouTube ads for six years. So user modeling is kind of the particular hammer I have to hit every nail. So that was the plan. And then we realized that, okay, all kinds of other products want to build personalized experiences, right? Understand what the user wants, give it to them quick.

Daniel Svonava [00:06:43]: Right around that time, TikTok also came out with the monolith paper where they kind of cemented some of their ideas into real time personalization. And our system was kind of structured quite similarly. So that was like the initial validation. And we kind of opened up the platform and started to onboard people beyond social like jobs marketplace and ecommerce company and so on. And then we raised some money. And the second time we tried to do that, everybody was like, oh, recommend the system as a service. We see that five times a year. What's so special about this one? And we had some ideas for how not to be stuck into like a software consultancy mode, building ranking models, new ones for each client, because that's typically how those companies go, right? You kind of pitch people on KPI list and then you go building these re ranking models based on the behavior of those specific users in that specific context.

Daniel Svonava [00:07:56]: We didn't want to do that. We wanted the product. And so we basically said, okay, in this whole recommender system stack, you typically have retrieval and ranking, right? And our goal was to basically reduce the need for ranking as much as possible and do it by making the retrieval smarter. Right? And we did that by carefully constructing these complicated vector embeddings of both the users and the recommended items and then having our own query language into the vector search, vector retrieval that would allow you to express all these different objectives that you have when you do recommend resistance and even add the behavioral feedback loops on top without building this kind of monstrous re ranking model.

Demetrios [00:08:49]: I'm just wondering, basically the Rexis as a service feels like when you say it, oh yeah, everybody needs recommender systems and it should be something that is important to people. But none of it took off because of this consultancy angle or this basically white gloved angle that you have to take. And you realized, okay, if we're going to have to make custom models for everyone, or if we're going to try not to make custom models for everyone, then what can we do to make this easier for us? And presumably that's where you came up with the idea, oh wait, there's other pieces to the puzzle that are harder problems that maybe we can go. It's like you're going a little bit lower down the stack, right?

Daniel Svonava [00:09:40]: Yeah, exactly. By kind of asking ourselves the question of can we make the underlying retrieval smarter? Right? We spend a lot of work in the vector embedding calculation for these complex data objects, and then we realized that just that alone is an interesting problem to solve that repeats across many different use cases. And that's where we basically, a couple of years ago started to build the vector computer, right? Kind of as productized as possible way to turn complex data into vector embeddings.

Demetrios [00:10:23]: And you told me about this before, but basically it's like in software, especially recently, we have vector, well, we have compute and storage. And one thing that has been very popular and it's been almost like a boon to software developers is decoupling compute and storage. And right now with vectors, we have vector storage. But we didn't really have anybody that was thinking about vector compute. And judging by how hot the vector space is, it seems like it's ripe for compute now, I guess, why is it different than just like regular compute, right? That's probably like the first question that you got asked by every vc that you tried to raise from on this second time around, right?

Daniel Svonava [00:11:24]: So as you're saying, there is the storage part, and I would also add to the problem of organization, right? So building the vector index, doing all the things the database has to do, sharding reliability, access controls, right? So there is a bunch of people working on that. And then how is the compute different? At the end of the day, you are burning the same electricity running on the same hardware. So at the end of the day the underlying operations are the same and the name of the game is always about how do you control this thing, right? Are you building models from scratch in Pytorch? Are you running data pipelines in spark? Are you using these kind of completely generalized computing models to solve the problem? Or is it worth thinking about an abstraction that's specific to turning data into vectors, right? And if you basically say, okay, how would that look? Right, what would be the properties that you want out of such abstraction? Then that's when you start to think about kind of the data engineering problems. How do you keep things synchronized? How do you do backfills? You think about the machine learning problems for complicated data object with million different properties of different types. How do I somehow boil this together and make the vector, right, what's the highest level abstraction that allows me to do that but still gives me control, right? So it's not like, oh, I just turn everything to a string and then send it to a language model, because there's literally the opposite of control. Right? So how do we do this? How do we put that.

Demetrios [00:13:19]: Tell me more about this turning it into a string, because I feel like you're talking directly to me. That seems like something that a lot of people do. And then what? The biggest problem there is that we can't wait this string in any way, shape or form, right?

Daniel Svonava [00:13:39]: Basically, let's take an example. You have, let's say, a movie database. And you know all kinds of things about your movies, right? You know the movie names and the categories and the descriptions. That's the easy stuff. Then you know things like viewership, popularity, you know, specific click patterns of your users if they interact with these movies, right? These are all the different signals that you have about the movie. Maybe the launch date. When was it released? Right. Now, when people do rag and they do recommender systems and kind of retrieval, unless they train their own two tower rexes model, which often still doesn't eat all the data because some of it is just difficult to eat.

Daniel Svonava [00:14:32]: Basically the kind of language model first approach here is to stringify the whole movie. So you literally concatenate all the different properties, and then you take embedding, and the embedding is like God knows what, right? Thousand floating point numbers, right? And then some movies kind of get clustered and some movies are near each other that really shouldn't. Like, you look at it as you just eyeball, right? You didn't get to quantitative tests. You just eyeball this thing. And there is something weird. Every once in a while, something gets slotted somewhere where it doesn't make sense. And then you notice that, oh, it's picking up something in the description that sort of makes it think that it's related to this other thing, but really shouldn't. And that's where prompt engineering comes in, which is my both most and least favorite part of this, because people literally start to say, like, movie name, colon, movie name comma, and then here is a description, but it's not so important.

Daniel Svonava [00:15:40]: But FYI, Colon, and then the description, right? And then they just kind of tweak just how sort of heavy handed they should be about processing the description. And that's just not a way to build an actual system. Right?

Demetrios [00:15:59]: Yeah.

Daniel Svonava [00:16:00]: Right. Is there any other way to kind of put the constraint over it? Like, how do you want to create this pressure for certain things to be close together and certain things to be far apart in a way that makes sense to your specific problem, right. So maybe you are all about freshness, right? Recent news, right. And so then you would want to kind of really boost the aspect of how long has the news been out, right. And popularity and relevance to your query, whatever it is. Right. So we spend a lot of time thinking about ways to express these different, often competing objectives and then have the underlying system kind of navigate the trade off, which, okay, among us engineers means basically you're putting all these things to the embedding, right? Then you are constructing the surge vector to tap into all those different properties of your movie. And it helps if the thing you index is kind of unit normalized, is not biased to any of those signals.

Daniel Svonava [00:17:07]: And then when you search, you can create the biases or kind of different weights to different aspects of the vector space, basically.

Demetrios [00:17:17]: So I like this. I think the idea here of you saying this is how you can get more control when it comes to dealing with vectors and especially on things like recommender systems or rags, where you have all of this data, but you don't necessarily know which data is important, or you do know which data is important, but you don't have a way to explain to the model how, oh, this data is more important except for through prompt engineering, and you never really know if that is working or not. And so now you're saying, if I'm understanding this correctly, with the vector compute, you're able to take all of the different things that the vector or the item is made out of. So it is made of a photo and some description and some features about how long people watch the video and the popularity score, whatever. There's metadata, there's the actual title, and there's the images. So you have all these different pieces of the puzzle, and what are you doing with all that information?

Daniel Svonava [00:18:38]: Exactly? How are you tapping into expressing your desires in the language of those different properties? Right. And how are you sort of doing it in a way that everything has a type, so there is no random python lambdas flying around, and you have a very clear path from whatever you cook up in your notebook to actually launching stuff because, yeah, I think that's going to be sort of important this year. Everybody talks about getting things to production, and not everybody is very clear on what that means. Does the responsibility of the vendor kind of stop at you being able to build a docker container? Or is it sort of reasonable to expect that you will get help setting up all the resources in your cloud, for example, which I believe it is. If I give you a docker container, you still have to figure out all the serving issues, right. Backfills how do you share the workflow? How are you connecting this to the destinations? Because as we kind of established, there is the compute and indexing or management of the vectors. Right. And so that means we work with the vector databases to make the overall offering better.

Daniel Svonava [00:20:10]: Right. And so now it also has to reflect all the sort of production implications of what the vector database wants out of the workflow. Right. Like are you dropping a million vectors in it all advance? And is the ray indexing parallelized? All of those considerations, there are part of the problem. And then you have the batch and you have the real time kind of use cases, right?

Demetrios [00:20:39]: Yeah.

Daniel Svonava [00:20:41]: There is a lot around it all kind of starting a notebook and then get it to production, but get it to production people kind of gloss over it, but it's a lot of work.

Demetrios [00:20:53]: Right, yeah, we know that. Definitely know that. So what does it actually look like in practice when you're setting up a system, how do you have a vector compute deal or communicate with a vector database and all the other pieces like the embedding model and then the LLM and everything else? If you're doing some kind of rag, if it's a rexis, I can imagine it's a little bit different of a system design. So maybe you can break down those two. And how you see the vector compute or the vector engine, what are you calling it? I don't even know. Is it a vector engine? Is it a vector compute engine? Is there a word yet?

Daniel Svonava [00:21:33]: So we call it the vector computer and we kind of take it to the computer place just to illustrate that there will be electricity spent on a CPU GPU type situation. In terms of kind of, how does this look when you just want to solve the problem? Just build a rack system that actually returns reasonable results, right? This is what everybody wants. And the end goal here should be something like you go to your cloud provider, anyone you choose, and these people have marketplaces, right? And that's where you already probably provisioned your vector database, unless you have one of the managed solutions, which is also fine, right? But for many people, they go to their cloud, they provision their whatever XYZ database and then it's just running there. You kind of pay through the same billing account and then the same way you should be getting your vector compute sort of system, right. You should be pointing it on one hand at the vector database for all the vector keeping, vector indexing needs, right. And you should be pointing it at where the data comes from. Because as we said, let's say you are a marketplace with houses, you know all kinds of things about those houses. And this data lives in different systems, right? So there are data warehouses, there are databases, there is kind of the blob storage where you might have other types of features that you compute with some other system.

Daniel Svonava [00:23:28]: Typically, I like to separate this compute problem from ETL. So I like to think about it as kind of ETL is getting the data from outside of your system to your system, right? So you pull it from your mailchimps and from your CRM and all of those million different places. I think the modern data stack kind of movement last couple of years pushed forward the idea that, okay, you should somehow pull from all these different tools and centralize it and somehow organize it in your core infrastructure. And that could still mean a few databases, but something sane, and then compute problems, any kind of machine learning workload should kind of stack from those core pieces of the infra and take it from there, right? So that's why I usually say kind of ETL is kind of the step one and then any kind of pipelines, model training, all of that work is kind of on top of it. And then you mentioned one also important thing, which is kind of the model inference. So normally you have this kind of train infer loop, right? And with the language models now, maybe that should be renamed to fine tune infer, right? But there is usually kind of some such loop. And yeah, you have to figure out, when you are building your own vector computer, you have to figure out, okay, where are we running the models and how are we making those, let's say, insurance services accessible from our pipelines. This is where the rubber hits the, basically we have some kafka streams coming in, we have some spark pipelines shoveling data around.

Daniel Svonava [00:25:24]: And somehow from those pipelines and all of those workers in those pipelines, we have to either run the models on the same node or we have to create an inference service that we call from those pipelines. And then like caching and not embedding things multiple times or actually re embedding things when the model version changes. This is kind of then the intersection of kind of data engineering and I guess machine learning engineering. Right. So those are some of the things you have to think about when you build vector computer.

Demetrios [00:25:59]: Yeah, I got to tell you, man, I'm not sure how I feel about the vector computer name, but you said so many incredible things in the last five minutes. I'm going to let it slide. All right, but the vector computer, we're going to have to think about a different, because I'm not sure. It has that spiciness, that touch that I'm looking for, but whatever. Who am I to tell you how to do your thing?

Daniel Svonava [00:26:29]: You're the marketing guy. You tell me. I'm just an engineer trying to get by here.

Demetrios [00:26:37]: There's so much incredible stuff that it's doing. I wish the name could encompass that more than computer. Vector computer, but we'll figure that out. Maybe by the time this airs, we'll have a different name for it. And I'll put it in the beginning as like an addendum, however you pronounce that word. Okay. There's a lot to unpack there, though, when it comes to everything that you just said. And I'm going to try and just go through it in my head where it's like, all right, yeah, you have the ETL that comes through where you're grabbing all this data from outside, you're throwing it into whatever, your s three buckets, your data warehouses, all that fun stuff.

Demetrios [00:27:13]: There's nothing new there when it comes to ETL. I think everybody's fairly familiar with that. Then you have the data in your ecosystem, and from there you're going to be taking that data and throwing it at the vector computer so that the vector computer can create different. And you're not creating embeddings from this. The embedding piece is a whole different area. You're taking the data and you're creating embeddings in a different route. Can you break down, how does this work with embedding models and all that fun stuff?

Daniel Svonava [00:27:52]: Yeah. So you are taking everything you know about the entity that you want to process, right? If it's your user, then it's all their behavioral data, all their metadata, everything you know, right? And you are somehow either training custom embedding model that can eat all of those different types of data that you have about your user and produce the perfect vector that encodes everything you know about them, right? So you either do a custom model and then you bring out your pytorch. You create all the layers that can process your feature engineer, because some of these inputs will be like variable length, click sequences and whatever, right? So you either build a custom model and then you give it some sort of objective for a downstream task and then kind of back propagate the error and the typical kind of machine learning work that we all do, or you find, let's say on the opposite extreme, you have your language model serving in some inference service in your own cloud, or use one of the providers that can help you serve open source model, let's say. And that was, unfortunately, people stringify everything they know about the user and send it to the language model and make the vector that way. Right?

Demetrios [00:29:18]: Those prompt templates. Yeah.

Daniel Svonava [00:29:21]: I think in an ideal scenario, you have figured out the way to reconcile those two worlds, right, because there will be some bits of text that you have on your user, right? And so for those using some sort of language, encoder is a good idea, right, because this has been pretrained on a lot of data. It's good, right. So for those isolated bits and pieces of text, if you have some way to put some framework around it that doesn't require you to first flatten all the string bits together, but somehow keep it separate, it's a good idea to use a language model, right? But for some of the other stuff, you might need an image embedding, you might need graph embedding, you might need the time series embedding. Maybe you just want to encode the timestamp, but then do it in a way that when you do cosine similarity, afterwards it's like a time delta calculation, which is like another kind of math puzzle. Long story short, you are collecting from your internal data infrastructure everything you know about an entity. And you want to have some process that runs through all of the instances of that entity and turns them into vectors that describe that encode and compress all your organizational knowledge. Because then when you do retrieval on top of that thing, or when you do any kind of modeling on top of it, the input will be as rich as possible, right? You have the highest chance to actually make this thing to work. So that's sort of the goal.

Demetrios [00:31:05]: I'm starting to wrap my head around it. And excuse me for taking a little bit longer than expected on this, because I was too caught up with the name and I wasn't focusing on the actual important part. But the idea here then, if I'm understanding it, is you're saying being able to fine tune an embedding model is a superpower, and it's very important when it comes to rags. Let's just take rags as an example, right? And so you want to make sure that that fine tuned model is as accurate as possible. The way that you can make it more accurate is by creating the right pipelines with the data coming from. So it gets ETL, then it's in your data warehouse or s three bucket, whatever. Then you're going to create a pipeline, it's going to go through the vector compute, and it's going to be used because the vector compute is able to weigh what's important, what's not, and then you're able to fine tune an embedding model off the back of that in the pipeline, kind of looking at it sequentially.

Daniel Svonava [00:32:14]: We are getting there, we're getting there.

Demetrios [00:32:18]: Keep it going. Then tell me where I'm wrong. Tell me what.

Daniel Svonava [00:32:22]: So it was all correct except one aspect, which is when you make this vector for your user, think about it as a problem of just kind of collecting all the available information but not yet putting in the biases. So you want kind of raw vector, you want the vector that encodes everything you know about the user, but kind of normalizes the components so that at the inference time or when you actually do the querying, for example here, inference time is funny because is it the inference of the embedding model in the ingestion pipeline or is it the inference when you actually query and you want to have your rack system answer a question? Right, when we talk about the sort of like amp inference, right? So we are training a model on top of the vector. We are doing retrieval on top of the vector. That's the time when you want to start playing with those biases, right? Because then you can, a b test, you can be kind of fast and nimble, you can converge to biases that hits the sweet spot for your different, for example, screens where you display the recommendations, all of that stuff, right? So you want that to happen as late as possible, definitely. And then in terms of the vector formation, how do I make thousand flawed numbers that encode everything I know about my user or my document chunk or house or whatever? You want to index these vectors into your vector database, but you want to make them not opinionated at all, right? So somehow you want to normalize the strength to which all these different parts contribute, right? Do you care about freshness or do you care about relevance? Well, 50 50, that's the index, right? And then when it comes time to, okay, we have the new page, okay, so then we want to boost freshness, right? We have the for you page, okay, a little bit more relevance, right. This is a concern that happens kind of at the final moment of just pulling the data together and putting it in front of the user or clustering your machines in manufacturing, like, okay, in this moment, in this context, which machines will break, right? And that's the time to put in the biases all the way up to then you just want to take everything about your machine and boil it down to the vector to just encode all your observations in kind of impartial way, let's say. Yeah.

Demetrios [00:34:58]: Okay. And so then the vector computer comes in in that last moment to make sure that it weighs out everything properly.

Daniel Svonava [00:35:07]: Both in the ingestion and in the last moment. Basically, our component basically works with the vector database, both for the ingestion and for the retrieval. Then for using the vectors as the feature vectors. So you are training another model on top of the output of the vector Computer. That model can learn its own set of biases. Right. So in that workflow you don't necessarily need vector computer anymore. But what will happen is that the model that uses the vector as an input instead of click sequels for users will be much simpler.

Daniel Svonava [00:35:55]: Right. So now you have this model that you are training on labels and vectors and like a floating point vector is for language model. It's exactly what it wants to eat. Right.

Demetrios [00:36:09]: Okay, so then if I'm understanding this again, I'm just going to keep throwing it back at you to make sure I understand because I really want to wrap my head around it and I think I'm there, but I'm not sure if I'm fully there. So feel free to correct me again. The reason you're flattening it out is because you want the vector computer to do all the heavy lifting. You want all of this to be as flat as possible so that later you can add in some kind of logic in the vector computer and let that give you the control that you were talking about before.

Daniel Svonava [00:36:42]: Yes.

Demetrios [00:36:43]: All right.

Daniel Svonava [00:36:44]: It's maybe like you take a raw photo, right? That's like a bit kind of grayish, right? It's like underexposed or whatever they say. It kind of looks boring, but it preserves the most information. Right. It preserves the most information about the scene. And then it depends. Okay, do I want this for wedding? Do I want this for social? Right. You push the curves this way, that way. And that's kind of the ingestion and.

Daniel Svonava [00:37:12]: Okay, I'll use this meme for my next link.

Demetrios [00:37:17]: There we go. That is perfect because now I understand. It's like I want to capture as much information as possible and then I want to be able to do what I will with it. However, I like to show that information to people. I should be able to turn the knobs in the way that I want to turn the knobs and then it gets outputted in something that I'm happy with.

Daniel Svonava [00:37:41]: Yeah, that's exactly right.

Demetrios [00:37:44]: Okay. All right. So now it's starting to come together for me where I'm seeing, yes, this is giving you more control because you're making the decisions with the vector compute as opposed to what you would traditionally see in the rag pipelines or even rexis. And it's like just going straight from the vector database. And maybe you've got your embedding model that throws the embeddings into the vector database and then the retrieval and it goes to the LLM. And maybe what goes to the LLM is right, maybe it's not. And you've also got your prompt templates. And so now what you're saying is before the embeddings go to the LLM, they're hitting the vector compute.

Demetrios [00:38:33]: And then the vector compute is turning the knobs to say this is more important, this is less important and we feed it to the LLM in the way that we want. It's like our secret recipe now.

Daniel Svonava [00:38:44]: Yeah, exactly. And very importantly, in that sort of progression that you mentioned when you at the query time, you are for example learning or a b testing what's actually important for my use case, you don't just use that to reorder the candidates that you already retrieved, right? This is one of the problems that exists in the space is that people do that initial retrieval kind of damn, kind of heuristic, right? Throw in a sub hybrid search, do just kind of like simple embedding of the query and just kind of see what gives on the text embeddings that come back. And then they try to be clever about, okay, let's reorder this thing to match whatever objective we have. But now you are reordering 0.1% of your database, right? You have retrieved a small fraction of all the stuff you have and now you are trying to be smart on top of that. The obvious problem with that is that there is like all this other stuff in your index that maybe you didn't even surface, right? And so when you edit the query time, when you are deciding, okay, freshness, is this important, popularity this important relevance this important. It's important to express that in a way that it informs the underlying retrieval. Not just I will retrieve 1000 things and then I'll have my sort of kind of linear combination, little model that figures out the preferences and then I'll reorder because that misses kind of the point. So this kind of retrieval time vector computer objective is to express these different goals and then formulate the search vector such that when you go and do the nearest neighbor search in the vector database, you are already in that sort of step boosting certain aspects and so on, and finding the globally optimal set of items that have the right trade off between how far is the house, how big is the house, is the price history similar to houses that clicked on before, right.

Daniel Svonava [00:40:56]: You need to be able to how old is the ad and send it to the index, right. And then the index should be globally finding you the ten houses that already kind of fulfill that combination of objectives and then you just return that to the user. Right. So at the start of this conversation, we talked about how our objective for a couple of years was to eliminate the need for the rewriting by making the underlying search smarter. So, well, this is how that works, right? That you kind of express the objective. We construct the search vector in a way that that's where the boosting happens. So that when you send it to the nearest neighbor index in the vector database, it's already doing that kind of opinionated combination of factors, right. And the underlying surge is cosine similarity.

Daniel Svonava [00:41:52]: I don't think you need sort of like special scoring functions in the index itself because that's slow, right. You have to be building an index around user defined functions that readily goes well. And so you need some way to do this with vanilla nearest search with cosine or dot product or something. And yeah, it's possible. We have been discussing a hypothetical system so far, but the company is building. So superlinked is building this vector computer. Actually, by the time this goes live, we might have sort of done the initial kind of public release already. There we go.

Demetrios [00:42:45]: Just to think that everybody go to what? Superlink AI or. Why do you give me that look? Everybody just listening. He just gave me the biggest mean mug ever for saying AI.

Daniel Svonava [00:43:03]: We have AI as well, I think.

Demetrios [00:43:06]: Oh, even better. So you can go to superlink AI and get redirected to superlink.com. Okay, cool.

Daniel Svonava [00:43:13]: Okay. Yeah, so I'll join you after the.

Demetrios [00:43:15]: Recording, but we'll cut it out if you don't.

Daniel Svonava [00:43:21]: People should go to superlink.com and learn about our ways of building vector computers. But yeah, I think this will be like a whole another field. Right. So you should also kind of, hopefully we kind of covered different ideas for how you can build a vector computer of your own. And I think for some people that definitely might make sense. We have been thinking about this for a while and we are happy to chat and kind of share what we learned and maybe just sort of like, as we come up on the wrap up, maybe it's where don't go anywhere yet.

Demetrios [00:44:05]: I still got so many questions for you. Do you have a hard stop?

Daniel Svonava [00:44:08]: You do. I can go a little bit over time. Okay, I can give ten minutes. Oh, man.

Demetrios [00:44:15]: All right then. Speak fast.

Daniel Svonava [00:44:16]: Man.

Demetrios [00:44:17]: You talked about speed.

Daniel Svonava [00:44:19]: Yeah.

Demetrios [00:44:20]: What have you seen when it comes to speed? Like, does this speed things up? Does this slow things down? Because it's one more step. Is it negligible?

Daniel Svonava [00:44:31]: You have to do this step anyway. You have to turn your data into vectors. And typically the language model, which is kind of taking the brand of these workloads right now, is the biggest possible model that you can be using to turn data into vectors for different types of data. There are specialized models that do it much more efficiently. If you train, let's say, graph embedding for the graph nodes to encode the structure. It's much smaller model than a language model. So actually kind of only using language model where you absolutely have to, and then using specialized models is kind of one of the things of how you make all these systems more efficient. So definitely there is a part of that being reasonable around the infrastructure, how you organize the pipelines.

Daniel Svonava [00:45:26]: Is the model collocated with the pipeline workers, or is it a separate service? You need people thinking about that type of stuff. When you build this yourself, that's where the gains come from. Right? So, yeah, I think if you have a scale of kind of single millions or maybe low tens of millions of data points, your main problem will be how do we experiment? How do we sort of try different ingestion policies and query policies in some environment where we can eyeball or draw things in front of a user quickly iterate. Right, step one, and then step two, how do we run the backfill? How do we get the service to be reliably running in our cloud, and how do we manage the lifecycle of the underlying vector database? These are kind of like basic concerns, but there is a bunch of moving pieces, right. And you don't really care that much about, okay, for everyone model ran, how many milliseconds per vector does it take? Right. It's more just kind of covering all the bases of that process and just kind of getting good at launching things. And then when it goes to sort of like 100 million plus data points, that's where you really start to care about, okay, are we serving this model as efficiently as we can? Right. And each cloud now has some managed language model serving offerings or there are third parties.

Daniel Svonava [00:47:08]: And so then it starts to make sense. But just to give you some specific number, a 100 for reasonably small chunks of text with a reasonable embedding model can make thousands of vectors per second. So this is doable and it's not going to be with the GPT four or with like llama two, but everybody knows the masses embedding table. And on that table you can find models that have one or 200 megabytes basically. And then yeah, those are fast.

Demetrios [00:47:48]: Sweet. So there is something else that I wanted to call out, which I think is worth everyone that is listening, go and check out some of the great work that you're doing with vector hub too, if they haven't seen it already. If anyone wants to compare vector databases, you've done a great job because you have no dog in the fight, basically. I really like that you're a vector compute, so you don't have to worry about the vector databases. You're compatible with all vector databases, I imagine. And that gives you a nice position to be like, okay, here's what each vector database does and doesn't do. And here's a table so that if anyone wants to see what vector databases do, what then they can go to vectorhub AI.com superlinked.

Daniel Svonava [00:48:38]: No, it's hab.

Demetrios [00:48:40]: Oh God, I got it so wrong. We'll leave it in the show.

Daniel Svonava [00:48:43]: Noteshab superlinked.com for the vectorhab, where we have basically contributor, provided deep dives on information retrieval with vectors. Right. So rex semanticsearch rag how to do this stuff in browser, in production, we try to be quite practical and little bit advanced, right? Like not your totally basic thing, a little bit more advanced tricks and tips. And so that's the Vectorhab app, superlink.com. And then, yeah, we have basically partnered with the different vendors in the vector database space to try and map out the different offerings. Right. This is one of the problems that we hear when we chat with new folks coming into the space is they're trying to wrap their head around, okay, when do I use lot? Right, which one is good for if I have many small indices or few big indices? Or how does the pricing work with that? Or is it running bundled with my own app? Or is there a managed offering or which hybrid search features, right. So we have basically for 36 or so different vendors, we have a list of something like 30 different features.

Daniel Svonava [00:50:10]: And then what we try to do with the vendors is not just say, okay, the database supports the feature or partially supports it, but also link to the documentation for that specific feature. Right. So when you are kind of learning about the space and you care about certain things, it's very easy to go and see okay, which database supports the functionality I need and then go to the documentation to kind of double check. Does this actually mean what I think it means? The vendors kind of send us, it's baked by GitHub, so it's a GitHub repo with JSON files for each vendor and the vendors send us pull requests to keep the data up to date. So this has been working quite well and it gave us a way to kind of chat with all these people and build the relationship as well. So as you said, we don't have a horse in the race. It's just that we need databases that work well and that people are kind of comfortable with. And for us that made total sense to kind of put together the data set and just help everybody understand this a bit better.

Demetrios [00:51:21]: So lucky for you, you get to benefit from the hundreds of millions of dollars that have been invested into vector databases in the last year because you can partner with all of them. So just jump on those marketing trains and get after it.

Daniel Svonava [00:51:38]: Yeah, exactly. And in some way help them deliver on that kind of overarching promise. Right. Basically, everybody is excited about deep learning based retrieval and feature engineering work, right? This is top of mind for people and a lot of the sort of materials out there make it look super easy. Right. So you can just done something through OpenAI and the retrieval quality will happen. And then in practice, folks get into these projects and it takes quarters to get something really working and we want to sort of remove that disappointment. Right.

Daniel Svonava [00:52:27]: So everybody's excited and we want to keep the happy train going and people launching systems. And I think, yeah, it's going to take both of those pieces of the puzle, the compute and the vector indexing and management and those pipes have to work together basically. So that's what we want to do.

Demetrios [00:52:48]: Well, I love what you're doing. I also want to call out that it's not just for rags. Right. I think we made it pretty clear. But just to reiterate that anyone doing anything that deals with embeddings, which of course recommender systems, are huge when it comes to that, and that is more of a traditional ML type of system as opposed to the LLM type system, this vector compute feels like it is also really interesting for that and it can help your system perform better.

Daniel Svonava [00:53:24]: Yeah. And then another kind of big one that we didn't really discuss much is analytics. You want to find customers that behave similarly, you want to find your machines in a factory that will break at the same time. So doing kind of clustering on top of these vectors or training other models or just doing, hey, show me similar products that sell similarly, things like that. I think by the end of this year, if we do a good job on partnerships, people will be doing these kind of queries in their business intelligence, and they will not necessarily know that actually they run kind of superlinked workload under the good, but they will be tapping into this kind of dream of deep learning based retrieval. Right. That's the goal, is kind of make it accessible to as many people as possible where they already work today. Right.

Daniel Svonava [00:54:24]: So the hype is awesome, but I really love to work with folks who have the ordinary problems and they just want to get their job done and kind of meeting them there. Right?

Demetrios [00:54:37]: Yeah. Oh, man, that's so cool. I'm getting goosebumps just thinking about it. I love what you are working on. I appreciate you coming on here and explaining it to me like I'm five. I know I took a little bit longer to get it, especially because this is like the second or third time that I have talked to you about it, which I didn't want to mention that while we were talking about it. I should know by now, but I was just, you know, going a little slower so that all the listeners out there could stay with me.

Daniel Svonava [00:55:06]: Yes, we brought everybody along.

Demetrios [00:55:08]: Exactly. I wish it was like that, man. A bit of a baseball moment for me, but I will say we got there. Yeah, exactly. Well, hopefully I do have a better understanding. I will encourage everyone that is listening to this. If you're interested in it, go to superlink.com. Also, if you want to partner with Daniel and the superlink team, get after it and make sure that you hit them up now because, dude, I think y'all are going to blow up.

Demetrios [00:55:37]: I will say that right now. I'm going to call it, and I really hope so.

Daniel Svonava [00:55:42]: All right. Thanks a lot for having me. And it has been pleasure to dive into this esoteric topic of vector embeddings. And, yes, see you around in the community and on the upcoming event.

Demetrios [00:55:58]: There we go.

+ Read More

Watch More

35:19
Embeddings and Retrieval for LLMs: Techniques and Challenges
Posted Jun 20, 2023 | Views 963
# LLM in Production
# Embeddings and Retrieval
# Chroma
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io