MLOps Community
Sign in or Join the community to continue

The Future of Information Retrieval: From Dense Vectors to Cognitive Search

Posted Feb 17, 2026 | Views 5
# AI Agents
# AI Engineer
# AI agents in production
# AI Agents use case
# System Design
Share

Speakers

user's Avatar
Rahul Raja
Staff Software Engineer @ LinkedIn

Rahul is a Staff Engineer at LinkedIn, where he focuses on search and deployment systems at scale. Rahul is a graduate from Carnegie Mellon University and has a strong background in building reliable, high-performance infrastructure. He has led many initiatives to improve search relevance and streamline ML deployment workflows.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Information Retrieval is evolving from keyword matching to intelligent, vector-based understanding. In this talk, Rahul Raja explores how dense retrieval, vector databases, and hybrid search systems are redefining how modern AI retrieves, ranks, and reasons over information. He discusses how retrieval now powers large language models through Retrieval-Augmented Generation (RAG) and the new MLOps challenges that arise, embedding drift, continuous evaluation, and large-scale vector maintenance.

Looking ahead, the session envisions a future of Cognitive Search, where retrieval systems move beyond recall to genuine reasoning, contextual understanding, and multimodal awareness. Listeners will gain insight into how the next generation of retrieval will bridge semantics, scalability, and intelligence, powering everything from search and recommendations to generative AI.

+ Read More

TRANSCRIPT

Rahul Raja [00:00:00]: Yeah, for image, I think that's, we can convert that into a vector. But for, I think for a movie, what we can do is let's say we extract some of the frames of the movie because it usually have a lot of frames and just we can convert those into vectors. And then again, when we want to search again, yeah, we can match those vectors with whatever we have in database. So that can speed up the search query that we have.

Demetrios Brinkmann [00:00:34]: People found out real quick in RAG that the R was the most crucial part of RAG.

Rahul Raja [00:00:42]: And then, yeah, I think RAG stands for retrieval augmented. So if RAG is not doing retrieval, then yeah, I'm not sure what RAG is actually doing.

Demetrios Brinkmann [00:00:53]: Yeah, you just are entering a wildcard into classical search when you throw the generation into it. And so it can be, unless you really got that retrieval and that search down, you can have a hard time. It's not like, oh, we're going to add an LLM to it and it's magically going to change.

Rahul Raja [00:01:16]: Yeah, I think LLMs are a part of it, like they can try to enhance the results, like, like the, what the A stands for, like they can try to augment the results, what the retrieval layer has already given. But yeah, unless you have a retrieval layer and that is functioning properly, then yeah, the G, like the generator here, it cannot help much. So retrieval is definitely the most important part of a RAG system.

Demetrios Brinkmann [00:01:44]: Well, dude, you've been knee deep in search. For a while now. Basically, you are the coveted search engineer, and I want to learn all about it. I've been loving talking to folks about the new ways that we're doing search, and search is so topical right now since context is so important for agents. And how do you get that context? Well, you got to search for it, right? Maybe you can start by just giving me a bit of the evolution of search.

Rahul Raja [00:02:21]: Okay. Yeah. So search, I think, so in the traditional, like, information retrieval systems or search systems, like, what we used to do is that we just usually, let's say, type queries and the search system will match your query to some document. That document might have your, the query that you are looking for.. So it will just try to look up, okay, which documents have the queries that the user wants. And just based on that keyword matching using some data structure, like mostly it's inverted index, it will return like some document that these are the top links that you're looking for or top documents. So now that was the traditional IR. So now from that it evolved to dense retrieval.

Rahul Raja [00:03:06]: So by dense retrieval I mean that now like your queries and documents, now they are converted into embeddings. Now with all sort of like language models, pre-trained language models that have come in, like we generate embeddings from the query and also the document. So now then the search system, they don't like, they didn't only match only the keywords, they also try to match the semantics, like the meaning behind it. So you can like for example type synonyms if let's say you type a query, it will also try to match the words which are synonyms and which have similar meanings. So then we evolved from that and just the keyword-based matching to more like embeddings-based matching and like BERT, Transformers, all of this, they helped us to create embeddings fast and then match these embeddings. And now we have seen that from dense retrieval, now we are going towards the next phase, which is the cognitive search. So in this phase, like, it's not just about like matching the meanings or understanding the similarity between the words. It's also trying to like understand what's the context behind the search.

Rahul Raja [00:04:21]: So the search system now, they like try to solve the problem, try to understand the goal, what a user has in terms of what they're looking for, like what's the eventual goal. Versus just matching some documents or like giving them some answers, we just match their queries. So this is what like cognitive search has become now.

Demetrios Brinkmann [00:04:45]: Wait, I'm not sure if I fully understood cognitive search and I know semantic search got huge, right? We're generating embeddings and we've, you've also got folks that are saying, all right, well, it's actually super effective if we do keyword search and semantic search and we get that hybrid thing going on. But cognitive search is trying to just understand the intention behind the search and then figure out from there.

Rahul Raja [00:05:14]: Yeah. So for example, if you want to search something, I think your ultimate goal in that search is not only to just get some documents, you want to like solve some problem. So semantic search tries to understand what's your eventual goal is like after let's say searching this, first of all, what's your context behind this search? And after searching that query, where do you want to go next with that? So it also, in terms of it can do a series of actions. For example, return answer to one query, based on that, take another action. So it will help user to eventually reach to their goal. And also based on, like now we have LLM, so it can also add personalized personalization to search and whatever information that is generating or retrieving for the user.

Demetrios Brinkmann [00:06:02]: And how is this different under the hood?

Rahul Raja [00:06:06]: Yeah, so under the hood now, we— since I think I mentioned we already have LLMs in place. So LLMs is what they help to— first of all, they also help to reason what the query that has been returned by the search system. Like they don't just return, for example, this is what's your answer to the search query. They also try to reason like and also try to think as a system that this is what I returned and this is what's the reasoning behind the search system. So LLMs are mostly powering like all this paradigm shift towards cognitive search.

Demetrios Brinkmann [00:06:46]: The idea of the next best action, or why are you making this search, and can we leapfrog what you're trying to do here and just get you what you're looking for, is a fascinating one, especially because I've seen that pop up now and again with when I'm talking to an LLM or a chatbot. And then it says, would you like me to do XYZ next?

Rahul Raja [00:07:17]: Is that kind of what you're saying? Yes. Also, like previously, we used to, for example, measure the effectiveness of a search system using the, let's say, the relevance of the query. For example, you have like these many documents and out of these, like we return these many. So what's the accuracy of this search result?. But now we also have started looking at not only the accuracy, we want to look at the happiness of the user. For example, is the user even happy? Like they understood what this search result returned and does this actually satisfy what they're looking for? One thing is also, for example, let's say your search system returned a result, then what action did the user do next? And did they, for example, click on some post? Did they do some action which reduces the next search queries that they are doing. So in this case, we can determine not just by looking at some metrics like recall and accuracy, we can see where the user is going next and we can use that to determine if your search system is actually doing what.

Demetrios Brinkmann [00:08:25]: It'S supposed to do. What are some high value signals that you can evaluate if the search is.

Rahul Raja [00:08:33]: Doing what it's supposed to be doing? Yeah, so I think I mentioned one of them is, let's say a user sends some query. Now let's say if the system didn't satisfy the user's requirements or it didn't answer the query that it's supposed to, then user is going to type some more follow-up queries and then they will keep on asking that some follow-up questions for that. So one signal is if we see that the user hasn't like type too many follow-up queries, that can be one signal. Like, it's not the utmost signal. There are many other reasons why the user is not asking follow-up questions, but that can be one of the reasons why your search system is effective. Also, let's say you return some results and if you have, let's say, call to action, for example, out of those search results, you want user to, let's say, buy something, click on something, go visit some website or something. Based on that, you can still evaluate that, okay, based on just this search, searching for this query once, like user got what he wanted when they make a sale, for example, they make a purchase, sorry, they made a purchase.

Demetrios Brinkmann [00:09:41]: For example. So now when you talk about relevance in a way that is the new relevance.

Rahul Raja [00:09:49]: Yeah, I think for both, I think for like the dense retrieval systems where we're using let's say embeddings to generate all of this, I think the relevance was still measured in terms of accuracy. Or for example, there are many, I think recall and all of that. But now I think if you want to design for these cognitive search, then this becomes the new metric that we care about the user rather than just some comparing what the.

Demetrios Brinkmann [00:10:25]: Query return. And you know, like with dense retrieval and just like vectors and embeddings in general, you could throw in a library. You have very battle-hardened libraries that I think everybody knows, like Faiss, and that would get you pretty far. Like, is there libraries that you can use for the cognitive search paradigm?

Rahul Raja [00:10:52]: Yeah, I think that is, yeah, I think not as evolved right now as we have the dense retrieval systems. So I think if you want to, I think, build, for example, agentic search capabilities, you have some libraries to build the agents, but like we don't have those libraries where you can just like use them to compute the metrics. And this is because for each system, I think these metrics are different. Because for some retrieval system it might mean your, let's say your user is doing, let's say X action, and for some others it might mean something different. So I think, yeah, because now in each system we are not using some standard like metrics or accuracy term. So right now there are no, I think, such battle-tested libraries which we can use to evaluate this.

Demetrios Brinkmann [00:11:45]: Yeah, I wonder if there ever will be, like, if we're going to have a consensus on what these terms are or what we evaluate for in that regard, because it is so use case specific.

Rahul Raja [00:12:02]: Yeah, I think I agree. So I think for some of these terms, for example, let's say for if we want to count the number of follow-up queries that have reduced, I think we can have some libraries to do that. We can have some, I think, libraries to, let's say, track general actions. For example, user clicked something or out of, out of your search results, or they went to that landing page. Yeah, but I agree that since this is, this is going to be very different for each system, I think having some libraries is not trivial.

Demetrios Brinkmann [00:12:36]: Well, and it really. What we're talking about really goes into the area of evals, and it's almost like product-specific evals, kind of, sort of, you know, if somebody is asking a bunch of questions to get what it wants, that should be caught in the eval sequence, right?

Rahul Raja [00:13:04]: Okay. By eval sequence, uh, you mean what?

Demetrios Brinkmann [00:13:08]: Uh, I guess sequence isn't the right word, but it should be caught with, hopefully when you're reviewing your data and you're seeing, um, like this isn't working as well as we thought it was, or the chatbot is not answering the question, or it's not doing what the user is expecting it's doing. And so when you're evaluating the multi-turn conversations that these chatbots are having, you catch that ideally. But to do that, you got to spend time with the data. You got to actually evaluate it. And maybe the, maybe you can train a judge if you're doing LLM as a judge to pick up on that type of thing. And I guess that's what you're talking about with like the libraries of Hey, if they keep asking the same question or if they're repeating, if the chatbot is repeating itself, that's like a bad signal.

Rahul Raja [00:14:05]: Yeah, I think also if you want to, yeah, let's say evaluate, yeah, like catch it in terms during eval. Yeah, this can be one of the manual review steps, which we can take to evaluate how the retrieval system is doing. I think it can be also automated. I think the number of follow-up queries, I think that's a bit simple. Yeah. So we can do like both manual and automated eval to catch this early.

Demetrios Brinkmann [00:14:35]: How have you seen the best folks like doing search or giving agents a search tool to leverage? Is it that you— each search tool is its own thing, like each way to search, whether it's hybrid search or dense vectors or whatever keyword search, or is it abstracting away and it's just one search tool and then under the hood you figure out how that works? Like, have you seen good ways versus bad ways?

Rahul Raja [00:15:11]: Yeah, I think so. Production-based systems, like if you, they usually have this abstraction where they hide all these things. Like your system is internally, it can do like multiple of these things. So usually systems in prod, they don't tend to, let's say, stick to sparse retrieval or dense retrieval. Like, they tend to do all of these because in production we have to think about or worry about all of these things. For example, there's cost factor, there's latency factor, speed is important, and we have to care about accuracy. So now, for things like if you want to really speed up things, you want to have very good accuracy, then the traditional sparse retrieval search, that is a very good option. Also, you don't— because if you want to go into the dense retrieval, you'll have to factor into many things like GPUs cost will go up.

Rahul Raja [00:16:07]: So you can't always run your queries using just dense retrieval. So usually there is an abstraction and based on the query needs also, it can decide whether it has to like go to which route. Like most, most of the use cases, it can be satisfied reliably using the sparse retrieval, using the inverted index. But yeah, depending on like if we have resources to spend, if we want to like, for example, trade-off between accuracy and speed, then we can, it can do the dense retrieval also.

Demetrios Brinkmann [00:16:44]: Talking about the scale, since you're saying, hey, think about trade-offs, think about, are you cool with latency? Are you cool with cost? I know that you've run some gigantic scale systems and you've played around with embeddings and we're talking like, how, how big are, how many embeddings? Like you're, you've been on some. Really big side of things. And you mentioned to me before that if you're not careful, that can kind of bite you in the ass. So can you talk to us about what we need to watch out for when you're running at that scale?

Rahul Raja [00:17:27]: Yeah, I think so. I think when you have, for example, let's talk about the vector embedding. So I think big companies like in production systems, they have like billions of vectors embeddings that they save. And then at that scale, it like not only becomes a simple modeling problem, like how do you need to generate the embeddings, it also becomes like in a lot of ways distributed systems problem. So for example, how are you going to effectively store the embeddings across, let's say you have different partitions of a cluster, and also how are you going to search the embedding? So this is again, and you have, let's say, multiple partitions. You have to retrieve all those embeddings effectively in real time while serving any query. So how are you going to do that? And then third problem is, let's say you have the embedding store. Now you have new data coming in, so you want to update the embeddings across all the partitions that you have.

Rahul Raja [00:18:26]: So how do you update those embeddings in real time. That's also a big problem. And also, if you are storing your embeddings, you can't store all of them just on disk itself, because retrieving that in response to a query and then comparing them across with the query and then returning the result to the user is going to take a lot of time if you are going to do that from disk. So you have to decide what, like how much embeddings you want to store in real-time to make it a live index, which we call it live index because it's being served off memory. And what are you going to store in the base index, which is actually stored in the disk? And when are you going to do that conversion? Like, when are you going to keep refreshing your live indexes from the base index? So all this, I think these are not only— not only relates to modeling or embeddings, general distributed systems problem which we have to like take into account if we are dealing with embeddings at that scale.

Demetrios Brinkmann [00:19:29]: Do you have a formula on how you decide that or is it case-by-case based?

Rahul Raja [00:19:34]: Yeah, there's no, I think, formula for this. It's unique to every use case. It's unique to the size of the embedding that you are dealing with. You can just go away by like keeping everything on just one computer hosted, let's say remotely, and just serving everything from, from the disk storage. So it depends on what scale you are dealing with. It's, I think, doing like setting all this, these things up in this way, for example, like distributed and replicated, it's an over-engineer, like people should not do it unless they really like hit that scale. So they should just go from the base.

Demetrios Brinkmann [00:20:14]: I want to ask a little bit about all of these different search techniques that you grapple with. I imagine you're constantly thinking about the trade-offs on each one, like cognitive search. In my eyes, it sounds amazing, but it's probably not actually going to be the most effective X percent of the time. The same with hybrid search, maybe just plain old keyword search is your workhorse that's going to be cheap, fast, and easy for the majority of your queries. And then you need to scale up. Do you think about like when to use one versus the other and how you think through those?

Rahul Raja [00:21:02]: In terms of scale, if I'm saying, let's say if I'm just starting up, I have let's say some users which want to search for, like, I have some products in my, let's say, database, then I think I just start off with setting everything in just the simple sparse retrieval way. I'll just create an inverted index based on whatever data that I have, and then I'll store it somewhere, and I'll just run my queries with that. Like, I won't even care about using GPUs or like doing all these vector-based search because I know that my— because users, they like don't only care about that they should get the most accurate result, they also care about they should get the result like reliably at all the times, like your system is always responding with something. I know that for— I think for small scale, I think that's fine. You don't have to be like 100% accurate returning results all the time. And I think when you are scaling. So I think that's again based on metrics. You'll have to check that in what scenarios or how many times your search is not returning the results what it's supposed to.

Rahul Raja [00:22:11]: In that case, we can just work with all the basic metrics that we have for search, like recall, accuracy, and all those things. So I think that will give us a good idea of when to use the vector-based search, when to use GPUs to, let's say, start encoding your queries and also your documents into vectors and then start using semantic search, then yeah, most of the small use cases, I think, yeah, that's not necessary.

Demetrios Brinkmann [00:22:37]: Kind of in that same vein, I talked with Jeff Huber on this podcast a few months ago, who is the founder of Chroma, and he was mentioning how there was this paper that came out probably now 6 months ago, and the headline was, you know, BM25 is basically all you need. It is the best. It's going to work always. But he did some digging and he saw that in that paper, what they were running these benchmarks on all of the queries were uniquely suited for BM25, right? It was like an easy layup for BM25. And so of course it's going to succeed because it was built for that. And I can't remember the exact queries that were made specifically for that, but that was kind of his argument. I wonder if you see certain queries in that regard where you're like, all right, BM25 or not, like don't believe everything you hear because BM25 isn't all you need and it's much more nuanced.

Rahul Raja [00:23:56]: Yeah, I think BM25, I think for, I think even for midsize, I think mid-scale companies, I think BM25 won't suffice. For example, let's say if you have more than 10 10,000 or so users, yeah, I don't think you can rely on BM25 to, yeah, for your retrieval system. Why is that? Yeah, I think again BM25 is because based on again the text-based retrieval, query-based, text-based search, sorry. So I don't think in BM25 you even like operate at the vector level. You're not doing any vector search on that. So I think most of the times people are not, they don't know what they're looking for. They just want to type some query and the retrieval system is, should be smart enough to figure out what the user intent is. And based on that, they have to return the result.

Rahul Raja [00:24:50]: I think so having, yeah. Also I mentioned earlier, if I think if it depends on the cost factor also, like if you want to do a trade-off for that, you want to minimize your cost also, then yeah. Sometimes we can run on BM25 using text-based retrieval, and sometimes we can do the dense retrieval also using vectors.

Demetrios Brinkmann [00:25:10]: I'm trying to digest this considering I, like, I am fairly interested in search, but I am by no means that knowledgeable in it. And so when I get to chat with people, I'm always excited about it because I learn a ton. Maybe we can just, I can just ask you this one. What are you excited about these days?

Rahul Raja [00:25:39]: Okay. In, yeah, in information retrieval, I think we talked about it briefly. So I think having agent capabilities is definitely one of the things which I'm excited about. So for example, by agent capability, I mean that. When you give a, let's say, query to your system, your information retrieval system, it doesn't just return answer to one query and then stop. So it's also, let's say, do a series of actions based on that answer. So for example, the first one can be getting a response to a query, and then based on that, it does another action. Maybe it queries the same search system again.

Rahul Raja [00:26:19]: So it does a series of actions. So this is what I mean by having agent capabilities. For information retrieval. Also now I think search, I think it's already there at many places. Like it's not based on only text-based search right now because you can also search based on images, multimedia. So multimedia retrieval, multimedia search is getting, I think I'm very hopeful for that. Like I think everyone now has support for this. I think that again calls for vector-based search because to match images, to match any multimedia with each other, you have to convert them into vectors.

Demetrios Brinkmann [00:26:58]: So that's all about it. But you're converting frames of a movie.

Rahul Raja [00:27:02]: Into a vector, or how are you converting videos? Yeah, for image, I think that's— we can convert that into a vector. But for, I think for a movie, what we can do is let's say we extract some of the frames of the movie. Because it usually have a lot of frames and just we can convert those into vectors. And then again, when we want to search again, yeah, we can match those vectors which, with whatever we have in database. So that can speed up here on.

Demetrios Brinkmann [00:27:32]: The search query that we have. Yeah, that I imagine you can also grab the subtitles and add vectorize that too, or help with the vector and the embedding.

Rahul Raja [00:27:44]: Yeah, or if you can, I think also just do an exact match with the subtitle— sorry, not the subtitle, with the metadata of the video, for example, title, the tags, what it has. So yeah, before going to that vector-based search, we can just do all this to first confirm that this is what.

Demetrios Brinkmann [00:28:03]: The multimedia is about. But you're not enriching the image with metadata of like a description of what is in the frame, are you? Or I guess depends on how much.

Rahul Raja [00:28:15]: Money you want to spend. Yeah. So do you mean like storing all the metadata also as vectors somewhere and then trying to match that?

Demetrios Brinkmann [00:28:25]: Yeah, exactly. Well, so basically you're grabbing frames from a video and before you embed them, you also run it through. Something that says, that describes what's in this frame, like a picture.

Rahul Raja [00:28:41]: Yeah, I think as you rightly pointed out, depends on how much money that you want to use for this. I think you can store a lot of things. You can store the metadata. Yeah, you can store the description also for, let's say, some of the frames. And then before matching the exact frame, you can match the metadata with this.

Demetrios Brinkmann [00:29:01]: What have you seen? Do you know many use cases? I'm sure there's a ton. I'm just drawing a blank right now on search for videos in this regard. And I have a friend, Amy, who is doing— she's— her company's called Cloud Glue, I think. And they're doing this where they're helping you search for videos and, or they're helping you search within your video collection.

Rahul Raja [00:29:29]: Yeah. I think, yeah, I have not seen. Many great examples personally. Yeah. For example, most video-based search. Yeah. That I've seen. For example, when I type a text.

Rahul Raja [00:29:40]: Yeah. I'm not sure even if YouTube tries to search that text with, let's say, some description or something which is going inside the video. I think that also mostly does search based on the metadata, what it has for the video. Yeah.

Demetrios Brinkmann [00:29:58]: Yeah. Guess one that I've seen do that well is probably TikTok, but other than that, I don't know what the use cases are and I'm sure there are a ton, like training videos or stuff like that and things that I'm just, you know, I don't encounter in my day to day. And so I can't think of it.

Rahul Raja [00:30:18]: But yeah, I think in Reels, I think in Reels also, for example, when I type a query, I I think to some extent it's searching for the content inside the video because it returns pretty relevant results based on what I search. Yeah. So that I think is doing some form of multimedia search.

Demetrios Brinkmann [00:30:37]: So the idea of being able to have multimedia search is cool. And the other thing that is exciting you right now is the idea of just being able to have like almost multi-turn search or enriching the search. It's not just the first thing that you get back. It's going over it, reasoning over the, what is retrieved maybe, and then going and conducting another search. So you have multi-hop search.

Rahul Raja [00:31:03]: Yeah. Right. So agentique is also like, that will help us to understand like getting a user reach to their goal. Yeah. Rather than just returning the most accurate result, the most relevant result. So this again, yeah, ties to. Yeah, that, that vision of search, the.

Demetrios Brinkmann [00:31:20]: Cognitive vision of search. One thing I've heard folks talk a lot about is the difficulties of being able to get the right amount of context to an agent at the right time, which as I talk to you, it's like, yeah, that's kind of a search issue. Right? That is making sure that you're not having too much crap that you're throwing at the agent that it needs to reason over and you're just getting it. It's that precision and recall type of trade-off that you were talking about.

Rahul Raja [00:32:01]: Yeah. So giving too much context to the user, I think in a, sorry, to the agent here, I think it Again, it's— yeah, I'm not sure how like the— any of the generative agents or that can help because the context that comes, it's mostly coming from the user, right? Maybe a user is not able to provide context to what they're looking for. Maybe they are able to like bombard the agent with extra information that it really needs.

Demetrios Brinkmann [00:32:35]: Yeah, I was thinking more along the lines of, hey, the agent is going and it's doing these multi hops and it's going and it's searching for something. It gets something, it retrieves it, boom. But if what it retrieves is maybe not relevant or there's only a little piece of what it retrieves that's relevant, that's really where we're talking about this is a search problem. It's less of. This like new AI LLMs problem?

Rahul Raja [00:33:05]: Yeah, in that case, I think to like ground an agent based on some specific context that you are looking for, I think RAGs can definitely help there because they— an agent can talk to that another retrieval layer and because retrieval layer that we can control, for example, let's say if the agent wants to query or only retrieve some of the context which is specific to, let's say, my company. So I can set up that retrieval agent in that way so that it can first talk to that RAG layer, get some specific context for that whatever query that user is looking for, and then pass that to LLM or generate an answer based on that context. So RAGs can really help to ground the agents to some company, like domain-specific.

Demetrios Brinkmann [00:33:54]: Data that we have. Yeah, there's so much good stuff. And especially when it comes to all the different ways, um, that you can like just encounter search or perfect search in your systems. And search has been around for so long and it's so fascinating to me that it's not a solved problem by any means.

Rahul Raja [00:34:23]: Right, right. Yeah, I think solved. Yeah, I think it will never be solved, I think fully, because it's just been evolving with every new piece of technology that comes in. Yeah, I see that. Yeah, it just keeps evolving. Now user like expect them more from search. I think some time back they just wanted that, yeah, just give me a list of links or list of documents that, that, that, which is most suitable for my query. But now based on LLMs, based on generative AI, then now users have expect, like are expecting more and more from search.

Rahul Raja [00:34:59]: So now search has to do a lot more just to make the users happy.

Demetrios Brinkmann [00:35:05]: Yeah. You know what? I just randomly thought of another use case for the video search. And it's when you have one of these copilots or like Rewind, Rewind.ai, which I haven't heard about recently. I don't know what happened to them, but they were supposedly watching everything that you're doing and taking screenshots every once in a while and recording what you're doing and so that you could search over your past history. Much easier. And that would be a great use case for the video search, right? It's just like if every 5 seconds they're just taking a snapshot of your screen and then they're vectorizing it or they're creating an embedding out of it, uh, which I imagine is how they were doing it, but something didn't work with that because I haven't heard anything about them for they blew up a few years ago and then I don't know what they're doing these days.

Rahul Raja [00:36:08]: Yeah, I don't know if I want to search my history like that. Like every action which I'm doing, it's recording. Okay. And then I have to search for that. Yeah. Never encountered a use case like this where I need to search what I was doing or basically what they are like, what's the most Search, I think use case that might do on my system is I have to look for some document and maybe like, I also want to search for something like what's, because I remember what's inside that document, but not exactly what was its name. So I just want that kind of text and also video, video and multimedia search. Yeah.

Rahul Raja [00:36:47]: But he's speaking.

Demetrios Brinkmann [00:36:49]: Yeah. As like, oh, I had a call with Jane yesterday and she said something. She said this number. What was that number? Or, um, you You know, when I was talking to this person, they referenced something and I can't remember exactly what they referenced. There is a world now where, all right, you probably have a call recorder and you can go to your call recorder and check that. Yep. But if you can have that all in one spot where you just search your history, your past, okay. And, and that might be like, I had the call yesterday, but what if I had the call two weeks ago? I can vaguely remember who it was with.

Demetrios Brinkmann [00:37:29]: What they were talking about, you know.

Rahul Raja [00:37:31]: That type of thing. Yeah. Yeah. That sounds interesting. Yeah. I never thought about having this kind of an app in my phone. I think for phone it would be more useful.

Demetrios Brinkmann [00:37:42]: Yeah.

Rahul Raja [00:37:42]: Yeah.

Demetrios Brinkmann [00:37:43]: The phone, because we use it more, right? But even, oh, I was on LinkedIn the other day and I saw a great post that was about this, but now I can't find that post. And so if I'm getting screenshots taken every 5 seconds, I imagine you could.

Rahul Raja [00:38:03]: Go and take it. Yeah. I wonder it will take like insane amount of memory and resources to like store all of that, that they're doing. If they're taking a recording of a system every 5 seconds or so.

Demetrios Brinkmann [00:38:18]: Or more, who knows? Yeah. The whole thing was that, oh, we figured out this special compression technique. And so it doesn't do that. But like I said, I haven't heard about them in a long time. And so I don't know, it might've failed horribly or people might've been like, I don't want some random company just spying on me.

Rahul Raja [00:38:42]: Yeah. Right. If it's a recording, then yeah, it might be using their data for something else.

Demetrios Brinkmann [00:38:48]: Yeah. You never know.

Rahul Raja [00:38:51]: Yeah.

Demetrios Brinkmann [00:38:51]: Yeah.

Rahul Raja [00:38:52]: They're training a lot of GPUs, maybe. Just due to the cost factor. Yeah, it didn't survive.

Demetrios Brinkmann [00:38:57]: That can be one of the reasons. Yeah. Also, yeah, who knows, but that's a little bit of a tangent. Uh, going back to what you're thinking about on a daily basis, I'm wondering about like the future in your eyes, uh, and where you see there's some unresolved challenges, maybe it's with cognitive search or even with the dense vectors and vector search, how do you think that we can.

Rahul Raja [00:39:28]: Move forward? Yeah, I think in, uh, just in cognitive search, I think you also pointed out, I think some of the things, for example, in vector-based search, we have a lot of battle-tested things already, but in cognitive search, we are like still far from it. We don't have a good way to measure the metrics which I was talking about. For example, let's say user is happy, or let's say they got in the end what they were looking for, like their intent of the search was satisfied. So that's still, I think, yeah, we need a lot of work there. Again, in terms of, I think, dense retrieval also, like we have GPUs, like we do vector embeddings for everything, but I think GPUs, I think getting the cost of these GPUs down again is a lot of challenge for big companies. Like you can't, like there are not infinite resources, you can't run everything on dense retrieval, you can't vectorize everything. So like getting the cost down for this GPU, like making your whole like AI infrastructure pipeline reliable and also cost effective, that's also a big challenge. So I think this is also what I hope, like it's, it looks like that it's not a big problem, but dealing with cost and accuracy with limited resources is often one of the big problems in a lot of companies.

Demetrios Brinkmann [00:40:55]: You bring up a point that has been stated on this podcast a few times over the years, and I wonder how you would go about tackling this problem, which is keeping the documents and files and everything in the vector database up to date. Because when you create embeddings, let's say you create embeddings for a certain document and then you throw it in the vector database or you, you, you know, throw it wherever you store it. And then there's a new. Version of that file that you need to have be referenced as the canonical truth, what do you do? You just go and like purge all of the past ones, but how do you purge all the past ones? Like, how do you make sure that now when you're doing this dense vector search, you're retrieving the correct file that is the most up-to-date one that you want to use?

Rahul Raja [00:41:56]: Yeah, I think this, uh, problem also persists outside of vector-based indexes. For example, even if you have massive, let's say, text-based index and you want to also first you want to update it constantly with the live events that you are coming in. And also then let's say after a few days your index becomes stale, then you have to rebuild the index again. So in terms of vector indices also, I think it's the same approach. So I think once you generate it, you can keep it for a few days, or let's say depends on like, you can keep it for a few weeks and then you can have updated gradually with all the live events that are coming in. And you can keep the snapshots of those live events just in memory. But let's say there's a need to also rebuild this index. You can't keep on like surviving with that index for long because the size of the live index will keep on growing.

Rahul Raja [00:42:59]: So I think a way to do this is you rebuild your entire snapshot of the index which are stored on disk gradually after, like, sorry, incrementally after every few weeks. And once you do that, because so each embedding is generated with a specific model, model version, so you have to make sure that you are using the exact like same model to generate the new embeddings again. And if you are using a different version of the model, then we have to compare those embeddings and then we can only make the switch from using the old embedding to the new. So in this way we are not actually purging the old embeddings. So we first generated a new set of embeddings, we did an evaluation between them, we did a compatibility check whether these embeddings are compatible, and then we make the switch. And the switch is only in terms of where your queries are going to be served. So you change that layer, sorry. And then you say that now just all the queries should use these embeddings to generate the responses and you just keep doing it after every few weeks or so.

Demetrios Brinkmann [00:44:02]: You remind me of how my buddy told me a story about how he had the exact same system and all he changed was the embedding model. Everything else was the same. And he was like, it's not going to be that big of a difference, right? It was completely different. He was like, You cannot use those at all.

Rahul Raja [00:44:25]: Yeah, embeddings are like very— those are kind of like version artifacts like we have in software systems. So you, even if you change, let's say a param or a weight in your model, you like your embeddings are just going to go haywire. Like you're not going to believe what it does to embeddings. Yeah. So that is like very important to make, keep that compatibility between the model and the embeddings.

Demetrios Brinkmann [00:44:52]: Yeah, the other piece that I think is fascinating when it comes to these, the embeddings and the vector databases, the vector stores that are kind of difficult that I heard people talking about back like 2 years ago, I haven't heard as much of it these days because maybe folks moved on to new problems or they just, they swept it under the rug and pretend like this doesn't exist anymore is like the access control with embeddings and vector databases?

Rahul Raja [00:45:27]: Yeah, I think the access control, I think it is similar to like the access control that we have for any version artifacts because we, I think there is like usually a lot of like layers that goes in that access control. So I think for embeddings also, I think the Yeah, I think in my experience, the access control that we have for any version system, like for a dataset, for example, that is same as embedding. So we consider like an embedding same as a dataset, which will just keep on having like more versions incrementally.

Demetrios Brinkmann [00:46:04]: Yeah. Yeah. Going back to what you were saying about how to update these files, whether it's embeddings or files or just data in general, I noticed that for certain things you can overweight the fact that there's, there's like certain features that you can really overweight and then it leads to a worse search experience. And I'll give you a prime example of this. When I was using Notion, I saw that I was looking for a file. Couldn't quite remember the name. I remembered some stuff that was in it and I spent so long trying to query this file or trying to like just find it in my gigantic Notion sprawl of a, oh fuck, it's a mess, dude. But that's a whole nother story.

Demetrios Brinkmann [00:47:01]: And Notion was over-indexing on. Certain features like, oh, this file was checked recently, or this file has been looked at by 5 people on your team. You know, like there's certain features that make it much harder for me to find an obscure file.

Rahul Raja [00:47:27]: Yeah. So I think, uh, like it depends on their indexing strategy. Yeah. Maybe what I think the features that they were using for indexing. Was not that great. But I think that if you're looking for the contents of the file, I think that's the, like, one of the first things which I'll, like, put my bets on. Like, that's the first thing which I'll index properly so that people, they usually, because they either remember the name of the document or they remember what was the document about. So, and they either search by, like, either of these two things.

Rahul Raja [00:48:02]: So at least it should return the documents properly based on these indexes, at least.

Demetrios Brinkmann [00:48:10]: Yeah. And this, if I throw into a search bar or something like, you know, a file that talks about X, Y, Z, and I kind of describe what's in the document, that is not being triggered for dense vectors? Is it like, I feel like I've heard of some search techniques where first you'll pass it through an LLM to try and gather the gist of it or create a query that will go and find it. Did I just make that up or is that something that you've seen?

Rahul Raja [00:48:47]: I think that's right. I think, uh, in case of documents, I think vector search is a good idea. Like you have generated embeddings of your documents, let's say, which, whichever you are storing in an offline way, and then you store it somewhere. Now based on each query, then you just compare that embedding with the query embedding that you have. So yeah, I imagine, uh, like because searching the content, yeah, that's a good way to search the content. Yeah.

Demetrios Brinkmann [00:49:12]: For a document. And now the, my friend Nishi was talking to me about how this is a really hard problem he faced at his company for search. And it was along the lines of he wants to be able to return what shops are open when someone searches for pizza on the app. You know, and so there's a lot of restaurants that offer pizza, but if someone is searching on his food delivery app for pizza, they don't want a restaurant that's closed. To be the first thing that shows up, even if it is the highest rated. How do you think through that? Like, what do you— because, well, I guess the other caveat to this was he was like, it's not like just open/close times are this magic number that we can reference or a row in the database. That was another thing that was a little bit hard, or even you can't like filter by that for some reason that I.

Rahul Raja [00:50:21]: Can'T remember. Yeah, I think this, yeah, to me, I think it's not a search problem. I think it can definitely be solved using search. Like you can, you can vectorize everything. Even based on that, you can see that which restaurant is open. But I think this is more of a distributed system problem. Like if you can't store the Yeah, I'm just thinking out loud. If you, let's say, if you have stored the timings of that restaurant, let's say if you are not able to do that in a relational database, then in a NoSQL database, and for each of those restaurants, you check whether at that time it's open or not.

Rahul Raja [00:51:05]: I think there are a few smart things we can do there. I think, let's say, there are a few times where people like search that more often, then we can create some indexes based on those time intervals that in this time interval these restaurants are open. So it depends on the query pattern that the app is getting. So for example, if I get a query at 5 PM, then I can see at 5 PM which are the restaurants that are going to open. So I can pre-calculate all this and I can store it in my database already. So that I can just return it at that time. So there are some like smart strategy that I, that I can do here. But yeah, I think this, I think I won't want to make it like.

Demetrios Brinkmann [00:51:50]: A search problem for me. Yeah. I like you humoring me on this one. The other one that he said was hard is that people will search for vegetarian options. And pizza margherita is a vegetarian option, but that's not going to come up normally because it's not tagged as vegetarian. It's just tagged as pizza. And even though it is vegetarian, it, it's like in vector space, vegetarian food is very far from pizza margherita.

Rahul Raja [00:52:30]: Hmm. Yeah. Right. I think that's why we should have a layer of like, yeah, I was going to say text-based search. But in this case, yeah, there is no vegetarian option specified anywhere. So you can't do a text-based search also. Yeah, again, in this case, yeah, I'll just pre-calculate it. If I see that people usually look for vegetarian or non-vegetarian options, I'll just mark my like options in the menu as vegetarian or non vegetarian.

Rahul Raja [00:53:00]: And then based on that, uh, I, yeah, I'll just, so if they, let's say they search pizza vegetarian. Yeah. Yeah. Based on that, I can just return because from vegetarian, I got these X number of things from my menu. And from that, now I can filter which one is pizza out.

Demetrios Brinkmann [00:53:18]: Of that. So, yeah, it's almost like, and I think he ended up doing this. But basically taking all of the menu items and then running them through an LLM and asking the LLM to tag them with certain qualities, whether it's vegetarian or gluten-free or Asian, whatever type of the type of food, the, and then you can use that. When you are throwing it into a vector database. You just have to make sure that what you're tagging is relevant.

Rahul Raja [00:53:59]: Yeah, right. Yeah, I think tagging we can figure out based on also like what do people search for? Like what are they looking for most? Vegetarian is one of the things that they look for. Yeah. So based on search, we can keep like evolving the tags. I think on a food menu, yeah, I'm sure there are a limited number of things which a user is searching for. So yeah, that should be like easy to estimate after like gathering some data from the users.

Demetrios Brinkmann [00:54:28]: Yeah, it's looking at the data you have and then trying to bucket that or just asking LLM, hey, bucket these 10,000 queries that we've had from our users. Yeah. Yeah, because he, it's, that's a, that's a great one. Any other hard search problems you've run into? And how did you solve them?

Rahul Raja [00:54:50]: Yeah, I think only— yeah, I can only recollect about— yeah, multimodal is one, agentic is another. I think mostly, yeah, it's mostly— I think what I have run into for search is not only like related to relevance, but also about like search infrastructure in general. So making the system like more reliable and like always dependable. So that is our like goal always. For example, another thing is also like make sure that the tail-end latency is not that high. For example, you're all the— for example, if few systems are lagging and returning a result, we have to make sure that the user is not seeing that latency. So most of I think these problems are also like distributed system problems and not only in terms of, yeah, what's the best result that we have to.

Demetrios Brinkmann [00:55:47]: Return to the users. Say more about that.

Rahul Raja [00:55:51]: Yeah, so in terms of distributed system problems, I think I already mentioned that let's say you are using vector-based search, I think then dealing with GPUs and keeping their cost down. Is one of the big factors that we like have to take into account. Also, when I think when you have indexes of like, let's say, billion embeddings, or you're also in cases of text-based search, you have very large indexes, then a lot of like distributed problems come into play. For example, you have to keep deciding, like one example is how many like how many shards or how many partitions you want to divide your data in. And how do you accurately define those many partitions? Like how do you make sure that when you decide on X number of partitions, then your queries are not running into issues, for example, out of memory, and you are able to load all the memory indexes fine, you are able to serve the queries from the indexes on the disk storage fine.. So you have to make sure, like, you have to like run experiments for all of that and then come into some conclusion because there's no magic number which can tell you that for, let's say, your index size is X terabytes, then you can just go run with these many shards, these many partitions, because you also have to take into account which is my, like, what are the query patterns that I'm going to get? For example, do you want each machine to be big? Like, you have to scale your system vertically. Rather than just scaling it horizontally where you just keep, let's say, X number of replicas for each partition. Because then, uh, so like analyzing your read and write queries, for example, what kind of, uh, in search is usually read, like there's no write which a user is doing.

Rahul Raja [00:57:43]: So you have to analyze the read queries and then design these systems, uh, like accordingly. Yeah, then again, I, I think I mentioned about live update. That's a big challenge. How do you make sure that your for example, let's say you have some X items in the menu. Now let's say someone goes and makes a change in the menu. How do you make sure that users are able to search for that updated thing, whatever it was updated? Let's say you added a new tag or you did something. So you have to keep refreshing your indexes periodically, like whichever is there in memory or on the disk, so that users are able to like users can see the latest results always. Yeah.

Rahul Raja [00:58:26]: So mostly it's, yeah, mostly it's distributed system challenges.

Demetrios Brinkmann [00:58:30]: Yeah. That live index or the live updates reminds me of a conversation that I had on here with Rohit and he was talking about how most folks default towards like, yeah, we need the fastest, we need like our data freshness to be, you know, in the milliseconds. And he was like, you don't know how much money I've saved people by just going from milliseconds to seconds. And there is hardly any noticeable change that the user feels, but on the backend, you're saving so much money.

Rahul Raja [00:59:14]: Yeah, yeah, that's I completely agree with this. I think people, yeah, when they think of a search system, yeah, they always think that I want my results to be faster, I want it to be 100% accurate, but that's not the— I think that's not always what a user is looking for. I think you rightly mentioned that going from few milliseconds to even a second, to some users, like, it won't make a difference. Like, there are some systems like Google or say you can't return the search results in seconds. So it depends on your users also, depending on what kind of queries that you are serving. So just trading, like having a trade-off between those latency and accuracy, I think that's completely acceptable. You have to decide. I think cost is definitely a big factor.

Rahul Raja [01:00:01]: So if latency is fine, I think.

Demetrios Brinkmann [01:00:04]: Then it's fine to do that trade-off.

Rahul Raja [01:00:07]: Yeah.

Demetrios Brinkmann [01:00:07]: And in this specific case, I also want to be clear that Rohit was talking about like the data freshness. And so that's why I was thinking about what you were saying with the live updates. Okay. The live updates and that getting incorporated back into the search system, you can try and have these live updates be constantly incorporated back into the search system, like every millisecond, or you could say, you know what, every 5 minutes is all right and save a whole lot of money because you're not having to perform those operations continuously and, and that you can save on compute.

Rahul Raja [01:00:47]: Yeah, I think most users can live with it if we don't provide like all the live updates within a few milliseconds. I think it's fine. They can. Wait, if even if it's getting reflected after 5 or 10 minutes, so they.

Demetrios Brinkmann [01:01:02]: Don'T have an issue with that. Yeah. Which feels like an eternity, but still like 5 or 10 minutes. I mean, you probably don't want to do that. It's probably can be a bad user experience depending on what exactly is being reflected. But I mean, you definitely like a few seconds. Is not that big of a deal. And then like dovetailing off of that and going into what you were saying with us understanding now that users, there's this pattern that's happening, at least within myself, when it comes to search, I'm more okay with search taking a little bit longer because I've been interacting with LLMs more.

Demetrios Brinkmann [01:01:48]: And so. If it takes a few seconds rather than it's instantaneously milliseconds that pop up, it's not the end of the world for me. I would rather it be as fast as possible, but I'm not going to like, I'm not going to have.

Rahul Raja [01:02:07]: A.

Demetrios Brinkmann [01:02:07]: Worse experience that is faster. I'm okay with waiting to have a better experience.

Rahul Raja [01:02:13]: Yeah, I think this is what we like see with, uh, using all the, uh, I think generative agents. Because there is some system which is giving us a link of all the documents instantaneously, but we are still fine to go and wait for the LLM to generate some answer and just reason that this answer makes sense and then just give me what I'm looking for. So in that case, even latency is acceptable. So it's fine to have that trade-off there.

+ Read More

Watch More

Information Retrieval & Relevance: Vector Embeddings for Semantic Search
Posted Feb 24, 2024 | Views 1.4K
# Semantic Search
# Superlinked.com
The Future of Search in the Era of Large Language Models
Posted Mar 14, 2023 | Views 1.1K
# Large Language Models
# You.com
# Future of Search
Code of Conduct