MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Agents as Search Engineers // Santoshkalyan Rayadhurgam

Posted Nov 27, 2025 | Views 27
# Agents in Production
# Prosus Group
# Search Engine Agents
Share

speaker

user's Avatar
Santoshkalyan Rayadhurgam
Engineering Leader @ Meta

Senior Engineering Manager - AI/ML

Engineering leader - AI for Ads at Meta Ex- Lyft Pink Subscriptions Ex - Amazon CV/ML

+ Read More

SUMMARY

Search is still the front door of most digital products—and it’s brittle. Keyword heuristics and static ranking pipelines struggle with messy, ambiguous queries. Traditionally, fixing this meant years of hand-engineering and expensive labeling. Large language models change that equation: they let us deploy agents that act like search engineers—rewriting queries, disambiguating intent, and even judging relevance on the fly. In this talk, I’ll show how to put these agents to work in real production systems. We’ll look at simple but powerful patterns—query rewriting, hybrid retrieval, agent-based reranking—and what actually happens when you deploy them at scale. You’ll hear about the wins, the pitfalls, and the open questions. The goal: to leave you with a practical playbook for how agents can make search smarter, faster, and more adaptive—without turning your system into a black box.

+ Read More

TRANSCRIPT

Santoshkalyan Rayadhurgam [00:00:05]: Right. Hello everyone, I'm Tosh. Today we're going to talk about a shift that's kind of happening quietly underneath every, I would say nearly every retrieval system in production today. Right. For, for about two decades, I would say. Search was kind of built around the assumption that users express these fully formed intent that is no longer true. Interfaces have changed and user expectations have changed and the world in the past five years has changed significantly. And search systems that were kind of once pretty deterministic and stateless and kind of flew, I would say like brutally little.

Santoshkalyan Rayadhurgam [00:00:53]: Now they need to behave more like distributed reasoning engines. My talk is basically about how we get there and what we kind of fundamentally redesign. So let's start with a problem. We're going to go through this in five chapters. One is why information retrieval kind of breaks under real world ambiguity. And then what happens when retrieval kind of becomes stateful reasoning and some building blocks of agentic systems and what these systems kind of look like in production. Finally, the very open book unknown question, which is how does AGI kind of dissolve this boundary between, I don't know, search and understanding? Right. So this is essentially, if you think about it, it's a systems engineering view of next decade of search.

Santoshkalyan Rayadhurgam [00:01:54]: So let's kind of start with why our classical approaches kind of bend and break. So tracing the progress that we made in search architecture, we had lexical pipelines, which is your BM25DF IDF. These were very deterministic and extremely fast. They also relied on sparse interpretations and they had no semantic structure. They treated every token as an independent statistical event. And these systems fell apart when vocabulary diverged from user phrasing. And then we enter the new era which we're still in. This is this whole vector based RAG systems where we introduce like embeddings and tensor presentation and like semantic similarity.

Santoshkalyan Rayadhurgam [00:02:54]: But they also came with some new challenges. If you look at chunking heuristics or KNN latency. Right. So a fundamentally stateless generation step. So as these embeddings kind of got better and you had like chunk collapses, vector drifts and irrelevance kind of became a big operational cost. The next stage of evolution that we are kind of heading into is an agentic search where it's an entirely different category. Instead of a single retrieval step, we kind of get this whole multi turn reasoning strategies, tool calls and state retrieval here becomes part of a control loop and not an endpoint. So I would say this is not search 2.0, it's a different computational model.

Santoshkalyan Rayadhurgam [00:03:54]: The reason for that shift becomes obvious when we look at user behavior. So this kind of two cross two diagram here captures a fundamental truth. Users are increasingly operating in a high ambiguity and high complexity region. Most queries today are kind of under specified by design and users say things like, okay, say find that Python memory thing from last week or you know, a laptop or editing, but light. Right. So if you think about static information retrieval, that assumes that text is truth. Right? But what users are doing is they're expressing partial intent here and not instructions. So that means failure mode, which is like your zero session memory or like intent collapse.

Santoshkalyan Rayadhurgam [00:04:58]: And if we think about lexical brittleness, they're all natural outcome of this whole architectural mismatch and the system does exactly what you tell it, which is precisely the problem here. Let's make it a little more concrete. So, so I mean the most common pattern in modern search is this whole, say this example, the Python memory thing from last week, right. It's incomplete, it's fuzzy and it's partially kind of recalled intent. Right. And it also has some temporal ground. So this is not a keyword kind of retrieval problem, it's basically an interpretation problem. So we need a system that can detect entities here, Python is an entity and resolve some sort of temporal constraints like last week and get to classifying the domain which is programming resources and finally infer what is the missing structure here.

Santoshkalyan Rayadhurgam [00:06:08]: Is that like article tutorial or code snippet? Right. So without reasoning, this query is impossible to answer very reliably. Right. And I think this is kind of the canonical example that motivates an agentic approach. So we move from a static retrieval to a stateful reasoning. So kind of look at like the core structure for an Agentix search engine. I've just kind of written this whole Agentix search state. So this holds the original query, some sort of a reformulation trajectory.

Santoshkalyan Rayadhurgam [00:06:55]: And you have the embeddings that are associated with each iteration and set of retrieval strategies and some confidence curves and diversity metrics. Right. Now this is a session local memory, which means you have a very consistent internal representation that survives across the tool calls and across iterations. Right. A challenge here is how do you do adaptive strategies? How do you kind of like perform failure signal detection. Right. And atomic component updates. Right.

Santoshkalyan Rayadhurgam [00:07:35]: And if you think about even diversity analysis, these are system problems and not like model problems. So what we need is a distributed stateful multi turn controller. Right. And with that state we can actually execute a controlled reasoning loop. The whole, there's, there's different versions of this Whole agentic search architecture, I think, fundamentally, I think treat it as a reasoning pipeline. We had the piped in memory thing example, so that is a realistic conversion process because the result entropy is extremely high and you have the system that is essentially blind. And if you kind of go to the next version, which is Python memory leak detection, you still have high entropy, but the strategy is ineffective. And then you have this whole each stage of the loop.

Santoshkalyan Rayadhurgam [00:08:44]: If you look at it, we have the query understanding here where we do entity extraction, intent classification and detection of ambiguity. And then you move to strategy selection, which is pretty dynamic. You have lexical semantic graph based hybrid, all of these approaches and then moving on to all the way to multiple backends that are kind of orchestrated with like fault tolerance. This loop is fundamentally, I would say a form of online optimization. Now to kind of build these systems we need solid design principles here. And so we have two philosophies I'm trying to contrast in this slide. One is you have your monolithic API, which is a single search endpoint. It has dozens of parameters, right? These are brittle for LLMs because they're non deterministic and it's nearly impossible to reason about them.

Santoshkalyan Rayadhurgam [00:09:53]: Right. And then you have composable tools here where these are atomic transparent functions like say, you know, keyword search, for example, or semantic search. These are serving as primitives for the agent. They kind of make the agent, the planning part of the agent, very tractable. And you're improving determinism here. That means, you know, debugging become simpler and managing a policy space is also easier in that sense. And those primitives only work if the substrate here is understandable. Right.

Santoshkalyan Rayadhurgam [00:10:36]: So this is my favorite benchmarking slide here for search. If you take a simple BM25, when it's paired with an agent, it outperforms a complex neural network that is without an agent. Agents need predictable distribution scores and semantics. But when you have opaque read anchors and black box embeddings, that makes reasoning difficult to understand, like what is the cause and what is the effect as well. And by contrast, I would say that systems like very transparent systems like BM25, they make the agent hypothesis very accurate and it's refinement kind of much more meaningful as well. So the takeaway is your backend doesn't need to be fancy, it needs to be predictable here. And let's kind of talk about a modern query interpretation in a sense, right? So if you look at our modern query systems, which is pretty dynamic in nature, right? How are we Kind of looking at agentic system performing. Yeah.

Santoshkalyan Rayadhurgam [00:12:05]: Free text into like transforming that into a structured meaning. Right. Our traditional NLV pipelines were sequential and brittle and what they produced was linear labels and not reasoning artifacts. Right. Now agentic interpretation instead it's generates some sort of a structured query type. Semantic intent and temporal constraints. Looking at our previous example and multiple hypothesis for any kind of disambiguation. So what we're trying to build is a probabilistic space for interpretation and not a single answer.

Santoshkalyan Rayadhurgam [00:12:49]: So this kind of makes the downstream strategies much more targeted. And it is this grounding in like linguistics that enables reasoning as well. And once you know what the user intent is, you can classify the task that they're kind of performing. And if we these examples kind of show like the intent specific feature vectors. Right. For example, for query, which is laptop for coding, the system kind of emphasizes cpu, ram, dev environment. Now the same thing for a video editing computer. You're looking at like GPU throughput, maybe a display surface.

Santoshkalyan Rayadhurgam [00:13:43]: And finally for a portable workstation, I think what's important is the mobility and energy constraints maybe. Right. So this is structurally different from the document embeddings that we've been doing. What we're trying to do here is build query embeddings that are conditioned on intent. Right. And you can see the outcome. I mean we've had like roughly 35% lift in precision, not a lot of latency overhead as well. So this is, you know, high frequency semantic inference.

Santoshkalyan Rayadhurgam [00:14:21]: Now interpretation isn't enough, Right. We need relevance and that requires real kind of human feedback. So the reality of things, we have element based relevant scoring actually that has some useful priors, but it has some famous limitations that it doesn't personalize, right. One, and it doesn't adapt to drift two, and it is not interpretable when it fails. Right. So a hybrid relevance basically combines LLM signals with behavioral feedback. Right. I have this whole hybrid relevance scaling equation here.

Santoshkalyan Rayadhurgam [00:15:11]: But basically what we're trying to do is the behavioral signals provide the grounding and then your human feedback corrects any model hallucinations. And we have AB weighting that gives us control as well. And this kind of closes the loop for retrieval to reasoning, to user and refinement as well. Now what is the, what's the cost or economics of this reasoning? So if you look at the table here, I'll just go through. Right. So the first tier is like cached patterns. These are kind of like the cost is essentially free ish and your latency is like about 10 milliseconds. These are like deterministic and repeated intents.

Santoshkalyan Rayadhurgam [00:16:03]: And then you move to distilled models where you know your latency is maybe about 50 milliseconds inexpensive and it's used for like simple reasoning. And then you move on to single pass agent which is, which is where the latency is like about 200 milliseconds and the cost is becoming moderate as well. Right. And finally when we go with full reasoning the latency is likely about 500 milliseconds high cost as well. So when we have complex queries that we receive like can some of these queries receive the depth? Right. And majority of the traffic kind of still remains extremely efficient. But when we have complex queries they receive these kind of depth like the economic backbone. You may call as for search and like agent context.

Santoshkalyan Rayadhurgam [00:17:03]: So what does it kind of look like in a real system? Right. So a modern production grade agentic search system, you have query routing that's happening so you know, using your top of the shelf flinks streams for any real time complexity. And then we have a cache layer and then an agent service which is a stateful orchestrator. Say now you can build it on temporal. Right. And then you have a search backend which is hybrid lexical plus vector retrieval along with like a multi stage ranking. Right. And finally you know this is what it looks like when kind of reasoning is meeting our production constraints.

Santoshkalyan Rayadhurgam [00:17:59]: We have about 100 millisecond in terms of like P50 latency and 6% zero result rate. Right. Let's, let's talk about where this whole trajectory kind of leads. I think the future of the search is an interesting topic. The next decade basically brings three kind of like horizons. I think near term as an end of this year, right. We're still looking at like multi turn clarification or some sort of a cross session memory and we are doing real time user learning. Now think About Horizon about 2026 looking at like domain specialized agents or microservices for reasoning.

Santoshkalyan Rayadhurgam [00:18:52]: This is already happening in some spaces. And how do we do anticipatory search as well? Long run we are looking at ambient intelligence which is something that's always available. Multimodal agents and they're operating across all your devices and context. The search becomes conversational here and conversation becomes predictive and predictive becomes embedded. So that's the realm of future for search. And beyond that, on philosophical note, search may disappear entirely. Right. So what does it look like when we have AGI? Right.

Santoshkalyan Rayadhurgam [00:19:41]: If an AGI fully Understands intent, context and the world state. Right. What does search become? So I kind of posit that there may be three features. One is search results. AGI anticipates needs. Right. You have information flows without explicit queries. That's one or two.

Santoshkalyan Rayadhurgam [00:20:12]: We have AGI that kind of routes knowledge between specialized agents or humans. And finally, we may also have something that's called reality querying, where your simulation is basically becoming a query primitive. The what test becomes computable at scale. I do think there's a bit of a paradox here. The better your search is getting, the less it's going to resemble search. I think at some point retrieval becomes understanding. Yeah, yeah. I think it's going to be interesting to see where the reality leads in the next decade.

Santoshkalyan Rayadhurgam [00:21:00]: And with that, you know, let's, let's add. I do think we're at a moment where search is transforming into something fundamentally more capable and more aligned with how humans actually think. Thank you.

Adam Becker [00:21:22]: Tosh. That was mind bending. I was just completely transfixed throughout. So it's, there's something, I don't know if meditative was is the right word here, but it's just deeply immersive in the way that you were presenting it. I think it's also not. If you go back one slide, it's. In some ways it reminds me that, you know, that, that future of search dissolving into understanding because it is anticipatory. I imagine that something similar might have happened in YouTube.

Adam Becker [00:21:58]: I mean, I just, it feels like back in the day I used to spend time on YouTube by searching, by using the search bar. And as recommendation became better, it's just, who uses the search bar on YouTube? I don't know.

Santoshkalyan Rayadhurgam [00:22:12]: Right.

Adam Becker [00:22:13]: And so do you imagine that it would be a transformation that might at least along this path that kind of resembles that where the system is, you're saying, predicting, embedding, surfacing, almost recommending the right piece of content at the right time.

Santoshkalyan Rayadhurgam [00:22:33]: Yeah, it's. I would say, I think it's, it's a very kind of like. It's a good observation. Right. What, what's kind of happening with YouTube recommendations is I think it's a really good early signal of where search is headed. I do think. Yeah, especially modern recommendation systems where you have like these YouTube TikTok or Instagram Reels. We've kind of shifted from this whole how do we retrieve relevant items to predict the next embedding.

Santoshkalyan Rayadhurgam [00:23:04]: I think the. If YouTube recommendation systems were modeling, hey, now, what are you probably going to watch next? Right. The Future systems model would be what are you going to probably going to ask next? Right. So I think in both cases the system is no longer kind of matching, but it's rather forecasting. I think that is where the essence of this whole search kind of dissolving into understanding comes from.

Adam Becker [00:23:32]: Yeah, I mean I see it now just with the way that I interact with ChatGPT, right. Where like it's, it's. Sure, I give it this seed of thought and then for the next five, six prompts I mostly say sure, yeah, go for it. Okay, I want to see it. Right. And it is almost, it's pulling me into the right. It is showing me what I should be interested in, type of recommendation. But I imagine it's sort of forecastable internally on their end.

Adam Becker [00:24:07]: Very cool. We have a couple of questions here from Apoorva for hybrid scoring. Do you have inputs on how to weigh the different terms?

Santoshkalyan Rayadhurgam [00:24:17]: Yeah, basically I think if you look at hybrid scoring, the main thing is how are you combining a few things. Right. One is LLM relevance clicks or dwell time, negative signals. So what we're trying to do is we're building a multi objective optimization function or any kind of heterogeneous signals that we have. So you don't want to treat these signals symmetrically because each of them is going to carry different statistical properties. So here's how I would kind of think about it. When is LLM? Relevance is kind of. It's a good prior to have, but it's not a proper ground tool.

Santoshkalyan Rayadhurgam [00:25:05]: Right. And you have your CTR and dwell time which are like high signal but high variance. So there are, there are the good short term indicators for relevance, but they're very sensitive to say some things like personalization, position bias. So you can typically apply like a sharper normalization, say you know, something like position normalized click model. Right. And then say if a user says look, this is not helpful or it kind of does a fast bounce, then you know, you actually have the negative feedback that's a low frequency but extremely high precision. Right. So we give disproportionate influence to that, sometimes even maybe a veto.

Santoshkalyan Rayadhurgam [00:25:52]: Because negative signals tend to be sparse, but they don't have. So I think the right way to kind of think about hybrid scoring is not a fixed formula per se, but rather how do we think about it as an adaptive policy.

Adam Becker [00:26:06]: Right.

Santoshkalyan Rayadhurgam [00:26:06]: That changes, the weights change.

Adam Becker [00:26:09]: Yeah. Remember like different regimes where one factor influences more than the other.

Santoshkalyan Rayadhurgam [00:26:15]: Right.

Adam Becker [00:26:16]: We got another one. Apoorva, any inputs you might share on the evaluation phase either of retrieval or of agent paths.

Santoshkalyan Rayadhurgam [00:26:26]: Yeah. So I don't think I covered evaluation of this tech so far, but I think fundamentally evaluating agentic systems are, I would say they're different from like your traditional IR problems because we are no longer kind of measuring a single hop. Right. You're. You're measuring a trajectory. Right. So in a dynamic agentic path, I think the evaluation metrics that I would kind of look at, it's more like a planning algorithm. So what is the convergence rate? How many steps does it take? The reasoning loop before termination.

Santoshkalyan Rayadhurgam [00:27:09]: Right. Or how many distinct strategies were explored? Was it like lexical first versus like graph first? Right. And quality, where how much information gain is happening per iteration? I think it's. It's pretty layered because retrieval kind of tells you this whole signal fidelity and agent dual tells you policy quality. Right. And then the combination of the sole retrieval plus agent is going to give you a system intelligence. So, yeah, I would say we can measure both the both of them independently and then together. And then the joint metrics, which is like convergence rate or entropy reduction, and the quality of the reasoning path, they kind of tell us whether this entire loop is working as we wanted it to.

Santoshkalyan Rayadhurgam [00:28:05]: Yeah.

Adam Becker [00:28:06]: Tosh, I see that you're asking a final question on my end. You've hiked Everest Base Camp, Kilimanjaro and Patagonia. Is that. Yeah. You haven't yet climbed anything in Yosemite. Do you think that if search fully dissolves and disappears, you'll finally have time to go to yourself?

Santoshkalyan Rayadhurgam [00:28:29]: Yeah, I. I really hope so. It's been. Whenever I try to climb Yosemite, something or the other happens. This time around, we actually have a new baby in the house, which is a good thing. And now we're planning to take the baby. So let's. Let's hope I do the hike before AGI happens.

Adam Becker [00:28:51]: Crossing my fingers that that happens. Tosh, thank you very much for coming and joining us. I think we might have some more questions in the chat. So if you want to drop in the chat and if people can connect with you, however you like, Twitter, LinkedIn, whichever method you use, I think it'd be useful if you do. Do you ever write about these things in the future of Search?

Santoshkalyan Rayadhurgam [00:29:12]: I think I do have a book in the making, but I'm planning to publish it by Bite Size Blog as well. And, you know, people can connect with me on LinkedIn. I'll drop my LinkedIn as well. Please do.

Adam Becker [00:29:25]: Kosh, thank you very much for coming.

Santoshkalyan Rayadhurgam [00:29:27]: It's a pleasure from you. All right, thanks, everyone.

Adam Becker [00:29:31]: Back soon.

+ Read More
Comments (0)
Popular
avatar


Watch More

Planning is the New Search // Fabian Jakobi // Agents in Production
Posted Nov 26, 2024 | Views 1.7K
# streamline workflows
# Agentic
# memoryrank
AI Agents as Neuro-Symbolic Systems // Nirodya Pussadeniya // Agent Hour
Posted Jan 24, 2025 | Views 681
# Agents
# neuro
# symbolic
# neuro-symbolic systems
Code of Conduct