Does AgenticRAG Really Work?
Speakers

Satish Bhambri is a Data Scientist at Walmart Labs, previously Sr Data Scientist at BlueYonder and Distinguished Fellow & Assessor at SCRS with over 10 IEEE and Springer publications and contributions in Astrophysics and Quantum Computing. His work spans from specializing in AI, deep learning, conversational AI, LangChain, RAG pipelines, and recommendation systems. Satish has contributed to research in machine learning, astrophysics, and quantum computing, with citations from NASA and Harvard. A passionate advocate for innovation, he has judged major hackathons of Y Combinator, WUST, and Innovate 2025. He's also a member of Incubator Shack 15 in San Francisco.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Satish Bhambri is a Sr Data Scientist at Walmart Labs, working on large-scale recommendation systems and conversational AI, including RAG-powered GroceryBot agents, vector-search personalization, and transformer-based ad relevance models.
TRANSCRIPT
Satish Bhambri [00:00:00]: Soul of an agent lies in how dynamically generates that prompt which is passed to the large language model. For instance, if I can just say, hey, this is my table schema, generate a SQL query for this, it'll give you a SQL query. But the problem would be it's going to be highly hallucinated.
Satish Bhambri [00:00:26]: It's been an interesting space so far. We started with a very basic, let's say if somebody asks me what ML is, for instance, what is this thing? So back in, if you remember in high school we would have these simple equations like we're given two data points, x1, y1, x2, y2. And we were asked that okay, what would be the value of y3 and x3? And that's extrapolation. And essentially that's what entire ML is. It's just that instead of line now we have so complex data points, so complex essentially curves and so complex contour maps. Like it's multidimensional, it's not even three dimensional anymore. We just can't visualize it in a human way. But there are so many dimensions to so much of data that we have.
Satish Bhambri [00:01:12]: And essentially it's all about being able to predict based on what we know, kind of finding out the models or mathematical functions that that could simulate what we have going on in the real world. So essentially what we are coming down to is creating more and more of those complex things, be it like image processing for example, in our live video feed for our cars in the Waymo that we see out there, they're able to predict right now when to stop based on the distance calculations. And that's all functions and how we got here towards this genai, which is a great word right now, a huge buzz around it. Interestingly, we used to have, we started with neural networks. We had like CNNs, convolutional neural networks. We had RNNs, recurrent neural networks. RNNs were the ones which were used extensively for NLP. The natural language processing aim was that how we are talking right now.
Satish Bhambri [00:02:11]: We should be able to talk to machines too. And we can understand what context of any, let's say website is or what somebody's talking about. But the problem there was.
Satish Bhambri [00:02:26]: They did not have much attention mechanism. The attention span was so low that after a few words we just did not have the attention window that could go back and relate to what was being said.
Dimitrios Brinkmann [00:02:37]: It was like a goldfish.
Satish Bhambri [00:02:38]: Yeah, essentially.
Satish Bhambri [00:02:43]: And that did improve when LSTMS came into picture, which was long, short term memory neural nets and then GRUS came into existence, but still the attention span was so low. And that was primarily because of the architecture that we were following. What we were trying to do is we're trying to induce gates in these neural networks which could ascertain how much of the context for input that is coming through we want to retain and how much we want to forget. But those control gates, they were not efficient enough. There is no way we could have achieved what we have until and unless the 2017 revolution happened. And there was this paper you would have heard of, Attention is all you need. The moment it came through the transformer architecture, when it was introduced, it just changed the entire game. And here we are now.
Dimitrios Brinkmann [00:03:35]: One thing that I'm thinking about with this model evolution that you just broke down is what the next architecture is going to be. Because there's things that happen with transformers that are not things we want. Like hallucinations, right? And so sometimes some people will argue, well that's like a feature, not a bug. And others will say, well you know, like we really want it to be reliable and if you're going to have hallucinations, then it's not going to be reliable. But at the same time it's AI, it's machine learning, it is probabilistic.
Satish Bhambri [00:04:16]: I don't know if hallucinations are the feature, but you're truly said it's all probabilistic models, right? And it essentially depends on how even the transformer architecture evolved. So from lstms, as we were just touching base on that, we essentially found out that okay, what if instead of using these gates to control the attention span, what if we were to actually have some self attention mechanisms from where these architectures started evolving? We had encoder and decoder, then we had encoder only models and decoder only models. It started from 2017 and eventually it just grew so much. So for instance, encoder only models were aimed at just understanding the context of what this particular text is talking about. And we had these query key and value vectors. We're using softmat functions for assigning probabilities to these words. And then in the decoder we wanted to make sure that when it's able to predict, we have some masking so that it's not able to induce a dialect from already pre learned words. So that for example, when me and you, we are talking before I register the input, I should not have any bias induced in me.
Satish Bhambri [00:05:34]: And that was the aim. But the problem is that that bias is somehow induced because humans are also using these Systems where these hallucinations actually come from even more. And since it's a probabilistic model, as such, the softness functions, they produce probabilities right now. But moving ahead in the architectures which we are using in production right now, RAG was a big change. Yeah, yeah, that was the one. Yeah.
Dimitrios Brinkmann [00:05:59]: So it wasn't on the model level. It was more on the system level in that way. And so it went from like, all right, cool, we've got this really important model, but now how do we architect the system around it? And so last year, RAG was all the rage. I think probably the last two years, RAG was very important. And that was like the next step in the evolution. We could say we had the ChatGPT moment and then we started playing with it. We started using tools and chaining together prompts. And then RAG became very popular.
Dimitrios Brinkmann [00:06:29]: And then you moved on to agent RAG or agentic rag. Right, but let's talk about RAG and what you were doing there and why it wasn't enough.
Satish Bhambri [00:06:40]: Yeah, I think this is a very interesting domain, specifically because when the hallucination started to come in, now we have something that understands the context of language. Now we have a model that wants to talk, but it just doesn't know what to talk about, essentially.
Satish Bhambri [00:06:59]: And RAG is essentially. I always envision it in a way that it's like a kid, you're watching him give an exam or her give an exam, and it's an open textbook exam. So that kid is referring to the books and getting you what you're asking it. But at the same time, we need to check for two things. If it's referring to the right books when the questions are being asked, as well as when it's answering how much of the context that it is giving makes actually sense. So RAG essentially came through that picture where we would ask, let's say GPT a random question, and it would start giving us irrelevant results. Sync, semantically correct, syntactically correct, but contextually not so relevant.
Dimitrios Brinkmann [00:07:45]: Yeah, and especially, I think I remember RAG became very popular just because of the fact that people wanted up to date information. And so then it was like, all right, well, we're just going to throw all the most recent information into a vector store and then we'll use that. And anytime there's something that comes up, we can search the vector store and get that information.
Satish Bhambri [00:08:07]: Right? Yeah. And that was really one of the, I would say, genesis of RAG in that sense. And moving on to now where we are with the agentic AI or that we say RAG agents, it's actually being used way across different contexts for which even it was thought of in the beginning. So as we were discussing when RAG started, the aim was to actually ground the LLMs, you know, let them make much relevant decisions based on the vector stores that we have and for example, now the ones that we are implementing. And it could be a very generic case study. Like if I want to talk to my databases, let's say. Right. And I need to generate a SQL query.
Satish Bhambri [00:08:53]: LLM can generate a SQL query, but would it be relevant at all? No. Again, hallucinations come in. It doesn't even have a lot of context. And more than that, how can we even make sure? Because now if I'm talking to a production or staging data, it's risky because what if some user would come in and just add a drop table statement right there?
Dimitrios Brinkmann [00:09:13]: My job is not drop tables. Yeah, your job gets more complex for sure. But so why did RAG fall over? Like where was. Why did you switch to a gentic rag and what are the differences between the two?
Satish Bhambri [00:09:30]: So if I were to make an analogous comparison. So when REST architecture came into existence, we started developing these REST services. I remember we started with Soap Services, then rest full APIs for the standards, then we delved into microservices, basically contextualizing each service very specific to the use case so that it's easier to scale, it's much more contextually relevant and as well as it gives us much more relevant results with respect to the architecture and reusability of these. And here in agentic AI, similarly when we were grounding the rags for very generic use cases, we're getting great results. But now in the multi agent systems, we are creating those agenty cracks for a very specific use case. And now these agents are talking to each other rather than having one wholesome agent for.
Dimitrios Brinkmann [00:10:26]: Okay, I see. So the idea is trying to like break it down into microservices.
Satish Bhambri [00:10:31]: Yes.
Dimitrios Brinkmann [00:10:32]: And say you're an agent that has access to this vector database and another agent can almost like use you as a tool.
Satish Bhambri [00:10:42]: Yeah, yeah.
Dimitrios Brinkmann [00:10:43]: And so the tool is search and retrieval type tool, but on our data in some place.
Satish Bhambri [00:10:51]: Yeah, yeah. And that also helps us in the terms of specializing the context of a specific agent. It's a separation of concern and we can have as many layers for security or for enhancing or enriching the data in between. And it becomes individual bots who are just taking care of these things. Things.
Dimitrios Brinkmann [00:11:13]: And are you making each agent like only give. Only giving access to one database? So it's like that is the marketing database and you can call that agent and it can retrieve everything and then enrich or summarize the answer and then give it back to the main agent.
Satish Bhambri [00:11:38]: Yeah, in somewhat in those terms specifically. So aim here generally is that we create agents in a way which are very scalable at the same time making sure we have data governance in place. Because let's say also we have different dialects of data across different systems and one of the ways that we can implement is using a SQL clot which is a security layer which can transform but at the same time sometimes that's not desirable because of different systems in place. Separation of concern being there. So what we do is essentially design very specific bots for very specific use cases and it also helps in the cost optimization. What if this was not to be used in production but it was to be used for.
Satish Bhambri [00:12:29]: When I say production I mean for the outer world but more like for internal efficiency of the workforce, let's say or onboarding. So do we really need those kind of resources to put in into those agents versus the ones which are going to be consumer facing? So these help in making those kind of decisions?
Dimitrios Brinkmann [00:12:51]: Yeah, exactly. There's a lot of different trade offs that you can be okay with. I imagine if it's just internally facing and on.
Dimitrios Brinkmann [00:13:01]: So many different vectors, probably quality and on speed on. Or maybe you're like no, we have to get it really fast because it's. But reliability I would imagine is you just have less high of a bar if it's internal because the internal user is going to be much more forgiving than the external user.
Satish Bhambri [00:13:23]: Oh, 100%.
Dimitrios Brinkmann [00:13:25]: Yeah.
Satish Bhambri [00:13:25]: And hopefully.
Dimitrios Brinkmann [00:13:26]: Hopefully. Yeah, exactly. And so then. All right, so I'm kind of understanding it. Think the thing that I ask myself if I'm understanding this correctly. You have agents that are able to query databases. Why not just make an MCP server for the database?
Satish Bhambri [00:13:48]: We possibly could. That could be a way to go but the main context comes through is like what use case are we like serving? Essentially if it's let's say we have to let business talk to a database and they want certain reports in just certain ways and we want to expose that. Would we want to go through the route of MCP server and create whole another layer to it? It's at all like optimizing how much resource and information we should put into.
Satish Bhambri [00:14:26]: For achieving a specific use case at the end of the day for consumer facing probably that might make sense.
Dimitrios Brinkmann [00:14:32]: For me what it sounds like is you have different use cases and they're very verticalized. So maybe there's a team or there's a suite of folks that need information and you create an AI product that can do that thing really well. So create dashboards from the sales data or the financial data, whatever it may be.
Dimitrios Brinkmann [00:14:57]: And then you have another product and it's in a way separated and so you have like separation of concerns, which is really good. But at the same time you have to create a whole new product around it. Is that it? I imagine some of the pieces are going to be reusable and you can say, all right, well, this is similar. We just need to change this and tweak some prompts as if it was that easy. But. And give it access to this database instead of that database. But is, is that how it is? It's like each individual product and then you have to upkeep the products for the internal teams.
Satish Bhambri [00:15:36]: Yeah, it's almost in that direction specifically because.
Satish Bhambri [00:15:41]: So definitely we can't discuss details into the internal architecture, but in a very wholesome level that essentially what it boils down to.
Satish Bhambri [00:15:53]: Let'S say if I'm building a bot for one of the teams in Slack, which aims at onboarding, for instance, and similarly, and there is a bot which works towards working with different databases. So let's say in sales and these two are going to have some interchangeable components, but their vector DBs are going to be different, for instance, because let's say if it were just creating SQL queries, we don't need something like Vertex AI matching engine or Milvus DB for that matter. We can use something very lightweight like fais, which is Facebook isolat search still.
Dimitrios Brinkmann [00:16:27]: Running hard.
Dimitrios Brinkmann [00:16:29]: Created all those years ago. And it still is. Just amazing how well it works. Yeah, but I, I, so I understand that it's.
Dimitrios Brinkmann [00:16:40]: You choose what you need to use also depending on the use case, because the use case almost dictates what kind of.
Dimitrios Brinkmann [00:16:50]: Necessities you're going to have.
Satish Bhambri [00:16:53]: Yes, yeah.
Dimitrios Brinkmann [00:16:54]: And so some of it can be, oh well, you're, we're going to need access to the same databases because there's some overlap. But I imagine you're not using the same agents for those, you're creating new agents because then there could be some context mixing and that could be bad.
Satish Bhambri [00:17:16]: Oh yeah, yeah, that's so true. So the soul of an agent lies in how dynamically generates that prompt which is passed to the large language model. For instance, if I can just say, hey, this is my table schema. Generate a SQL query for this, it'll give you a SQL query. But the problem would be it's going to be highly hallucinated. It would not know where to join, it would not know how to or which particular columns to join on. And aim is to essentially eradicate that middle layer where we have to constantly describe these things. So then it boils down to, okay, how do we create that? How do we make sure that our system or our agent creates this dynamic prompt which is passed to the LLM? Because LLM is going to do its hallucination on its own side for sure.
Satish Bhambri [00:18:07]: We can't stop that.
Dimitrios Brinkmann [00:18:09]: Yeah, it's like playing telephone a little bit. But how are you making the SQL queries then if you're not letting the LLM generate it?
Satish Bhambri [00:18:18]: Oh, the prompts are actually dynamically generated. SQL queries are definitely generated by the LLMs.
Dimitrios Brinkmann [00:18:25]: Okay.
Satish Bhambri [00:18:25]: So the prompt which is, is passed to the large language model, those basically are generated based on the documents which are retrieved by the retriever. And those documents are retrieved in this high dimensional space through semantic similarity. Excuse me, searching. And that's where the Vector DBS role come into so much of picture. Like, what kind of vector DBS do you want to use? Do we like really want to go with something? Let's see, if I had. If I want to search across all the like 40 million products of Walmart and I want to find out the products for, let's say all the users, it's humongous data. Even creating embeddings for that. My gosh, it just boils down the system.
Dimitrios Brinkmann [00:19:10]: So expensive, I imagine. Yeah, I just, I can't even fathom how much data that would be.
Satish Bhambri [00:19:16]: Yeah.
Dimitrios Brinkmann [00:19:17]: And so then you have to decide.
Dimitrios Brinkmann [00:19:21]: What subset of the data you want and then throw it into a Vector db and you're spinning up new Vector DBS for all these different use cases.
Satish Bhambri [00:19:31]: Yeah. So we make a choice on like based on what industry standard is being used, why, how much is the cost? Do we have an open source solution for it? And sometimes open source solutions are available like milwaukeeb. Great Vector db. But the problem is it can have a little bit of high latency even at the same indexes like IVF P8 or IVF PQ for Vertex AI. But the problem there is if I were to generate, let's say, recommendations for my customer base and if I were to generate them once a day, I can probably use an open source.
Dimitrios Brinkmann [00:20:12]: Yeah.
Satish Bhambri [00:20:12]: Why would I want to spend something on something, you know, which is really costly for me? And it's again very use case specific. But if it were an online serving model which is oh I'm generating them every half an hour, the pipeline is constantly running or every 15 minutes in that case. Yeah, I would have to shell out that cost and then that would make sense that to use very low latent and very highly complex indexable vertex, sorry vector DBS that we can actually use.
Dimitrios Brinkmann [00:20:51]: If you got the chance to just start from scratch and build something, how would you go about it?
Satish Bhambri [00:20:58]: Okay, let's take an example. Give me an example like what would we need to build?
Dimitrios Brinkmann [00:21:02]: What would we need to build? Is there something.
Dimitrios Brinkmann [00:21:06]: Is there something that you feel is uniquely valuable in the E commerce space?
Satish Bhambri [00:21:15]: There are a lot of things I would say the two most prominent examples that come to me is so sometimes when we launch a new product for instance.
Satish Bhambri [00:21:26]: Or a new customer experience, we generally go about doing multi amp bandits or AB testing. Yeah. But sometimes we already have so much of.
Satish Bhambri [00:21:38]: Good control experience.
Satish Bhambri [00:21:41]: And that we don't want to sway away from because it can cost us potential users.
Satish Bhambri [00:21:49]: And if we try out even in multi bandit way like we use Thompson sampling to sway 1 or 2% of the users, are we really willing to take that risk in that case?
Dimitrios Brinkmann [00:22:00]: Because it's like you have so much, you don't want to lose what you have. You're playing defense in a way.
Satish Bhambri [00:22:08]: Yeah, it's explore exploitation, trade off basically.
Dimitrios Brinkmann [00:22:12]: Okay, yeah, that's fascinating to think about that. It's not like yeah, you can't be willy nilly because you're already so optimized.
Satish Bhambri [00:22:20]: Yeah. So then in explore exploit trade off that we generally go around here, we think in a way that okay, I'm going to go ahead and use my control group which is how it is right now but at the same time and I'm going to exploit it but I might explore like 1%. It's definitely a cost to the business but it can yield a lot more. But it's very dynamic in map. You know like we'd go about. And this is called Thompson sampling basically. Right. So we just sway the users like that.
Satish Bhambri [00:22:56]: But another way that we can possibly try is let's say we launch something, some recommendation model and we want to try it or direct users to it but we want to do that after they have done experiencing the current product.
Dimitrios Brinkmann [00:23:10]: After they paid money. After they paid money. The upsell can be something that is experimental.
Satish Bhambri [00:23:17]: So there we once, I remember used something, a kind of a bot, you know and that was the first experience of you know, us trying out rag agents essentially that specific use Case, for instance, we were actually going through a lot of products, a lot of, you know, embeddings from the user data. And we could not have gone with something like, you know, lightweight. So at that time we did explore very specific vector DBs like, you know, vertex AI or know, Milvus DB. If a use case is like that, then yes, absolutely, I would go with those ones. But if it's a use case, something like, you know, I want to generate.
Satish Bhambri [00:24:03]: Reports for the business, I don't have much data of the schema, you know, like it's barely in kilobytes, you know. And in that case I would use something very lightweight, open source, which is out there. All I need to make sure is in that lightweight vector db, it's able to pick up at the right time, the right context by doing the right similarity matching, be it cosine similarity, be it pearson correlation, like. However it finds Pearson correlation being one of the really interesting things when we think about user reviews and all now.
Dimitrios Brinkmann [00:24:39]: So that's on the Vector DB side. What about just architecting it? Architecting the whole system. You had a blank slate and you come into a startup and it's like, okay, sweet, we want to build this product. How do you go about that?
Satish Bhambri [00:24:57]: Excuse me.
Satish Bhambri [00:25:01]: That would be very much of course reliant on the kind of problem that we are solving. But let's assume if you're solving a problem which comes through the rag agents or agent AI. Yeah. So first of all, definitely it would be what kind of data we have. Is it a textual data? Is it an image data? Is it like what kind of problem we are trying to solve? And let's say if it's a textual data, for instance, do we have different context of the data? Do we have different data governance.
Satish Bhambri [00:25:32]: Rules in place? Do we have different geographical locations in place? Europe has huge data privacy requirements compared to.
Dimitrios Brinkmann [00:25:43]: This is a startup, we don't have shit.
Dimitrios Brinkmann [00:25:48]: I imagine that is a beautiful thing if you have the data governance rules and you know everything on also just where the data goes and how you can't do anything with the data unless you follow the processes. Sounds like an amazing place to be in, but I imagine that you get there or you need a team of people to be focusing on that. That's not something that just magically happens, right?
Satish Bhambri [00:26:19]: Yeah, that's true, that's true. But essentially when I'm trying to build up a prototype, I would want it to be scalable to a point where whenever, because those things are going to come through down the road. And I would want to be prepared with my architecture to be able to include that.
Dimitrios Brinkmann [00:26:34]: Oh, interesting.
Satish Bhambri [00:26:35]: Yeah. So in that case, I would go about building small agents for very specific use cases and then rerouting the incoming queries based upon what that specific context is. So that.
Satish Bhambri [00:26:50]: The agents which are catering to very specific data sets its scalability and maintained at the same time maintaining, making sure that the rerouting that is happening, it produces very optimal that dynamic prompts which can be then used to query any large language model, tune it or fine tune it to a way that we want, at any temperature that we want it to.
Dimitrios Brinkmann [00:27:12]: And that routing happens with an LLM or that's just a router, it's part.
Satish Bhambri [00:27:19]: Of the rag agent itself. So there's an agent without basically the last part, which is the generator. So it's in the augmentation part where essentially we have query coming in and then we are routing based on what kind of context we want to route that query specifically in and then specifically querying those specific agents in a sense to generate further dynamic LLM prompts that we would use.
Dimitrios Brinkmann [00:27:49]: Are there use cases that you.
Dimitrios Brinkmann [00:27:55]: Particularly like and have seen like a lot of.
Dimitrios Brinkmann [00:28:01]: Usefulness with?
Satish Bhambri [00:28:03]: Yes, absolutely. So of course, data analytics is one part. You know, what used to happen before was like business would get back to engineers and they would be like, okay, we want these kind of reports. Of course they would have their dashboards and like Power BI would be there. Right. But now they can just straightaway talk to these databases essentially. And that's a huge, huge win for.
Dimitrios Brinkmann [00:28:29]: Any, for the data team that doesn't have to service those requests anymore, that's for sure. Yeah, yeah. I remember talking to Donay about this because she built a data analyst agent. And one thing that she said was the hardest part in building out the agent so that it gave correct answers and it understood the context was they had to build out a whole glossary of terms. Since a lot of what you say when you are speaking to another person, you're using this like lingo. And even if it isn't marketing lingo, it is still fuzzy in the way that our company or our team describes that. So an MQL in marketing terms is like a marketing qualified lead. And at this company we describe it someone who has, you know, downloaded the ebook.
Dimitrios Brinkmann [00:29:26]: But at another company, it's not until you download the ebook and you come to a live event or you reach out to sales because you went to a webinar. And so there's even with the same term, the same Word. It's very loaded and that happens across the board. And so like how do you do, did you do like a glossary thing like that?
Satish Bhambri [00:29:50]: Yeah. And that, that is a very interesting problem to be honest. You know, even like within the company, you know, different teams have different lingos for different kind of things sometimes. But yeah. And that's one of the things that, how I came overboard with that was essentially defining the schema docs which were being parsed to, let's say the rag agent that we have been building. So you can define the relationships there or the mapping essentially, which is the glossary. Because as such the large language model doesn't understand anything. It's just understanding the language.
Satish Bhambri [00:30:28]: We are telling it like, okay, what this is, how do we need to do stuff? And it's just creating a very semantically correct, syntactically correct answers for us. But yeah, so a mapping layer is for very specific lingos which can be mapped to very context specific terms. That becomes absolutely necessary for that.
Dimitrios Brinkmann [00:30:52]: And actually the other piece, I think that could get tricky and I would love to hear how you deal with it is that.
Dimitrios Brinkmann [00:31:00]: You get natural language questions like, how did we do this quarter? Which is like, what do you mean by that?
Satish Bhambri [00:31:10]: Right.
Dimitrios Brinkmann [00:31:10]: And so maybe the, the agent can come back and say like, are you talking about revenue? And it's like, yeah, are you talking about revenue generated in the whole company on just your team? On. There's so many variables that when you talk it's not clear.
Satish Bhambri [00:31:30]: True.
Dimitrios Brinkmann [00:31:30]: And so how, how are you dealing with that? That the agent isn't just giving you stuff? Because the hardest thing in the world is getting the agent to say, I don't understand. Right. Like it'll just come back with like, oh, here's how we did this quarter and you're kind of scratching your head. Like, I don't know if that's actually what I was looking for.
Satish Bhambri [00:31:50]: So true. And thanks Dimitrios for that because that really invoked two thoughts in me. I would certainly. One is, I don't know if you've heard of Dr. Dikai. Yeah, he actually has just recently published a book. It's called Raising AI. Very interesting.
Satish Bhambri [00:32:08]: It's about how we need to work along with the AI. And he's a professor at Stanford here. Really interesting. But we'll definitely get into it. But from the technical side of it, from abstraction to coming to a very specific use case, that has been the biggest challenge for us at the end of the day. And the reason why a large language model or an agent would Never say that I don't know. Because it works on our confirmation bias. At the end of the day, it just wants to answer no matter what.
Satish Bhambri [00:32:42]: So specifically for one of the ways that I was able to achieve it was contextualizing the prompts that we are developing dynamically. Not the ones which we write statically, but the ones which are being developed dynamically in those agents. And that's why that becomes the soul of the agent. Because if it's able to provide from that abstraction, okay, how did we do this quarter to okay, what do we need? Like revenue, we need sales, we need products sold. And all this information can be put out. So it will give us very context specific results. And how do we do that? Depends on couple of things. One is how is our data schema and structured? And when we say this quarter, it's just going to pick up that date range.
Satish Bhambri [00:33:33]: How did we do? We can always map these queries to revenue, product sold, let's say sales, employees, performance. It could be anything but. And that's where very context specific agents come into picture. If I am, let's say business and I'm just focused on the sales of it versus if I have, I'm an HR team, I'm working on the employees performance, I would want to have two different answers to that. And hence even in the reusability of those agents, we would have to configure them or tweak them in a way. Although we can deploy the same ones and they can constantly ask and they'll get the result. But we would want them to be very specific and precise to what they are like.
Dimitrios Brinkmann [00:34:12]: So this is like a little bit of a personalized agent.
Satish Bhambri [00:34:15]: Yes.
Dimitrios Brinkmann [00:34:16]: So it knows that I'm on the HR team. And when I ask about how we did this quarter, it's like employee engagement.
Satish Bhambri [00:34:24]: Essentially. Yeah.
Dimitrios Brinkmann [00:34:26]: If I'm on the sales team, it's like how much did we sell? But maybe if I'm on, if I'm in the C suite, I'm looking more at like holistic view of how all the numbers are.
Satish Bhambri [00:34:36]: Yeah. And if you imagine, let's say we are agents, right. What context do I have? I'm just going to go and look up, okay, what documents or what index I have in my db. This is the query that came in. There is an embedding of this query. Like let's say 0.12, 0.34. I'm going to go and see in the vector space where are like what are the closest points to this. I pick that up, I Bring it back.
Satish Bhambri [00:35:04]: Now, those specific points in the semantic search, they could mean very different things for different agents. But those points which are suspended in there, that depends on, of course, the embedding models that we are using. And because we cannot perform semantic search when coming from different embedding spaces specifically, because what happens is it's a way to represent our textual, pictorial, or any nonlinear data in a numerical form, essentially. And when we are picking it up from the schema docs, let's say, or from any documents that we are feeding into the RAG agent, those schema docs define essentially what context or which, let's say, tables to pick that data from. If I'm picking up from, let's say, a sales table, it's going to give me more context around sales. If I'm picking up from, let's say, employees performance table, it's going to give me more about that. So inherently as an agent.
Dimitrios Brinkmann [00:36:07]: Wait, sorry, I missed that. It was from vector space that it gives you that information.
Satish Bhambri [00:36:13]: Yes, because those vectors are essentially mappings into numerical terms of what these, you know, I see documents or schema.
Dimitrios Brinkmann [00:36:22]: So you're enriching the schema with the vector, with basically vector space.
Satish Bhambri [00:36:28]: Yeah, yeah, essentially.
Dimitrios Brinkmann [00:36:30]: Okay.
Satish Bhambri [00:36:31]: So it's like going into this, you know, multiverse and then just finding out, okay, I just need to pick something up. But I don't know what it's going to map to because the mapping has been done basically based on what it was fed in before. And that what was fed in is very much dependent upon what kind of relationships we have defined before, how the index has been created. And the index creation essentially happens as a first phase of this agentic.
Dimitrios Brinkmann [00:36:58]: Wow, okay, that's super cool. I haven't heard the adding a little bit of extra context so that it understands. And it's almost like you're saying grounding the information. Again, it's going back to like RAG was all about grounding the models, and now we're grounding the agents with a little bit of extra vector space, semantic vectors and all that stuff. Tell me more about these dynamically created prompts.
Dimitrios Brinkmann [00:37:25]: How does that work?
Satish Bhambri [00:37:28]: Okay, let's take an example and we'll go back to our example. How did we do this here? Yeah, or in this quarter. Right.
Satish Bhambri [00:37:36]: And let's say if I have three tables, or let's say we have like three tables, we have a sales table, we have orders table, we have, let's say, products table. Now, in the schema docs, which are passed to the RAG agent from where the Index is being created, we are defining that, okay, these are the sales, we tie it to the products as a key and there's a key for sales products. And third one, let's say customers for instance, right. So AIM is like how did, when we say how did we do this year? It would be based upon we can tell what customers bought, let's say this quarter or what products were sold.
Satish Bhambri [00:38:25]: And how much revenue was generated. So that part where we are defining this essential schema of the tables in the docs and the relationships inside the docs and when the embeddings are being generated, those embeddings are also encapsulating those semantics of the relationships between these three different tables. And when I ask this, and if this is the only information I have, I have not told RAG to do anything else. When I ask like this quarter, it picks up the time range, finds out in that time range about what all data it can fetch from these tables. And then we fine tune how the generation of this dynamic prompts are going to be in retriever methods and during our indexer as well. And then from there we can actually be very specific about like what we want in the reports outside.
Dimitrios Brinkmann [00:39:19]: Interesting. So you're getting it, I'm trying to think about like the step by step nature of this. You're getting the query how do we do this quarter? There is a model that receives that. It also will go and search vector space.
Dimitrios Brinkmann [00:39:42]: And then it outputs a prompt to go for another agent to go and use.
Satish Bhambri [00:39:48]: Yes. Yeah.
Dimitrios Brinkmann [00:39:50]: And in the output it's where you're very specific because you have the information of. All right, this is whatever this is the relationship between these three tables. Here's what I want you to look for specific agent. Here's what you need to focus on. And then go and find that and come back. And then after it finds that, it comes back and it tells that master agent. Here's what I got.
Satish Bhambri [00:40:18]: Yeah, yeah.
Dimitrios Brinkmann [00:40:19]: And then the master agent will have all of this and it's input and then summarize it and then output something.
Satish Bhambri [00:40:27]: Yeah. And it could even use that to let's say retrieve do more. So like we have this output now and we are of course going to have these evals in between. Right. Okay, now let's say we want to relate it to some other set of databases and then the other agent picks it up and goes and creates a dynamic prompt based on that. So it's multi agent system just communicating with each other at the same time, making sure they're very contextually relevant to their own specific set of questions. But it's all happening because we are able to create very relevant, dense embeddings in these vector spaces. And let's say a user query comes in that is also actually.
Satish Bhambri [00:41:12]: Put in the vector space or projected in the vector space, that being the right word. And then from there the pickings are of like, what is the relevant context that can match this query? So we don't even have to manually define a lot of things for RAG agents until, unless absolutely necessary, that it's not picking up. Which is a very, I would say.
Satish Bhambri [00:41:37]: To and fro process sometimes. And that's why prompts are not very successful in getting large language models grounded. Whereas RAG agents, they introduce this additional step of suspending. In this generated index, our relevant information query comes in, it's projected. Now we find the match, we come back, we generate that dynamic prompt now with that context, and that gives us the exact results that we want.
Dimitrios Brinkmann [00:42:05]: Yeah, it gives you a much more rich field to play from. It's so much more enriched with all of that information as opposed to just my simple words of like, how did we do this quarter?
Satish Bhambri [00:42:19]: Yeah, yeah, essentially, yeah, like, I guess.
Dimitrios Brinkmann [00:42:22]: I'm lost on how.
Dimitrios Brinkmann [00:42:26]: Putting the query into vector space, right. And then seeing what it is semantically similar to.
Dimitrios Brinkmann [00:42:34]: How does already have all of this stuff that it's semantically similar to.
Satish Bhambri [00:42:40]: Makes sense.
Dimitrios Brinkmann [00:42:42]: All right. I was hoping it did, but I didn't know because I was confused myself for a second there, to be honest.
Satish Bhambri [00:42:48]: No, no, no, that's a very valid actually point because what happens is.
Satish Bhambri [00:42:54]: So before even we started asking our agent any questions, what we did was we built an index. Those index was built on some documents, and documents is a very abstract term for any set of data that we would use. Now imagine a three dimensional space where we just have X1, X, Y1 or Z1, and we have essentially put just the data of our two tables or even for that matter, this conversation. Let's say we want to go back and review this conversation. Somebody wants to ask questions. So we have created a kind of analog for this. And what we have done is we have actually created embeddings from this from different, let's say each question and each answer represents one document. Now when we are creating embeddings for these different documents, we are suspending these in the vector space.
Satish Bhambri [00:43:50]: Now let's say if somebody comes up and asks, okay, when Dimitri asked this, what was the answer? Or did these two questions make sense? This query is essentially going to go in in the same vector space and embeddings are going to be generated using the same model using which the index was created from our conversation before. Then it's going to go ahead and perform that semantic similarity search and we can tweak it to use any similarities. It could be Euclidean, could sine pearson any similarity and then it finds the closest matching similarities. And we can also define how many neighbors we want for it to find the similarity to. And that's like a hyper parameter that we can tune. And let's say if we include too much of a context, then it can actually waver off and too little of a context it can also waver off. But generally, for example, for me when I did in the rag agents, five was a five nearest neighbors was a really good approximate nearest number of search. It was a small use case essentially.
Satish Bhambri [00:44:57]: And yeah, so in that case, when the new user query comes in, it's suspended in that same space and then it performs that semantic similarity search and gives us back that dynamic prompt which is passed to the large language model.
Dimitrios Brinkmann [00:45:09]: And you're always updating it with the new queries. You're always adding the new queries to the vector space.
Satish Bhambri [00:45:14]: Yeah, so that's. So what happens? Let's say if I did not get the results that I wanted, I'm going to go back to my docs, I'm going to check like, did I define the schema correctly? What went wrong? Why wasn't I able to find a very relevant why wasn't it able to find a relevant context? And I'm definitely going to update those docs, I'm going to define another doc which defines the scheme or relationships between these questions. So I'm essentially defining these things prior to make sure that our large language model, whatever answer it gives, it's grounded, it comes from what exactly we want it to do.
Dimitrios Brinkmann [00:45:51]: How are you dealing with the problem of just having too much data and messy data?
Satish Bhambri [00:46:01]: Fortunately, the use cases that we have worked with so far, the data wasn't too messy. But let's say we encounter, for example, we were talking about the one of our conversations. If it's, let's say a rag agent which is talking to databases, it is going to be very straight up. It's going to be schema of the database, there's going to be type of the columns that we have and what relationships between those tables are there. And very straight up. No, nothing which can be inferred essentially. But in a normal context of things, we will definitely have to make sure that we have proper data pre processing in place where we are cleaning up the data, we are removing any unnecessary tags because we wouldn't want our LLM to focus on very unnecessary information.
Dimitrios Brinkmann [00:46:54]: Like any random overweight. One word and you're like, why did you care about that word? It's not that big of a deal.
Satish Bhambri [00:47:02]: Yeah. And this essentially also comes down to our previous initial discussion about how transformer architecture essentially works. It's essentially assigning these probabilities to all these words and doing the mlm, which is masked language modeling, predicting the next word. And if you would want certain words to be upweighed more, you would have to make sure that for a very specific use case, we eliminate the parts which are not at all desired or required. Yeah. As such.
Dimitrios Brinkmann [00:47:30]: Well, let's say that some data becomes stale because for some reason or another you don't have the same policy anymore or you don't have this. You realize that, oh, this data actually was incorrect and so we need to change it out. How are you going about swapping things? Because I've heard that is a real pain in the butt when it comes to keeping your vector database up to date.
Satish Bhambri [00:47:59]: Yeah, that is actually a big challenge for sure. One of the ways would be to. And it's definitely going to be a costly process to re index our database.
Dimitrios Brinkmann [00:48:12]: Oh, interesting.
Satish Bhambri [00:48:13]: To create the index again.
Satish Bhambri [00:48:16]: And as the data increases indexing it becomes harder and harder.
Dimitrios Brinkmann [00:48:21]: Yeah, if you only want to change one file, it's like we got to re index. What?
Satish Bhambri [00:48:29]: Yeah. I wonder though, like in that case, how would we essentially go about it?
Satish Bhambri [00:48:38]: One of the ways could be inducing a negative example for something that we don't want to include in our rag agents. For example, let's say if certain part of the data became stale, we can always include a relationship in the documents, which is like if asked for this specific data.
Satish Bhambri [00:48:59]: This is a stale data. We would not want it to go there. So even though in the semantic search it might be picked up, the good part would be when it goes to the large language model, it is going to reject it immediately because in our prompt we are specifying not to actually answer in that direction. But that's more of a sanity check or a smoke test.
Dimitrios Brinkmann [00:49:19]: And it's also okay to do if there's one or two pieces of data, but if it's 20. No, now you can't, you can't do that. So easy.
Satish Bhambri [00:49:27]: Yeah, it would not be as scalable. But that's a, that's a really good problem.
Dimitrios Brinkmann [00:49:31]: I'm just thinking Because I heard the example of you have your HR handbook and certain policies get updated and then what do you do with when someone's going and asking about their HR questions?
Dimitrios Brinkmann [00:49:51]: It's getting information from which handbook, which policies are getting referenced here. So this was back in the rag days and I just remember people talking about how hard it was to keep their vector databases.
Dimitrios Brinkmann [00:50:05]: Tidy and up to date. And so I find it interesting for you. It's, it's almost like maybe you're not having to replace as much data so then you don't have that problem. You're only creating new data and then you can just use the time or the date created filter type thing.
Satish Bhambri [00:50:29]: Yeah, yeah. For most of our use cases that we have like worked with.
Satish Bhambri [00:50:34]: We don't have like a lot of data which needs to be either removed or replaced. But if it were the case, I would believe that re indexing would definitely solve the problem. But with the scale of the data, I'm sure we can always re index certain parts of the vector DB as well by for example just using one document, which is the part we need to update. So those query would be again suspended or projected in that vector space but with different values essentially.
Satish Bhambri [00:51:08]: And any negative examples could help us to not entertain our previous policies. But that is definitely an interesting problem and I think we'd have to look more with detail.
Dimitrios Brinkmann [00:51:21]: You don't do anything with LLMs and recommenders, do you? Are your recommender systems all very traditional still?
Satish Bhambri [00:51:29]: No, we do use.
Dimitrios Brinkmann [00:51:30]: Really?
Satish Bhambri [00:51:31]: Yeah, yeah, yeah.
Dimitrios Brinkmann [00:51:32]: How are you using LLM for recommenders?
Satish Bhambri [00:51:34]: So for instance, let's say if we have some recipe recommendations and I want to like, you know, generate a recommendations like based on that page. So and I would want to like extract out you know, like let's say cooking tools relating to certain recipes. Just give you an example and that way it's way, way easier to contextualize it using a large language model.
Dimitrios Brinkmann [00:51:55]: But you're just contextualizing it and then you're going to the traditional recommender systems style or it's a hybrid.
Satish Bhambri [00:52:03]: Yeah, hybrid basically.
Dimitrios Brinkmann [00:52:05]: I've heard about this a few times from people. It's like you slap a LLM on top of a recommender system just to make that part so much easier because you don't have to train this model on the whole cooking utensils and everything.
Satish Bhambri [00:52:22]: Yeah, I mean fortunately for us, you know, our users don't scroll Walmart, you know, page as often as Instagram. So that definitely helps us to have a little bit of latency. Or like, you know, specifically for the products that I have worked with, you know. Yeah, have not as much of an online serving model, but in that case latency is of the utmost importance right now. But for the use cases that I have worked with and not to delve into too much of details for the company's confidentiality, but yeah, so we have used large language models in essence trying out.
Satish Bhambri [00:53:02]: Like what features can we extract out from the current recommendations so that we can use those features to recommend next set of things immediately and at the same time combining the user feedback to make sure how they are essentially liking and what else they will like in that sense. So because when we are essentially taking a natural text and trying to get the feedback out of an item or essentially a description of an item which user is following set of instructions, there are so many products that could be relevant and if one product has captured any user's attention right there we have user engagement. The top of the funnel is already established. Now we have user awareness and we can essentially use the top of the funnel again like a little bit of marketing term that we can essentially use that to leverage. What else can we bring into awareness? As the user awareness increases for the products which are related essentially engagement increases, engagement increase leads to potential sales and they definitely come back. And for example we were working with how are YouTube banner impressions which have led to the sales of the items and there is no way for us to relate that. Yeah, and yeah, sorry.
Dimitrios Brinkmann [00:54:26]: No, no, no. Okay. So really you're generating the feature or you're extracting the features with the LLM.
Satish Bhambri [00:54:34]: Yeah, yeah.
Dimitrios Brinkmann [00:54:36]: And the features. But the features are already in some kind of feature store.
Satish Bhambri [00:54:42]: This is very dynamic because what we are focusing on is what user essentially looking at.
Dimitrios Brinkmann [00:54:46]: Yeah.
Satish Bhambri [00:54:47]: And when we find out, let's say for example there's a set of recipe that user is essentially looking at. What products can be mapped to this recipe that could be gotten from the description of the recipe and that is very user specific. Now we know for a fact it's going to be a little bit slow for an online serving. I mean the latency is going to be high. But we do have that user's data. We do know what we can essentially further diversify the recommendations when the user logs in next time or even at that time.
Dimitrios Brinkmann [00:55:23]: I see.
Satish Bhambri [00:55:23]: For that matter.
Dimitrios Brinkmann [00:55:25]: Oh, so you're using the LLM to enrich the recommendations. Yes, I understand. So the LLM took me a while, but I got here.
Satish Bhambri [00:55:38]: It was a random experiment somehow and it turned out to be pretty Fascinating. I'm like, okay, wow, this works, you know?
Dimitrios Brinkmann [00:55:44]: Yeah, yeah, yeah.
Dimitrios Brinkmann [00:55:47]: Oh, I really like that. So all of the.
Dimitrios Brinkmann [00:55:51]: Pages that I'm looking at on Walmart, you can extract, especially if it's any type of a blog, type of post, like a recipe, you can extract everything from that page using an LLM and then enrich my customer profile with all that data.
Satish Bhambri [00:56:12]: Yes. For certain products. Yeah.
Dimitrios Brinkmann [00:56:14]: And then later you do the traditional recommender system. But it's. But you know that, like. Oh, yeah, he did this. He like. You know, he liked this. And we have some strong signals that he might need some chopsticks because he was looking at a sushi rush Sushi recipe.
Satish Bhambri [00:56:31]: Yeah. And, you know, the interesting part is it's not even like that latent anymore. If you will go right now and you will see, you will have the, let's say, cooking tools that you need to use right there because it's a very lightweight model and we cache a lot of things, which helps us to immediately get back and bring it into production immediately. So it's definitely fascinating and to our advantage. Recommendation systems, they work on confirmation of personal bias. You know, it's going to tell you more things that you want to see at the end of the day. And that's why, you know, like, sometimes I feel whenever we are even scrolling Instagram, you know, like, I like, like, one kind of page and it just pushes me in that direction. Like, oh, my gosh.
Satish Bhambri [00:57:18]: Okay. Wow, dude.
Dimitrios Brinkmann [00:57:19]: I know.
Satish Bhambri [00:57:19]: Yeah.
Dimitrios Brinkmann [00:57:20]: Or I have the hardest time because on my iPad, I can't log into.
Dimitrios Brinkmann [00:57:27]: YouTube. On my other profile that I normally watch YouTube on, I have it on my random, like, profile. And the recommendations I get are not that good at all. It's from like three or four years ago when I used to watch things and that I. That I liked. And so.
Dimitrios Brinkmann [00:57:46]: A little bit of a tangent, but it is fascinating to think about that. Okay, so I like it, man. I really like it. I'm just wondering what else is there that might be worth hitting on.
Satish Bhambri [00:58:01]: Evals?
Dimitrios Brinkmann [00:58:02]: Yeah.
Satish Bhambri [00:58:02]: Like, how do we evaluate NARAG agents?
Dimitrios Brinkmann [00:58:04]: I like that. All right. Yeah, let's talk about that. All right.
Satish Bhambri [00:58:08]: We are in a very niche space of that I would say we're still figuring out. One of the most intriguing parts of this space is that there are so many components to a rag agent. Where do we even start evaluating? I mean, I'm getting an answer, but there are so many steps behind that answer.
Satish Bhambri [00:58:28]: There are retrieval systems, there's indexing system. Where do I start evaluating what part is Going right and what part is going wrong?
Dimitrios Brinkmann [00:58:37]: Yeah, how do you debug?
Satish Bhambri [00:58:38]: Yeah. And this boils back to the initial example that we touched base upon was it's like a kid who's having an exam, an open book exam. And now what we have, we have given a kid the question that, okay, this is the question, it goes back to its books, finds out the most relevant context, starts giving us the results or answers. Now one place to look for is is that kid or is our RAG agent referring back to the right set of documents or not? That's where like the first evaluation starts. If it's not, we need to like check into our indexing. We need to make sure, like are we defining the documents correctly or not? Are we passing the right context? If that's done, that's step one. The next thing would be once the context is retrieved, which is, let's say right. So we can use recall in that, for example, like sensitivity, which is a very.
Satish Bhambri [00:59:36]: I would say cliched evaluation strategies in machine learning. And then the next would be, okay, once we have that, are the prompts, the dynamic prompts being generated, are they somehow relevant or not? And the third part would be once the prompts are being passed to the large language model, whichever one we are using, are they actually generating the right set of output or not? So these three steps essentially we use for evaluations in rag.
Dimitrios Brinkmann [01:00:06]: And what are you doing just printing them out and having people label this? How do you go about this? And I imagine it's all offline. It's not in the moment.
Satish Bhambri [01:00:18]: Okay, yeah, yeah, it's all offline. Yeah. And it's not like, of course not manual, I mean prototypes, yes, we started to look into each and everything like debugging, like manually. But as I mentioned, recall is the one parameter that we use at the time of document.
Satish Bhambri [01:00:41]: Picking for a rack. And good recall means like okay, we are actually picking up the contextually relevant documents over there. And then secondly, for example, how the large language model is performing that is essentially evaluation done based upon the hyper parameters of that large language model. How much of a temperature we are setting, what is the value of K that we are using? Is it producing contextually relevant results or not? And then we also have LLM feedback. So these results are fed into a large language model to get back if they are relevant or not. So it's a feedback module where the content generated by one RAG agent is being fed into another large language model, which gives us the evaluation of that. It's still work in progress.
Dimitrios Brinkmann [01:01:28]: Right?
Satish Bhambri [01:01:28]: Now.
Satish Bhambri [01:01:30]: Again, it's. We don't have like a state of the art methodologies. We're still like getting there. But these three phases are ones like where we.

