MLOps Community
Sign in or Join the community to continue

Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

Posted Jun 17, 2026 | Views 9
# Agent Security
# AgenticRAG
# RAG
Share

Speakers

user's Avatar
Varsha Prasad Narsing
Principal Software Engineer @ Red Hat

Varsha Prasad Narsing is a Principal Software Engineer at Red Hat working on Kubernetes-native AI platforms, distributed systems, and agent infrastructure. She is a core contributor to the open-source OGX and Kagenti projects, focused on building scalable and secure infrastructure for enterprise AI workloads. Her work spans AI serving, retrieval systems, agent runtimes, workload identity, and Kubernetes-native operational platforms, with a particular interest in bringing cloud-native principles to the deployment and governance of AI systems at scale.

Varsha holds a Master’s degree from Carnegie Mellon University and is passionate about distributed systems, cloud-native technologies, open-source software, and the evolving intersection of AI and infrastructure.

+ Read More
user's Avatar
Francisco Javier Arceo
Senior Principal Software Engineer @ Red Hat

Francisco has spent over a decade working in AI/ML, software, and fintech at organizations like AIG, Goldman Sachs, Affirm, and Red Hat in roles spanning software, data engineering, credit, fraud, data science, and machine learning. He holds graduate degrees in Economics & Statistics and Data Science & Machine Learning from Columbia University in the City of New York and Clemson University. He is a maintainer for Feast, the open source feature store, and a Steering Committee member for Kubeflow, the open source ecosystem of Kubernetes components for AI/ML.

+ Read More
user's Avatar
David DeStefano
Staff Engineer @ EvolutionIQ

David DeStefano has spent the past decade building data and ML/AI systems across healthcare, insurance, fintech, conversational AI, and adtech. He's worked across the stack, from data pipelines to predictive models to agentic systems in production.

+ Read More
user's Avatar
Rohan Prasad
AI/Data/ML Platforms @ EvolutionIQ

Rohan has been working on getting models into production for the last 10 years. His focus has been on building AI/ML systems that allow others to build AI/ML systems. He’s worked across various spaces, including banking, insurance, trading, and low-code platforms.

+ Read More
user's Avatar
Sam Christensen
Staff Software Engineer @ EvolutionIQ

Sam Christensen, based in Middleton, WI, US, is currently a Staff Software Engineer at EvolutionIQ. Sam Christensen brings experience from previous roles at EvolutionIQ, Central (YC S24), Sunday, and Zapier. Sam Christensen holds a 2013 - 2016 Bachelor of Science (B.S.) in Computer Science @ University of Wisconsin-Madison, with a robust skill set that includes Leadership, Writing, PowerPoint, Mobile Development, C, and more. Sam Christensen has 1 email and 1 mobile phone number on RocketReach.

+ Read More
user's Avatar
Arthur Coleman
CEO @ Online Matters

Arthur Coleman is the CEO at Online Matters . Additionally, Arthur Coleman has had 3 past jobs including VP Product and Analytics at 4INFO .

+ Read More

SUMMARY

Retrieval-Augmented Generation and agentic AI are increasingly common in enterprise deployments, but real enterprise environments introduce challenges largely absent from academic treatments and consumer-facing APIs: multiple tenants with heterogeneous data, strict access-control requirements, regulatory compliance, and cost pressures that demand shared infrastructure.

This paper identifies a fundamental problem underlying existing RAG architectures in these settings. Retrieval systems rank documents by relevance, not by authorization, so a query from one tenant can surface another tenant’s confidential data simply because it scores highest. The authors formalize this relevance-authorization gap alongside related shortcomings (tool-mediated disclosure, context accumulation across turns, client-side orchestration bypass) and introduce a layered isolation architecture combining policy-aware ingestion, retrieval-time gating, and shared inference, enforced through server-side orchestration. They validate it through an open-source implementation in OGX, a vendor-neutral OpenAI-compatible Responses API, showing empirically that ABAC gating eliminates cross-tenant leakage while introducing negligible overhead.

+ Read More

TRANSCRIPT

[00:00:00] good morning everybody. Welcome back to the reading group. It's been quite some time since we did it, uh, and it's a real pleasure to be back here with you. The team is very excited to chat with you today, present to you our current topic, um, which is securing the agent vendor neutral...

Oops, there we go. It's in misorder. Vendor neutral multi-tenant enterprise retrieval. So basically what Francisco and Varsha are talking about is the fact that there's a difference between relevance and access. So all of us worry about relevance when we're dealing with RAG databases, but the reality is in enterprise, and this is where I'm gonna make a comment later, Francisco, that you have to worry a lot about access control and governance.

And I... There was a great presentation last night on governance, by the way. So this is an interesting topic that I hope all of you will pay attention to and also ask questions about, because governance is where the real danger spots are in what we're doing right [00:01:00] now. And we can-- Again, we'll talk about that.

Guiding principles of this event, all our reading groups, is you come as you are, whether you've read the paper or not, we are okay. Um, we manage through. We don't assume that you've read the paper. Um, the session belongs to you. I always like to say that, you know, this is... We learn from each other, and we've had just as good sessions where people have been telling us about what they've built and what they've done, not just listening to us talk.

So the more that you contribute, the more all of us will get out of it. Now, the good news is that there are no dumb questions. In fact, you may have noticed even from our chat here, that if there's anyone gonna ask a dumb question in the room, it's me. Okay? So I insist that this be a no judgment zone, meaning you ask any question, you shouldn't be afraid.

No one's gonna sit here in judgment of you, and please do not do that. Um, and it's okay to have, you know, critic- critique and have critical insights, but criticism is not acceptable in this environment. Um, and [00:02:00] the questions that I mentioned in the link is... The link to the document where we put questions is in the chat, and it's a Google Doc, and you just have to put your name and the question, and we will answer them in the order received.

Okay. I'm Arthur Coleman, I'm the CEO of Online Matters. It has been a consulting company for twenty-five years. In a few months, it's going to relaunch, relaunch in a different form, but I'm not chatting about that yet. We're still in stealth. Our speakers today are Francisco Javier Arceo and Varsha Prasad Narsing from Red Hat, and, um, they're exceedingly knowledgeable.

I had the pleasure of spending about, uh, forty-five minutes with Francisco on a call recently where we were talking about this paper, and as I mentioned, I was working on my RAG database at the time. And so Francisco was talking about, 'cause I'm doing consumer, that enterprise is different than consumer.

So what I wanna say to people on this call who are doing consumer I do not believe these issues-- I disagree with Francisco that these are enterprise-only [00:03:00] issues. Governance is a problem for all of us. And Francisco, I could give you three examples I've built since then that I've had to build into my system that deal with exactly the issues that are in this paper, and it's still a consumer app.

So we'll have that argument as you present. Now, the, the session is our, our, our presenters, our, our authors here are gonna present for about 20, 25 minutes, and then we're gonna have a sort of debate and round robin with our three hosts. So let me introduce them. David Destefano, who is the ineffable David Destefano, who has been on these sessions almost since, since the beginning.

Rohan Prasad, and Sam Christensen, who's a staff engineer at Relevant Evolution IQ. Uh, Rohan is the head of data and ML AI. And so just not to give our show a plug, but, um, this is their website. They're in the insurance business, AI business, and we'll let them tell you more if they wish to. But in the meantime, we go on.

So I turn it over. Do I stop sharing, David? What do I do [00:04:00] here? Uh, you could pass it off to, uh, Francisco and, um- Okay ... they could, they can kind of run with it. There we go. Yeah. Now I can see everybody. Welcome everybody. Uh, yeah. Are you guys able to share your screens, Varsha, Francisco? Yeah, one second. Okay.

Let me just hit play on the, on the thing

Francisco, I bet you didn't expect to get challenged right up front. Yeah, not, not right at the start, but, you know, now I gotta figure out the UI. But, you know, that's what computers are for. Yes, to confuse us and make our lives diffi- difficult and miserable. Yeah, right? Uh, actually, I don't know if I can share.

I don't think I have permissions to. Oh, I have the slides. Oh. I can... That's right, I've got the slides, so why don't I share? And then you just tell me when to shift. Hold on one second. I forgot that I had the slides. Yeah, I- I do have to break away for one moment. David, do you have... Actually, I have to walk away for about two minutes here.

So hopefully, [00:05:00] um, David, can you share instead? Um, I actually, I, I don't have the ability to share either, unfortunately. Okay. Um, yeah, we can figure that out next time, but- Yeah. Francisco- That's all right. We can start off and you go for, um... Just put me on the, um, guest author thing. I just need two minutes to walk away.

I just need two minutes from you. Yeah, yeah. So- Just go ahead. Put us on the guest authors, and it'll take me two minutes to talk about myself 'cause of my huge head. Oh, I see. We go there. Oh, you did. All right. I'm just kidding. We, we, we can talk a little bit about, um- Okay. There you go. I'll be right back.

Awesome. Um- Hey everybody, I'm Francisco. Um, I helped write this paper. Uh, you know, I'm a big fan of the community. Um, let's see. I'll tell you a little bit about myself. Uh, I've been working in AI/ML for like, I guess 2016, so like 14 years, which makes me a dinosaur by, uh, [00:06:00] by most accords in this space, um, as ancient as, as you know, Fisher himself, um, or Gauss maybe might be a better example.

Um, you know, I, I've, I worked at a bunch of different enterprises, banks, uh, Fintechs as well. I started my own company for a little bit. Um, I, I started as a statistician and over time, like emerged more or morphed more towards the software side of things because I found that getting models into production was really, really hard.

Um, I remember the old days where you could hard code equations like regression models or logistic regressions in systems, and that would work really fine. And now we have these gargantuan, you know, trillion parameter models where like that's, that's not really as, as, as... Um, but you know, over time I learned a bunch of stuff about data and securing and security principles and all this other stuff.

Um, a-and that kind of led me to be really, really passionate about the work we did here. [00:07:00] Um, and let's see, I'm a maintainer for Feast as well as this, uh, project OGX. I work with the LLM community, and I'm also on the steering committee for Kubeflow. If you folks know about Kubeflow, we're the MLOps infrastructure, uh, you know, open source project with like a collection of a bunch of, bunch of things.

And so I feel really lucky that at Red Hat I get to work, uh, on open source across a bunch of different things, really, whatever gets my attention that day, um, is sometimes how it feels. Um, so it's, it's a real pleasure to meet everybody here and, um Yeah. Really excited to get to talk about this, uh, th-this paper.

Uh, Varsha, did you want to introduce yourself? Sure. Thanks, Francisco. Um, hey, everyone. My name is Varsha, and I work as a principal engineer at Red Hat. Um, my introduction is pretty simple. Um, I graduated six years back and joined Red Hat, been there ever since. I started working on Kubernetes, uh, on the, [00:08:00] uh, operator side, mostly on the control and data plane.

So my work revolved around building operators, maintaining their life cycle, helping users to get up to date to maintaining the states which the operator mainta-operator takes care of. So, um, I, uh, was the one who introduced Operator SDK, uh, and also Kubebuilder. Kubebuilder was a joint effort by a lot of maintainers, but, uh, yeah, uh, those are the two projects I still maintain.

Uh, around two, two and a half years back, I joined Red Hat AI. Um, I have been working on multiple projects ever since. Started working with something known as Kube, which deals with GPU orchestration, uh, because infrastructure is expensive, GPUs are scarce, so it deals with how to schedule GPUs efficiently.

Then moved on to work a little bit on Kubeflow and then eventually, um, working on OGX. So on the OGX [00:09:00] side, uh, I've been contributing to the Ragbit's skill, uh, evolution and things like that and, um, yeah, that's it for me. Awesome. Awesome. Well, we can probably get this party started. Um- Awesome. Thanks. Any chance, Arthur, we could flip the next page?

Your, your wish is my command, Francisco Thank you, thank you, thank you. Um, so I guess we could start with like enterprise problems. To Arthur's point that if you care about security and access control, then these might not be limited to, to, to just enterprise problems. Um, but, but that's, that's definitely the root of where we started from because, um, Red Hat is so enterprise centric.

You heard Arthur mention, you know, like Linux and so like, you know, Red Hat Enterprise Linux is, is our core thing. One, one of Red Hat's great strengths is, is security and access control and, and, um, and patching CVEs. We're really good at that. Um, [00:10:00] and so like that was definitely the lens that, that we were working on this project through.

Um, and it started probably a year ago fro- from this project that we were working on that was previously called Llama Stack. It was open sourced by Meta AI. Um, and one of the, the core things that we built out was multi-tenancy. What multi-tenancy is, is basically just groups of people working on a software system, and when you have data, they may want to access it differently, right?

And you might not want every tenant to have access to everything. Uh, and so that's the kind of high level thing. You can think about it in terms at, at teams at your company. You have a marketing team, you have a finance-- you have a marketing team and a fraud team. Maybe you don't want them to talk to, to the same set of data and, and that's basically what, what it is at a high level.

Um- Uh, Francisco, would you mind like maybe elaborating on like the Llama Stack piece and that like evolving to what OGX is? Yeah. We don't got, we don't got to OGX, but like how, how did you guys go from saying Llama Stack was either [00:11:00] not enough or we, we kind of had to go and like implement this a little bit further?

Yeah. So it, so, uh, OGX is Llama Stack- Yeah ... renamed. And so, um, Meta, you know, expanded the project, 'cause one of the things, one of the things we heard feedback from people was that Llama Stack meant like, "Oh, does that mean I can only use LLM models?" Mm-hmm. As they were pivoting towards their, um, their latest, uh, models, I, I forgot what the, the, the Muse I think it was.

Um, like they wanted to kind of like, you know, rebrand it a little bit and especially- Gotcha ... added so much enterprise, uh, adoption with Oracle involved as well that like, "Hey, look, can we make this a lot more vendor neutral sounding?" And so we went with OGX. Awesome. Cool, cool. Um, and so yeah, I mentioned a little bit about enterprises and wanting multiple tenants on shared infrastructure.

And what does that mean? [00:12:00] Basically, if you have a database and like a single table, some people might say, "Well, I'll just make a sim-- a single table per tenant." As you get more tenants, teams or even, um, companies, because maybe you're selling B2B SaaS, and you're gonna say like, "Well, I'll have table isolation for every company," right?

Eventually, that starts to get expensive as you start to really create a whole bunch more tables. Um, and, and you get some just more operational overhead within that as well. Um, it gets easy if you just say, "Well, I can have a tenant in a, in-- as like a row identifier, and I can split things out that way," right?

So I can just query by tenant ID, and then you're kind of all good. Um, but that's in contrast to how approximate nearest neighbor search works, or basically the retrieval side of the RAG equation. For, for those that aren't familiar, though I'm sure everyone's familiar, RAG stands for Retrieval-Augmented Generation.

Basically, you know, you pull some chunks from a database, uh, you inject that into the context of an LLM [00:13:00] and have it generate the thing. Um, and the basic premise there is like, well, you might not want every tenant to have access to all the chunks, is, is kind of the, the, the simplest piece. And that's what we did in our paper was highlight, "Hey, look," um...

And, and one of the things that OGX does really well here is we open-sourced OpenAI's Responses API. And, and, you know, to OpenAI's credit, their Responses API design is, is, is great. I think it's, it's probably the best. It, it's a very ergonomic API. The RAG pieces basically happen on the server, which is very different than how mod-- or like how the version one, like RAG and client-side agent frameworks behaved.

Where for those frameworks like LangChain, uh, or, or LlamaIndex, you really did have to like handle that yourself, right? Like I pulled chunks from somewhere, I have my Milvus collection or whatever, and I got to go and, and, um, then inject them, uh, into, into the [00:14:00] context and then call chat completions. Now in the responses world, that changed where you upload a file using OpenHands Files API, and then you can just say, "Use a file search tool call in my responses API interaction."

And what happens is that on the server, it's going and pulling and handling database junk to go pull out and, and parse and whatever, and do annotations and junk. Um, and then, and then that ends up being really helpful. But then that exposes you to this multi-tenant problem where it's like, okay, now how do I, how do I optimize the database but also balance security?

And then so, um, you know, credit to, again, OpenAI in, in supporting that natively in their API, and then in the open source, we'll be able to kind of reverse engineer that and build that natively. Um, and so, you know, again, naive per tenant duplication creates lots of operational infrastructure overhead and, you know, major AI SaaS providers have high concentration of adoption.

And so with OGX, [00:15:00] we create a completely vendor neutrality in every piece of the stack, um, which I think is what's really valuable. 'Cause if you're pointing towards OpenAI's responses API, now you can have like an open source alternative that handles all the stateful pieces of the responses API. It's inherently a stateful API, and you know, if you go and dig into it, you'll find that, that actually OGX is the best implementation of this stateful thing.

Um, next slide please.

So, um, architectures evolved rapidly and security assumptions did not. Um, a-actually Varsha, did you wanna take this slide? Yeah, sure. So, um, if we go a few, not even years or months back, we were stuck with just completions API. So what were completions API? You had a prompt, you got a response. In between you had a black box and LLM model that would generate some kind of text for you.

And then we evolved into RAG systems. We [00:16:00] said that agents are stateless, but we want to bring in context, we want to bring in memory, so we brought in stateful components through databases, through vector databases in them. And then we moved on to tool using agents. So in addition to having a knowledge base, we said that agents could make external tool calls, uh, get inputs from external environments, and then give a better answer to whatever the user query is.

And now we eventually have moved to a complete autonomous system where we have, we support responses APIs. What it does is you have, uh, an agent who is able to query a knowledge base, who is able to make tool call, who's able to understand which tool call to be made or which document to be used to come up with an answer, and then provide a response to the user.

So from step one to step four, our architectures evolved, but then our security concern was not even taken [00:17:00] into consideration. The only benchmark which we had was how relevant is the answer to the query, but we never said anything about whether the answer being provided by the agent is something a user should even have access to, whether a user should even be able to get that particular context from the answer which the agent, uh, comes up with.

So yeah, this is something brief about how quickly the architectures evolved, but then we were just stuck with the same benchmark on the quality of the answer rather than the security assumptions itself.

Awesome. Next slide, please So why do existing architectures fail? I think we've talked a little bit about that, right? Like client-side orchestration expands the trusted computing base is basically like, "Oh yeah, you can access everything." That's no bueno. Um, and tool execution frequently runs with elevated or shared con- credentials.

And I think people find-- are finding out the hard way about this when they kind of [00:18:00] YOLO OpenClau, uh, or they have cr-- you know, Claud code accidentally leak their credentials, and then you have to kinda go and create a new credential for the hundredth time. Um, and then, and then, like conversation state accumulates across turns without policy re-evaluation.

And so conversation state management, a part of the responses API is an interesting one, right? Where that starts to get, um, at what some of the, um, really powerful things that like OpenAI started to move towards. Like I think the, the, the canonical example here is that like you can now share chats in ChatGPT with your friends.

And, and you don't think about it like at the start, right? But it's like, oh wait, I just granted access someone to a particular conversation, right? And in the multi-tenant space, that starts to say, "Well, hey, look, do I want every tenant to share access to all of these conversations?" Maybe, maybe not, right?

And like we can debate it, but like, you know, before this wasn't a construct that people thought about. Um, and now be-- and because that's baked into the responses [00:19:00] API, it, it's something that we can, uh, can now do. Um, and again, and, and lastly, and probably most importantly, retrieval systems had no native authorization semantics.

Th- this wasn't something that like, you know, V1 of these APIs designed for V0, uh, whatever index you like, um, designed for, because that wasn't how people kind of used the rules. It was just that like OpenAI was first, and they started with chat completions, and everybody just went haywire with it. Um, and like there wasn't any really second thought about like, well, what's the right retrieval system for this?

And in fact, you know, you know, Feast, which was extremely popular, like feature retrieval API, um, w- we never had any sort of search API that we had a standard on. Um, you know, Feast had-- I mean, did a lot of, uh, work in array generation, but never in the proximist nearest neighbors search retrieval. And so like we, we actually did add one, and so we actually do have the ability to have, you know, governance and RBAC there as well.

But there [00:20:00] wasn't really an open standard about this. And, and there's now an open search, uh, project in the Linux Foundation. But by and large, 90% of people are using OpenAI APIs. I mean, maybe it's 50/50 between Anthropic, right, or Gemini or somewhere in between there. But, um, OpenAI kind of set the precedent and, and now there's, you know, a, a, a move towards, um, other semantics, right?

And so now, now we can actually do this, but we just didn't have the infrastructure before or we didn't really agree on it. Um, and so like again, relevance and authorization, they're just fundamentally different. Um, you don't have to have consideration for one or the other. Um, but if you care about security, th- then you do.

Um, so- Yeah. Sorry to, sorry to interject, but like, I, I think- No ... one question I do have is, you know, like, in, especially in terms of existing architectures, um, what if I wanted to take like the naive approach? You know, just like query everything and at the end just like remove out results that aren't, you know, [00:21:00] uh, relevant or, you know, that aren't, I'm not authorized for.

Um- Yeah ... sorry, I shouldn't say relevant or like, yeah, I'm not authorized for. Like, does that break down somewhere or does that run into any issues? It, it really does break down, actually. I think we have a slide on this. Is that right, um, Varsha? Is that the next one or is that...? Yeah. I think we also talk about this like eventually towards the end, and we also run evals to just show that how, um, a single tenancy, um, or like filtering on a single tenancy cannot be scaled enough, and that's one of the reasons why your recall like just drops down, uh, if you don't have relevance and authorization done together.

Yeah. And, and, and the short version of it, and we'll go into more detail with the empirical results, is that like, um, if you do a post-process where you just find the, the, the top K and then filter afterwards, um, it turns out the reci- the precision might end up really bad. A- actually, or the recall as [00:22:00] well.

Um, you know, it... Because you might filter out all of the top ones because none of those were like actually within the tenant. And so like you, you pretty much have wasted your time with that. And so there's something called predicate pushdown where you essentially propagate the query or that filter of the tenant ID down to the database layer, so that at the database layer, when it's doing its own retrieval, it does the filter step first.

Um, and, and you know, some databases support this, some don't. And so where it doesn't, it, like, it, it's actually probably the, probably that little nugget is probably the most important conclusion of our paper, where it's like, "Hey, um, don't even try filters." I think you actually, you actually covered a bit in the next slide too.

Yeah. Um, yeah. Yeah. Yeah, yeah. So, so formalizing it here. Yeah. A-and Varsha, do you want to walk through this, this, uh, this slide? Yeah. This is the crux of the entire paper. Like, I think this one, uh, particular equation kind of captures it all. [00:23:00] So what it just says is when a document is retrieved out of a document set, we need to ensure two things.

The LHS, I'll just, uh, like split the equation into two. The LHS says that a document which is retrieved from the knowledge base should obviously have a relevance score which is higher than some threshold which a user sets in. But in addition to that, we also need to consider the policy enforcement, which says that when a user is able to access that document, does-- do they have the right permissions to do so?

When the answer to that is yes, only then we move on to even checking relevance and getting an output, uh, from the agent. So this is the crux where till now we usually ignore, uh, the LHS side, the, uh, permission side of things, and we just concentrate on the relevance side of things. But what we are saying through this paper is let's also [00:24:00] concentrate on both the parts together, where we not only consider that the relevance is above a particular threshold, but we also ensure that some policy management is also in place in the system, so that when the agent refers to a particular document to get the chunks out of it, uh, we also ensure that that user has an access to the particular document from the database or particular table.

So it is chunk-level gating that we bring in. We'll talk about that in the next slide, but this is the, uh, overall idea of the, uh, relevance authorization gap that we have introduced in the paper. A-and the paper says too that like, um, say for example, you're not using a backend database or vector store that has the capability, native capability of like predicate push down, that still the OGX will go about filtering based off of like the attributes of the user to make sure that...

And that's where you say the recall dips a little bit, but it's still gonna function and it's still gonna ensure that, um- Yeah ... regardless of your back end, that like you're [00:25:00] gonna only get what-whatever vectors that are, uh, relevant to you as a particular user. Yeah. Yeah. Exactly. Yeah. Yeah. And so it's, it's two layers of, uh, security in a, in a sense.

It was like, "Hey, we'll do the best we can," but like- Yeah ... still would recommend you use predicate push down where if possible. Yeah. Because, because in that sense, like it's free to do it, right? Yeah. I mean, so, so, so long as it's a, a supported, uh, um, a database and, and there are ways that you can take your own database, whatever we use, if it's not one of the ones that we support, and just, um, have what we call an out of tree provider that you can inject into OGX and at build time and, and, and still use the APIs.

And so, uh, so for the most part, there's no reason that you, that you, um, should do it, but like you know, I guess, uh, you know, we'll cover you a little bit even, even if, if, if you don't. Cool. Yeah. Next slide, please

Slide? Yes. Next slide, please. Thank you. Yes. And so, um, the layer-- We call this in the paper the layered [00:26:00] isolation architecture, right? Where like one, it's policy aware ingestion, which is like quite literally, I give you a file, and I'm going to ingest it, i.e., like park chunks into the database, right? And, and in, even implicit in that is the idea that I'm gonna take the file, chunk it into, into, you know, pieces, sometimes, you know, overlapping, sometimes not, right?

If you want, you know, completely mutually exclusive. And I'll embed them, um, and, and then those embeddings along with like whatever chunk ID I put and some metadata will be inserted as a record in that database. Um, you know, we'll add some, some metadata to say like, "Hey, look, this tenant owns this thing," um, and by authored by this user or whatever.

Um, and then layer two is, is the, the, the retrieval gating, which is, you know, uh, access-based access control-- attribute-based access control and metadata filtering, which we talked about a little bit, um, as well the two layers, that's those pieces. And [00:27:00] then the third layer is the shared inference with tenant scope context.

And, and this is a really, really, uh, important piece of it where like the, the cost implications are at play here. Um, and, you know, basically between layer one and two, if things can't work, we're saving money on, on cost of inference, which, uh, tokens these days are getting pricey, uh, if you're following our AI overlords, Anthropic, I guess.

Um, uh, yes, uh, uh, do you wanna go to the next slide, please? And, um, and so this is, um, probably my favorite slide, uh, 'cause it highlights, um, one, I made the, the graph in LaTeX, which is awful. Uh, if you have ever made a graph in LaTeX. Um, uh, and so I spent a lot of time with it, even with the clanker. Uh, it turns out clankers aren't that good at creating straight 90-degree angles, but, you know, with some, some elbow grease, I got it in there.

Um, but this, this highlights-- The reason I like this slide is not because of that, but because it [00:28:00] highlights the complexity of OpenAI's responses API. And, and so when you-- And the thing I'd invite you to think about is like when you hear, um, the responses API, think Codex, right? Because that's what's, what's powering under the hood.

Um, and some of these APIs are closed source, is what, what we call them, or, or like there's a spec available or known. Um, and, uh, and the Prompts API is a really good example, where if you go into OpenAI's admin portal, you'll see that you can create a prompt there, but they don't actually give you a prompt API that you can hit to create a prompt cache key.

Prompt cache key helps basically, um, speed up inference because you can cache the prompt effectively, right? Um, so there's a lot of utility from that. I'm not sure how adopted it is candidly, but, um, just by virtue of like prompt caching, you get a lot of be-benefits, uh, from it depending on how big the prompt is.

Um, so it's a thing to be considerate of. And, and we implement all these, um, in OGX. So we've, [00:29:00] you know, inverse engineered all of them. One, one of the things that's pretty interesting here is the file processor API, where you hear a lot about, um- Document ingestion, because that's super important for enterprise.

I worked in insurance 12 years ago, uh, at AIG, which is a small ins-insurance company, almost took down the economy. And, um, you know, we actually paid vendors in, in different countries to actually like, uh, annotate our data. Now you can do that with these small OCR models, um, which we've added support for in this kind of file processor API.

And so the idea is I upload like a PDF, and behind the scenes we're gonna embed and extract the content of that PDF and then insert it so you can do RAG on it, which is, which is like a standard use case from everybody. And, you know, we've done a lot of work in the stack to support that. Um, you know, in OpenAI, it might use a visual language model.

It might use some small cheap models, um, kind of TBD. Um, but that part is kind of mysterious. Y-you can only find out once you like actually, um, [00:30:00] test with it, and so we've done a lot that, that there. The other aspect of this, this diagram is the tools. And so you see in OpenAI's documentation a lot of tool call capabilities.

The one you probably are the most familiar with, like when you use ChatGPT, is web search. You might see like, "Hey, OpenAI is searching the web for me," um, agentically, right? And, um, and really what agentically means is the, you know, inference thing is spitting out a token to say, "Do a tool call." Um, and that's basically what, what we orchestrate and execute here.

And, and the interesting one is the file search tool call is approximately a wrapper around their search API. So we were able to kind of learn and, and, you know, tie these things together. Um, what's not shown here is the skills API and the containers API. So people are talking a lot about skills. Um, through OpenAI's response API, you can deploy an agent on their server and execute a skill, and that happens, you know, through, through the skills API basically.

And, [00:31:00] um, and like what that means is that like increasingly more stuff happens on the server. And the more stuff that happens on the server, the more risk you have of like leaking data to somebody you didn't mean to. I think there was recently an incident about this from one of the labs where like user data was being leaked to the other, um, because this thing is, is a common thing that happens where it's like, "Hey, why am I seeing David's chats all of a sudden?"

Like, oops, they forgot to add ABAC. Um, you know, and so it's super embarrassing for them, and I guess people forget about it. But like for everybody else that's, you know, not them, you know, we, we fortunately have these great primitives, uh, in, in security that's been around for forever. Um, and so this is just to go-- This is just to say that like there's a huge amount of, um, like work under the hood beyond just the responses API.

And it's, it's actually quite brilliant because the UX is so good that if you didn't spend enough time, like spending as much time as [00:32:00] I have staring at this, it wouldn't as-- be as obvious. Um, but increasingly these things are becoming re- really, really great and, um, it's awesome that we have a spec available for it.

Cool. Next slide, please. Right there. Varsha, did you wanna take this one? Yeah. Yeah. Yeah. So, uh, we spoke about, uh, securing the data path. So we spoke about how we need to ensure that chunk level, um, auth is important. But I think Francisco also covered this a bit in the previous architecture diagram, but this talks more about how server-side orchestration is also equally important.

So we can execute, uh, the chunk level gating also on the client side, but inherently, clients cannot be trusted. Clients can skip your authorization requirements, and a client can be buggy. So, which is why on the enterprise level, [00:33:00] we say that server-side orchestration as an enforcement boundary is equally important.

And that's why we have tools like LangChain and LangGraph, which execute this inference tool, inference loop kind of thing, where you send a query, it, uh, the agent executes something and comes back. But they all execute this on the client side. None of them execute this on the server side. And OGX helps you execute all of them on the server side.

The reason being, when you have a tool call being executed on the server, which, uh, as, which we control, we can ensure that every tool call goes through a set of policy checks, so that, for example, an agent makes an MCP tool call, we need to even validate whether an agent is allowed to make an MCP call or not, uh, or an agent is allowed to do a web search or not, or say an agent is allowed to even look at a particular database or not.

So what we propose [00:34:00] here is, in addition to having your data path controlled and secured by having these gating mechanisms on the database level, uh, OGX also helps you secure the control path by ensuring that we have policy checks at every tool call or at every request which is made by the agent. Um, so yeah, that was, uh, the slide.

On the sixth slide? Yes. Next slide? Yes, please. Okay. Yes, and so, you know, to, to summarize, you know, uh, OGX is our implementation. The open source is there. You can, you know, clone it, you can do the PR, you can build a business on top of it. We'd love that. Um, it supports inference, vector stores, conversations, prompts, tools, skills, files, uh, you know, lions and tigers and bears, oh my.

Um- ... the provider extraction, so we enable, uh, [00:35:00] uh, pluggable vector DBs and model backends. I think we support like 31 inference providers or something silly. Um, we support vLLM because we do a lot of work with the vLLM community. Um, and so the-- you basically, you can take an open weight model like Nemotron V3 and start running it completely yourself.

Um, you know, and you get all of the OpenAI compatibility. Um, and deployable on O- on Kubernetes through our OGX operator. Um, you know, we support annotations and a bunch of stuff that, that, um, you know, you get within the UX. There's a UI in there which, you know, um, gives it a, a ChatGPT-like functionality.

You know? You have that one you have to build from, from the refill, but, but it's there. Um, uh, yeah, tokens are gonna get a lot more expensive. A hundred percent agree with you. And, and I do think that what, what the phenomenon we're observing is models getting hyper huge and then also modeling-- models getting s- smaller and more efficient.

And, um, you know, y- you'll notice that like as models get [00:36:00] more efficient, it's very likely people will start to use them client side. Um, and I think that's gonna be really cool, and basically you can then spin up OGX with, with, with vLLM, and you can get going and get the same UX is, is kind of the vision there.

Um, and yeah, everything's fully open source reproducible. In fact, our paper won several, uh, badges for, um, our reproducibility, actually all three badges that you could earn, um, because our evaluations were, were, uh, open source and reproducible. Uh, next pa- uh, slide, please. Uh, and I put the, uh, link to the paper, not to the repository.

I'm getting that now into the chat so anyone can see that. Yes. Uh, let me see if I have the

I think it's, uh, Nicole. Oh, yes. And, uh, Varsha, did you remember this one?

Oh, you're on mute. Oh, shoot Oh, sorry. Yeah, I just sent over, um, [00:37:00] the evals repo where you can basically run down the evals and replicate. But, uh, yeah, uh, the source paper is also open, so we also have a repository where we have the source code. Uh, so that also can be looked at. Uh, I'll share that, too. Uh, but on the, uh, mostly open to fully open AI architecture, um, so I think Francisco covered it in the beginning.

OpenAI has given us the right set of contracts and the right set of APIs which we need, uh, which can be followed to make it easier to be able to interact with RAG systems and tool calls. But unfortunately, most of them are not open, uh, with the OpenAI, uh, APIs. So what we have done on OGX is we have followed the OpenAI API format, and we have tried to open source every possible API that they can provide.

So we have an open implementation of, uh, various providers, various databases. We have an open [00:38:00] implementation on the inference side, uh, which can be vendor specific, uh, if you want to have like NVIDIA specific inference or AMD specific inference. So that is also available. Uh, we also have support for MLflow, so you can run evals out there.

So what we have tried doing is take up OpenAI APIs and have provider model support to as many, um, um, solutions as possible, so that if you have to build a system on top of it, it is much easier to just bring in the most popular databases or the most popular tools and integrate it with OGX and get the benefits of OGX outside the box.

So... Actually a really good job of also just like outlining your conformance to these different APIs just to make sure you, you get an idea of like what your coverage is too, so yeah. Yeah. Very detailed. Yes

Sorry about that. Uh, yes, and next slide, please. [00:39:00] Okay And so, um, actually, uh, Varsha, did you wanna take this one as well? Yeah, sure. Uh, so we ran a couple of evals. Uh, I'll just quickly go through all of them, but one important thing which I wanted to, uh, answer was the predicate pushdown one. Uh, but the first one is basically, we know that gating is important.

We have established that we need chunk-level gating. So we did two things. We implemented gating on the client side with the database itself, and the second thing is we implemented gating on the server side. And as expected, gating ensures that we have, uh, we reduce the leakage. So whether it's on the client and the server, we accomplish what we want.

But in addition to that, we just say that server side gating is important because the server controls, uh, the orchestration on who can make the call, who cannot, and all the policy requirements. So this is just to say the first diagram and the figure five towards [00:40:00] Figure E towards the end says that gating is important, but server side gating brings you additional benefits.

Uh, and the next most important result, which I wanted to talk about and like, which was also asked previously, is about the predicate pushdown. So I think when we started testing SQLite, we didn't kind of support predicate pushdown, but now it does and like mostly all of the databases, whether it's Qdrant, Milvus, Fast, kind of everyone support predicate pushdown.

But the whole idea is with the predicate pushdown, you can propagate the tenant ID up to the database level. So when you execute a query, the database ensures that the, uh, matching happens after your filtering. So you only get the chunks out of the database which are most relevant at the same time which belong to your ideal tenant, which you have access to.

Uh, and this ensures that towards the end, your recall is good, [00:41:00] which means that your relevance is high as much to the query or as much of to the ground truth as possible. Whereas if you don't use a database which does not... If you use a database which does not support a predicate pushdown, OGX does provide you filtering on the server side.

What it does is out of the retrieved chunks that come out of the database, we run metadata filtering on the server side to ensure that all the chunks which are not, um, like, which not conform to a particular policy are filtered out, and then we send it to the inference model to generate, uh, the ideal response.

So you do get filtering and ABAC through OGX out of the box. But then we just say that on an enterprise level, it's very much important to use databases that support predicate pushdown so that OGX can propagate all your tenant ID metadatas up to the database table level, rather than just depend on what we provide on the, um, server side.[00:42:00]

Awesome. Awesome And yeah, I mean, just the key takeaways, uh, Varsha, do you wanna finish this up since you were kind of talking parts of this? Sure. Uh, so very major take- key takeaway out of the whole, uh, talk, 50-minute talk is relevance is not authorization. Whenever you run benchmarks, just don't look at the relevance score, but also consider authorization and security, which are very important, uh, on the enterprise level or not on the enterprise level.

Uh, the second is enterprise agentic system requires server-side enforcement because you cannot trust clients. Clients can be buggy. Clients can just skip your, uh, authorization checks. Clients can do whatever you want because you don't control them. Uh, the third one is, uh, infrastructure is expensive. We acknowledge that, which is why shared infrastructure can still provide you an [00:43:00] isolated environment, uh, rather than having to replicate the same thing across tenants multiple times.

So this is one of the very important takeaway from our paper, uh, is that when you use OGX, your cost, uh, does not... Your cost, uh, does not multiply based on number of tenants. It rather multiplies based on number of models. So this is also an, uh, takeaway saying that to-- on an enterprise level, if you want to use a shared infrastructure for multiple teams, you can still get the same level of isolation, uh, if you have the right set of checks available.

Um, and, uh, just to conclude, whatever we built is just not theoretical or not, uh, on paper. We also have this deployed on customer clusters. Uh, it is deployed on a large scale using Kubernetes, so you can still have the same benefits on a single node, [00:44:00] on a multi-cluster environment, on a distributed environment where you have Kubernetes in place.

So yeah, these are the four takeaways. Next slide. Uh, yeah. Oh, another slide? Sorry. There we go. And so, um, and, and one thing that we didn't mention in this talk, this is very much focused on our paper. You know, we also did, uh, we also did a workshop for the AI Agentic, uh, Software Engineering workshop as well.

Um, and, and you know, we're gonna actually be presenting, uh, this poster session at, um, at the AI Engineering World's Fair. So you just got a preview of what we'll talk about, uh, at the World's Fair in San Francisco in a few weeks. And, um, one of the things that, that we're really excited about, um, is we actually did a benchmark of our RAG implementation, um, compared to OpenAI's SaaS product.

And, you know, we published our results and, and, and shared them and, and you know, the, the, the cool part was that we were competitive. Our retrieval system is on par with, with [00:45:00] OpenAI's, which means we've reverse engineered it pretty good, which is a pat on my back and Varsha's. Um, and so I think like it's, you know, w- we care deeply about open source, obviously, as our, our careers have shown, uh, obviously the work by, uh, virtue of Red Hat.

Um, but it's just great to see that you can get the same sort of feature parity with open source that you can with, uh, with some of these SaaS products. Now, we can't claim like frontier intelligence. We're-- We don't have Mythos open source yet, but I'm sure that's coming. And I, I think when, when that comes, you know, more and more clanking will be done, uh, through open source, and, and that, that gets me pretty excited.

So, you know, we have a link to the paper, the link to the, the evals repo. We shared them also in the chat. And then finally, a link to the, the GitHub repo. Feel free to s- give it a star or comment with the middle finger if you think it sucks. We would appreciate that too, some feedback. Um, I'm kidding. Uh, I mean, maybe not.

Uh, but, uh, you know, please feel free to, to, to join the community and, and contribute, see if it solves some of your problems. [00:46:00] We're doing a lot of great stuff there All right. Well, thank you, Francisco and Varsha. Um, first of all, do we wanna show a slide while we're talking now? Any particular slide we should leave on the screen?

No, we can take them down. Yeah. Yeah. Okay. So everybody can see everybody, which is their beautiful faces. We have a lot of questions. And Andrew, are you still on, Andrew? And if you would share your image, if you're not, if your picture... And I can't see you again. I can see a lot of people, but I don't see everybody.

Um, there you are. Andrew, there you are. Why don't you take the first? We've got three questions in a row. I tell you what, you and I will interleave. I have to also, Dave, in five minutes, I have to leave. So if we run over, can you pick up the Q&A? Yeah, sure. We got a hard stop too, but, uh, we'll, we'll try to push it here.

Andrew, go ahead. I guess, uh, my first question was with OGX. Can we use, uh, can we use this for RBAC and ABAC for non-RAG retrieval? Like, let's say I've got [00:47:00] my agent, my customer support agent, I don't want it combing through the finance tables. You know, that option just should be a hard guardrail. Yeah, I believe so.

I, I mean, it depends because, uh, um, so we have guardrails within, uh, OGX as well. So, so there's like inference-based guardrails, um, which happen, you know, like agentically as people call it, but really it's inference-based. Uh, and then these kind of deterministic ones. So if you have like an explicit table where you don't want, um, people to converge on, I actually think you get that for free.

Um- Uh, do you mean with like, uh, the tool use, for example? Like if you had a specific table that the agent would have access to? Yeah. I also asked about tool use, so this is probably converging on the same question. Like, could I have an IAM role assigned to my agent or pass through to it that prevents it from calling the query the finance tables?

[00:48:00] Hmm. Yeah, so like, um, you could have a guardrail that prevents it to call the, uh, which is inference-based- Inference-based, yeah ... preventing it. Having a deterministic rule that says, "Hey, you can't call this particular table." That's kind of true. So the how OpenAI's files API work or like, um, like RAG API works is you create a file and then you create a vector store.

And I, I should have explained this before. When you create a, a vector store, that's basically a collection. You can attach a file to that vector store, and then when you actually do your search, that's when you orchestrate. The, the client says, "Hey, these are the tables I want to query," which is basically a vector store.

Um, and so in your case, the client, the OpenAI client still says, "Here are the vector stores I want to actually query." And then beyond that, like if they don't have access to that, um, that table and they try to query it, we'll just reject that. [00:49:00] Maybe Francisco, like, so the same policy awareness that you could put on your ingestion to populate your vector store.

For the tool use, um, if you have like an MCP server that like says, "Query my internal entities table," that same policy awareness can happen for the tools. Yes. So that way when you're making a call to the inf... I think that's kind of like what Andrew's question may be of, um, can I apply- I see. I see. Can I apply that same kind of like logic or does the server handle, um, the, uh, the governance of like what a tool that may do a certain thing on a systemic table, does this tenant or not have access to be able to do that, um, beyond- So that's a really good question because it's a...

That one's a little bit more nuanced because if you're using this as an MCP, can you add ABAC over that MCP tool? I don't think so. But Varsha might know about that one more than I do. Yeah. Yeah. So in case of external tool calls, uh, like for example with MCP, uh, what you can do is you can say whether a tool call should be made or should not be made, [00:50:00] but- Propagating the metadata is not possible because your server is not controlling, uh, what's in-- what is behind your MCP.

So it's very difficult- It's not- -to propagate the right set of metadata annotations back there. Yeah. If it was embedded in- It's not this- Sorry. It's not this, uh, problem of token exchange. Because with the MCP, basically, you have an IDP, you have token A, let's say, calling an MCP server. You preserve the identity, and then you exchange the token A with the token B having a different scope.

And in this scope, for example, you can specify some attributes that can allow, well, that, that can... You can use them, for example, for filtering. I use this mechanism for, uh, uh, with Chroma. Chroma, for example, it's, uh, allowing, [00:51:00] uh, uh, metadata filtering. And it's, uh, you can map the,

let's say, multi-tenant behavior by using as an attribute, the tenant ID, user ID, agent ID, for example. And basically, for me, it's, it's, an, on behave of, behalf of, um, scenario when you just exchange the token. You preserve the identity, but basically you, you, you switch the, um, uh, you change the scope. So when you do the ingestion, basically, you can, uh, decorate the metadata with to-- uh, tenant ID, user ID, agent, if it's necessary, and some extra, um, metadata attributes.

For example, indicating if the, um, uh, document, um, contains sensitive, uh, um, uh, information or not. [00:52:00] And basically, um, you can, uh, let's say, um, um, map differently or you can use filtering by, uh, using this token exchange approach Yeah. I- And, um So that's not how I think it works, say, but we'd love for-- to, I mean, if that's a way we can extend it, we'd love to hear more about it.

And, you know, uh, Calin, if, if you wouldn't mind me asking, feel free to like maybe post a GitHub issue about this, and, and Varsh and I can, can look more at it. Like if there's a way for us to extend to, to support that natively, we would, we would love to. Yeah. It's, it's, it's a standard way that we, um, we use MCP, right?

Because MCP by default doesn't have any, um, authentic, ~authori- ~

Basically, the new scope on the other side in case of RAG, you can use to, uh, do filtering. It's at least how I'm using in production, for example, um, any, uh, vector store that supports the [00:53:00] filtering. I can do- Awesome ... this kind of mapping. I, I know- Yeah ... we're a little over time. Uh, I'd like to appreciate everybody just, like, coming.

Uh, Varsha, Francisco, thank you so much for just, like, giving us your time and energy and, like, putting this presentation together for us. The paper's awesome. Learned a lot from it. It's super applicable to, like, the work that we do, for example, at Evolution IQ. Um, if anybody wants to contribute, I know you guys will share the information out with everybody as well.

I know you guys got a Discord. I don't know if you have a Slack channel. Um, but just excited to see where this project goes and, like, what you guys continue to do with it, and thank you again for your time. Yeah. Thank you, everybody. Thank you. Thank you so much. Thank you. Talk to you guys later. Thank you.

+ Read More

Watch More

Navigating the AI Frontier: The Power of Synthetic Data and Agent Evaluations in LLM Development
Posted Jun 18, 2024 | Views 593
# AI Frontier
# Synthetic Data
# Evaluations
# LLMs
# Okareo.com
Assess the Value and Feasibility of LLM Use Cases with a Checklist
Posted Oct 24, 2023 | Views 583
# LLM Value and Feasibility
# Checklist
# Xebia
Future of Software, Agents in the Enterprise, and Inception Stage Company Building
Posted Mar 04, 2025 | Views 491
# Software Agents
# Start ups
# Boldstart Ventures
Code of Conduct
Your Privacy Choices