MLOps Community
+00:00 GMT
Sign in or Join the community to continue

From Idea to Implementation: How to Self-Host an AI Agent // Meryem Arik // Agents in Production 2025

Posted Jul 30, 2025 | Views 53
# Agents in Production
# Idea to Implementation
# Self-host
# Doubleword
Share

speaker

avatar
Meryem Arik
CEO / Co-founder @ Doubleword

Meryem is the Co-founder and CEO of Doubleword (previously TitanML), a self-hosted AI inference platform empowering enterprise teams to deploy domain-specific or custom models in their private environment. An alumna of Oxford University, Meryem studied Theoretical Physics and Philosophy. She frequently speaks at leading conferences, including TEDx and QCon, sharing insights on inference technology and enterprise AI. Meryem has been recognized as a Forbes 30 Under 30 honoree for her contributions to the AI field.

+ Read More

SUMMARY

Generative AI and Agentic AI hold the potential to revolutionize everyday business operations. However, for highly regulated enterprises, security and privacy are non-negotiable and shared LLM API services often aren't appropriate. In this session, we will explore the open-source landscape and identify various applications where owning your own stack can lead to enhanced data privacy and security, greater customization, and cost savings in the long run. Our talk will take you through the entire process, from idea to implementation, guiding you through selecting the right model, deploying it on a suitable infrastructure, and ultimately building a robust AI agent. By the end of this session, attendees will gain practical insights to enhance there ability to develop high value Generative AI applications. You will leave with a deeper understanding on how to empower your organization with self-hosted solutions that prioritize control, customization, and compliance.

+ Read More

TRANSCRIPT

Meryem Arik [00:00:09]: Hello, I'm Meryem, I'm the CEO and co founder of Doubleword. We make infrastructure for people that are self hosting their AI models. We know a lot about when self hosting makes sense and doesn't make sense and when you're doing it, how to do it properly. So I'm going to talk about three things today. The first thing I'm going to talk about is what the characteristics are of AI agent inference and how that is different from kind of traditional inference, or I say traditional inference of LLMs that you might experience in a traditional chatbot. So what are the key characteristics? I'm then going to talk about and make the case that those characteristics of agents make it really well suited for self hosted inference versus using API providers like OpenAI or Bedrock or something like that. And the third thing I'm going to cover today is how you should go about self hosting inference for agents in the enterprise. And I'll just give like a couple tips of what we think works particularly well when structuring AIOps teams.

Meryem Arik [00:01:12]: So first thing I want to talk about is what inference looks like for AI agents. So agents are systems that can independently reason, act and adapt to achieve goals with limited supervision so they can understand problems, create plans, they can act, they can use tools, potentially they can call other models to achieve their goals and they can also adapt so they potentially learn from experiences and improve performance. So this is how I'm defining an agent. But it's also important to know that people are using the word agent to define a lot of different things at the moment. So I kind of have this graded system of how agentic is actually on my top level of minimal agenticism. I have just like you're calling an API and you're getting response all the way up to a multi agent system where I might have agentic workflows that can start whole agent other agentic workflows at the moment, when people talk about agentic systems in the enterprise that they're building, typically they're talking about these middle three. So on one end you have LLMs determining basic control flows that like use path A, if not B. But we also get more agentic things where you can have an LLM that can choose the right thing to do based on all of the tools it's given access to.

Meryem Arik [00:02:45]: And so I'm going to talk about the difference between what we think about as a low agent agency or agentic system versus like a high agency system. And I'm going to talk about how inference is different for each of them. So let me just define them first thing. So no agentic system might be a system has a simple input output flow. I send a request to my model and I get a response. A high agency system might be a situation where I have a model that's a controller model, but maybe that model has the ability to use a range of different tools based on what the situation and the use case requires. Maybe it can look up in databases, do some rag, maybe it can call other tools, maybe it can search the web, maybe it can execute code. It's given a lot more things that it's able to do.

Meryem Arik [00:03:35]: And so for my low agency system, I'm actually generating not very many tokens. Pretty much all the tokens that I'm generating are going directly into my output. So if I ask a question, the response I'm getting is pretty much the number of tokens that are being generated. Sensitivity of access is an interesting one, which people often don't think about. In low agency systems, you're not giving the model access to any information other than what you've put in your prompt. And so it's, it's a very limited risk situation where as long as you haven't fed into the prompt something you're, you know, that's not private, then you're kind of fine. And an example of these low agency systems is simple chatbots. These aren't the kinds of systems that completely revolutionize workflows.

Meryem Arik [00:04:23]: They aren't the kinds of systems that are kind of relied on as an essential part of a business process because they're just kind of too simple and too dumb. A high agency system, on the other hand, generates a load of tokens. The tokens that you see as the output token are actually a massive subset of the number of tokens that it's generating. It's generating tokens every time that it's potentially looking up in a database. It's writing SQL queries, it's generating tokens every time it's trying to search the web, it's generating tokens every time it's potentially calling another model or a specialized model, it's generating a load of tokens behind the scenes. This might be familiar if you think of something like reasoning models. Reasoning models kind of do a similar thing, but you know, you can actually see the reasoning steps. The same kind of thing is going on with agentic systems where they're generating a lot of tokens behind the scenes to actually get you the output.

Meryem Arik [00:05:22]: These agentic systems also have really high sensitivity of access. They might have access to internal tools, internal databases. You're not just giving them the information that's limited to the input, you're giving them access to a whole bunch of your internal systems which may have access to information that you aren't entirely happy sharing with third party systems. Potentially some of these services that you give it access to might not even be hosted on the cloud or on the same cloud that your LLM is hosted on. So that's another consideration. An example of this chatbot of this kind of agentic system might be a KYC agent, an agent that might be used in financial services that can take a particular client or potential client and do all kinds of background checks on it, reference it versus other things they're doing with the bank and come up with KYC reports that otherwise would be manual. So the kinds of workflows that these high agency or high agentic systems are able to report reproduce can become very, very business mission critical. They can be the kinds of things that people are designing entire workflows around.

Meryem Arik [00:06:34]: And the characteristics of agentic systems are they're very, very computation expensive, they're using a load more tokens than are used by traditional inference. They are deeply ingrained, typically in sensitive data and in highly proprietary business tools. And they can be highly business critical. If my KYC agent goes down, then that's a real real issue for me as a bank compared to if my chatbot goes down that didn't really understand my business context. So they can be very, very business critical. And so I'm going to argue now that I think those three characteristics make it really well suited for self hosted deployment. But before I go on and think about that serving and self hosted deployment, I first want to give a kind of overview of the landscape of the various options that you have to access your AI. So there are a couple different options.

Meryem Arik [00:07:36]: The most common one is using public AI APIs like OpenAI or Anthropic. What these are, these are proprietary models that you don't get the weights of that are hosted in third party environments. In a multi tenant, in a multi tenant system you have kind of one layer more private is cloud APIs like Amazon, Bedrock Azure, OpenAI or Gemini hosted on GCP. These are often either proprietary or open models like I can access a llama or I can access a Gemini model, but they're still multi tenant hosted and they're still kind of not hosted in within your own single tenant environment. And then you have self hosted deployments. So this is when you do the deployments yourself in a VM or an on prem GPU that you have. I've put some of the technologies that you can use for self hosting, including a plug for ourselves at Double Word. And this is for the deployment of either open source models or potentially custom models that you've trained yourself and deployment in your private environment.

Meryem Arik [00:08:42]: So they're kind of the broad spectrum of option to AI deployment that I'm going to consider in this talk. Which deployment option is right for you. They each have their pros and cons. Public APIs are really good for rapid experimentation. If I was trying to get a POC for a weekend project up and running, I'd probably use something like OpenAI. They give you access. If you require frontier models, then often they're almost the only choice. If you're not particularly latency sensitive or throughput sensitive, then they're a good option because otherwise you might get rate limited and you do get noisy neighbor effects.

Meryem Arik [00:09:20]: And if you don't have data privacy concerns of it going to for example OpenAI, then they're a great option. Cloud APIs have a similar dynamic, but versus public AI APIs, you get a larger range of models than just those proprietary models on that particular provider. And then self hosted models are really good when your models interacting with sensitive data or systems because when you're self hosting, you're deploying in your single tenant environment, you know exactly where the data is going and what the model is able to see. Self hosted models also tend to be better when you're latency or throughput sensitive because you don't get these noisy neighbor effects and you can really control the kind of latency and throughput that you should be expecting. The reason why you would want to self host above the public APIs and the cloud APIs is you firstly, and I've said that a couple of times, you get these noisy neighbor effects. I actually posted about this on my LinkedIn the other day. Services like Bedrock, you know, on the whole, really fantastic, but the service level you get and service quality you get changes a huge amount depending on if you're using it on a Monday or a Friday or a Saturday, and depending on the time of day you're using it. And it also changes region by region.

Meryem Arik [00:10:44]: So for example, the US regions in Bedrock tend to have pretty good latency and throughput if you're doing it first thing in the morning on a Friday, but tend to not have high performance inputs otherwise. So that's just kind of like a constant issue that you get that you can get peaks and troughs in the latency and throughput that you can expect. Secondly, they're multi tenant systems by design. Even when it says, you know, is your OpenAI is deployed in your VPC, you're still still a shared service. And sometimes for some industries and some applications that's not always appropriate, especially if you're a multi cloud house. And also for some people who aren't in the US there are jurisdiction problems as well. For example, if you're in New Zealand or Mexico, you won't have access within your jurisdiction to a whole bunch of the frontier models, whereas you would if you were self hosted. And so Agentix systems are very computationally expensive.

Meryem Arik [00:11:49]: As I say, they use a bunch of, bunch of tokens and self hosting is really, really good if you're deploying at scale be cheaper, faster, you can get reliable, high, high throughput. Agentic systems can be really deeply ingrained in sensitive data and very, very proprietary workflows, which is great for self hosting because you're deploying in your private environment. And agentic systems tend to be applications that are highly business critical because they're able to take over entire workflows, which makes it good for self hosting because you have control over your application, you know that the model is not going to be deprecated unless you want to deprecate it and you have complete control of your model, your code and everything that goes on there. And so I think that there's a really nice lineup and I kind of think this as well between what agentic systems need and what self hosting your AI can offer you. So my takeaway would be on the whole, agents and agentic systems tend to be pretty well suited to self hosting. The question that I'm now going to answer in the remaining couple minutes that I have is that is how to go about self hosting and what do you need to get that to get that running. And so at the most basic, basic level, what you need to self host is some hardware, an inference engine like ULM or sglang and an open source model. We kind of think of that as a bit of using those three, you can kind of turn it into a sandwich so you can have your hardware, some kind of GPU somewhere, an inference engine which allows you to actually run inference at reasonably high performance and then a model that you want to inference and this is what you get.

Meryem Arik [00:13:48]: Now I would almost describe this as like a kind of cucumber sandwich. It's inference that is fine, it will work, it'll get you an API, but it's probably not the thing that you want to put in production, it's probably not the thing that you want to scale. We need a sandwich that's a little bit more fulfilling than just our plain cucumber sandwich. You realize once you've done this first deployment and start getting into maybe day two maintenance, the list of to dos and maintenance can get really, really intense and crazy questions come up like, how do I create API gateways to route requests to my various models? Because presumably I'm not hosting just one model, I might be hosting a couple. What does authorization look like on top of this gateway? If I have multiple requests to the models, how do I orchestrate them? Maybe I have, for example, some situations where I have batch requests and real time requests coming in at the same time. How do I prioritize them effectively? How do I add auto scaling? Because presumably I don't just want this to be a completely static service. I want to be able to scale up and scale down so I can get consistently low latencies. How do I monitor my usage of these GPUs, how do I add usage and chargeback processes, etc.

Meryem Arik [00:15:05]: Etc. Etc. And the takeaway here is that you could build this all yourself and DIY it, but it does get very, very complicated and turns into a bit of a headache. My shameless plug here is that this is actually the type of work that we do at Double Word, but moving very, very quickly on. And so we want to get to this situation where we have this kind of sandwich that will actually fill us up, that will solve all of the day one and day two problems that I've listed here as a subset. And I think there's a really good this will kind of lead me on to how I think we should structure self hosted inference teams within enterprises is this is so, so complex and there's a lot of work to do here. And given that in an enterprise you're going to be expecting to deploy dozens of AI agents and AI applications over the next 1, 2, 3, 5 years, every single team doing this themselves is going to turn into a bit of a nightmare, which is kind of what we did with ML, right? With ML models, each team could kind of deploy it themselves because the model was so specific to their particular use case and very, very resource light. I don't think that's going to work with inference.

Meryem Arik [00:16:20]: And I think with inference our customers are already starting to see, and we're starting to see the inference really does need to be centralized where you can have this sandwich built kind of once incorrectly and have this Hub and spoke model where others can call from it. And so I have like before I finish, I have a couple tips of how we see our clients structure their inference within enterprises and why I think it's a really good idea and why we build our product for this kind of reality as well. And so we think you should centralize your inference ops. The reason why is GPU resources are really expensive, which is kind of contrary to ML where the resources weren't as expensive. And you're also working with shared models in a way that you weren't with traditional ML. So in traditional ML pretty much every single model was trained for that use case. Whereas in inference and AI inference you have models that have a common backbone. So you might for example have a LLAMA architecture and then fine tune variants.

Meryem Arik [00:17:27]: We have this common backbone and so I can have a bunch of different use case teams sharing this backbone and I don't have to deploy this model 10 different times. I can deploy it once and have 10 teams hit that. Difficult to get inference right. As I said, you know, it's easy to get your lens spin up, spun up, but it's pretty hard to actually build it in a fault tolerant and resilient way. And so if you have this inference center of excellence, then you can get it right and do it once. But you can't just have it all be completely centralized because use case teams do need customization. So use case teams will need the ability to do things like fine tuning and give them access to different tools, et cetera. So the world, and you know, the world in which we're moving towards is going away from individual teams deploying a single VLM container because that doesn't scale on day two to having more central teams, typically platform teams or teams that would have dealt with MLOps centrally deploying models that are going to be used by a bunch of different applications and having that as a service inferences of service essentially for self hosted inference.

Meryem Arik [00:18:36]: And so I have a couple takeaways. Agents are really computation expensive and are going to be deeply, deeply ingrained into mission critical systems. As a result, agents are really well suited to self hosted AI inference rather than third party APIs like Bedrock and OpenAI and self hosted inference in the enterprise should be centralized with a hub and spoke model. This is the way that you can make sure you do inference right. Right. And all of the downstream teams aren't blocked by infrastructure in order to get their applications done. Thank you very much. If you have any questions, I've also popped in my QR code to my LinkedIn and feel free to connect me there as well.

Adam Becker [00:19:21]: Awesome Meryem, thank you very much. This was incredible and very engaging. There's a bunch of questions for you in the chat I want to start. We'll go through them very, very quickly. So let's just right at the top. So Brahma was saying this is to one of your first slides about risk and data. Self hosted or not. If you have agents going out and accessing internal and external data, isn't there still data security risk?

Meryem Arik [00:19:55]: So yes and no. So it's about the level of risk. Right. So if my self hosted model is deployed in the in the same environment that the data was sitting anyway, then you can dramatically reduce and minimize that risk and if you also need to put in extra controls about who can see the output of the model and all of these things. So it's a matter of degree. So yes there is additional risk compared with the data just being completely air gapped and no one can access it at all. But it's about level of risk. You need to balance the data being useful to people with data security.

Adam Becker [00:20:30]: What can you tell us about the this is from Ravi Considering hybrid agent approaches where let's say some regulated components are self hosted and the other ones are on the cloud.

Meryem Arik [00:20:45]: Yeah I actually quite like this hybrid approach. Specifically things that I like here are using one of the really big generative models as like a controller model. This can either be self hosted or not because they're really, really they've got great intellectual horsepower and then using these smaller specialized models to do some of the legwork. So that can be an option often it that won't solve the data privacy problem because it will typically get fed back into the big generator model which you might not want it to see all of that data.

Adam Becker [00:21:17]: Jyoti is asking. I have a feeling I know the answer to this. Where are you self hosting?

Meryem Arik [00:21:24]: As in where am I self hosting? We have service on GCP and we have a bunch of on Prem servers as well in terms of where our clients self host typically on VMs on the major clouds and some have some on Prem as well. And Snowflake.

Adam Becker [00:21:40]: Snowflake increasingly interesting we got might be a comment from Chris. Everything is going to remain separate if we have every team building their own agents with different backend architectures. In that case we're not really moving intelligence forward. I feel like you would agree with that.

Meryem Arik [00:21:58]: Yeah, I think there's a big issue at the moment going on where you have use case teams building in silos and not sharing common infrastructure and common knowledge. The way that we advocate for doing self hosted inference is you centralize the knowledge of the inference infrastructure and then you can decentralize the the business specific and domain specific knowledge to the business and to the teams actually building the use case.

Adam Becker [00:22:26]: We got a question here from Ricardo. What are the hardware that you used for a given LLM and parameters selected that worked as expected for what amount of clients? I would like to have a reference of LLM sizes, hardware needed for amount of clients relations.

Meryem Arik [00:22:42]: This is a great question. We actually have on hugging face a model memory calculator which will say based on what model you have, what hardware you have, how many kind of, what kind of batch size and request and sequence length you can service. I'll try and pop it into the channel once I log on.

Adam Becker [00:23:01]: Awesome. Guillermo here is saying given the complexity and token demands of Agentix systems, self hosting often requires significant investments in hardware, dedicated support teams and infrastructure. In this context, why would a self hosted solution be considered safer than a cloud hosted agentic system deployed within properly configured VPCs? Especially considering the advantages of elastic infrastructure and managed services in the cloud? Or is self hosting primarily recommended only for handling controlled and highly regulated data?

Meryem Arik [00:23:32]: So once it's at scale it can actually be much cheaper and you can also get more reliable throughput? I definitely hear you Ari, it being more difficult and like more infrastructure to manage. That's kind of why like we started the company to begin with is we wanted people to be able to have the experience of using serverless cloud hosted infrastructure but it being self hosted and you know, on prem or in their environment so. I totally hear that.

Adam Becker [00:23:56]: Maryam, can you drop the link to that calculator when you get a chance? I think folks are asking for it in the chat. Let's see, let's take a couple of more. So Guillermo is asking again, is there any specific tool that allows data leakage or I mean I imagine it's for protecting against data leakage. I guess it depends on the type of data leakage.

Meryem Arik [00:24:19]: Yeah, it really depends. I mean as best practice a bunch of people are doing like PII detection models and stuff before they pass it into at least public LLMs. And you can train different classifiers depending on the kind of thing you're looking for. Yeah, go on, go for it.

Adam Becker [00:24:40]: No, no please.

Meryem Arik [00:24:41]: Yeah, I've just posted it but I'm not sure if I posted in the right place so I'll post it again afterwards.

Adam Becker [00:24:46]: Okay, cool. Yeah, you might have posted here, so if that's the case. Yeah. Okay, I'm just gonna copy it over.

Meryem Arik [00:24:54]: And so it's using our old branding titan ML. We're now double word but it's still, there's still the same guys behind it. Guys and gals.

Adam Becker [00:25:01]: Awesome. Okay, a couple more. So Sandra, the link is in the chat for you. Gun is saying the choice between cloud versus self hosted for agents does sound similar to the trade offs I guess. Classical trade offs for traditional software engineering.

Meryem Arik [00:25:18]: Yeah, I mean there's a lot of similarities. I think the key difference at least between like MLOPs and LM OPS and so I'm going to talk about that rather than traditional software engineering is how resource hungry this stuff is. So like dealing with GPUs and managing GPUs is just so much more expensive and cumbersome than CPUs. But yeah, a lot of the similar trade offs exist.

Adam Becker [00:25:42]: Ravi saying do you have to desensitivize your LLM agent traces?

Meryem Arik [00:25:49]: So this will really depend on the given application. So yes, potentially.

Adam Becker [00:25:57]: Okay, let's take maybe one more. And first of all, all these people are thanking you for taking their questions. So Chris saying, do you expect to teach companies how to run and design the LLMs on their own or are you setting up and developing those models and these LLMs with them?

Meryem Arik [00:26:14]: Yeah, and firstly thank you for the questions as well. So typically our clients start with deploying open source models and our involvement is essentially creating their AIOps team and so getting that set up and running in a self hosted way. Once they're comfortable with using self hosted models, that's when they start exploring fine tuning as well.

Adam Becker [00:26:34]: Meryem, this has been an absolute pleasure having you here today with us.

Meryem Arik [00:26:39]: Thanks so much.

Adam Becker [00:26:40]: Thank you. Stick around the chat in case folks have more questions. I will drop your LinkedIn in the chat below as well and we'll see you next time area. Thank you very much.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

From Idea to Production ML, From Idea to Production ML, From Idea to Production ML
Posted Apr 28, 2021 | Views 783
# Googler
# Panel
# Interview
# Monitoring
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production