Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Patrick is a Staff Engineer on the Vertex Applied AI Incubator team, where he focuses on building tooling and reusable assets to extend Google’s cutting-edge LLM technologies. He specializes in the Conversational AI ecosystem, working with products such as Vertex Agents, Vertex Search, and Dialogflow CX. Previously, he was an AI Engineer in Google’s Professional Services Organization.
Data is a superpower, and Skylar has been passionate about applying it to solve important problems across society. For several years, Skylar worked on large-scale, personalized search and recommendation at LinkedIn -- leading teams to make step-function improvements in our machine learning systems to help people find the best-fit role. Since then, he shifted my focus to applying machine learning to mental health care to ensure the best access and quality for all. To decompress from his workaholism, Skylar loves lifting weights, writing music, and hanging out at the beach!
Generative AI Agents represent the current frontier of LLM technology, enabling dynamic interactions and intelligent workflow automation. Building and deploying production-ready agents, however, requires navigating various complexities and understanding key lessons learned in the field. This session distills over two years of practical experience into actionable best practices for successful agent development. We'll dive into techniques like meta-prompting for prompt optimization, establishing robust safety and guardrails, and implementing comprehensive evaluation frameworks. By addressing these critical areas, you'll be well-equipped to build more effective and reliable Generative AI Agents. Check out Patrick's recently authored whitepaper published on Agents: https://www.kaggle.com/whitepaper-agents
Check out Patrick's recently authored whitepaper published on Agents
Skylar Payne [00:00:02]: All right, we're just about ready to get started with our next session. Patrick here is going to bring the chill vibes instead of the music. How's it going, Patrick?
Patrick Marlow [00:00:15]: Doing great. How are you, Skylar?
Skylar Payne [00:00:17]: Great, great. Having a great day, hosting this conference on track three. Had a lot of great talks so far and super excited for this next one. So I'll just kind of kick it off with quick intro and let you take it away. But Patrick is here, staff engineer from the Vertex Applied AI Incubator at Google. And I think the big thing you should know about Patrick is he just published a white paper on agents with some collaborators kind of leveraging his wealth of experience working on conversational AI for a long time. And so you should definitely go check that out. It just dropped on Kaggle, so.
Skylar Payne [00:00:55]: So that's definitely a way to kind of read up more on these ideas. But without further ado, let's just jump into it. Let's take it away.
Patrick Marlow [00:01:04]: Cool. Thanks, Skyler. And thanks everyone for having me today. You know, it's been a really cool conference so far. I've been watching a lot of the talks. They're all really great. Got to meet some llamas. I thought Dalai Lama is the best name for a llama ever.
Patrick Marlow [00:01:17]: It's really great. You know, it's funny, when Demetrius first asked me to do this presentation, I was going to put together something around agents with Gemini, things like that. I started noticing that everyone was doing how to build an agent. I made a slight pivot and so I'd like you to accompany me on this pivot journey today. Really what I'm going to talk about today is lessons learned over the last couple of years of delivering generative AI agents and just being in the trenches. Getting started. A little bit about me. My name is Patrick Marlow, staff engineer at Vertex Applied AI Incubator at Google.
Patrick Marlow [00:01:52]: It's a mouthful of what does that mean? Me and my team, we work on a lot of just the cutting edge stuff when it comes to large language models, function calling, the Gemini SDKs, things like that. I have been in the conversational AI and NLP space for a little over 12 years. Right now my focus is working on multi agent architectures, memory usage and agents, understanding how agents use these things efficiently. As Skylar just mentioned, I dropped a white paper this morning. I was a co author. It was really, really cool to be a part of that. So if you're interested in agents, obviously you are, you're here. Go check out that white paper.
Patrick Marlow [00:02:28]: I'm also a Large contributor to the open source community. So I manage a couple of open source repos at Google, contributor to LangChain as well. So really, you know, kind of enjoy that part of large language models in the agent community and stuff. It's really great. Okay, so let's get started. I'm going to do a super brief history of large language model applications and then we'll get into the lessons learned. So in the very early days when we first started this whole Genai journey, there was basically just models as a user. You would send a query into that model, that model would magically respond with a bunch of amazing tokens and you would look at the answers and you'd say, hey, these are really great.
Patrick Marlow [00:03:05]: But one of the things that we noticed very quickly is that there is this concept of hallucinations didn't always get things right or it was confidently wrong. Very quickly we moved on from this architecture of just models into this concept of retrieval, augmented generation or RAG. I would say twin 2023 was definitely the year of RAG. That's when everyone was implementing RAG solutions. This also brought the rise of a ton of amazing companies in the Vector DB space. Now we had vector databases where we could do embeddings of all different types of data that we wanted to store. Now when we send those queries in, the model can actually ground itself with external knowledge from these vector databases. We had less hallucinations, but still it was like a.
Patrick Marlow [00:03:51]: You could think of this as like a single shot architecture. There was a query in, there was a retrieval, there was some generation and we're done right at the end. But what we were really lacking is the ability to add kind of like another level of orchestration around that. And that's really where we get the idea of agents. So you know, agents kind of came about, you know, late 23 into early 24 or at least kind of into the mainstream I should say. And we start to see this concept of like reasoning with agents and orchestration with agents and you know, kind of really being able to do kind of multi turn inference and stuff like that that's happening under the hood. You've got tools that they have access to, sometimes multiple models that they have access to, things like that. The concept of agents that we're going to be talking about delivering in production today is this architecture and there's lots of variations on this architecture.
Patrick Marlow [00:04:38]: But roughly this is what we talk about when we talk about agents in production. Getting to production. The first thing that I'd like to point out is that production is not just a simple model. I think over the last couple of years there's been a lot of hyper focus on what LLM are you using. Are you using 01? Are you using 3.5 Turbo? Are you using 4? Are you using Gemini Pro or Flash? There's a lot of hyper focus on that model. But the reality of the situation is that your production agents are more than just the model. In fact, there's a lot that goes into it. Whether you're taking that model and you're adding grounding to it, or you're adding tuning, or all the prompt engineering that goes around that.
Patrick Marlow [00:05:20]: And then you really start to see that there's like more and more that gets layered into this. And you start to realize that there's this entire kind of system design practice that goes into building out an agent, including orchestration, API integrations, CI, CD analytics. There's so much that really goes into it. And so you realize that an agent, again, not just a model anymore, but you're really starting to think about your software development, lifecycle, best practices, how you do system design, how you put everything together. And if I'm particular predicting the future here, I would say that at some point, you know, a lot of the models that we're looking at, they're all going to become commoditized, they're all going to be fast, they're all going to be good, they're all going to be cheap. And so really what you're left with is that ecosystem that you're building around your model to make things like truly good. And that's kind of where, you know, your secret sauce comes into play for building your models and putting them out, or your agents and putting them out in production. So today I'm going to touch on just a couple or a few of these things and really just kind of summarize it into these three points.
Patrick Marlow [00:06:19]: I'm going to touch on this concept called metaprompting. Touch a little bit on safety and guardrails, and then finish up with evaluations. And so really after delivering hundreds of models into production with lots of different developers and customers and partners, these are the same three consistent themes that we saw across everyone that was doing really, really high quality agents in production. I want to share all that information with you today. So first we'll start with metaprompting. The TLDR around metaprompting is essentially you're using AI to build AI. Now a metaprompting system, the whole architecture starts simply like this. You have some sort of metaprompting system that we'll talk about in a little bit more detail here in a second.
Patrick Marlow [00:07:03]: And that metaprompting system is going to generate some prompts for a secondary system. So we'll call that the target agent system. This is the agent you're going to put out into production. That agent will then produce some sort of responses which you can then evaluate and you can send those evaluations back into your metaprompt system. So you go around the circle as many times as you want to sort of optimize the types of prompts that you would be sending or using in your target agent system. So to give you a little bit of intuition around how something like that might look in production, here is a handcrafted prompt that I wrote, right? So I'm saying like, hey, you're a Google caliber software engineer. You've got exceptional expertise in data structures and algorithms. Here's the things that you can do.
Patrick Marlow [00:07:46]: You know, you can solve coding problems, you can debug code, you can write doc strings and perform reviews and so on and so forth. Typical type of prompt that we might see in production or that someone might write from prompt engineering perspective. But the thing is, when we actually feed this to like a metaprompting system, what we're asking the LLM to do for us is say like, take this prompt and write it in higher fidelity, sort of embellish on all of this and add more descriptions to this so that a secondary LLM system will be able to use it with much more accuracy. So after passing that through an LLM optimizer, this kind of meta prompting technique, you get something like this. Now, semantically, it's pretty much the same as the previous prompt that you saw, but the difference is sort of like how it gets structured, the language that gets used. And again, the way I like to think about it is that humans aren't always necessarily great at explaining themselves. And so what you're doing here is you're leveraging the LLM's capability to kind of write and embellish. Not necessarily embellish, I should say, but like, just describe in higher fidelity in more detail the tasks that you want completed.
Patrick Marlow [00:08:58]: And so there's a big difference putting something like this into production versus putting the small snippet of what I was just showing previously into production. So to show you a little bit how that architecture could come into play, we'll talk about one of the first concepts for doing metaprompting, which is called seeding. The way that this works is essentially you would Start with your metaprompt system, and you would start with some system prompt. That system prompt might look something like this. Hey, you're an expert at building virtual agent assistance. You're going to help this user basically write prompts for this virtual agent assistant system. Then you, the developer, would write what we'll call the seed prompt. Now, the seed prompt is essentially the same type of prompt that you saw me write previously in the handwritten prompt section.
Patrick Marlow [00:09:43]: But I'm going to give a little bit more information. I'm going to say, hey, the end users company that you're writing this prompt for, it's Google Cloud, they're a software engineer. And then here's the task that we want to have accomplished. And then what happens is that metaprompt system will generate these target agent prompts. Now, you can go around this loop as many times as you want, refining that prompt and you know in any way that you want and kind of modifying it. But when you're satisfied with the prompts that have been generated, you then drop those into your target agent system. So this is a really great way, you know, if you're building an agent for the first time and you're not super great at prompt engineering and you're like, hey, I'm not a great creative writer, but I need something high fidelity to really start my first agent off. It's a super great system to use to kind of kick off your agent prompt for the first time.
Patrick Marlow [00:10:31]: Now, the second way that we can use metaprompting is in the concept of optimization. The way optimization works is just taking this system a little bit step further. Once we put our agent into production, our agent is going to start producing responses. We can take those responses and we can evaluate them against different types of metrics. I think lots of other speakers today have talked about evals and things like that. Imagine evaluating your responses against coherence and fluency and semantic similarity, things like that. Essentially what you do is you then take those evaluations and you send them back to your metaprompt system and you say, hey, given this existing prompt that I have, and here's some set of evaluation data points that we now have, can you optimize my prompt for better coherence? Can you optimize my prompt to reduce the losses in my tool calling, then what this metaprompt system will do is again, just go around that cycle and write better prompts to kind of optimize for the metrics that you want optimized for. Now, you might be looking at the system and you Might be feeling a little bit like this exhibit mean him where it's just like we're writing prompts, to write prompts, to produce prompts, to do all kinds of prompts.
Patrick Marlow [00:11:42]: And I totally feel you on it. The cool thing is there's a lot of really great systems out there that you can use to do this type of prompt optimization. Now if you're following along, take a screenshot here. Capture this with your phone. This is going to take you over to a YouTube video. Was a bit more lengthy to talk about how you can do this metaprompting with Gemini, with OpenAI, with Claude, with all the different systems. If you're also familiar, there's some really great open source systems out there like dspy, adl, Flow, Vertex Prompt Optimizer from Google. There's a lot of really great stuff out there that you can use for prompt optimization.
Patrick Marlow [00:12:19]: It's a little bit about metaprompting. We're going to move on to the next one. The next one we're going to talk about is safety. I think for a lot of users putting their agents into production, their architecture might look something like this. Especially if you're putting an agent into an internal use case production, you're thinking, hey, my users, they're all super friendly. I'm going to give them a ui, I'm going to put a little bit of prompt engineering as my defense layer against my agent so they don't do anything dumb. Then I'm basically going to let this cycle repeat. But the reality is if you put those agents out into the wild, so like you actually let them go into the public domain, there's a lot of, there's a lot of bad actors out there, right? And so those bad actors are going to be looking to break your agent, they're going to be looking to do prompt engineering, they're going to try to figure out a prompt injection, they're going to try to figure out ways to kind of game your system that you've built.
Patrick Marlow [00:13:12]: And so it's important to kind of think about that ahead of time and start implementing multilayered defense defenses against these types of things. And it's not always prompt injection. Where this really starts is thinking about input filters. When a user query is actually coming into the system, you can do things like language classification checks or category checks or session limit checks and things like that. For example, there's a lot of prompt injection techniques that play out over the course of many turns. So a simple safety measure is essentially not allowing your users to talk over 30 turns or 50 turns or something like that. Because if you look at the long tail of conversation turns, many of those are where the bad actors are living right. So implementing these types of input filters is also super important.
Patrick Marlow [00:13:59]: Now, on the agent side of things, most people already have their APIs secured or implementing some sort of safety filters, which is super great, but you also have to think about the return journey. So once the response has actually been generated from the agent and you're about to send this back over to the user, what types of things are you doing to protect that payload coming back? A lot of times it's thinking about things like do I need to add error handling or retries? Does your LLM API sometimes give you a 500 error being able to do those types of things? Control generation, JSON outputs, That's all super important to generating a really safe solution. The other thing that we found that users often overlooked is this concept of caching. I think that there's a propensity for developers to always want to build the latest and greatest system, use the coolest technology. But if we take a step back, at the end of the day, the only thing that matters when you're putting these systems into production is the outcome that they achieve. It doesn't matter that they achieved that outcome with the latest top of the line model, it just matters that they achieved a high quality outcome. And maybe that high quality outcome comes from caching responses. And so you can completely bypass your Agentix system by simply knowing that you're going to have like the same query over and over and over again, cache the response to that query and send it back to the user.
Patrick Marlow [00:15:21]: You don't even have to involve your LLM at all or your agent system. You save money from tokens and you get a lot of speed back as well. So these are just little things to kind of think about as you're building your agentic system and putting it out in production. The last piece is really thinking about feeding these signals back into your data analyst teams, your data science teams and things like that, to be able to do things like take a look at analytics, what are people talking about? And then using those signals to inform how you're updating your prompts, how you're updating your input filters or your output filters, all of the different things that come into play with that. A lot around safety, a lot around creating this type of layer defense. Think about this as you're putting your agents into production. Okay, last one that we're going to touch on today is evaluations or evals. So you might be thinking like, Patrick, what are evaluations? Why do I need them? The one thing I would say is if you're building agentic systems, the number one thing that you could do is just implement evaluations.
Patrick Marlow [00:16:24]: If you don't do anything else, implement evaluations because it really helps you provide a sense of measurement and sort of like gives you a barometer for how well your agents are doing in production. So again, evaluations just simply a process to measure the performance of your agent and identify losses and areas of improvement for your agent. So imagine a scenario that goes something like this. We've probably all already gone through this. You build a new agent framework, right, and, or a new agent and you launch that thing out into production. Everyone's high fiving and patting themselves on the back. We did a really great job. Our agents out in prod.
Patrick Marlow [00:17:00]: Awesome, right? Then you go to release your next feature that's going to be attached to that agent. Maybe you're adding a new tool call connection to a database. You change your prompt up a little bit, but all of a sudden all of your users are coming back and saying, hey, this thing is garbage now. It's not responding correctly, it's hallucinating. There's all these things that are going on. You and your team are sitting there going, what's going on? You're starting to look at manual, basically inspecting manually every single response that's coming back from the agent. In my base agent, the responses were coming back like this. But in my updated feature agent, my responses are coming back like this.
Patrick Marlow [00:17:40]: Why is this happening? You might feel a little bit like our friend Mr. Vincent Vega here, just wondering what is going on? How do I solve for this and why am I having to do all this stuff manually? Luckily for you, there's a lot of really amazing frameworks out there that help you do evaluations of your agent systems, your RAG systems, your LLM systems, all the above. They all really start with this concept called a golden data set. Some people also call these expectations. It's all really the same kind of stuff. What this really boils down to is essentially saying you need to define what is the ideal scenario for what would happen when someone is interacting with your agent. So when a user says this, the agent should say this. When a user responds with this, the agent should call a tool with these inputs and then it should say this.
Patrick Marlow [00:18:29]: So you're defining these expectations and then what this allows you to do is take those expectations and compare them against your agent responses. Like the things that are actually happening at runtime, at inference time. And then you can score those against a various set of metrics again, like semantic similarity, tool calling, coherence fluency, things like that. Then as you iterate on your agent, your expectations stay mostly static and you're able to see changes or variations in what's happening as you're deploying your changes over time. So you might see, oh, my tool calling is actually suffering and that is causing my semantic similarity and my agent responses to also suffer. This is why people, our users, are not happy with the particular system that's going on. Again, evaluation is just a way to measure what's going on in your agentic system. Now, to give you one more real world scenario of something that we see all the time with our customers and our partners is essentially evals around multistage rag pipelines.
Patrick Marlow [00:19:28]: I'm sure all of us have built something like this before. Essentially. Imagine where you have a user query that comes into a system and you have a model that does maybe a query rewrite on that input query to fix spell corrections or things like that. Then that query goes into a retrieval system in the middle vector database, fetching all these different pieces of information. Then at the end you take that query and all of the retrieved information, you send it to a model again to do a summarization and you get some result from it. You're evaluating the outside of this pipeline essentially, and you're looking at it and you're saying, hey, cool, we're getting a really great score here. Then someone says, hey Patrick, you can actually improve this score if you use a re ranker in the middle. Then you implement re ranking in the middle and you see that your score is going up and you're like, hey, everything is going great.
Patrick Marlow [00:20:14]: Then let's say your vendor comes back to you and says, hey, we just launched this brand new model. It's the best thing since sliced bread. You've got to put it in everywhere in your systems because we just think it's better than everything that you've built before you say, okay, cool, I'll try your new model. You throw that new model into production and then what happens? All of a sudden your eval pipeline is saying, hey, things are suffering. Your responses are coming back really bad. But the interesting thing is you don't really know why that's happening because you're using that model everywhere. Is it the fact that the qu rewrite is suffering or is it the summarization that's suffering, or is it the re ranking that's suffering, you don't really know what's going on because you're only evaluating the outside of that pipeline. So this is where it's also important to think about evals, not only from an end to end perspective, but also at every stage of the pipeline.
Patrick Marlow [00:21:03]: So that means, you know, performing evaluations on the query rewrite itself and also on the retrieval and also on the summarization. Because essentially this allows you to start to kind of suss out what is happening at each stage and identify like, oh, actually the largest losses that we're seeing are happening inside of the summarization stage. And so we could just swap that back with the previous model. We'll have two models out in production and that will give us like the highest quality results. So again, the takeaway here with evals is if you're not doing anything else, you absolutely have to be doing evals on your agents in production so that you can understand what's going on at each stage of the way. So if you're looking for sort of like a follow up, leave behind, take a screenshot of this, capture it with your phone. This will take you to one of our open source repos where we have a ton of notebooks and code around how to run evaluations with the Vertex SDK Rapid Eval SDK model. You can do this with a lot of other frameworks as well.
Patrick Marlow [00:21:59]: Again, it doesn't really matter to me. What's important is that you're actually doing evals on your production agents and just understanding how to use some of these frameworks. Cool. All right, Wrapping everything up and then coming back around to Q and A as we have time. Remember, the lessons learned here are essentially after working with developers like yourself, after working with partners and customers and seeing all the things that people are trying, these were the most common things that we found are the most impactful at building high quality agents in production, using some metaprompting techniques or prompt optimization techniques, implementing various stages of safety and guardrails and error handling. And then absolutely every single engagement that we worked on had some form of evaluations in there. Without evaluations, there's really no way for you to tell what's actually going on in your system or how well it's working with that. I will bring Skylar back on and I guess we'll jump over to the Q and A.
Patrick Marlow [00:23:01]: We can go from there. I'll leave this up if you want to connect with me as well.
Skylar Payne [00:23:06]: Awesome. Thank you so much. There's a few questions in the chat. We'll just jump through them. First question, when will you plan to release the Vertex AI Agents API to use the most powerful Gemini models to build assistance?
Patrick Marlow [00:23:22]: Yeah, I can't particularly give you the answer to that because a lot of these things are still in preview. I unfortunately can't give you the timelines for when those things will be released, but they are being worked on today. If you're looking for something that is already a GA product where you can build agents, we have the Conversational Agents platform. This was previously known as dialogflow cx. The dialogflow CX platform, Conversational Agents platform has a full functioning API to it. You can build all of these agent experiences via the API. You don't have to go to the ui. One of the open source libraries that I manage called Scrappy is actually a wrapper around this, I guess shameless plug there.
Patrick Marlow [00:24:07]: Try out Scrappy to build agents. It's really fast to build agents. You have access to all of the latest models in that framework as well.
Skylar Payne [00:24:15]: Awesome. How do you think about managing versions of agents?
Patrick Marlow [00:24:21]: This is a really interesting one. It comes up all the time for me. I think the easiest way that we found to manage versions of agents are to break up the agent into all of its individual components and think of it all as code, you know, so we, the way we manage agents internally is basically pushing all of the, you know, all of the prompts, all of the functions, all of the tools, everything into Git repos. And essentially you treat, you treat everything as code, right? And so that even means like the prompts themselves. And so you're looking at like, what are the diffs between these prompts? Do we need to roll back to, you know, you know, previous commits and things like that? And so that's honestly the best way to manage versions of agents is kind of treat them the same way that you would in a traditional, like software development lifecycle with cicd.
Skylar Payne [00:25:11]: Awesome. We're at time. It was a great presentation. Really loved the key takeaways that we had here and thought it was like simply summarized. But yeah. So everybody, Patrick has left his info here, so go ahead and connect with him. But yeah, with that we'll kind of mosey on forward. Thanks for your time, Patrick.
Skylar Payne [00:25:32]: Take care.
Patrick Marlow [00:25:33]: Cool. Thanks everyone. Thanks, guy.