Sign in or Join the community to continue

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid // Agents in Production 2025

Posted Jul 23, 2025 | Views 277

# Agents in Production

# Google DeepMind

# Chatbots

# Google Gemini

Share

speaker

Philipp Schmid

AI Developer Experience @ Google DeepMind

Philipp Schmid is a Senior AI Developer Relations Engineer at Google DeepMind working on Gemini, Gemma with the mission to help every developer to build and benefit from AI in a responsible way.

+ Read More

SUMMARY

As AI continues to evolve, we will a shift from static chatbots to dynamic agentic AI systems capable of autonomous reasoning, tool integration, and multi-step problem-solving. This talk explores how to design AI agents that leverage structured outputs, function calling, and workflow orchestration with Google Gemini.

+ Read More

TRANSCRIPT

Philipp Schmid [00:00:09]: So yeah. Hi everyone, I'm Philip, I'm now part of the AI DevREL team at Google DeepMind and we're going to talk a bit about, okay, what comes after chatbots, right? Everyone has seen and used ChatGPT, Claude, Gemini, but how, like what's coming next? Quickly to myself, yeah Philip, AI deferral, mostly working on AI developer experience. So how can we make sure that what we are going to release is in a state that everyone enjoys using it? And I would say that we are currently at a turning point. Right? In the last 10, 50, 20, 40 years AI and machine learning was all about predicting, finding patterns, trying to help understand our data. And now we are moving into doing something with ChatGPT and like agents and Claude and Gemini, we see that AI takes action instead of just telling us, okay, so some kind of recommendations. And what we are going to do today is really look at a brief history about agents, where they are coming from. Then what are agents? Yes, there are loops and then look in detail on like different agentic patterns on how we can build those agents.

Philipp Schmid [00:01:16]: Look a bit at how evaluation of agents differs to LLM or like in general machine learning evaluation. Then look at some cool practical examples and quick starts on how you can start building agents with Gemini. And then maybe we have some time to talk a bit more about the future. So where we are coming from, right? GPT or GPT2, all of the first transformer models were all about completion. We had models, they started to generate text and we learned, wait, they can generate poems and they are really nice auto completion tools. And then we started to move into a world where hey, what if we train them on some data that when I ask it a question that it, it doesn't continue with another question? What if it can really answer the question? So we went into instruction tuning where we basically trained the models to not complete text, but still complete text in a way that it's more helpful for us. And then with ChatGPT at the end of 2022 it was all about conversations and okay, how can I go beyond the single instruction with question answer into a way that I can have really long multi turn conversations. Then we entered into a way, okay, that's nice, but what if I want a model to do something for me like get information from API? So we went back to the instructions but now added some functions where the model started to generate an output which helped us parsing and then calling APIs.

Philipp Schmid [00:02:40]: We are now at the point where with all of those reasoning models and thinking models we have way longer inference time. So user model response is not within seconds. It's really starting to go into like the minutes or even hours. And the whole big difference is that we as user just provide a goal and functions on like what the model can use and then we expect the model to start, do a lot of different things and come back with a very nice and good response for us. So where are agents coming from? And I think the first paper was published by OpenAI where they talked about or fine tuned pre chatgpt model instruct GPT on so how can we prowl the web? They trained it model had very like simple actions. It received all of the browser. They use reinforcement learning to teach the model okay, what should I do? Where should I click it? And basically had the first browser agent for LLMs. In addition to this meta also worked on tool use on how can we train models in a more general way to help the model understand okay, what is the question or instruction where I should use a tool rather than trying to just complete or continue with a text.

Philipp Schmid [00:03:58]: Then we kind of started the whole reasoning journey. The react paper in 2022 was the first kind of abstraction away of okay, how can we do more than just a single request response pattern? How can we have the model first start thinking or reason about okay, what should it do? Predict an action and then also observe the response to either call another action or call the user. This whole react pattern basically made its way into the model itself. Today we have mostly function calling or tool use where we provide a user input. The model either thinks about it or just directly starts generating structured output. Most of the time it's JSON with the function the model suggests to call and then also the arguments the model wants to call. What is an agent? And at least on our side, we define an agent as a system that uses an LLM to decide the control flow of an application. And I think it's very important that we highlight the decision making here because there's a big difference between just chaining LLM calls after each other and really have the LLM decide, okay, what should I do next? As agents we have our reasoning or our brain.

Philipp Schmid [00:05:12]: That's basically the model we are using during our conversation. We have tools which are basically the hands or sense of the agent, where it can go and query an API, retrieve data for us. And then of course we need context and memory. Here we try to separate between short term memory, that's basically everything you have in your conversation, and then long term memory, which is information preferences from Prior conversations with the agent, maybe from your own user experience, but those are external and then provided into the agents. And if we put that all into a loop, we have the LLM more or less generating the whole time until the user request is served. If we take another quick look at what is an agent and what is a workflow really, A workflow is a predefined sequence of steps. So we might have a user input to query or generate a story, then we have another LLM, continue generating a story and then a final output. And in an agent we really have some kind of decision making.

Philipp Schmid [00:06:17]: So the LLM decides should it call a tool, should it just generate a response? So very important if we look at the differences again, that's a workflow. So we have a user query, two parallel calls, an aggregator and a final output, and that's an agent. Very similar from a visual perspective. But the big difference here is like the decision making. So in this case an LLM decides should I use the LLM for task A or the LLM for task B? In a workflow, there's no concrete decision making. Awesome. So now let's look at a few examples on how we can build agents. Those are called agentic patterns.

Philipp Schmid [00:06:55]: It's mostly common structures which we have seen people agree on when building agents. One very common pattern is the reflection pattern. I guess most of you have used it even indirectly in ChatGPT or Gemini. When you ask the LLM to generate something and it generates a response, and then you follow up with a prompt, hey, please reflect your response. If it is good, if it's not good, and depending on the reflection or the feedback of it, the LLM generates a new version of it or it continues with the final output. That's oftentimes used of course, when writing or refinement or even for code generation. When you write a code and then you have some kind of review or some kind of automated process which reviews the code, tries to run it. If it fails, put back the error message into the LLM and then tries to continue tool use pattern.

Philipp Schmid [00:07:47]: We already talked about it a bit. It is like most default pattern for calling LLMs. It's basically where the model has access to a JSON structure, JSON definition, and then tries to generate or decide whether it should generate a JSON output or if it should ask the user some more details for it to follow up with a response. Then of course we have the orchestrator pattern or planar pattern. That's a more common pattern I think we nowadays see in all of the bigger and more complex agents. The whole idea here is basically before start doing any action, we have the LLM first generate a plan and then dedicate or provide a task to different models and have each of those tasks be run by a separate agent or by other LLM calls or other models. Basically then our final pattern is the multi tool use pattern where the LLM, similar to the orchestration pattern has some kind of handoff mechanism between an agent 1 and an agent 2. Where the LLM instead of creating a plan, it more or less natively hands over to another agent using the full context of the model.

Philipp Schmid [00:09:23]: So when you have an orchestration pattern, the idea here is that you have like this flow and each agent might have its different context or history and a multi add on multi pattern agent, you have a single unified context which is hand off or shared between the different agents. So very common example is if you have like very separate agents for like booking a hotel or booking a flight. When a user starts like hey, I would like to book a flight, I don't know, like to Toronto and a hotel as well. Maybe the first hotel booking agent starts the process. Once the hotel is the flight is booked, it hands over to the hotel agent and then the hotel agent has all of the information prior to the user. When is the flight, when does it start, when does it end? And it makes it much easier and nicer in terms of an experience for the user to book the hotel then for its final destination. So why do we need those agency patterns? Building agents is still software development. And in general for software development it's very helpful if we have some kind of structured way of creating our systems.

Philipp Schmid [00:10:29]: Very important those patterns are not fixed tools, it's more like building blocks or help you guide the process of building those. I mean we already have seen in the orchestration pattern it's very common to use tool use pattern as part of it. So the idea here is really to compose all of the different patterns together. So a perfect agent will not most of the time not only just use one pattern. It's could be some kind of orchestration pattern with a tool use agent, with a reflection agent or with something additional. And it helps of course to scale like right, if we build some bigger agents and want to scale, it's very important that we know about those patterns that we can test them individually and help us improve our agent over time. So what is different between evaluating an agent and a model or an LLM in general? And I think it's good when we start looking at the LLM evaluation in general. Normally when you create some kind of question answering system, you have an input and you have an output and you are able to evaluate the output if the answer is correct or not.

Philipp Schmid [00:11:35]: So it's very easy to verify, it's very easy to calculate the cost. LLMs are priced on like a per token basis. So we kind of can estimate okay, how much or how expensive it would be to run an evaluation for a specific set. And there are multiple evaluations available from academia and from public leaderboards we can use to compare each other or different models prior versions to make it very easy to understand on a first claims. Okay, how good is our agent? And of course most of the time evaluation is measured in capability, meaning the accuracy like achieving like 80% on like math benchmarks with like 5 shot. Looks like my slides got a bit smaller, I don't know. And then like to agents it's very different, right? Agents are taking actions in real environments, so making it much harder to evaluate. What if you have like a customer support agent which needs to like send emails? That's much harder to evaluate in like just looking at the output.

Philipp Schmid [00:12:42]: Cost can become very unbound. So it is, we don't know when running an evaluation on how many tools will be called or how many iterations we have in our agent loop. So we, it's really hard to predict in the beginning what the cost will be. So we'll always have a range of cost. It's very task specific. So an agent is normally built for a small set of use cases for a very specific domain. So we always all the time need to create our own evaluation. So it's very hard to start with.

Philipp Schmid [00:13:13]: And of course I would say the biggest difference is that we need or should stop measuring performance and reliability. And what I mean by this is that with capabilities we basically measure how correct or how good an answer is. And with reliability, of course, we measure how many times something is correct. And if we look at a very naive example on understanding agent reliability, I created this visual with two agents and if we rerun our evaluation only once, it looks like, hey, agent one performed much better. Let's use the setup of agent one compared to the agent two setup. But if we look on a longer horizon. So if you think about a customer support use case, you don't want only one customer to have a good experience, right? You want all of the customers or like most of the customers to have a good experience. If you would go with agent 1, one customer will have a very good experience.

Philipp Schmid [00:14:07]: Maybe a second one as well. But as soon as we are like 3, 4 or 5, the performance is much lower compared to the agent. 2. Of course cost is or like. There are many other different considerations. We need to think about it when selecting the agents. But reliability of agents is much more important than it was to compared to traditional LLM use cases because we want our agents to be successful on all of the runs and not only just for one case. Cool.

Philipp Schmid [00:14:35]: So let's look at a few examples on how you can start building agents with Gemini. If you haven't tried or started yet, the easiest way is go to AI Studio. You can now access it on just AI Studio. You can get an API key for free. You don't need any credit card. If you have normal Google contour like with Gmail can log up sign up. We have a generous free tier. You can now use all of the Gemini models including 2.5 Pro and 2.5 Flash with Gemini and you can start experimenting directly in AI Studio.

Philipp Schmid [00:15:03]: That's our developer platform where you can prompt the model directly with text or you can even start using tools. So on the right side in a studio you have the access to different tools. We have built in tools like Google Search where you can ask or crown your requests with search. So what happens behind the scenes? Gemini decides okay, should I use Google Search to answer it or can I answer it myself? Of course, if you are done experimenting or want to build your own kind of agent, you need code. One of the easiest way is the Gemini SDK which supports automatic function calling. Basically what you see on the right side here is all you would need for building a WETA agent. You create a python function and Gemini SDK converts the doc string of the python function and the input parameters into a JSON schema and then provides it to Gemini. And the Gemini SDK basically handles the function calling.

Philipp Schmid [00:16:01]: So when you do function calling, the model might generate a structured JSON output and then on the client side you need to call the method and then send back the response to the model and then get a human friendly response. That function calling loop basically is handled by the Gemini SDK so it makes it very easy to get started. We work a lot with the open source ecosystem to make sure that the Gemini models are well integrated, well supported. We work with LangChain and Landcraft making it super easy for you to build all of those awesome composable crafts. Or if you want to build multi agent system with Crewai, Gemini is well integrated. We also have a nice guide in our documentation for it for the customer support use case work with Llama Index if you want to process large number of documents or build some agent workflows or research agents. And then of course the whole ecosystem is much broader. We try to really make sure that Gemini is well integrated in all of the popular tools you want to use or are available.

Philipp Schmid [00:17:05]: So we partner with Vercel with browser use browser based Lightlm to make sure that Gemini is really accessible to the tool you want to use for building your agents and that we are not limiting you on the things you can and want to build. Of course, if you are looking for really cool dedicated examples, we published a deep research agent using landcraft on GitHub which has a nice UI but also the back end which does all of the looping and the reflection and doing multiple Google searches, generating a report for you. Take a look, you can get inspired. You can clone it locally, you only need the API key from a studio. You can get it for free. Then if you want to really look at how does the coding agent work or how can I look on a broader scale, the Gemini cli, which is now used at Google internally but also externally is completely open source. So you can take a look at the system prompts, how are the tools implemented, how are they connected? Also available on the Gemini GitHub repository. Awesome.

Philipp Schmid [00:18:11]: I hope it was a bit helpful for you and you learned something new and looking forward to some questions.

Demetrios [00:18:18]: Questions there are already, dude. And of course it's always great when I get to hear you give talks. I really appreciate this. Let me jump right into what people are asking. We've got one from Chris saying. So we need to develop parameters for how, quote unquote, how the agent makes decisions, navigates probability space. This is where prompt engineering design still needs to evolve. Right? Did you catch that last part? I.

Demetrios [00:18:58]: I almost, yeah.

Philipp Schmid [00:19:01]: What was the question?

Demetrios [00:19:03]: Sorry, I was. We need to develop parameters for how the agent makes decisions and navigates probability space. This is where prompt engineering and design still needs to evolve. Right?

Philipp Schmid [00:19:25]: Okay, got it. I mean next to prompt engineering, a very popular term which is coming up in the last few weeks is context engineering. I guess that describes it a bit better. In general, we always need to make sure what information we provide to what time to the LLM to give it basically the capability to answer the user request. So prompt engineering or the prompt is one part of it, the tools is the other part. Then external information is like another part. How do we want the output to look? And there is no single rule. And I think that's why agent evaluation is also so much harder than LLM evaluation because for like a very simple customer support agent, it's, you need to correctly define, like, how do you define a success? And agents are not 100% reliable currently.

Philipp Schmid [00:20:18]: So you also need to like calculate in, okay, what happens if like two customers are not correctly served? Like, what if I sell them a car for like $1? And you definitely need to work on those like guardrails and ideas and like, what could go wrong? And then like build evaluations and as soon as you have evaluations, you can really work on like the context engineering part. On, okay, how can I make it better? Can I provide policies as part of like the agent request? Can I run some pre classifier to filter out something like trying to buy a car?

Demetrios [00:20:54]: So all of that is, is fascinating and especially this idea of like giving it the right information at the right time and making sure that it is the right time. You don't want to just like constantly be stuffing everything as much as you can into the context window or like it's not going to be clean and it's going to almost like muddy the waters of what you really want. And I, it just reminds me of a conversation that I was having with my friend Nishi last week and he was saying that the majority of his time is spent on all the unsexy stuff, which is basically data engineering problems. And so when you have in the prompt like variable, he's like, that variable is really hard to make sure to get right. Especially if you're pulling it from like different databases or if the users are creating it and you need it in real time. And then it's not just like this clean table that you're getting. It's not like some kaggle data set that you have access to. You know, so that's, that's a key point that you're hitting on.

Demetrios [00:22:05]: We've got more questions. How do you measure developer productivity and developer experience outcomes? It's hard to convince executives to allocate a team and resources in this domain.

Philipp Schmid [00:22:24]: I mean, it's very hard. And I guess like if you pick a single metric, it will be definitely the wrong one. I know like very popular is like PRS lines of code. I mean like, for me at least what I've seen the most success is like shipping features, like how fast can you iterate on a product? And I guess it doesn't matter if it's like shipped with 10 PRs or like 20 or 200 lines of code. And of course shipping features should ultimately go down to revenue if you are like a business oriented company. So is your revenue growth going up by enabling more AI tools for your developers or more AI as part of your product experience? Of course you cannot know prior to what works, what doesn't work. I guess the easiest is to look at studies done by externals and third parties on how GitHub Copilot helped developers to be more productive. I know they will always be like pro on using AI and productivity tools.

Philipp Schmid [00:23:30]: And contra to it, I guess the easiest always is to start small iterate and show how much more business value you can generate. And not having weird abstractive metrics on number of accepted lines. I mean that's not going to help flow manager if manager's goal or KPI is to pro usage or like usage and like revenue like try to work on that.

Demetrios [00:23:56]: Yeah, nobody cares how many PRs were accepted. It's like how can you tie it to revenue and what are the ways to do that? I think that's a fascinating question though because it is something that a lot of folks are grappling with. Just like back in the day when the machine learning engineers were trying to figure out how to best champion for their projects. It's kind of that same narrative, right? And so you look at that, there's. There's a ton more great questions coming through in the chat. I want to ask this one which is very Google gen AI SDK specific for you. At what point do you suggest switching from gen AI SDK patterns, loops, workflows, etc. To an external tool like a lane graph or a llama index etc.

Philipp Schmid [00:24:52]: So what do we really want to make sure with like partnering and collaborating with the ecosystem is that we are not limiting what you are building. So many like AI is not day zero anymore. Many companies might already have existing workflows using other providers using in other like departments and might already decided as company policy. Hey, you can use LangChain and Langcraft to build agents and we want to make sure that Gemini is well integrated. The same might count for Crewai and others. And if you have completely freedom and flexibility, go with the tool you are most familiar with because that way you can start shipping the fastest. If you have experience using Anthropic with crewai, use Gemini with crewei or same codes for language. And if you really no experience, no like requirements, nothing.

Philipp Schmid [00:25:45]: We have a blog post on like giving short description on like what are different frameworks and of course like with tools like deep research try to do some high level research, look at how those different frameworks build agents and like what kinds of abstraction they're implementing and go with like a personal preference. And ultimately there's always the way to not use any of those tools and hand troll your own workflows and agents. I mean, we have seen that agents and workflows are very high level, simply defined where you just might have a sequence of steps which could be small functions, or you might have some kind of conditions where you use structured outputs to see, okay, do I need reflection? Do I don't need a reflection? So it really depends on the current environment. But what we really want to make sure and working hard on it is that we are not limiting or like hindering that you can use Gemini because it's not integrated into a one or two the other tool.

Demetrios [00:26:42]: Philip, dude, it's always an honor when I get to chat with you. I feel like the luckiest guy in the world that you came and you graced us with your presence on this live stream. This has been awesome, dude. And I hope we get to connect and do this more because you all are doing some incredible stuff. I will say it here for the world to know. Gemini is the only model that I actually pay for because I just absolutely love its capabilities. Maybe I'm losing out, but I don't even care.

+ Read More

Sign in or Join the community

Comments (0)

Popular

Watch More

From Research to Production: Fine-Tuning & Aligning LLMs // Philipp Schmid // AI in Production

Posted Feb 25, 2024 | Views 1.3K

# LLM

# Fine-tuning LLMs

# dpo

# Evaluation

Create Multi-Agent AI Systems in JavaScript // Dariel Vila // Agents in Production

Posted Nov 26, 2024 | Views 1.2K

# javascript

# multi-agent

# AI Systems

Challenges of Working with Voice AI Agents // Panel // AI in Production 2025

Posted Mar 14, 2025 | Views 598

# Voice AI

# AI Agents