Sign in or Join the community to continue

Real World AI Agent Stories

Posted Jan 14, 2025 | Views 379

# AI Agents

# LLMs

# Nearpod Inc

Share

speakers

Zach Wallace

Engineering Manager @ Nearpod Inc

Software Engineer with 10 years of experience. Started my career as an Application Engineer, but I have transformed into a Platform Engineer. As a Platform Engineer, I have handled the problems described below

Localization across 6-7 different languages
Building a custom local environment tool for our engineers
Building a Data Platform
Building standards and interfaces for Agentic AI within ed-tech.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Demetrios chats with Zach Wallace, engineering manager at Nearpod, about integrating AI agents in e-commerce and edtech. They discuss using agents for personalized user targeting, adapting AI models with real-time data, and ensuring efficiency through clear task definitions. Zach shares how Nearpod streamlined data integration with tools like Redshift and DBT, enabling real-time updates. The conversation covers challenges like maintaining AI in production, handling high-quality data, and meeting regulatory standards. Zach also highlights the cost-efficiency framework for deploying and decommissioning agents and the transformative potential of LLMs in education.

+ Read More

TRANSCRIPT

Zach Wallace [00:00:00]: So, hey, everybody, my name is Zach Wallace. I am an engineering manager of at Nearpod. We are in the edtech space for K12 throughout the US but also throughout the world. And how do I take my coffee? I take my coffee black every time. Yeah, I learned that in college where I was desperate for money and cream was expensive.

Demetrios [00:00:24]: All right, this guy feels like he is smack dab in the bullseye of the Venn diagram that I would paint my P for. I. I knew that Zach and I were going to be friends, to be honest. I knew it. I had him on here. It was excellent conversation. And why did I know that? Because he wrote a whole blog post on how they made their data platform more efficient at NearPod. And right when he got on, he said, you know what I've been doing, though, which is pretty wild, is diving into the world of agents.

Demetrios [00:00:57]: And I have now done some crazy stuff when it comes to breaking down barriers between departments at the company. So we talk about all that. Welcome back to another mlops community podcast. I'm your host, Demetrios. Let's get into it. Let's start it with this. We were supposed to come on here and talk all about the data platform and the transformation that you did there. Maybe you can just give us the tldr, like super TLDR of that, because we're going to take a bit of a curveball here and go on a totally different path.

Demetrios [00:01:48]: But I feel like there's a lot of valuable information in what you've done with the data platform. So let's go over that real fast and then turn left.

Zach Wallace [00:01:57]: Yeah, for sure. So, trying to summarize this as best as I can, and my, you know, 10 minute read on medium felt like a summary. So I'm going to try to do a little better there. So essentially we had data in a bunch of disparate systems, a bunch of disparate data sources, and we have monoliths even on the data, the data architecture side, and microservices on the data architecture side. And with that, we had data all over the place and we didn't really have a good way of condensing the data, transforming the data, or processing for reports in any manner. Right. And what we were able to do is utilize a combination of dbt, DBT Core is what we're using under the hood. It's been fantastic.

Zach Wallace [00:02:44]: Some of my engineers have just absolutely fallen in love with it because it is transformational for an engineer going into the data engineering world. The best way to describe it is it Feels like you're engineering data like you're software engineering with data. And then we're using Redshift, which in the past has had a bunch of really large issues, but we're using it like in an Apache Spark. So it's just for transferring the data and bringing the data to where we need it. And then we're passing it to other areas. Like, you know, we're, we're a subsidiary of a larger company, so maybe we're sending it to Snowflake, or Maybe we're using S3 to process other, other areas of data. And we built ourselves a data product exchange, which is the underpinning of a data mesh. So you can identify where the interactions are across the data products, how you will be able to interact with data products throughout our system.

Zach Wallace [00:03:43]: And again, we have like 20 different disparate data sources all then they have anywhere from millions of rows to billions of rows. So we're talking large scale data to some degree. Not, not meta large. Right, but, but larger than a, than a poc, if you will. And so we're able to do this with consistency, with reliability and with confidence.

Demetrios [00:04:06]: Dude. So explain what does it look like with the disparate data sources and how did you pipe it in? Or did you just make each one of those its own API? Like, I guess what I'm not super clear on is. Give me the breakdown. What, you had this database over here and then you had another database over there and you had to connect those two with DBT and then you're joining them and then putting them in another database that then has this data contract around it or something and it's a data product. What does that look like in practice?

Zach Wallace [00:04:41]: Sure, yeah. So we have data. So we have data stored in, you know, these disparate systems. And most of them are Aurora db, some are Dynamo. But one of the key pieces is with Redshift, they enable something called zero etl and essentially this is an elt, so it's not an ETL like you would typically see it, but it provides real time updates for any data within any of these disparate systems. And if you hook it up, you're able to get the data transferred over to Redshift and then you're able to process the data if you want, in Redshift. Or we've set up exposures with dbt, so we're able to bring that data out of there into any other system, like Snowflake for instance. Right.

Zach Wallace [00:05:28]: So bringing this into Snowflake and enabling us to actually do Transformations in Snowflake, where it's very powerful. So again, we're using really Redshift, like in Apache Spark. And it's a wild mental shift because of how easy 0etl is to set up. And so we set it up. Probably each database took like 5 minutes to set up with, you know, with the console, AWS console, and then we're able to send that and process that wherever and however we want. We use S3 as an intermediary between different data sources beyond redshift. Right. So the microservices go into redshift, the DBs go into redshift, and then from there they'll go into S3 or Snowflake or something like that.

Demetrios [00:06:15]: But where does DBT come in? I didn't catch that part.

Zach Wallace [00:06:18]: Yeah, totally fair. So DBT is how we process and transform all of our data in any of our areas. So we have multiple DBT repos based on the domain that it's living in. And so DBT will typically live. We have one inside of Redshift and one inside of Snowflake. So if we want to process the data for any of the domain relevant areas inside of Redshift, then we'll run that through, through dbt, then that gets passed down to Snowflake and we can do even more transformations in there.

Demetrios [00:06:49]: Nice. Okay, and you did say that buzzword of 2022, I think the data mesh. Why do you feel like this is data meshy? But it's not like the full on data mesh, Right?

Zach Wallace [00:07:01]: So the data mesh by itself provides the ability to send data and define data to anywhere it wants to go across your system. The issue is that similarly to how you build microservices, you have to build this with domains in mind. Right? And so as we're building this, we're working towards a data mesh right now where we're defining the domains. But it takes a while to break down microservices and monolith data architectures and really define the right domains. So we have a team, team working on that now and they've been working on it for about 12 months. We have some data products, probably, you know, 20 or so, 30 maybe, that are able to transfer between different areas of our architecture. But ultimately the goal would be to implement streaming data products where you can go through. Right now we're only batch processing, right? So at this point in time, 0etl works really well for batch processing, but not for streaming.

Zach Wallace [00:08:02]: And so we have to build the other side of that. And that's why I would Say we're sort of a data mesh because you can get data from a batch processing, but not from a streaming. And we need to really define how other teams can get in there because this is an organizational shift. When you talk about going from the traditional MySQL data, maybe it's PHP, maybe it's some sort of TypeScript type orm sort of ordeal. Right. And you're defining these systems and how they work. But now we're bringing these out into a separate area and actually implementing them. Back from our transactional layer, we're able to send that to the analytics layer and then back into the trans transactional layer for further processing or real time updates or anything.

Zach Wallace [00:08:48]: And that's, that's where we're still trying to learn, if you will.

Demetrios [00:08:53]: And what do you mean by data product?

Zach Wallace [00:08:57]: Yeah, that's a great question. So we define. And data product is, is a word you'll hear define 18 different ways. So it's important for you to define it yourself in.

Demetrios [00:09:06]: That was kind of why was like, okay, for the listeners at home, AKA me, what is a data product in your mind?

Zach Wallace [00:09:14]: Yep. So a data product, as we've defined it, is the intersection of the data and the data definition. So it's going to be the transfer of, you know, a set of data that has a clear definition of what it is. So let's say a user, for instance, you send, you know, the times a user is logged in, you send the times that a user's done something in your application. And all of a sudden that is user usage. Right. And that is a data product by itself because you're able to define exactly what that is and you're sending the data whether it's aggregated or, you know, just single rows. A bunch of rows.

Demetrios [00:09:54]: Nice. And so then theoretically you would have many different types of data products. Maybe there's user usage, but there's user profile and there's user. Whatever else you can think about.

Zach Wallace [00:10:09]: Exactly.

Demetrios [00:10:10]: Okay, cool. So now let's take a hard left and talk about what you've been doing recently. Because that was almost like your past life. And you just told me for the last three months I've been diving deep into agent architectures. And I thought, well, that's perfect because I'm all in on agents too. So I do love talking about data engineering for ML and AI and data platforms. But I also am fascinated by agents right now. And so what's your story there?

Zach Wallace [00:10:47]: So I would rephrase how we stated it's a hard left. I would actually Say this is the next step. So in this architecture the key that you need across your system is data, right? You need the data of your users, of your system, of the architecture to be able to facilitate, facilitate quality in your system with LLMs. And that's key because you can, you can ask an LLM to do whatever you want and it's going to give you however it interprets it. But without data, it's not going to have the quality associated that you need to provide reliable and confident answers or you know, suggestions to your users. However, you're going to use this. And so the data platform is really the first step. It's getting the data in places that now you can utilize that data for better quality responses or better quality LLM responses.

Zach Wallace [00:11:38]: And so we endeavor, we took an endeavor on the agents we, we tried to see, okay, the market's demanding that we use LLMs in some capacity. I'm in the edtech world and, and that's a dangerous world to get into LLMs, right, because we have to consider the students, the parents, the state legislature, we have to, the country legislature. If we're, and we're a global company, so if we go globally, how does this affect different cult the world? And that's a tough problem to solve as you've probably seen, whether it's language barriers because LLMs are not great at trans adaptations. They are good at translations in some cases, but not trans adaptations. They're not great at identifying cultural, significant events or cultural specific sensitive topics.

Demetrios [00:12:30]: Nuances.

Zach Wallace [00:12:31]: Yeah, nuances is a good way of putting that. And so as we're getting into this, there's a lot to think about, right? And what we started with in mind was question generation. So we're in edtech, we're trying to provide value for the teachers. They work in almost every country that I've ever heard of. They work intense hours, they don't have enough time in the day to do what they need to do and they're getting burnt out. The students are affected by that burnout, the parents are affected, everyone's affected by this. And so we focused on the teachers and that's a powerful way of, of approaching this. So with questions generation, how can we reduce the time that teachers take generating questions? And we started building agents to do this and that was, that was really powerful because we're able to define, the way that I would define this are these agents are almost like three year old consultants, if you will.

Zach Wallace [00:13:29]: So you, when you start, right, so you're going to bring these into production but you're essentially asking a three year old to generate school questions for school teachers. And we know that's not going to be what they need right now. Right. Like we know this, but what it does is it enables us to start building out these domains of specialists. And this is where the interesting part comes in. Because it, it took us seven hours to build something like this. Right. Which in the past is something that's just unimaginable.

Zach Wallace [00:13:58]: We would have had to take 60 days at least. Right. Two months, maybe it's eight months to do something like this or even something remotely close. But the dev cycle has been reduced so drastically to get a proof of concept out so that now we're actually building independent services that other teams across our company can access and provide insights on. So we need other people, other teams across the company to help us. In the past we've been the bottleneck, but now we're providing this opportunity for other teams to collaborate so closely with engineers that have never been able to in the past.

Demetrios [00:14:38]: And when you say other teams, you mean other engineering teams or anybody in any department can help other departments. Yeah.

Zach Wallace [00:14:49]: And that's the key.

Demetrios [00:14:50]: Yeah, right.

Zach Wallace [00:14:51]: Because we have all this knowledge, we can build this super quickly. You've seen a bunch of different ed tech companies come out with like question generation or slide generation, but if you bring those to a teacher, all, at least from what I've heard, they're all going to say that they're really subpar for quality. Yeah. They generate lessons, but they don't actually, they don't meet any standards. They don't, they don't help you design a real lesson that you could immediately use in your classroom. Why is that? Because engineers are building these. We don't, we're three year old consultants ourselves. So now we're the three year old consultants telling other three year old consultants what to do.

Zach Wallace [00:15:27]: Right. And so we need to get those subject matter experts closer to the code, closer to the development cycle.

Demetrios [00:15:35]: And why do you say that you're using agents or why is this an agent problem as opposed to just like pinging an LLM?

Zach Wallace [00:15:44]: Yep. So as you think about this from a consultant perspective. Right. You're building a very domain specific agent to define and handle a problem. So as we're going through this to give you, we have input validation agent. So what that does is it, it validates our input. We, you know, we have a lot of legislature that we need to handle a lot of sensitive topics across the nation, across the international culture that we need to think about. And we don't want, because this is coming from our company, we do not want to give the stance on these, regardless of how we feel about that.

Zach Wallace [00:16:21]: We need legal, we need curriculum development, we need these other teams to be closer to us. Right. For the actual question generation. That's its own domain where it's just generating questions. The entire purpose of this consultant is to generate questions for all of our teachers. Right. And as you're thinking about this, you're actually building again, these mini consultants. But you start to understand that these tasks need to be broken out because you can be, you know, it's one of those idioms of the past.

Zach Wallace [00:16:52]: Are you a jack of all trades and a master of none, or are you a master of one and you don't know the rest? And I totally botch that, but you get the point. Right? And we don't want to build a jack of all trades in a lot of cases. And so you're going to start to see these agents going through our system. And so much so that now we're building an agent registry that's able to be seen and utilized throughout our system.

Demetrios [00:17:15]: That other engineers or whoever wants to can come and grab them off the shelf and say, I'm going to put these three or four agents together to create my product.

Zach Wallace [00:17:25]: Exactly.

Demetrios [00:17:26]: Wow. And that's why you can empower the other departments.

Zach Wallace [00:17:31]: Yep, Exactly. Exactly. So let's say, you know, we have 12 to 13 different teams across our company. Let's say one of the product engineering teams says, oh, I need to go and build product. Product feature xyz. Right. Well, I want to use an agent for that. So what agents are available to me, what agents do I need to create? What agents, how do these agents interact? So to try to give us a very specific example, let's say that you want to tackle.

Zach Wallace [00:18:00]: And I need to take this out of edtech, because I don't want to cross the line of sharing too much.

Demetrios [00:18:05]: Theoretically, if you were in a different business, like E Commerce.

Zach Wallace [00:18:08]: Right, like E Commerce, exactly. So let's say that you go into E commerce and your goal for a product feature is to identify or is to get the right user the right product in front of them. Right. So you're going to think about agents to understand, okay, but what are the individual functions that need to happen? And, and, and how would I associate them on a larger scale? Right. Do I need to have non deterministic orchestration? Can actually use deterministic orchestration, define which steps of these processes need to happen. And so for The E Commerce example, you know, you're going to need to understand has this user ever bought anything on your site? What are the typical products that they enjoy? Maybe throwing curve balls, you know, to spark interest in other areas. So you would have an agent to go through and understand like what are the interests of this user. Then you'd have an agent to go in and understand what are typical things we're selling today.

Zach Wallace [00:19:01]: And then you'd have an agent to kind of, you know, merge those two together to have this user profile that you're trying to generate. And then there's a lot of other things, whether, you know, you can think input validation, you can make sure that you're not throwing errors. But then this is where the data platform comes in because you can again, you're sending data on the, on the front end. But what if you're collecting data from these agents and you're thinking, okay, so these agents are throwing errors, you know, 30% of the time, hopefully it's not that bad. But you know, an example and then, but the agents are letting you have this success value of someone purchased something within, you know, on our platform, on this E Commerce platform, you can start to identify which are working well and which aren't. And so as you're taking this through, you can get into the data platform, start processing this data to build a feedback loop to understand what can we update, where can we do this autonomously, where the agent is actually learning right now, if you're using ChatGPT or OpenAI or something like that, they don't have the ability to, I can't think of the word right now, but where you bring data back in and let it learn itself.

Demetrios [00:20:11]: But if you like the retraining or the fine tuning.

Zach Wallace [00:20:15]: Yep. And so you can, but you can customize that a little bit by using a rag and updating the data that you're passing in and whatnot. So like there's ways you can implement of feedback loop tying this whole system together.

Demetrios [00:20:28]: Yeah. It makes me think of this guy Tom that I was interviewing a while back and he's doing Mixpanel for voice agents. And so with voice agents you can see it a little more clearly because you're on the phone, it's real time, you're talking to them, and if something goes the wrong way, you want to know about that or if there's an expected call duration and all of a sudden all of your calls are just taken three seconds when you think that they should take or the averages before this would take a Minute or a minute and a half. You want to see that type of stuff. I hadn't thought about it. With agents for what you're talking about, where you want this mix panel type of view to be able to understand where the agents are successful, where they're not successful in doing, in moving the needle on one of these metrics. That's important for you.

Zach Wallace [00:21:28]: Yep, yep. So you. And, and that's a power of agents, to be completely honest, because you can now have a department of, of your analytics working on one agent, building out what does it mean for these to be successful. You can have your product engineering teams that are working on implementing this exact agentic flow. Right. And then you have this idea of, well, why are we going to recreate the same agent in four or five different places? Right. It's sort of like object oriented programming in some cases because you have to come back to the fundamentals and understand how can we repurpose this and reuse this in another area? How can we understand what the breakdown is of the problem to pick apart the standard problems or maybe some of the more intricate problems that are very domain specific.

Demetrios [00:22:19]: And I guess when, when I think about agents, one thing that I think about is how they're able to take some kind of a question or some kind of a request or an instruction and then figure out, out of all their possible actions that they can take. Okay, I'm gonna use this tool and I'm going to. First of all, they have to understand that. And so they have to know, should I ask for more context? Should I really clarify what is wanted here, what the outcome they're trying to do is? And then, cool, I can go and I can grab this tool and I've heard it viewed. My buddy Sam talked about how in every case that you can, you want to try and A, narrow the scope of what you're trying to get the agent to do, but B, narrow the scope of what the tool is doing. And so when the agent interacts with the tool, you want to narrow that scope a hundred times when possible. You want to narrow it as, as much as possible because, and he gave me the example of if you are trying to have an agent write a SQL statement, or if the agent just has hundreds of SQL statements it can choose from and it chooses the correct SQL statement because it knows what you're trying to do.

Zach Wallace [00:23:43]: Yeah, yeah. And that's, that's a powerful concept. There's a lot of implications that you're bringing up.

Demetrios [00:23:51]: So yeah, let's go through them.

Zach Wallace [00:23:53]: Yeah, One of the things that we're learning a lot about right now is what, how do you define the number of tokens and relate that to the cost and the time required to process this? And so as you have specialized agents, you're typically going to have less tokens and so they can run and get whatever they need much quicker from the LLM versus having highly specialized, highly defined agents that are going to take longer to process the information, understand everything that you know, all of the context and all of that.

Demetrios [00:24:32]: Okay, so, so you're saying that when it is super narrow and you have this scope that is smaller, you can not only save money, but it, it's more reliable.

Zach Wallace [00:24:48]: Exactly. And, and that's, that's when you start considering multi agent approaches. Right. So when you're using agents and we're using multiple agents for everything we do, because it makes our processes easier to understand from the engineering and easier to adjust. So let me give you a breakdown for our time requirements for this project. So we noticed that it takes about 10 to 20% to actually build a POC and get something available for end users. But assessing the quality and understanding how this works with other departments, or diving into what we call false positives, which is where the agent is reacting in a way that believes it's doing something correctly, but it's not sort of like hallucinations in some capacity, when you're trying to fine tune those with your prompt or with the code and there's a blend there, it takes 90, 80, 90% of your time to debug that, that. And so now again, it's sort of flipped.

Zach Wallace [00:25:52]: Right. So you need to be able to communicate exactly what's happening in each agent, understand exactly what their task is, to reduce communication channels between engineers on a team, between departments in a company and other areas of communication channels.

Demetrios [00:26:09]: So you are probably coming to a place where you've got thousands of agents that you're dealing with. Or is it not that sprawled out?

Zach Wallace [00:26:18]: It is not anywhere near that right now, but we will. We will.

Demetrios [00:26:22]: Okay, so that's the end. That's kind of like if you extrapolate this forward a few months or years, you expect that to happen.

Zach Wallace [00:26:32]: Yeah, yeah. And I mean, you've seen everyone from HubSpot to Salesforce to, to Zuckerberg talking like there's going to be more agents than there are people. I mean, these agents could be apps, like mobile apps on your phone that you're seeing. You could think about that as the size of agents that I'm, I could see in the future.

Demetrios [00:26:52]: I, so I really like the idea of going into it and, and looking at agents as part of a dag. And it's just like one more step in the dag. And it just so happens that in this step it's a little non deterministic and we're giving it information from the other steps in the dag. But it can, it can be that type of like visual representation in my mind. Do you tend to build them like that or do you look at it differently?

Zach Wallace [00:27:30]: Yeah, so that's a great question. And it can be a little complex to identify that at times. So I'm going to go with the base case and, and move up from there. So let's say, you know, you're building just a standard agent that's, that's a single agent. The flow in which this would go is, you know, you, you're coding and then you have one node in this DAG that you're calling and it's going to give you some sort of response, right? However you're intending, if we go to the E Commerce, it's going to say, oh, these are the user interests, right, if you will. And so it's going to be able to identify those. Then if you scale that up and you start to say, okay, we want to add more to this to understand, well, what, what is the business interested in selling today, right? Because there's going to be different valuations on all of your products, different margins, et cetera. So then it's going to, the next step is saying, okay, instead of just calling this user interests agent, we're also going to call this next, this other step, this other tool which is identifying business needs.

Zach Wallace [00:28:34]: And so you actually almost have to build a third node which is orchestrating this, right? And that's where this non deterministic orchestration comes into play. And it becomes fascinating because you can now say, okay, I want you to bring both of these tools into the play based on what you are seeing, right? You can either choose to bring one or none or two or whatever. And so it comes in, it calls those tools as it sees fit and condenses the information. And that's sort of, you know, your high level non deterministic orchestration. But let's say that it's not really producing quality results, right? So this non deterministic orchestration is giving you some sort of summary, but it's not actionable. What are you going to do? Well, let's add another agent, right? And so you start to have this third node that is on the same tier in this dag. Right. Because really you're just calling and receiving responses.

Zach Wallace [00:29:28]: And so you're getting this, processing this information from the first two and then you're going to send it to this next, this next node. So rather than saying same tier, let's say the, you know, the DAG is those first three are in one. And then you gather all this information, pass it to your analytics or whatever it is, your summarizer, if you will. I don't know what to call here because I'm not familiar with E commerce, but you get my gist. And so you pass it to this next node and then that summarizes and coordinates a goal, a quality response. But then let's say that you actually want this to be flexible based on current events. So then you can make that a non deterministic agent. So now you're going to have this DAG that just has a bunch of non deterministic agents that are going and.

Demetrios [00:30:10]: Going and going, and they all have separate use cases or separate. It's almost like you're giving the agents more features and you're enriching the agent with all of these different steps in the dagger.

Zach Wallace [00:30:23]: Exactly.

Demetrios [00:30:25]: Yeah. That's fascinating. How are you looking at costs? Because I know that you, you kind of mentioned that before.

Zach Wallace [00:30:32]: Yeah, yeah. So the. This is a fun way. So with our agents, we have, we have built a custom evals framework. So we based it off OpenAI, but we had to bring the evals to our engineers. We're a platform team, so our goal is to make the engineers more productive. Right. So they're currently working in either Python or Typescript.

Zach Wallace [00:30:54]: And Python for feature engineers is pretty easy. Right. You just have that basic OpenAI model, but for TypeScript, nothing existed out in the real world. And so we built our own custom evals framework that can dive in and handle this. And within that, we've hooked this up to CI CD for confidence and different levers. And we're able to assess how, how much each of these agents are costing us based off our evals and create an approximation for how much it's going to cost us in production based on usage and other metrics we're looking at.

Demetrios [00:31:27]: Whoa. And how, how did you do that? Because that's fascinating. I don't even know where you would start with that. Trying to think like, how did you even break that down? And oh my God.

Zach Wallace [00:31:39]: Yeah, yeah. So we're able to utilize. I'm. I'm drawing a fine line between what I'm allowed To share and what I'm not. Right. So yeah, yeah, we have internal logistics that are able to define and measure cost usage with the LLM and identify what we're able to use, how we're able to use it within, you know, what our internal logic associate that with real monetary values that we've assessed that are like very, very close to the real world. And then we have all of our measuring and monitor monitoring data within any, you know, any telemetry that you, that you're using to assess. Okay, well how many total users are we expecting to use this and associate those, those numbers with the actual calls that we're making? And what we found out is with agents not only and especially using non deterministic for any, you know, any reiteration or any, you know, look backs or reflection that you're using, we're able to identify that we're getting really accurate results to the tune of 98 to 100% accuracy in our evals for a lower cost because we're able to use better like cheaper models and whatnot.

Demetrios [00:32:56]: And why do you think that is? Because you're passing in more context, you're giving it better information. What is the.

Zach Wallace [00:33:06]: It's half, half prompt engineering and half software engineering at the end of the day. And so we need to identify how we can reduce our token size, how we can reduce the number of calls and you're stuck in this optimization loop, you're never going to have it perfect. Right. But we have a ton of optimization nerds on our team that are really focused on, okay, what is the cost and what is the quality and how do we optimize for those.

Demetrios [00:33:36]: And when they're looking at the cost, it's like, could we get rid of this sentence in the prompt? Because that means that less input tokens and when you times the number of prompts that we're going to be using with this agent and all of the folks that are going to be using this agent out in the real world, that starts to add up.

Zach Wallace [00:33:58]: Exactly. And if you think about this from a, from bringing this full circle, if you think about this from the perspective of consultants, maybe you ask them to do less in some cases. Right. Which has less time and money associated with them.

Zach Wallace [00:34:29]: But then on other times you say, okay, well we've kind of completed this task and we don't really need that agent anymore. We don't need that consultant anymore because we're doing something different. And with that, I think that's powerful because you can consider these, because they're so easy to spin up, the dev time associated with that enables us to remove things much quicker and say, okay, look, you spent two hours on this. I'm sorry, but like, this is no longer going to be used. And that's a much easier conversation with an engineer compared to spending six months, 12 months iterating over years to, you know, remove a project from, from what they've been doing.

Demetrios [00:35:23]: Yeah, you're much less pot committed. Exactly. It's been, whatever, a couple hours that you put it together. And when you talk about the marketplace too, I can envision that you have the price tags there also, like this one. If you use this agent, expect for it to cost X amount. And if you're using all of these agents together, maybe it does a little fancy math and it adds all of the price of all of the different agents that you plan to be using in production. And so you can get a estimate of. All right, cool.

Demetrios [00:35:57]: Well, this agent's probably going to cost us X amount of money. Are we okay with that? Can we make that back? And, and does anyone have the ability to just launch something into production? How does it go from me in some random department, I now want to create my agent and I throw a few of these different small agents together and I say, cool, I think I've got it, let's launch it. And then what? How do I launch it? Like, how do I go from that? I've got something to production?

Zach Wallace [00:36:35]: That's such a great, great question. So organization dynamics are going to change. And this is one of the craziest realizations I've had over the past couple months. And coming back to something we talked about earlier, and I don't remember if we were recording or not, but in essence we have to bring the departments closer. Right. So when you have this product feature, it's no longer people sitting in a closed room figuring out what the user needs. And you know, you have two, two people ID ideating on this and then they send it to the engineers and the engineers say what they can and cannot do and give you a time assessment. Yes, that's still going to be a factor, but the speed of determining that and enabling the actual like POC or MVP is so much shorter.

Zach Wallace [00:37:29]: And that is the wild part about this, because you can now say, okay, if it takes seven hours to build this, we can give you a POC to see. Are we meeting the user's demands based on your department's vision of this in the past? Even though we're iterating quickly, we're using CI cd, so we're updating production, but it's with very small pieces of the puzzle in this. You almost take a whole puzzle and give it back to the other department and say, hey, look over this. What are we missing here? I'm a three year old, remember? So like, what did I not understand from this language gap that we have at this point in time? And then they can give you actionable feedback on where you need to focus your time as an engineer. But that's bringing for ed tech, that's bringing the curriculum development in, that's bringing sales in, that's bringing marketing in. You're bringing these other departments to such a closer collaboration that we've ever seen.

Demetrios [00:38:25]: But this is for net new products, right? Or do you still feel like there's going to be that capability with the gigantic code base and the monolith that you go back and you say, well, we want this feature. Okay, if it's net new agent, I can create it in a few hours. If it's, I gotta go dig through what Johnny did two years ago, it's gonna take me a few months.

Zach Wallace [00:38:52]: Yeah, and that's a great question. And as every software engineer will say, it depends. Right. Because there's a lot of context that engineers have and will need to continue have for the monolith and for other areas of your application. But I think what we'll start to see is that there's going to be ways to deprecate old code bases or code bases that have been maybe some of the more technically challenging areas of your code base. And they can be updated with agents to solve the same concerns that we've seen in the past.

Demetrios [00:39:32]: Wow, that's fascinating to think about because at the end of the day you don't. The bottom line is the end user just wants the hole in the wall. They don't want the drill. They don't care if it's a hammer that's doing it. They want the hole in the wall. And so if you're giving it to them with agents, and it's actually a much better Experience and on your side of things, the backend is much less complicated and easier to spin up and get validation from quicker. Then it feels like a win win. It also feels just intuitively very scary because you're for you too.

Demetrios [00:40:11]: Right. Okay. It's not just me.

Zach Wallace [00:40:13]: Yeah. So coming back to how do we get this into production? Because I never really answered that. I just got excited by the first part of this organizational dynamics. So what we will start to see is that it is a very scary thing to release the first LLM AI non deterministic software development cycle. It's very, very tough to do that. That and we took an approach where our evals were the single truth to our system. We have thousands of evals that are very similar in, you know, in tdd. You think about those as tests, right? Yeah, we're generating like user prompts, not from production, but we're generating user prompts that we would foresee to happen in production.

Zach Wallace [00:41:02]: And these prompts are being added to our evals to assess what are the boundaries in which we are seeing, how are these boundaries assessed by our LLM? For us, sensitive topics are something we have to be very careful about. Legislation differs across the world, so we have to consider very different legislative requirements across the world. And when we're doing that, we had to build our evals. So we have this product team coming with this product goal based on the users, we identify the product design sprints that you'll need with the relevant departments and the engineers. And now we're building out evals. Now the interesting part about this is the engineers in some companies are so far removed from what the end user does or actually inputs into the system. And so this collaboration loop is just tightening every step of the way. So you're building out these evals.

Zach Wallace [00:41:54]: With our CICD pipelines, we're able to get thousands of evals for each agent to run with confidence levels of getting the right result 98 to 100% of the time. But when you have that confidence with testing in the past, right, where you're testing deterministic functions, you feel that you're going to release reliable work. Right. With this there's always going to be a notion of but what if. Right. Because it's non deterministic.

Demetrios [00:42:24]: What if it goes Air Canada on us? What happened to this?

Zach Wallace [00:42:27]: Yeah, and that's always going to be there, there. It's non deterministic by nature. Like you have to accept that. And so partly it's about accepting higher risks with production deployments. But as we're going through this and deploying this, I mean really all we're doing is setting different environment variables. You're releasing this non deterministic agent just as you would in every single other production deploy. Your risk just is increased.

Demetrios [00:42:55]: And where are some other areas? Because it feels like you could take what you are doing right now at your company and replicate that with different companies or use cases. What are some, I'm sure you've thought about it like, huh, I, I bet this would be a good remedy for these problems too. What are some other areas that you feel like could be useful here?

Zach Wallace [00:43:24]: That's such a great question. A lot of it depends on the data available. Right. So and again this comes down to quality and to assess quality you have to have a notion of what is good versus what is bad. So I've worked in a few different areas but I focused in ed tech. So my, my main understanding is edtech. But I'm a massive fan of fantasy football and other areas like that. Right.

Zach Wallace [00:43:55]: So, so let's consider fantasy football as American football.

Demetrios [00:44:00]: Or you say football like yeah, so both.

Zach Wallace [00:44:04]: I grew up playing soccer in America, but football across the world, massive Arsenal fan. So I have a huge love for football from the, from outside of America. But also all of my friends really only know American football football. So I have to stay true to what I know and what my friends know. So we're going to consider American football here. But you can consider, you know, injury analysis. That's something you could have an agent for where you're thinking about, okay, is this player going to play? How long is this player out for? You can think about, you know, you can consider the team's strength of schedule for the rest of the season and identify who are the trade targets for your, for your, that you want to focus on. Right.

Zach Wallace [00:44:46]: And I think that applies generally across many different sports. But there's a lot of ways you can implement agents to that degree. The only thing that matters is where's the data and who has the data. And that's really, really tough to find in, in the fantasy football world because everyone is already price gouging it really.

Demetrios [00:45:05]: As far as data goes. Like to get data on all of that.

Zach Wallace [00:45:10]: Yeah, for backtesting and other purposes. Right. Like it's really hard to get data and quality data across many different systems to enable you to have back testing and future predictions of how it will perform. But let's say you go into the health, you know, the health space, you have a lot of HIPAA rules, so you have to Be careful about that. But there's way. Ways that you can aggregate the data within there to identify, you know, have an agent that's able to identify something as simple as if we take the Silicon Valley hot dog, not hot dog. If you've ever seen that show, you could take this into like serious. Not serious issue.

Zach Wallace [00:45:48]: Right. And help them identify where they are in. In various elements of the space.

Demetrios [00:45:55]: Yeah. I was thinking about for government contracts and agents that can help you identify what government contracts or, or at least help you fill out the proposal as much as possible.

Zach Wallace [00:46:11]: Yeah.

Demetrios [00:46:12]: Now, one thing that I was thinking about back to that football example, the fantasy football example, is what in your mind makes this an agent need versus a traditional ML need?

Zach Wallace [00:46:29]: It's a great question. So I don't think those are mutually exclusive. I've been looking at a lot of, you know, easy to define agent videos across YouTube and other things because I've been. I've been basically giving this spiel to other departments across my company. So I'm trying the best way to be less technical. So the way I would define this is that ML or LLMs, any model. Right. Whether it's LLM or traditional AI, ML models are the brain power.

Zach Wallace [00:47:03]: They can do very specific tasks. Agents are what are able to perform the job for you with that brain power. So it's about. It's less about LLMs and the integration with agents and more about enabling a model to perform a task. And that was a wild piece that I noticed.

+ Read More

Watch More

Unlocking Real-World LLM Use Cases

Posted Oct 09, 2023 | Views 581

# LLM Use Cases

# RAG

# Google Cloud

Agents in Action: Real-World Applications // Zach Wallace // Agent Hour

Posted Jan 24, 2025 | Views 211

# Agents

# real world

# AI agents in production

Real LLM Success Stories: How They Actually Work

Posted Jan 31, 2025 | Views 313

# ChatBot

# LLM

# ZenML