AI Agent Development Tradeoffs You NEED to Know
speakers

Sherwood Callaway is an emerging leader in the world of AI startups and AI product development. He currently serves as the first engineering manager at 11x, a series B AI startup backed by Benchmark and Andreessen Horowitz, where he oversees technical work on "Alice", an AI sales rep that outperforms top human SDRs.
Alice is an advanced agentic AI working in production and at scale. Under Sherwood’s leadership, the system grew from initial prototype to handling over 1 million prospect interactions per month across 300+ customers, leveraging partnerships with OpenAI, Anthropic, and LangChain while maintaining consistent performance and reliability. Alice is now generating eight figures in ARR.
Sherwood joined 11x in 2024 through the acquisition of his YC-backed startup, Opkit, where he built and commercialized one of the first-ever AI phone calling solutions for a specific industry vertical (healthcare). Prior to Opkit, he was the second infrastructure engineer at Brex, where he designed, built, and scaled the production infrastructure that supported Brex’s application and engineering org through hypergrowth. He currently lives in San Francisco, CA.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Sherwood Callaway, tech lead at 11X, joins us to talk about building digital workers—specifically Alice (an AI sales rep) and Julian (a voice agent)—that are shaking up sales outreach by automating complex, messy tasks.
He looks back on his YC days at OpKit, where he first got his hands dirty with voice AI, and compares the wild ride of building voice vs. text agents. We get into the use of Langgraph Cloud, integrating observability tools like Langsmith and Arize, and keeping hallucinations in check with regular Evals.
Sherwood and Demetrios wrap up with a look ahead: will today's sprawling AI agent stacks eventually simplify?
TRANSCRIPT
Sherwood Callaway [00:00:00]: I think that agents collapse on a simpler structure as the models get better. We actually had our first agent frameworks all come out during that period and we got like auto GPT and baby AGI and none of them worked or enough to show promise. And then we got langgrab cloud that's trying to become Datadog and then we've got AWS that's probably doing both.
Demetrios [00:00:17]: I still am not sure what it means to be an agent in the cloud.
Sherwood Callaway [00:00:22]: One argument you could make against these agent frameworks is that here we go, let's go.
Demetrios [00:00:33]: This is particularly close to my heart because I started in tech as an SDR and you are working on making that job obsolete.
Sherwood Callaway [00:00:44]: Well, we don't have SDRs at 11x, so that's. That's true.
Demetrios [00:00:48]: Really?
Sherwood Callaway [00:00:48]: Yeah, we, we actually, we only have account executives and other roles within the Rev Revenue. Org but we don't have.
Demetrios [00:00:54]: And their calendars are booked SDRs.
Sherwood Callaway [00:00:56]: Yeah. Wow. Yeah, yeah. So for context, I'm a tech lead and engineering manager at 11X and 11X is a company that's building digital workers. And our two workers today are Alice and Julian. They are both in the revenue. Org. So Alice is our AI SDR or sales development representative, product.
Sherwood Callaway [00:01:16]: And then there's Julian. Julian is our AI voice agent who actually specializes in inbound sales and speed to lead if you're familiar with that concept. And I have, you know, I'll caveat by saying I've got some voice AI experience at 11x. Most of my role is focused on Alice but. But my previous startup was a voice AI company. So you can talk a little bit about OpKit if that.
Demetrios [00:01:40]: Yeah. What were you doing at OpKit?
Sherwood Callaway [00:01:42]: OpKit was YC Summer 21 company. We built one of the first vertical AI phone calling solutions for the healthcare market. So we were automating calls to insurance companies on behalf of healthcare providers and provider groups mostly for verifying patients insurance eligibility or for securing prior authorization or for claims follow up. So you know, a lot of the operations that are relevant in the back office and billing department of these provider groups are actually can only be done over the phone.
Demetrios [00:02:14]: I was going to ask. They don't have a website for that?
Sherwood Callaway [00:02:16]: Yeah, they have website. Short story is that they're not that incentivized to make it very easy and one of the ways that they make it hard is by only allowing you to do certain things over the phone. And we had been working in health insurance medical billing for about a year and a half when ChatGPT, GPT 3.5 came out and we had already had team members who, operations folks who were doing these calls for our customers as a sort of a services component of our platform. And it was like immediately we realized we were going to be able to automate the call end to end using AI. And so we just pivoted to purely doing the voice AI solution for provider groups.
Demetrios [00:02:58]: But what happened? Why aren't you still doing that?
Sherwood Callaway [00:03:00]: We, I think we're a little early on the technology. Technology is.
Demetrios [00:03:03]: I mean that's summer 21.
Sherwood Callaway [00:03:04]: Yeah. And I'm, and I actually. Interesting. I watch the 11x team work through a lot of the same engineering challenges that we were working on while we were at opkit, kind of doing it from the, the sidelines. I mean they've, they've definitely taken things to a new level and they're, they're way beyond where we were previously and, and way more sophisticated. Those guys are probably as sharp as it gets when it comes to building voice agents. Um, but it's funny to see like the industry is still grappling with how to do that. Well, and actually voice agents are really different than regular text based or, or multimodal agents.
Demetrios [00:03:38]: I talk about this all the time. They're so it's a rich medium. But there when we type something out, if you and I both type the same thing, it's pretty much understood what we're trying to say.
Sherwood Callaway [00:03:52]: Yeah.
Demetrios [00:03:53]: If we say the same thing, it can be that I'm saying it with a certain tone and I mean one thing and you're saying it and you mean something completely different.
Sherwood Callaway [00:04:02]: Yeah.
Demetrios [00:04:03]: And so how you pick that up with these voice agents is very difficult.
Sherwood Callaway [00:04:07]: Yeah. Yeah. Those inflections and those paralingual cues and obviously make things a lot harder. And then doing that in real time with latency constraints is also really challenging. So.
Demetrios [00:04:19]: But, so then let's, let's move over to Alice and what you're doing there.
Sherwood Callaway [00:04:22]: Yeah. You probably know the SDR role better than I do. I'm an engineer, so I was a founder previously and I've tried to do some outbound, which is why I have a little bit of appreciation for how.
Demetrios [00:04:32]: Painful you're on the other side of it. Right. You probably get hit up all the.
Sherwood Callaway [00:04:35]: Time now it's all vendors coming at me and ironically I'm, you know, I'm building like the canon. Right.
Demetrios [00:04:42]: Imagine they're using Alice to hit you up about that.
Sherwood Callaway [00:04:46]: I, Yeah. I wonder if I could suss out or recognize an Alice written email. Fortunately, I Think they're like, dynamic enough that I wouldn't notice?
Demetrios [00:04:53]: No. You talk about Alice being an agent. What does that mean for you? Why is it agentic as opposed to just a program?
Sherwood Callaway [00:05:05]: Yeah, that's a really loaded question. I mean, getting to the definition of what an agent is in a hotly debated topic, I think that, you know, agents, the characteristics that they have are they can automate complex and ambiguous tasks, tasks that really were not, you could not automate previously. So they're sort of unlocks a new category of things that can be automated with software. They use tools sometimes they have memory and generally built with one of these new agent frameworks, whether it's Langgraph or the new OpenAI agent SDK. And effectively at the center of the agent there is a loop that involves a language model. And in that loop, the agent is planning about what to do and then taking action, usually through a tool, then observing the results of that action on their environment and. And then reasoning about whether it should continue to execute and take another action or whether it should end the execution.
Demetrios [00:06:06]: And inside of Alice, this looks like I have this data. I'm sending an email or I'm scraping a website to find out more about this person and then I'm personalizing a message and then I'm sending the email.
Sherwood Callaway [00:06:21]: Yeah, that's essentially right. So this main campaign creation flow, which is the core part of what Alice provides, is, starts with sourcing. So helping you build a list of people that look like your ideal customer profile. We have an agent that specifically helps with building the right audience. Then there's research on the individual leads that are in your audience. So we'll essentially have a deep research agent that creates a really comprehensive report on Demetrios and on Sherwood and anyone else who's enrolled in your campaign. Using some of our data and also using web scraping and web search tools. And then following research, there's sequence and message generation, the sequences, the different messages that occur in the outreach process for this lead.
Sherwood Callaway [00:07:05]: So I'm going to contact Demetrios on day one, day three, day seven. This is something the agent decides finally, the contents of each of those messages. You know, what am I going to say to Demetrios in each message? What's my follow up going to be? What value props am I going to emphasize? What personalization am I going to call it? I'm going to say like, hey, ML Ops, I saw you guys. You recently broke what, like a million views on one of your videos?
Demetrios [00:07:30]: So they call that out and congrats.
Sherwood Callaway [00:07:32]: On the would say congrats on, like, breaking that milestone. I bet you have a need for more growth tools now that you reached this threshold.
Demetrios [00:07:41]: Dude, you know what's so funny is back in the day when I was doing the STM PR work, I thought it was magic when a friend showed me a tool that gave you the tone that you should speak to somebody in when they had this list of folks and so you could add your leads and then it would say, oh, yeah, this person. We know that judging by how they speak on the Internet, you can talk to them in this type of tone.
Sherwood Callaway [00:08:09]: Interesting. Yeah.
Demetrios [00:08:10]: And that for me was like, wow, incredible.
Sherwood Callaway [00:08:12]: Yeah.
Demetrios [00:08:13]: This is 10 steps further than that because now I don't need to even know what kind of tone or what to think about and I can pull in relevant data on recent things that have come up and I don't have to craft it.
Sherwood Callaway [00:08:25]: Yeah. And I think one of the product challenges that we have is, you know, we're selling this into sales organizations, people like you who actually know what they're doing. Some of them a lot. I mean, basically all of them have previously been SDRs, have opinions about how these things should work. And a balance we always need to strike is that we think that Alice can do this very well. Alice is great at sourcing, great at writing messages, great at research, knows how to pick the right tone, knows what aspects of your product and business to emphasize when reaching out to this particular lead. And sometimes the user thinks that they know best and wants to, you know, wants the messaging to have a particular structure, wants it to only ever reference, like this case study, wants to target a specific type of lead, even though maybe we don't think that might be the best lead in terms of performance. So we have to strike this balance between providing control for the users and providing results.
Sherwood Callaway [00:09:26]: And the way that we think that we get results is by effectively putting Alice on autopilot and letting her do what she does best. And that's something comes up in our product conversations and with customers a lot.
Demetrios [00:09:39]: Now, we talked about the reliability of this and you mentioned. All right, well, we chose to outsource the hosting of the agents in a way. What were the trade offs? How did you look at that decision of we're going to go with. And you're using Lang Graph, right?
Sherwood Callaway [00:09:56]: Yeah.
Demetrios [00:09:56]: So you said we're going to use Langgraph. Why? And how did you come to that decision?
Sherwood Callaway [00:10:02]: Yeah, that's a great question. And controversial recently.
Demetrios [00:10:06]: Yeah, I forgot.
Sherwood Callaway [00:10:07]: Grab the popcorn.
Demetrios [00:10:08]: Yeah, there's pedantic why didn't you use Pedantic AI?
Sherwood Callaway [00:10:11]: That's so true. Well, pedantic, we, we're, we're mostly typescript. Typescript shops. So probably not the right fit for us. But yeah, it's been a hotly debated topic, the agent framework wars recently. I'm sure your listeners know, like the blog post that Harrison published in response to OpenAI's new agent framework. And their post, their guide to building agents, threw a little bit of shade on the idea of workflows and graph structures. Blank Graph is the agent framework that has graphs in its name.
Demetrios [00:10:42]: The name. They went all in on the graph idea.
Sherwood Callaway [00:10:45]: Exactly. And actually to their credit, I think it's an extremely flexible structure. Land graph, Lang graph and graphs generally for representing the agent. And with Lang graph, everyone can build an agent that gets to production. And I think that is in part possible because of the graph structure that they chose. Now, a lot of other agent frameworks are much simpler. They modeled around chat and maybe they're sort of pure agents in the sense that this is not a workflow. There are no predefined code paths between nodes.
Sherwood Callaway [00:11:20]: There's really just one node and it's looping through this recent act. Observe loop, the react agent that people are familiar with. That's a super powerful structure. I think that as models get better, all agents start to collapse on this simpler form factor. But the reality is models have limitations today. Prompts aren't perfect, our evals are not that good. Some things are expensive and you want to to create structure in the graph that ensures that like a certain step performs well. And a graph structure allows you to do that well.
Demetrios [00:11:57]: Yeah. And if you're going out and you're selling a product to folks and you're claiming that it's going to work, you can't have it work some of the time. Right. And so a graph gives you much more reliability. And you thinking about it from that systems angle, I'm sure that was one of the things that you realized right away was we need this to be reliable.
Sherwood Callaway [00:12:17]: We recently rebuilt ALICE from scratch over like the last six or seven months. We started in October of last year. So October 2024. Alice for context was launched in January of 2023. So a lot had changed. We actually had our first agent frameworks all come out during that period. And we got like auto GPT and baby AGI and none of them worked or enough to show promise. You're like, wow, there's something cool here.
Sherwood Callaway [00:12:45]: And then the LangChain people are pretty Forward thinking. So they launched Langgraph and really what happened for us is we saw the way the landscape was changing and then we saw some of the new Agentic products that were being released. Specifically the Repl Dot agent left a really big impression on our team. Are you familiar with the Replit agents?
Demetrios [00:13:03]: Yeah, but I don't know why it would be different than like Cursor, Windsurf agents or Devin.
Sherwood Callaway [00:13:08]: Replit Agent was the first of these tools of these coding assistants to really be agentic. You know, Cursor had tab, which is the one of their main feature, which would use a language model to predict where in the, in your code you're going to jump to next and what the next change that would be. And you could sort of just press tab to, to apply that. They didn't have an agent in the, in the platform.
Demetrios [00:13:32]: It was Autocomplete.
Sherwood Callaway [00:13:33]: It was effectively really, really good autocomplete that made GitHub Copilot look like a joke. And to their credit, all done on custom language models, Some small language models involved really impressive work that they did. Meanwhile, Replit was the first one, was really, in my opinion the first company to bring this AI coding agent that is looping through multiple steps, calling tools to build an entire project. The agent is your co pilot and it's actually helping you build the project. And the experience was unlike any product experience we had ever seen before. And they did this way before Cursor Composer, which is the cursor agent, which is now the default mode within Cursor.
Demetrios [00:14:16]: And so you saw that and you thought that's where we need to go.
Sherwood Callaway [00:14:19]: Exactly. We saw that, we knew that landscape had changed. AI products were going to look completely different and it had to be agents. And we just basically made the leap based on that conviction.
Demetrios [00:14:33]: And it's interesting that you're reassessing. I imagine you have to reassess continuously because it moves so fast and the landscape and what you're doing is one upping all the time. Yeah, right.
Sherwood Callaway [00:14:49]: Yeah. Keeping up with models is really hard.
Demetrios [00:14:50]: Well, not just the models, but like you're seeing. Okay, now we're not in a baby AGI world, we're in a lane graph world. So we could potentially do what we're trying to do now a lot easier.
Sherwood Callaway [00:15:03]: So obviously models are improving. I mentioned that. I think that agents collapse on a simpler structure as the models get better. One argument you could make against these agent frameworks or against more complex ones that allow for more complexity like Langgraph is that they actually they incentivize you to create structure and scaffolding to accomplish short term performance gains. Right. Like I can add a few extra nodes here and create a new loop within my agent and that ensures that this step in my campaign creation flow now consistently works. What happens is that you do this for, to expand the capabilities for your agent in a lot of places. Now you have a very complex graph.
Sherwood Callaway [00:15:42]: Then a new model drops and you're like this new model. I could have accomplished all of what I just accomplished using a new model in a much simpler structure.
Demetrios [00:15:52]: I could have just waited.
Sherwood Callaway [00:15:54]: You could have just waited. And so you're doing this constant analysis of whether there's premature optimization. Is this investment in my graph structure or in the framework of my agent going to pay dividends before a new model comes out that renders it obsolete? And then there's cleanup work to be done to simplify your agent afterwards? Because actually you, what you do is you become sort of sclerotic where the agent is like locked into this structure that no longer really makes sense and prohibits it from doing other things that you might have.
Demetrios [00:16:29]: What kind of vectors do you look at as you're saying, hey, we need to be forward thinking in three months. We don't want to have to go and rebuild this from scratch.
Sherwood Callaway [00:16:36]: Yeah, that's a really good question.
Demetrios [00:16:39]: And I imagine you can't tell the.
Sherwood Callaway [00:16:40]: Future, so obviously I can't tell the future. I barely keep up with the present. We try to obviously stay on top of all the new model releases and at minimum we will. When we get access to a new model, we'll have our team sort of bash it in like regular chat clients, get a feel for how it behaves, maybe use it in cursor a little bit to see how it performs in an agent loop. We will drop it into our agent and see what the performance impact is. To be honest, it's a lot of vibes today. We have an agent that works and agentic product that's pretty successful. Customers are happy with.
Demetrios [00:17:16]: Can you just lock it in there then?
Sherwood Callaway [00:17:19]: I wish we could just say, you know, mission complete and we're done. But we, I think there's a lot more that we want Alice to do. And our, our North Star metrics are revenue generated for customers or pipeline generated for customers. Meetings booked. Some of these are hard to do attribution for by the way, positive replies to any of the outbound messages and anything we can do to create lift in those metrics justifies the investment on the engineering side. So we're just Kind of keeping an eye on those metrics and on some of the evals that we've set up on the agent. And when we feel like there's a model that changes the game, we spend a little of time investigating whether that.
Demetrios [00:18:05]: Is there any time where you've been creating in creation mode and you thought, I could do this right now, but I think the next model drop is probably gonna make this all not worth my time to invest in it?
Sherwood Callaway [00:18:22]: Yeah, that's. That is a good question.
Demetrios [00:18:27]: Or we could do it retroactively and say, what did change in the last model drop?
Sherwood Callaway [00:18:33]: Yeah, we're still mostly on 3.5 and 3.7 sonnet. We've obviously got access to some newer models from OpenAI since I don't think that we've seen a big enough performance lift from those models. There's also a cost and latency consideration. So it's a pretty complicated analysis. At any given time to understand whether we should upgrade, we need to know is the performance. First of all, we've got all these lagging performance indicators, so it's hard to even know what the performance lift is. Then once we know what the performance lift is, how does that relate to the increase in cost and whether. What does it mean for product or in product user experience when you're using the agent or interacting with it? Does it feel slow when it's generating those messages? And then meanwhile in the background, we are doing like hundreds of thousands of emails per month.
Sherwood Callaway [00:19:34]: What's our throughput, what's the speed impact there? So I wish we could say we're more systematic about it. Maybe the best thing for like the ML Ops community to say here is that we don't have it down to a science. And it's a lot, it's a lot of vibe decision making there.
Demetrios [00:19:51]: Yeah, it does feel like you're not the only one.
Sherwood Callaway [00:19:55]: Yeah, I'm pretty sure that's the state.
Demetrios [00:19:58]: Of the industry right now.
Sherwood Callaway [00:20:01]: Yeah, we're not researchers either. I'm pretty proud of what we built with Alice and how far we are as a company with 11x. But we're really focused on commercializing AI and so we're happy to rely on the labs and on the other researchers and the benchmarks to help guide us. But we are very pragmatic and focused on creating a better product and so that guides a lot of our decision making.
Demetrios [00:20:29]: What did you decide on when it comes to Lane Graph? Again, you told me that it's hosted. What does that even mean? Because I'm not sure I'm clear with a hosted agent versus what.
Sherwood Callaway [00:20:43]: Yeah, yeah.
Demetrios [00:20:43]: And why.
Sherwood Callaway [00:20:46]: That? There's a lot of places we could go with that. So just the main decision making criteria around Langgraph, specifically in September of last year, Langgraph is one of the best and most mature options for an agent framework. And you know, you can argue whether that's still true, but at the time certainly it was and those guys are a leader in the AI dev tools and agent space. So we really wanted to partner with like a team that could educate us and that where it was leading the way and had solid adoption wasn't going to disappear in six months like many agent frameworks.
Demetrios [00:21:17]: That's so true.
Sherwood Callaway [00:21:18]: Have and probably will.
Demetrios [00:21:19]: Yeah, you have to watch out for that. That's a great axiom to be thinking about.
Sherwood Callaway [00:21:23]: Yeah, yeah. We use LangChain products in a lot of ways. We use their Langraph agent framework. We use Langraph Cloud, which I think is now called Langraph Platform. This is their cloud hosting solution for AI agents. We use Langsmith for observability, which is coupled with Langgraph Cloud. That gives us visibility into the actual agent runs. And then we use some of their other SDKs within the agent itself.
Demetrios [00:21:52]: I still am not sure what it means to be an agent in the cloud.
Sherwood Callaway [00:21:57]: Yeah, yeah, well, we could break down like what the agent actually is. So we talked about the agent is this loop, right? It's really an LLM and a loop and the LLM is saying is deciding, reasoning about its state and its environment. So given some input object which represents the state of the world and a prompt which explains who it is and what its objectives are, it decides or it reasons about what action it should take. Then one of the actions, many of the actions typically are tools that it can call. So if it chooses to call a tool, then you have something that executes the tool. Now the LLM isn't executing the tool, it's just saying I would like to call this tool. And then you've got. Maybe it's some case statement that says if LLM return tool, call, get tool, call type and based on whatever type was called, you would then invoke a function which is implemented in the code.
Sherwood Callaway [00:22:54]: Or maybe a more modern way to do this is with mcp you have the agent have access to have you.
Demetrios [00:23:00]: Had to go and upgrade shit.
Sherwood Callaway [00:23:02]: Because I personally love mcp, very interested in it, have built some side projects. We're not using MCP in production today.
Demetrios [00:23:09]: Yeah, understandable idea.
Sherwood Callaway [00:23:11]: It's a very new technology. This would probably not be like the most pragmatic thing to do, but super interesting, I think does solve a lot of problems related to tools. We can get into that.
Demetrios [00:23:22]: So then it calls the tool.
Sherwood Callaway [00:23:23]: So it calls the tool. Something executes this tool. It could be the same. Let's call it a node process. As we run TypeScript, so our agent is written, implemented in Node, and this node process is running. It calls the tool as a function. The function returns some results, then you update your state to reflect the new state of the world now that the function has been called. Then you repeat the loop and eventually the LLM says, I.
Sherwood Callaway [00:23:51]: I'm either stuck and need to exit the process or I've completed the task and I should exit.
Demetrios [00:23:57]: And so that whole process is done on the LangChain cloud.
Sherwood Callaway [00:24:01]: Yes. So we use. So LangChain provides us this agent framework, langgraph, which creates some nice abstractions for defining the agent that I just described. And what that ends up being is a node or a JavaScript bundle that runs with Node in a container on LangChain servers or Lang graph servers. And there's a. There are two things that sit in front of this that LangChain also provides or Langraph also provides. The first is a, is an API that allows us to call our agent from either our backend or our front end. And we do.
Demetrios [00:24:39]: You're just interacting with an API.
Sherwood Callaway [00:24:41]: Exactly. So our agent becomes an API endpoint for us. And as a side note, LangChain has also created this protocol called the Agent Protocol, which is like a set of guidelines for how that API should be structured. What are the different endpoints that it should have? What are the different entities and resources that it should expose, which HTTP methods should be used? What are the payloads?
Demetrios [00:25:07]: This was pre mcp.
Sherwood Callaway [00:25:08]: It's not, it's not for tool calling, it's really for, like exposing your agent. Yeah, you know, if I want to deploy my agent as an API and have, I mean, Alice is a constellation of agents, but if we wanted to take Alice and turn her into an API endpoint that you could use instead of using our dashboard, then one of the ways we could do it is by implementing an API that adheres to this agent protocol that LangChain has, has defined. So they have this API and the API sits in front of a queue. Basically every time I call this API endpoint, I think an example endpoint would be like runs create, like post runs create. Now I'm creating an agent run. There's. Right at the front of the queue, something is holding my connection, actually it depends on whether I'm streaming, whether I want to stream the results back, or whether I just want to fire off the run. A simpler example is if I just want to fire off the run, I don't care about streaming it back.
Sherwood Callaway [00:26:02]: I call this endpoint. I pass the parameters like whatever the agent is that I like to call whatever the initial state is. And then an event is placed onto this queue. There's a consumer at the other end of this queue that sits in front of the node process and for every queue event that is received, it starts the node process and passes in that state. The nice thing about having that queue there is that if I haven't totally scaled or if LangChain in this case hasn't scaled up all of the agents so that it can handle a peak load, none of the. None of my agent execution requests get dropped or lost. They will all eventually get processed. Yeah, just slower, but just be slower.
Demetrios [00:26:51]: Yeah.
Sherwood Callaway [00:26:51]: And you can scale the queue or LangChain provides some ways for you to scale the queue. If you are hosting this yourself, you could also have your queue do some auto scaling or.
Demetrios [00:27:00]: Well, let's talk a bit about that because I guess one of the huge value props here is that you don't have to deal with that scaling issue. You don't have to worry about.
Sherwood Callaway [00:27:09]: So.
Demetrios [00:27:09]: Nice. Yeah, but what are the trade offs? There's always going to be something that you're.
Sherwood Callaway [00:27:17]: Always. There's no. There's no free lunch. Yeah, there are trade offs. I mean, so this is like a lot of infrastructure we didn't have to build.
Demetrios [00:27:24]: Yeah.
Sherwood Callaway [00:27:25]: And it's infrastructure we could have built. But you're with. ALICE is also like a SaaS tool. I mean there's the agent, but then there's the dashboard and then there's. We have other background processes that are happened. So we're building a whole application and then we're building this agent that needs.
Demetrios [00:27:39]: To hook into the application and you rebuilt it in October when you already had something out. So you can't just turn around and spend a year and a half building out the infrastructure, I imagine so that you can get this agent going.
Sherwood Callaway [00:27:53]: Oh yeah, there's a whole meta here, which is that our customer base was growing, our old product was starting to crumble.
Demetrios [00:28:00]: You were going fast.
Sherwood Callaway [00:28:01]: We had to go fast, go to market was not slowing down for us. We had to get this thing out really quickly. And it was quite the crunch. It was super painful. But we're on the other side of it now, so I can laugh about It.
Demetrios [00:28:14]: I heard somebody talk about that being type two fun.
Sherwood Callaway [00:28:17]: Yeah, yeah, that's exactly right. It was type two fun. It was a lot of late nights. My skin looks a lot better than it did in March or whatever, February or something.
Demetrios [00:28:27]: You got some sun. Thank you.
Sherwood Callaway [00:28:28]: Yeah.
Demetrios [00:28:30]: Not as pasty. I remember the first time that I met a little more pasty.
Sherwood Callaway [00:28:35]: The let's see, the trade offs, I mean, so it was super advantageous for us to do this just to get off the ground really quickly and to not have to worry about building that infrastructure. The I think some of the trade offs were, you know, when we started using Langraph Cloud, it was a newer product. We did run into some scaling issues at times. There were some limits on the extent to which they could scale an individual, what they call deployments. So in fact, we actually had to create lots of deployments.
Demetrios [00:29:04]: Those queues or those.
Sherwood Callaway [00:29:06]: You can think about this as like, you know, if you're deploying a service in Amazon, you deploy ECS service. If you are deploying an agent in Langgraph, you created a deployment. And we had a lot of volume, we had spiky volume. And that led to us running into connection timeouts during those peaks. And obviously one of the things we could do to make this better is to smooth out our request traffic, which we have done and we take full responsibility for our query patterns. But we were also exceeding some of the limits of what they could do, the extent to which they could scale an individual deployment. So the way we got around with this is we created 10 different deployments and then we had all of our clients round robin across them and now it's been fixed, so we're consolidated again. And then other challenges.
Sherwood Callaway [00:30:04]: We have a deployment pipeline that deploys all of our backend services. Our code is stored in a Monorepo, including the agent itself. So usually at any given time, whatever the head of main in our Monorepo is, represents the state of the world, what should be in production. And our deployment system works for all of our backend services. We've got a separate set of infrastructure now for our agents. So if we want to also auto deploy and auto roll back our agents, there's all of this additional deployment infrastructure we need to create to support that. And our deployment system gets way more complex. And then last thing that comes to mind is on the observability front, agents are software just like everything else.
Sherwood Callaway [00:30:53]: And we want to see the logs, we want to see the traces. And we typically pipe all of our observability telemetry Logs, traces, metrics to Datadog.
Demetrios [00:31:05]: This is different than laying which. What's the service Lang.
Sherwood Callaway [00:31:09]: Yeah. This is another small, small complaint about using LangChain ecosystem is that they have LangChain, Lang Smith.
Demetrios [00:31:18]: Everything is graph. It's the year of the Lang mem.
Sherwood Callaway [00:31:20]: Is the year.
Demetrios [00:31:21]: It's not the year of the Lang Lang mem.
Sherwood Callaway [00:31:22]: There's more ironic because it's hard to remember.
Demetrios [00:31:25]: Yeah. And it's all about memory.
Sherwood Callaway [00:31:30]: And then there's two sets of. Two sets of documentation for both Python and Typescript. Oh, so you're like, am I on the right documentation page right now?
Demetrios [00:31:38]: Painful. Yeah.
Sherwood Callaway [00:31:40]: We've gotten better at navigating that. I think they've gotten better at positioning there and figuring out the products.
Demetrios [00:31:45]: Yeah. Call it something different.
Sherwood Callaway [00:31:47]: Yeah.
Demetrios [00:31:47]: The Langs have been exhausted.
Sherwood Callaway [00:31:49]: Yeah. It's the calling card.
Demetrios [00:31:51]: We.
Sherwood Callaway [00:31:52]: On the observability side, Langsmith offers a lot out of the box, which is great. But I'm a former infrastructure and observability engineer. Like, one of the core tenets of that role is like, all of the telemetry should be in one place so that when something's going wrong, a developer.
Demetrios [00:32:07]: You don't have to search, you don't.
Sherwood Callaway [00:32:08]: Have to go looking for it. And if you're not, if you're a product engineer who doesn't know what the whole observability stack is, you can just like jump into that tool and it's got everything. All of the logs are there.
Demetrios [00:32:18]: Yeah.
Sherwood Callaway [00:32:18]: You don't need to remember, you know, there's. There's a customer that's complaining or there's an outage or.
Demetrios [00:32:23]: Oh, there's. This is everything. Except for that.
Sherwood Callaway [00:32:25]: Except for that one thing, which, by the way, is where the bug is or like. So getting some of our logs and other telemetry into Datadog from Langsmith was not that easy. And it's still a little. A little bit painful. I know that they're working on better integrations with Datadog to make that. To make those pipes.
Demetrios [00:32:44]: That seems relatively trivial. I guess it's a huge problem for you, but it's like there's nothing that's blocking them from making that easier process in six months.
Sherwood Callaway [00:32:59]: It depends on where you think these agent framework companies are going.
Demetrios [00:33:03]: Oh, yeah.
Sherwood Callaway [00:33:03]: I think sort of there's an incentive for LangChain to own everything. They want to lock you into their platform. And all of their agent frameworks are open source. A lot of their other tooling is open source. So the way that they capture value is by Getting them by you using their hosting solution and by you using their observability solution. And so making the pipes easier so that I can get to datadog may not be like number one priority, but they do want to make their observability tool really good too. So it kind of ends up becoming a question. Yeah.
Sherwood Callaway [00:33:39]: Do you think that agents are different enough to require a different cloud? Will we end up using dedicated clouds for agents or will we just deploy agents like we deploy other software to existing clouds?
Demetrios [00:33:53]: Yeah. Is it complex enough to warrant its own cloud?
Sherwood Callaway [00:34:01]: I think that's, I'm not sure about that. I think we talked about the infrastructure pieces that are required to host an agent. I think that you, you could imagine open source versions of all of those or a single open source solution that combines those components. Something like an elasticsearch or a Kafka that is everyone knows how to work and everyone knows how to deploy and AWS is a one click deploy for. So I don't think that that's out of the realm of.
Demetrios [00:34:39]: I'm surprised AWS hasn't stolen Langgraph actually and run it as a service on aws. Now that you say that they have.
Sherwood Callaway [00:34:46]: Their own agent FR framework and they have AWS Bedrock and they have Bedrock.
Demetrios [00:34:50]: Agents, but their agent framework, it's one of those ones where it was older. Right. And I don't know anybody who's using that.
Sherwood Callaway [00:34:58]: I don't know anyone who's using Bedrock agents. No offense to the, to the Bedrock team doing great work. We use Bedrock for inference. Bedrock has a, has a partnership with Anthropic where they can host some Anthropic models. So.
Demetrios [00:35:12]: But why Bedrock as opposed to just going to Anthropic because of the reliability.
Sherwood Callaway [00:35:16]: We originally switched from, from Anthropic to Bedrock because of rate limiting. We couldn't get a high enough rate limits through Anthropic. You gotta know someone, you gotta have an angel. Um, and so Bedrock gave us higher rate limits. Um, now I think there, there's a difference in cost between the two platforms. And so this is like where you get your procurement team involved because.
Demetrios [00:35:40]: Sure. You don't want to give us these better rate limits.
Sherwood Callaway [00:35:42]: Yeah. Yeah.
Demetrios [00:35:45]: Well that's a great point. So then.
Sherwood Callaway [00:35:49]: It'S just a ghost.
Demetrios [00:35:50]: Yeah, just a ghost. There's a fitting for this room.
Sherwood Callaway [00:35:55]: It's super cool space.
Demetrios [00:35:57]: Yeah, it is cool. Thought about getting a bunch of cigars.
Sherwood Callaway [00:36:00]: Oh my God, that would be cool. But I wouldn't be able to like step out of work to come do the Podcast.
Demetrios [00:36:04]: Yeah. No. Or we don't have to smoke them. We just chew on them and look cool with them.
Sherwood Callaway [00:36:08]: Just have one like smoking in a tray.
Demetrios [00:36:12]: Yeah. I bet they would be very excited about us doing that.
Sherwood Callaway [00:36:17]: We could do an after hours thing.
Demetrios [00:36:19]: Yeah. Well, what else, man? There's a few other things. Right. Like I, I like the trade offs with Langraph and hosting an agent there, Ling Lang Cloud.
Sherwood Callaway [00:36:29]: Langraph Cloud.
Demetrios [00:36:30]: They missed an opportunity to confuse the shit out of people more.
Sherwood Callaway [00:36:34]: Yeah.
Demetrios [00:36:34]: And just call it Langloud.
Sherwood Callaway [00:36:35]: I think Lane Cloud would have worked.
Demetrios [00:36:37]: That's the next iteration. Actually, you know what I would love to talk about is you as a observability guy. How do you see observability being different now with agents, if at all? And it can be totally cool if you say nothing. It's same same. It's still software.
Sherwood Callaway [00:36:56]: No, no. I think this is a rich question. We might have to noodle on it a little bit. Yeah. So my background in observability is I was a second infrastructure engineer at Brex and had like an amazing mentor who grew me a lot at Brex, built a lot of cool production infrastructure there and then at some point established the observability team at Brex. And so we had a couple of folks who are specifically focused on developer tools related to instance and understanding how systems are behaving in production. In practice, that's like we set up datadog, we set up auto instrumentation for all of our services. We had like 30 or 40 different microservices running in our Kubernetes cluster.
Sherwood Callaway [00:37:39]: All of them are sort of uniformly and automatically instrumented. We had an incident process and tool that you would use for kicking those off and for resolving them and some other observability tools as well for things like fraud detection, um, and a client side. Observability is actually interesting area of area too. That means like, like capturing errors that occur on the client as opposed to on the server. Things that you know. For example, there's like exception that occurs in your, in your web app and it happens in the user's browser. How do you send that back to a server somewhere so that someone can look at it? Someone knows that something went wrong. Yeah.
Sherwood Callaway [00:38:23]: There are a lot of tools for doing this today. Like I think logrocket was one of the first ones. Posthoc has a session recording feature. It's pretty cool. With all the browser DOM events you can reconstruct exactly what happened on the, on their experience. So it's. They're not actually recording their your screen, but they might as well be because you look at the recordings, it's the exact same thing that happened. I think observability with agents can be really interesting.
Sherwood Callaway [00:38:49]: One is, I think that just agents. So there's two separate things we can talk about here. One is what does it mean? How do we observe agents as more of our production systems become agents? How do we keep track of what's happening and make sure that they're performing reliably and reconstruct things after the fact? And then the other one is what does observability for just regular software look like when a world of AI agents to tackle the latter one? I think agents have a lot of potential to automate the sort of production infrastructure or site reliability process. I mean, so much of what I do in my day to day is someone says there's an issue, customer reports an issue, or customer success person reports an issue. I get an organization id, maybe a URL and then I can roughly, I guess it happened within the last hour or something like that. Maybe I get a loom or a session request. That's crazy. And then I take whatever clues I've got, take them to Datadog or to another observability tool or maybe I can query our production access to our database.
Sherwood Callaway [00:40:00]: So if I can look up a record, if it seems like a record might be in a bad data state, and I have an identifier I can query for that in our postgres database. And in Datadog I can plug in like a trace ID or the organization ID and get the logs related to that request or org and. And I start piecing together.
Demetrios [00:40:19]: You're probably asking people on Slack. It's funny that you're mentioning this because we had Willem in here yesterday and he's working on aisdr. Yeah, no, sorry, that's what you're working on. He's working on AI sre and I.
Sherwood Callaway [00:40:34]: Built, I built one over there. Really? Yeah. No way I won a hackathon building an AI.
Demetrios [00:40:39]: Yeah. That's awesome.
Sherwood Callaway [00:40:40]: That's so funny.
Demetrios [00:40:41]: He's so. He's doing that. He built a knowledge graph on all the stuff that's happening in Slack, all the stuff that's happening in the logs, all this stuff that's happening in your environment and how do you connect all of that and know what's valuable? Tell the agent. Sometimes he was saying that the agent will just go and do its thing and come back with nothing. But it will tell you, hey, I searched these places. And so then you don't have to go and search them.
Sherwood Callaway [00:41:06]: Yeah, that's valuable. Yeah. You have to really trust the agent's doing its job well. Especially if it comes back with nothing.
Demetrios [00:41:13]: Yeah, but you're like, are you sure about that? I don't believe you. Yeah, I'm gonna go check in.
Sherwood Callaway [00:41:17]: Agents like our agents just, they want to please, you know, language models want to please. Yeah, yeah.
Demetrios [00:41:22]: It's probably rare that they come back with nothing.
Sherwood Callaway [00:41:23]: Products like that have a lot of potential. It's a, it's a, it's a clear set of tools that you would integrate with clear set of data sources. And you're just correlating this commit with this error log line with, you know, this bad record in the database over and over again and then summarizing what happens, creating a linear ticket out of it report you're responding to either the customer or the internal user. And agents can, can automate a lot of that process. So we see them automating like the code, the process of writing the code and shipping it. Yeah, we don't really see them automating the, the production. What's happening in production?
Demetrios [00:41:57]: Yeah, root cause analysis of what's going on, why did that fail? And, and even preemptively saying, you sure you want to put that kind of scale on that kind of load right now?
Sherwood Callaway [00:42:09]: That's, that's another piece actually AI code review, which has happened somewhere in between those two things. And there are some cool tools there. I know someone, Merrill Lutzky from, from Graphite is a. One of the leading AI code review tools. And we use Graphite and love it. It pretty consistently catches things. Now the balance you need to strike there from a product perspective is how do you make sure that you are catching things and providing useful suggestions without being noisy? Because nothing devs hate more than the.
Demetrios [00:42:36]: Noise that alert fatigue. Yeah. I know a friend that plugged in an LLM to their GitHub instance and then they turned it off like six hours later. They were like, this is not going to work.
Sherwood Callaway [00:42:47]: Yeah, yeah.
Demetrios [00:42:49]: So really recognizing what's valuable and what's not, that's a huge thing now when it comes to the observability of the agents.
Sherwood Callaway [00:42:57]: Yeah, I think so. I think that agents are kind of complex and scary and new. It's important to remember that agents are just like software. They're just regular software programs. And so observability in a lot of ways is the same. You should be instrumenting them with metrics, logs and traces. You can do this using open source libraries. LangChain and Lang Graph have a lot of this stuff out of the box.
Sherwood Callaway [00:43:26]: If you use that, you get that in Langsmith. But opentelemetry is the standard that has been used for most software monitoring tools thus far. And you can use a lot of those tools to instrument your agent. Yeah. Depending on the agent framework that you use, they might already have hooks for like emitting opentelemetry compliance traces at each step or similar things for metrics and logging. Otherwise you might have to do a little bit of your own, build a little bit of your own middleware, or maybe wrap the agent framework so that it instruments it. But all of those tools work just as well. Now, if you use a cloud solution like langgraph, you are restricted in terms of what other things, processes you can run alongside your agent.
Sherwood Callaway [00:44:15]: And that is something that you should consider when you're thinking about observability for the agent. Because, for example, a very normal way to collect metrics, logs and traces with traditional software would be to have a sidecar container that runs like an agent process. Like a stats D agent or DataDog has dog statsd, I think, which is their flavor. You can't do that in a cloud environment that you don't have a lot of flexibility or control over.
Demetrios [00:44:42]: Interesting.
Sherwood Callaway [00:44:43]: So you might only be able to do that in something like AWS or GC or Azure. So something to consider when you consider when you're thinking about where to host your agent is what options does this give me in terms of observability, getting those metrics out of them?
Demetrios [00:44:57]: It restricts your architecture decisions. Like you can't just go and have a blank canvas. You go and draw in the lines.
Sherwood Callaway [00:45:04]: Yeah. And none of the agent clouds are as sophisticated as the, as the big clouds.
Demetrios [00:45:09]: So are you linking up these observability metrics with the evals, or are you fully separating them or are they kind of mixed and matched? How do you look at those?
Sherwood Callaway [00:45:18]: Yeah, evals. I think a lot of companies, most companies are not doing evals today. Everyone's interested in evals.
Demetrios [00:45:28]: But it's all vibes.
Sherwood Callaway [00:45:29]: But it's all vibes. There's a lot, I mean, a lot of vendors are talking about how they have support for evals. Usually those features kind of take the same shape. It's like you can curate a data set from some of your production traces and then you can run your agent in times for n items in the data set. And it will tell you, well, each time you run your agent, it will run a set of heuristics against the output. And for each of those heuristics, you get like a score or a label. So a common one is to detect hallucinations. Um, we actually use this in our message writer.
Sherwood Callaway [00:46:08]: So we have an eval that actually automatically runs on some percentage of production.
Demetrios [00:46:15]: Oh, nice.
Sherwood Callaway [00:46:16]: Traces for the agent that generates the messages.
Demetrios [00:46:18]: And this, this is also done with Lang Chung.
Sherwood Callaway [00:46:22]: This is. This one is done. We, we've implemented and LangChain offers this capability. This one we just, we were trying out another tool called Arise, which is another nice LLM or AI observability tool, and they also have support for evals. And so this one we implemented using Arise. I don't have strong opinions. Oh, this, this is using their core platform.
Demetrios [00:46:43]: Okay.
Sherwood Callaway [00:46:43]: Although Phoenix is really interesting and I think I've. I've been interested in Phoenix for a while as a way to deploy sort of like a production grade observability tool locally for all of my developers on my team. So, you know, we're running these agents, we can run our entire application locally using a single command, which is sweet. Starts up a bunch of services, starts up the client, starts up the agent server, which is built on line graph. What would be really nice is to also start up this dev server that allows you to view all of the agent traces that are happening on your local agent and maybe even iterate on some of the prompts through that server. And I think Phoenix might allow us to do something like that.
Demetrios [00:47:26]: Nice. Okay.
Sherwood Callaway [00:47:27]: Just make like a much better local development experience for building the agent.
Demetrios [00:47:30]: All right. So sorry I got you distracted with evals in production.
Sherwood Callaway [00:47:35]: The main one we have running, or one of the ones we have running today is detecting hallucinations and messages. Obviously super awkward if you say like, hey, Demetrius, Congrats on passing 10,000 subscribers.
Demetrios [00:47:46]: And you're like, it was a million.
Sherwood Callaway [00:47:48]: It was a million, you.
Demetrios [00:47:50]: My ego is hurt.
Sherwood Callaway [00:47:51]: Exactly. Actually, that might generate a response.
Demetrios [00:47:54]: Yeah, yeah. Not the kind of response you want, probably, but actually I was telling you this earlier. Like, I imagine a lot of the stuff that AI generates, or Alice generates specifically, is a lot better than the human stuff where I was saying some random shit in my.
Sherwood Callaway [00:48:13]: Yeah, that's so funny. Yeah, we. And we get scrutinized a lot on our messages too, because users will log in and they'll be like, you know, I saw this slightly weird message that you sent to one prospect. And maybe the prospects, LinkedIn says they work on like, Intuit and our data set. You know, our email says, like, Intuit, Accelerate or something like that because.
Demetrios [00:48:34]: So it's a little bit off.
Sherwood Callaway [00:48:35]: Yeah.
Demetrios [00:48:35]: It's not enough.
Sherwood Callaway [00:48:36]: A little off.
Demetrios [00:48:37]: Or it's like when you're spelling is it dessert or desert? And you're like, yeah, it's the same when you look at it and you don't know how to spell like me. And so you're like, yeah, I said dessert, right?
Sherwood Callaway [00:48:48]: Yeah. There's one other thing that's really funny about this is we often we try to show sources in our. In the research alongside the message that we generate when we do it in the product. And oftentimes we'll have customers say like, why, Why'd you say like we, you know, maybe the message says, we helped a similar client in the healthcare services industry.
Demetrios [00:49:10]: Yeah.
Sherwood Callaway [00:49:11]: Save, you know, $50,000 in the last year using this solution. And our, our customer will say, wait, where'd you get that? That like, doesn't sound true. And then we pull up the research and sure enough, there's a case study on their website that says that so many. Yeah, it's. There's a lot of that that happens in the day to day and it can be really delightful when it does happen. But we have also, you know, we're held to a high standard for that messaging. I think certain customers are more savvy when it comes to sales and understand that a lot of times sales reps will bend the truth or, or say things to provoke a response. And Alice can do that if you want her to.
Sherwood Callaway [00:49:49]: You. You can nudge her.
Demetrios [00:49:50]: Be a bit more sassy.
Sherwood Callaway [00:49:51]: Yes, we effectively have sassy mode if you want to enable that. We call it brand voice.
Demetrios [00:49:58]: Interesting. But.
Sherwood Callaway [00:50:01]: But also, yeah, we want to, we generally like want to be factual. Everything that Alice should say should be based in fact. And it's, it's a product design problem to make sure that the, the sources are alongside the messaging so that you can correlate. You know, if a user wants to double check something, they can do that easily.
Demetrios [00:50:19]: So then back to the evals. You're saying, all right, we are doing evals with Lang Smith. We're also doing evals just kind of on vibes and figuring it out as it's coming out. And you have this cycle of whatever 20% did you say so.
Sherwood Callaway [00:50:37]: And so we do this in a rise right now. Um, and jury's out on which, you know, in our next episode, we can talk about which of these eval platforms I like better.
Demetrios [00:50:46]: Interesting. So you're doing it both with Langsmith and. Or does Langsmith only do Observability Langsmith.
Sherwood Callaway [00:50:52]: Has an eval offering and we've used an earlier version of it. We don't use it right now. And our team has been experimenting with Arise, which is another LLM observability platform. And they have also an eval offering which has been playing around.
Demetrios [00:51:06]: Oh, that's fascinating. So you have evals in Langsmith, but you're like, we're just going to stick to the observability piece of it.
Sherwood Callaway [00:51:15]: We tried an earlier version of it and weren't satisfied. Felt like the experience of setting it up was a little too hard. And our team is very curious about AI tooling and we're always kind of shopping around for whatever the latest and greatest is.
Demetrios [00:51:30]: That's awesome.
Sherwood Callaway [00:51:31]: So we test drive a lot of products and for evals, right now we're test driving Arise.
Demetrios [00:51:36]: Very cool.
Sherwood Callaway [00:51:37]: This eval, we sample like 1% of our total. Yeah. Of we have too much volume for. Because we. We're running an LLM in our eval. So LLM as a judge, and so that it could be expensive if we ran it for 100% of our traces. We. We only do it on like 1%.
Sherwood Callaway [00:51:57]: But what's cool is it gives us this. This number between 0 and 1, which represents, you know, that the average across all of the traces in a period of. Of whether they have hallucinated or not. So, you know, as we. If that number is close to one, it means, like, we're not hallucinating. If the number is close to zero, it means we are generally hallucinating. Exactly. You're burning that, right? This SDR is writing emails from their satellite connection, you know, from Deep Ply.
Demetrios [00:52:26]: Yeah.
Sherwood Callaway [00:52:28]: So that. So that's what we. We. That's a number we watch to sort of as a bellwether of whether or not like a recent change to the prompt has. Has made it less reliable or something like that. And what the research or what the eval is doing is it's comparing the inputs which include this research report which we've assembled and the outputs, which is. Which are the email that we've written basically saying, hey, is there anything in the email that doesn't match the research report?
Demetrios [00:52:55]: Okay, I have a very clear picture of how you're running your agents. It's. You're hosting it with langgraph. You've got Langsmith as the observability. You're exposing an API in a way for that agent that you can then build services off of. You are relaying whatever that output is. You're taking 1% of that and feeding it into Arise right now and saying we've got these evals and you're matching it up with this is what the research says and this is what you said. How accurate is it? Right.
Sherwood Callaway [00:53:29]: That's effectively how the eval works. Yeah, we send all of our traces to Arise in the same way that we send them to Langsmith. These are the agent traces. And we run a bunch of agents for Alice. One of these agents is the message writer, and the input for the message writer is the research report, some demographic info and the output is the message. And so for the traces that are message writer traces, for 1% of those, we run this hallucination eval.
Demetrios [00:53:59]: Wait, but then what does? Because you've got datadog, Right. And datadog is your observability, like the system observability. What is Langsmith doing then?
Sherwood Callaway [00:54:09]: Langsmith. So, I mean, some of these things are duplicated. I mean, we have. Langsmith allows you to do logs, metrics and traces too.
Demetrios [00:54:16]: Or the agent for the.
Sherwood Callaway [00:54:18]: For only what's deployed in Langraph Cloud.
Demetrios [00:54:20]: Oh, I see.
Sherwood Callaway [00:54:21]: Which is our agent. Now, it's nice to get all of that data into Lang or into Datadog if possible, because that's where it is for all of our services.
Demetrios [00:54:29]: Yeah, I remember you saying that. Now that makes a lot of sense because it's like datadog can't cover what's happening in links. Lang Cloud.
Sherwood Callaway [00:54:37]: Exactly. I mean, there's. There's two walled gardens right now.
Demetrios [00:54:40]: Yeah. And interesting.
Sherwood Callaway [00:54:42]: I think the LangChain team, I don't. I don't want to publicly promise anything on behalf of them, but I'm pretty sure that they're working on some kind of integration that allow us to understand. Pipe more of our data to Lang to Data.
Demetrios [00:54:52]: But they're not going to let you put a Datadog instance in Lang Cloud.
Sherwood Callaway [00:54:56]: I don't think that their ambition is to become a general observability platform. Like to have our backend services send their logs to Lang Graph Cloud. That would. That's probably outside of their scope.
Demetrios [00:55:09]: Yeah, that makes sense. Yeah. So then why would you want Datadog in there? It would be better if you just took all of that from Lang Graph and just said, hey, we're throwing all the Lane Graph traces and whatever, whatever your cloud is doing, we're throwing it into datadog.
Sherwood Callaway [00:55:27]: Yeah, and that, I mean, that would leave them, their value add would then be like running the actual agent, hosting that API and queue that sit in.
Demetrios [00:55:34]: Front of the agents, which, as we talked about earlier, less may or may not be needed.
Sherwood Callaway [00:55:41]: Yeah. So everyone's being squeezed right now. You got Datadog, which is trying to become Langraph Cloud without the hosting, I guess. And then we got Langrave Cloud, that's trying to become Datadog, and then we've got AWS is probably doing both. And so they're all kind of trying to carve out more of the tooling and infrastructure that needs to surround the agent and own it. But there's definitely a lot of overlap between these different tools.
Demetrios [00:56:06]: Yeah. And then you have Arise. That's doing.
Sherwood Callaway [00:56:09]: Arise is sort of a Langsmith competitor.
Demetrios [00:56:11]: Exactly.
Sherwood Callaway [00:56:13]: And they're really focused on LLM observability specifically. So just more of a better specifically with Langsmith than Datadog. Although Datadog has their own LLM observability suite.
Demetrios [00:56:24]: Yeah, but Arise is an API that you're hitting or you're sending data to. It's like a cloud, a hosted version.
Sherwood Callaway [00:56:30]: Also, Arise is open telemetry compliant, so.
Demetrios [00:56:36]: Oh. So, yeah. Okay.
Sherwood Callaway [00:56:37]: In the same, you know, we can just instrument our agents using some open source libraries and all of the data flows into Arise. Interesting. Yeah.
Demetrios [00:56:45]: Dude. Well, this has been awesome, man. This is. I appreciate you being so forthcoming. We'll see if your PR team allows half of the shit that we just talked about to get out into the world.
Sherwood Callaway [00:56:55]: I think the technical stuff is really good. This is fun. We could have gone on for longer, I think.
Demetrios [00:57:01]: Yeah, I think we can. Well, we'll continue the conversation tonight.
Sherwood Callaway [00:57:04]: Important.
Demetrios [00:57:05]: I'm going to.