The Hidden Bottlenecks Slowing Down AI Agents
speakers


Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Demetrios chats with Paul van der Boor and Bruce Martens from Process about the real bottlenecks in AI agent development—not tools, but evaluation and feedback. They unpack when to build vs. buy, the tradeoffs of external vendors, and how internal tools like Copilot are reshaping workflows.
TRANSCRIPT
Paul van der Boor [00:00:00]: The evalist one. Typically, the bottleneck is actually not the tool itself. It's about having the ability to generate those eval sets and having a product feedback loop which you can then take data from to measure against, you know, new models and improve and sort of have a real flywheel. Let's say.
Demetrios [00:00:22]: We approach the build versus buy question when it comes to building agents and what kind of vendors you want to onboard in the enterprise setting. Bruce and Paul talk us through this. Paul is the VP of AI at Process, and Bruce is an AI engineer. I myself am Demetrios, the founder of the MLOps community. Let's jump into this conversation. You're saying, hey, we have this culture of we should buy it, we shouldn't build it, we don't want to think about building it. Despite that, as we'll hear from Bruce later, it's hard. It's so immature at this moment in time, the tools that you have, and I kind of went step by step over the main areas where you have these categories forming.
Paul van der Boor [00:01:06]: Well, maybe what I can emphasize is sometimes the tool is not the hard part. So let's say if you want to have an eval solution, you want to know whether your models are good, that's the problem you have. And you want to know whether they get worse and how they perform and where they do worse and whether they do better. The hard part is not the tool. The hard part is you need an eval set that you curate. You need to have real users that give you new conversations that you can measure against. You need to have that capability. If you don't have that, it doesn't matter whether you build the tool yourself or not.
Paul van der Boor [00:01:45]: And so there in that thing, the tool itself, at least in our setup, in our setup, is not the thing that unlocks our ability to do evals better.
Demetrios [00:01:55]: And human time is expensive.
Paul van der Boor [00:01:58]: Yes.
Demetrios [00:01:58]: What I think about, though, is there's an argument for if you have a tool that makes it much easier for humans to get that eval set, then you're saving time, Right?
Paul van der Boor [00:02:09]: But I think that you're absolutely right. But what we do in the team, to give you an idea, is we have these labeling parties. So we buy pizza, we invite folks. Everybody's welcome, not just folks from the AIT or engineers, like anybody in the company is welcome. And we'll, for example, show them these are a certain set of answers for a problem set. So we'll do like, I don't know, images for food, as one example. Or we do code, obviously, if you do code evaluation, you need the folks who can rank or score or evaluate code snippets. We will do, let's say, customer conversations.
Paul van der Boor [00:02:45]: And then you need people in different languages, if it's Polish or. Or Portuguese, Brazilian or whatever. So we'll do these labeling parties to create data sets. Right? And these are the eval sets that we can then score. You can see that on Pro LLM, which Zukof will talk about in another episode, I think. So the hard part is getting those eval sets. And it's not that if we then all of a sudden had a tool and we've tried them all, right, Whether it's spellbook or human loop, humanloop or orc AI and so on, some of them are really great. And by the way, they offer 15 things.
Paul van der Boor [00:03:19]: One of them is evals. The evals one. Typically, the bottleneck is actually not the tool itself. It's about having the ability to generate those eval sets and having a product feedback loop which you can then take data from to measure against new models and improve and sort of have a real flywheel, let's say.
Demetrios [00:03:39]: Well, one thing that I think is wild is how you're actively encouraging the team to go out there and buy tools. You would prefer that they do that because as Bezos said back in the day, focus on what makes the beer taste better. And creating your own orchestration system is not what makes the beer taste better. It's not going to make Tokon a better tool. If there's something off the shelf, go and grab that. And also, you have the angle of if you find out about a tool, then you might want to become an investor in it. Despite that, you have had a lot of trouble, per se. I guess that might not be the best word, but you've had a lot of difficulty finding tools that you can actually buy.
Paul van der Boor [00:04:24]: Yeah, less adoption of outside tools for our agents in production systems than I'd like or than we'd like. But yeah, listen, we're all like, time and people are scarce resources, right? So, you know, I always say to the team, if. If there's a tool that we built that we could have bought outside, there's no, you know, there's no brownie points for that. Nobody. Like, there's just no value in that. We could have done, you know, other things moved faster or whatever. So we're very. I very actively push the team, like, hey, check out this.
Paul van der Boor [00:04:57]: This new, whatever open source solution. Check out this new. This tool out there for doing evals, for doing observability, for doing Whatever. And I do think our team lives a little bit in the future in some ways, because we're so like, our job is to test this, the latest and greatest, and then figure out what makes sense. We have real problems that probably other folks will have, you know, 12, 18 months down the road. And so it's not just, you know, I want my team to, like, have all the time that they can to spend on the things that make a difference. And so I encourage them to buy and use tools wherever that makes sense. It's also that if we understand, hey, this tool out there, these guys built something for a problem we know we have and others will have down the road, we should partner with those guys.
Paul van der Boor [00:05:43]: Let's make sure everybody in this group, we've got, like I said, thousands of engineers all over the world that will have those problems from us, already have that. We should then expose this tool to, like, hey, did you know that such and such actually can now help you with authentication and the agentic flow, or they can help you with whatever observability of such and such part of the stack? And that happens. And it also means, because Prosis has a big investment arm, we can also help partner with those founders if they're raising money to say, hey, one, we're already using you, we want to use you. Two, you may want to have other users in the group because we're a big global company. And third, if you need help, investments and so on, we can provide that because we use tools in my team to try and solve real problems that we think others will have soon.
Demetrios [00:06:34]: I tend to trust your opinions about what's real and what's not.
Paul van der Boor [00:06:39]: A lot.
Demetrios [00:06:40]: I heavily weigh them because you do live in the future a little bit, and you're able to tell me what is hype and what is real. And so when I see you saying, oh, this is great, but it's not for us, that kind of is a signal for me to say, huh, there's. There's something wrong with that piece. Why is that? What is it? And the fact that you haven't onboarded a lot of these common tools that you hear about every day is a very big signal for me. Whether that is an orchestration thing like a lang chain, lane graph, whatever, or an observability tool like a lang fuse. Insert your favorite lang in here, an eval tool, that for me is a huge signal. It makes it all the more important when you do onboard a tool that I stop and I listen to why you chose that tool.
Paul van der Boor [00:07:43]: Yeah, I think Sometimes, like I said, I do think in some ways we live in the future because we're earlier than many and trying to build things at scale and production agents or other, let's say AI powered products. So I think one thing that we do understand well is what are the problems that others will have down the road. We're also a bit of a strange team and I've come to realize that as sort of look back is like we are because we're much closer to. We train our own models, we work with the biggest labs out there, we have our own evals, we are not representative of many users that are coming down the line. So if you have other teams that already they have ML models or maybe they don't, they're smaller teams and smaller companies, they will have different needs and even if you Fast forward them 18 months, they will still not be same as a sort of full AI team, right? So some needs and the ways to solve those needs that says, hey, the problems are common but the ways to solve them for a team that is less AI experienced, they will maybe favor a more drag and drop solution, for example, as one. Right, so I'll give you on like there's a lot of observability tools out there and we looked at all of them and many of them had this sort of drag and drop solution, right? And it turns out that for some people that's great because they, they don't want to get into the intricacies of the coding and so on. But my team would say, well listen, I need an SDK or I need this to be headless or basically be able to kind of interact directly with the underlying things. We need to debug it, I need to see the logs and so on.
Paul van der Boor [00:09:29]: So those things that there will be, it's like if you look at, let's say the over observability tools and the ones that create these chains of agents, you have Zapier, right, Which is just anybody can use and has access to a keyboard and a mouse and then you have all the tools, like, I don't know, the things that are native in GCP or aws, right. That kind of help you with observability and chaining these things together. Those are very different end users. The people that access the AWS environments and the GCP environments and the Azure environments are your SREs, your engineers. The Zapier users are anybody. And so you need to solve for those different needs as well.
Demetrios [00:10:13]: Are you using coding agents?
Paul van der Boor [00:10:15]: Yes, we are. We started already using them early on and by the way, coding agents, I assume you mean things like Devin and Manus and many others.
Demetrios [00:10:26]: And that's something that you haven't built yourself.
Paul van der Boor [00:10:29]: No, we have not, no. So that's a clear area where we see, you know, amazing teams shout out to the Cursor team, for example, who have really focused on the Persona, being the software engineer and how to really change the user experience for them to be able to use these things. Right. So we don't have. We build products that are consumer facing. Right. So we would never build a coding agent, but we use them all the time, we test them all the time. Very excited by all the things coming out.
Paul van der Boor [00:11:02]: We're investing in some of them as well. But if you think about, let's say the first generation of tools in this space, the first ones were GitHub, Copilot, like they were the first ones to come in this space. They of course partnered with OpenAI and we could already see the early days that this is actually pretty good. And then soon thereafter you saw, let's say, newcomers like of course, Cursor, sourcegraph, Codium, yeah, replit and so on that came and started to do this slightly better. And then they evolved and actually in my view they overtook, let's say the first movers in this case Copilot, to build what are now like software native sort of agentic flows and this sort of conversation and the interface that you had that Cursor build where they train their own models to actually execute these tests was very quickly, like it was clear that that was better. And in fact my whole team moved to Cursor and now we are sort of at the next level. Think of it like the autonomy levels that they have with Seller and carry level 1, 2, 3, 4. We're now at the next level where, let's say Devin, like agents who can, they just don't just do autocomplete, they don't just edit code, they can create entire environments.
Paul van der Boor [00:12:23]: Right. And execute a task end to end. We were playing with them, I think started about a year ago when the Devon announcement came out and it was a little bit underwhelming. Very promising, but underwhelming. Now fast forward almost a year, we see that it's much, much more useful. You still need to look at this different, let's say use cases. So if I'm in my hobby project at home and I need to create a little, I don't know, a little app or little dashboard to track whatever household chores, to say something right by myself, I can Use Devin perfect. Now, today.
Paul van der Boor [00:13:03]: And it will work basically like first time. Right before I had to try it five times. Now it'll work for something simple. Great. But how much of that really happens at work? Right.
Demetrios [00:13:14]: So I've got a legacy.
Paul van der Boor [00:13:15]: Yeah. I need to commit stuff. I have maybe smaller projects. Some of them are more demo like there. You can also start to use Devin. But what happens there is if you do what my experience said, at least today, is that if you let it create the entire code base from scratch for things that we need to actually work on together and maintain. Like you create it in ways you didn't like, the architecture isn't exactly as you prefer and so it's not to your taste and it's not easy to maintain. That's where I don't see it working yet.
Paul van der Boor [00:13:47]: What we are now finding it works really nicely is in the other type of projects where you actually already have a code base that isn't a code base that 3,000 engineers are involved in, but it's a code base that maybe dozens of engineers, maybe hundreds are involved in. But it's clearly documented repositories where you already have a CICD pipeline, things are already documented better because you got more people than you can individually see. So that agent actually comes into something and you can ask it to do these tasks that now are a card on a JIRA board, a ticket, I don't know, make sure this becomes XYZ compatible or insert this feature, have Devin give it a shot. And that's where these things in my experience, are now becoming great. Because it's an existing code base, we already thought about how the architecture should look, but you're making tweaks and modifications that are somewhat incremental. And the agent, it's not that large that the agent can't handle the context. And of course, all these features that we're seeing now is like Devin creating its own wiki. Right.
Paul van der Boor [00:14:59]: Indexing the code base, various repos together, you can give it instructions that then it knows. For anybody else working with Devin on that code base, that's cool. And we're tracking sort of how Devin now, how many PRs is Devin actually making on the team?
Demetrios [00:15:15]: That's a KPI or is that something?
Paul van der Boor [00:15:19]: Well, it's definitely like we want people so that we have this duty. Now we're basically saying Devin duty, right? So somebody's on duty this week on Devin, and this week is Devin, but next week it'll be Manus or Cursor or whatever, or repl it. But the goal is that during that Devon duty, any task you pick up from the JIRA board, you should first think, let me run it with Devin. Wow, cool to see, right? Because then otherwise how do we know if this thing got better and where do we apply it? And so that's the KPI that everyone needs to try that be on duty to sort of learn and discover. And then eventually we may set like, hey, 10% of the PRs should be Devon, if we believe that's feasible, or 20% or 25% or whatever.
Demetrios [00:16:02]: How do you ensure the cognitive load doesn't just go through the roof because people are submitting a bunch of shitty PRs.
Paul van der Boor [00:16:08]: Yeah, well, we had this, right? So we actually built our own PR reviewer, right? That thing was so bloody verbose, right? People turn it off after a couple hours because it was just inserting a ton of comments. And then the cognitive load of you having to read those comments compared to the submitted pr, it's easier just to look at it yourself. And you can spare the commenting from this AI verbose AI thing. So that's the question, right? So what's the sweet spot where the work done offsets, let's say the cognitive load you still need to. Additional cognitive load you need to put in to understand that AI's work. And that's what we're trying to figure out. And so how many PRs are worth it to sort of give to an AI software agent? We're going to play with an SRE agent soon. So there's.
Paul van der Boor [00:17:04]: We just try to understand where is it useful and where not. And to some extent there's also a team thing, an individual thing, more senior engineers versus more juniors. Are you familiar with the code base or not? And so on. For onboarding, for example. It's great, tell me more. So for me, I don't spend a lot of time coding, unfortunately, let's say committing production level code. But I obviously want to understand what's happening in the code. I can just interrogate a code base using cursor, right? Or any of these other tools.
Paul van der Boor [00:17:35]: New hires have the same. They come in and they're like, hey, I need to work on this new repository. We move people from projects to projects, then they can come in and say, hey, describe to me how this code base works. What are the key endpoints, what are the key services, what are the standards of agreement, what are the utils? And it'll just describe that to you. And so it's a good way to get people up to speed or people like me that aren't sort of day to day in the code base.
Demetrios [00:18:00]: I think we should bring on Bruce now to talk about what tools you all have tested and where you decided to build your own and why. Great, you're here now.
Bruce [00:18:15]: I am.
Demetrios [00:18:16]: Who are you and what are you doing here, Bruce?
Bruce [00:18:18]: So I'm Bruce. I work at Prosys in the AI team. Been over two years there now. And I work as an AI engineer in the team. And what I spent most of my time on is building Token, which is our own agent or agent platform, and we distribute that to the portfolio companies. So I work as an AI engineer, do software engineering, a mix of both of those.
Demetrios [00:18:43]: And you spent a good amount of time trying to incorporate different vendor tools that will help you build token faster. I know that you've been working on token for years. And so there is this maturity factor that goes into the tools that are out there that you can use and buy. Even if you wanted to pay money for something, maybe it just doesn't exist or it's not at a solid maturity level. I typically see folks using tools in a few different places and you can tell me why or why not you ended up buying a tool in these places. And we can talk about which areas there's actually value and which areas you potentially see value in the future if it matures or there's no value. One is prompting tools or prompt tracking.
Bruce [00:19:46]: Yeah. So I think for a lot of products out there and also for what we are doing, prompts are very essential to your system and storing that somewhere else, I guess is fine. Of course we also store it in a database, but we're not storing it in an external service that's doing the evaluations for us because it feels so essential to have a very good prompt that I think it's something you want to spend time yourself on.
Demetrios [00:20:16]: But then the evaluation piece.
Bruce [00:20:19]: So we've built our own evaluation flow for that. Two reasons. One big one is that within the AI team, we also have a team that do evaluations that have a leaderboard. So we have the knowledge to do those things. And the second big reason is if you're going to use an evaluations tool, you're going to need to send all the conversation data to an external party, which usually is not. It's hard to convince our legal department that we're going to do that.
Demetrios [00:20:48]: So you bring up a great point there on the fact that maybe you wanted to use this tool, but to actually get it through and get the okay from legal is a whole nother beast.
Bruce [00:21:01]: Yeah, yeah, I guess so. So it's not only like getting the okay from Legal. I guess I also want to be confident that something like that goes well. And if there's very new startup companies out there and say, oh, you can just send the whole conversation, it'll keep everything in track. It does feel a bit strange to send it all out there because it's core to your product that you do well when you change the prompt, for example. And let's say I'm talking to a colleague at another company and they ask me this question. I would think like, oh, I'm a bit hesitant to tell you how these do this because they are probably not going to like if we send all the PII to these eval sets. I think it really helps then if you have these companies that provide that if they have a self hosted version.
Bruce [00:21:52]: So the data is still at your place.
Demetrios [00:21:55]: Yeah, that's a common design pattern or vendor offering where it's like bring your own cloud or VPN and so then you don't have to send that data anywhere.
Bruce [00:22:05]: No, that's true.
Demetrios [00:22:06]: But even still, I can imagine Legal is one vector that you're thinking about when you're thinking about that build versus buy.
Bruce [00:22:12]: Yeah.
Demetrios [00:22:13]: And so you said we're going to do the prompting tools and the evaluation tools in house. Yeah, those are two big ones in my mind that people will pay for. The other one is orchestration. And so these are frameworks like the llama indexes and the lang graphs, Lang chains, that type of thing.
Bruce [00:22:35]: So I have an example in our code base where we built our own orchestrator and we just added Vertix AI and we noticed that sometimes we are throwing timeouts. So the request was taking too long because we set the timeout, it was five minutes. So that's very long. So because we're doing the request ourselves, we did the API implementation ourselves. We weren't using the SDK, we were pretty close to what the code does. So we easily edit metrics there to see, oh, what time, how long does it take to get the first chunk from Vertex AI? We saw it was super fast, so we still didn't know what's happening here. And then we added an extra metric which did the time between chunks. And then we noticed all of a sudden that sometimes the time between chunks is super high, like minutes, which is probably just.
Bruce [00:23:25]: Yeah, this is probably just malfunctioning of Vertex AI, which is fine. But if you then using something like library to do that, you're probably missing a way to fix that. So now, because we did do that, it was very easy for us to first add a special timeout saying, okay, if the time between chunks is really large, we just try again so the user can get a reply quicker. And because we're making the whole API implementation ourselves, we saw that in the headers there was a request ID that Vertix AI sent us. We could just easily add that to that metric again. And it logs that we have, we exported those, sent those to Google and they easily can then see what happens. And I think you need to be quite lucky if you precisely want to do that, because it's quite niche, but it is a fix. But I feel like using an orchestrator there probably wouldn't have that or he would be very lucky and would pass it all the way up top.
Demetrios [00:24:22]: But yeah, it's that transparency piece. You have so much more control of what you are able to see and what you're able to do. And you do say that this is a very niche and specific situation, but it can be generalized to. If it's not this specific situation, I'm sure you encountered five more like it.
Bruce [00:24:41]: Yeah, yeah. And then it's nice that because one of our main advantages of using us instead of using a different orchestrator or agent framework is that we take care of that.
Demetrios [00:24:54]: We're very reliable on the reliability piece. Do you think that the orchestration frameworks, although they give you a lot on the abstraction and they're able to make it much easier, they're able to make the prototype experience nice, the reliability experience drops. And so it's like you have to balance those that trade off is very clear. You have speed to prototype is very high, but then reliability is very low.
Bruce [00:25:25]: Yeah, definitely. Reliability on those frameworks can be super good as well. And if they just have what you need, then it's perfect.
Demetrios [00:25:33]: Well, explain why they didn't have what you needed. I know that you had mentioned the different SDKs and specifically like a lot of these frameworks are in Python.
Bruce [00:25:43]: Yeah, yeah. So we use Google and there are SDKs of course, out there that do this kind of stuff. But even Google and Anthropic say, okay, this is our beta version, which then funny, because Anthropic says in their documentation, use our SDKs if you're streaming, because the direct API implementation might not work as nicely.
Demetrios [00:26:10]: But so it's not only with the orchestration or the evals or these other tools, it's with the foundational models too.
Bruce [00:26:18]: Yeah, definitely. I think a lot of these for they like these AI products out there are really focused on Python, doing very, they're probably very good at building the SDK. But yeah, if you're not using Python. Yeah, it's pretty hard to. Yeah.
Demetrios [00:26:34]: What are some downfalls or what are some pitfalls that you've had because of that?
Bruce [00:26:41]: Not using Python? Yeah, we were using Python two years ago or even a year ago still, I think. Yeah. So one thing we're missing is the SDKs because we're doing sometimes like these small tweaks that I just talked about. You could do that in Python with monkey patching, which is also not possible in Go. But using Go was such a nice transition for us because it's compiled and there's already, almost, already so much uncertainty with what comes out of a model. If then the code is, you know, okay, this is fine.
Demetrios [00:27:17]: And, and you didn't look into using something like pedantic.
Bruce [00:27:21]: We did.
Demetrios [00:27:22]: Why didn't that work?
Bruce [00:27:25]: Or you were just for like the models that come in, like the objects. But then in the code we. I think it just requires a lot more effort to make bytes and fully, fully fail safe where it would go. It's way easier to do something like that.
Demetrios [00:27:42]: It's almost like it comes out of the box in Go and in Python you have to do this extra work.
Bruce [00:27:49]: When I switched from Python to Go, then it felt super restricted, but in the end now it feels way faster because I just know, okay, I can't do this, I should do it this way. Something new comes in and it's just easier to switch.
Demetrios [00:28:04]: It's funny that it's restricted in a good way.
Bruce [00:28:07]: Yeah, I guess so. Yeah.
Demetrios [00:28:09]: Now you do. So on one hand with Python you are. Or without Python you're able to have that reliability and you get all of these things out of the box. But then on the other hand, when you're working with the different foundational models or orchestration or any kind of tool, vendor tool, you have to do extra work. Do you see it that way or do you see it a little bit different?
Bruce [00:28:40]: Yeah, it's a little extra work. Yeah, I would think that's the case. So of course there, there's frameworks out there and orchestrators that are super hard to replicate because they do something very good. Then that's also the reason why, for example, we use LLMs that are. That we don't make ourselves for this, for this agent, because that would be too much work. And then for agentic tools as well, those are too much work to build the whole OCR Pipeline, we can just use something that's out there, but for this, for these core functionalities like an LLM orchestrator in your agent. I guess yeah, it's a little more work maybe, but maybe that's in the beginning. But then in the end if you're trying to find that time between chunks, it's way easier if you have the code and you don't have to fork anything and change the library or if you're in Python monkey patch something to get that one thing you need.
Demetrios [00:29:39]: Yeah, so I see it a little bit as you're investing more time up front, but then you save it on the back end. And when you have these edge cases, that's where you really see things shine because you can go and debug much easier.
Bruce [00:29:55]: And we try to figure out what do we need. And then it's way more easy to build something yourself. And then maybe in six months we feel like, oh, this needs to be expanded because it needs to. Let's say we do memory on a user level and first we just build it on what the user and assistant is saying. But then after a couple of months, okay. It would also be nice if it remembers how it did tool call. So because we're now integrating a lot of tools and maybe id, if I say okay, I want to send something in a Slack channel and I always call that channel operations. But then actual ideas like token operations it would would be super nice to that it remembers then that.
Bruce [00:30:40]: But then if you didn't build that whole memory thing yourself to begin with and they don't provide that extra feature, then you're going to need to rewrite the whole thing again instead of if you have built yourself, you know what you want, you just add it. And of course you can do a feature request. But yeah, I just like that if something is so core to your product and you want to expand the feature that it's easy to do instead of having to hack around or change the library or for that last for the newest feature. Right.
Demetrios [00:31:10]: The next piece that I think a lot of people end up paying for or that there is a lot of attention around is the observability piece. And so tools like Langfuse or I think Lang Smith also does this obviously insert your favorite word after lang and there's probably a tool that is called that. Are you all using a specific observability.
Bruce [00:31:39]: Tool for the agents, not specific for the LLM. So I think most of those Lang something are all for agent specific observability, but we use datadog for our observability. So I guess those Lang products, let's say Lang tools, I guess those Lang tools do a little more things out of the box for, for the photos agents, for the observability on the agents. But yeah, we started using datadog and it's also easy just to add the metrics there and the logs there that you would probably also get at other observability tools. And it also depends a bit probably on your backend as well. So most of the time with those Lang products you can also rerun if you make a change to your prompt. That's something we have in the backend ourselves. That's how it builds with event streaming.
Bruce [00:32:45]: So yeah, I guess you could use also traditional tools, but if you don't have that in place, then it's probably nice to do something like that as well.
Demetrios [00:32:56]: And you don't feel like you're losing out by using a datadog, which is not specific for it.
Bruce [00:33:05]: No. Yeah, maybe a little, I guess a little. Of course, when there's a new model coming out, it takes a little more effort for us to change, to roll out that model and check whether it's still performing as well because we don't have those tools. But then we save that, we don't need to send all the data to that tool as well. And then we have a data for the. Sorry, we have a vendor tool that does the LLM logs and traceability. And then we have datadog which more does the software engineering side. Now it's all just combined, which is also nice to work with.
Demetrios [00:33:46]: And so it does go back to this idea of to bring on a new tool is a bit of a. A journey. You need to get it through all of these different things, whether it's contract and pricing and legal, etc. Etc. That when you look at what's out there and their capabilities versus what you're using in house with a datadog, you say, I think if we squint, we could probably make Datadog work up to 90, 95% of what you get from all these other tools. And we don't have to go through any of those hoops.
Bruce [00:34:28]: No, yeah, yeah, I think so. I do. I always catch myself if I look at any of these new products, I'm super excited about them saying, oh, they can do this and they can do that and it's better than datadog because you can do this and this and I convince everyone in the team we should do it and I make the PR and we test it on Staging and then, oh, this. And then. And then we find out either something is missing or we could have just done this in Datadog. It's not that different or like a more traditional software engineering tool. Not that it's traditional, but in a sense it's not AI.
Demetrios [00:35:05]: It seems like that's a little bit of a trust issue also. Right. There's advertisement and what they say they can do and then what they actually do. And you don't find that out until you test it out yourself.
Bruce [00:35:17]: Yeah. So I have another example for that. We were looking into adding a provider, like a vendor tool. Sorry. We were looking into adding a vendor tool that could help us roll out tools more easily. So we have a bunch of like more general system tools, we call them. It's ocr, it's image generation, all that kind of stuff that you expect from an agent. But then we also wanted to do OAuth tools.
Bruce [00:35:42]: So going to Google with the users OAuth credentials and maybe creating a doc or going to GitHub and list PRs, which is a lot of work because you need to go into the documentation of Google, go through the API and be creative and think, if I have these scopes and I have this endpoint, I could probably make a tool that's called a create blank document, which is nice to fight.
Demetrios [00:36:05]: Why can't this just be done with OAuth?
Bruce [00:36:08]: Oh, it can be done. We want to do it with OAuth. But like finding out the writing, the tool descriptions, coming up with what a tool she does that should do. It's not like in the Google documentation, say, oh, and these endpoints are super nice to create an agentic tool for. Kind of looks like I'm talking about MCP now.
Demetrios [00:36:28]: Yeah.
Bruce [00:36:30]: A step back. So we were looking to a company that would provide all these tools for us that would do the OAuth flow. And we could just provide a user ID saying, okay, this user ID, these arguments and this tool. So we found one a Composio and they had a huge list of tools available. So I was super convinced. Okay, this is it. This is super nice. I tested a couple of them, they worked.
Bruce [00:36:53]: So I made the pr, went to staging, we all started testing it and then we noticed that a lot of the tools weren't working and it was just minor, minor things each time that needed to change, like, oh, you need to request an extra scope or yeah, there was something you needed to add in the dashboard, but you didn't do that. But it was also not really well documented because they don't companies like new. New startups don't focus that much on documentation, sadly.
Demetrios [00:37:21]: I wish more did. Yeah.
Bruce [00:37:24]: So, yeah, that's definitely an example of where I got really excited and I was like, oh, this, this was a little too, too soon to communicate to the team.
Demetrios [00:37:31]: We should do this because then the team's looking at you like, dude, come on.
Bruce [00:37:38]: It'S more. So I'm saying, okay, this task is done at the end of the week because. And then we have 20 tools and then we have 30 tools and then we tested five of them and then you test the other 15 and then you needed extra scopes and getting certain scopes at Google is so much work. And yeah, just things you didn't expect because it wasn't in the documentation. What I do like about these startups though is that they're always very much listening to you. So if you're, if you would tell them something like this, they would probably help you out. And they also love the feedback, I guess because they, this feels like a feature request because yeah, they're just new anyways and the roadmap is not that defined yet, but yeah, so then you.
Demetrios [00:38:22]: Scrapped that one and you decided to just go with mcp.
Bruce [00:38:26]: No, he didn't. So a lot of our users are not used to working with AI or are very technical people in HR are using. So we're distributing our agents across the portfolio company. So someone from HR is not going to go online, download some mcp, hope that the docker runs instantly. I don't think even like a regular person easily can set that up. So we considered it and in the future it might be nice if, for example, Google themselves are going to offer mcps and it just maybe you go to your Google account and you say you can create an MCP link. That would be nice, I guess. But if the users need to run the mcps locally themselves, it's going to be too hard for them.
Demetrios [00:39:14]: So basically you check that one off the list.
Bruce [00:39:17]: Yeah. And then another thing with these startups, they. We really tend to also look into the legal part and the PII and the privacy because we need to be compliant and that kind of things. So I'm used to, if I go into Datadog, I won't find pii. If there's pii, we get an alert saying, okay, there's probably PII edits, we never log pii. Sentry, the same thing. If I want to look at user data, do you need to consent that per conversation? But then when you start using some of These vendor tools. So when I started using Composio and some users added their accounts so we could test it out, there were like buttons where you could pick one of the tools, for example search documents and you could pick one of the user IDs and just execute the tool.
Bruce [00:40:05]: So I thought, oh, maybe.
Demetrios [00:40:06]: So you had God mode.
Bruce [00:40:08]: Yeah, basically. So I thought, okay, maybe this is a one off, it's not that bad. Maybe they're going to add an admin feature later or a PI feature. But then I looked recently into Memo, which is like a long term memory service, and added that to our system just locally to test it. Went to their dashboard again and then I saw my whole conversation being there, like on the front page of the dashboard saying like, oh, this is your recent traffic. And then I went to the memory section, I saw all the memories, which is. It is basically all pii because it's about, okay, this user likes this and this user prefers it if you do that.
Demetrios [00:40:48]: Wow.
Bruce [00:40:49]: Yeah. So those tools are not really made for like made in a way how they respect PII in respect to data. There's probably feet, probably buttons and settings out there. But that was something I needed to get used to as well.
Demetrios [00:41:03]: It was a little eye opening. You weren't, it's like you weren't trying to look, but you had to, you know, and then all of a sudden I can see you looking at your screen and then trying to put your head down just locally me. But I didn't mean to see that.
Paul van der Boor [00:41:17]: I promise.
Demetrios [00:41:19]: It does beg this question of you want this capability. Inherent in the memory capability is you need to know things about people. But the way that the company is setting it up, they now have to think, oh, if we want to go to users with various users, we want to make sure that we're not inadvertently making their lives hard with doing things like keeping PII or making it front and center.
Bruce [00:41:57]: Prototyping. You need to be able to see that quickly, right?
Demetrios [00:42:00]: Yeah.
Bruce [00:42:00]: But yeah, out of the box seems just so different than like the more traditional software engineering tools where you definitely don't want PII in places like datadog.
Demetrios [00:42:10]: Yeah. And it's like you would have to add this extra layer of getting rid of all this pii. So adding that extra work makes you then come back to the question of should we just try this ourselves?
Bruce [00:42:23]: Yeah, yeah. And the self hosted version is then a really good option as well.
Demetrios [00:42:28]: Yeah. It feels like you're taking two steps forward and then one step back or two steps back and one step forward. Sometimes with the different tools because they give you certain capabilities, it makes your life easier on one vector. But then when you are looking at other vectors, you're saying, oh my God, to actually get this into production and to get rid of all this PII and to comply with these norms that we have set up, we would have to do so much extra work. It's not worth it for us.
Bruce [00:42:57]: No, I mean, if it's easy enough to do, then you're probably going to save some time in the long run if you start from the beginning, just experimenting and building something like that yourself.
Demetrios [00:43:09]: And do you see this becoming more common or is this just a maturity thing?
Bruce [00:43:17]: No, I don't. I still think these vendor tools out there are probably going to stay for a long time because there's always people building prototypes and you don't always need to be super compliant, you don't always need to be able to scale it indefinitely. So do think these vendor tools out there are perfectly fine, just not if you have an agent in production and you need to be compliant, you need to be too scalable, etc.
Demetrios [00:43:47]: Let's change gears for a second and talk about how you're leveraging different coding tools. Are you using Windsurf, Cursor, Devin? A little mix of all of them took on even.
Bruce [00:43:59]: Yeah, we're doing a mix right now, so of course we're using our own tool. But now since the IDE systems got so good, we also use that quite a lot. So I use Cursor, we use Devin and we started using GitHub copilot now to do PR reviews, for example, the PR reviews, um, it really took away that first step where you maybe made in spelling error which didn't get caught by a linter or you copy pasted some Mongo adapter and you needed another Mongo adapter. You just copy paste the whole thing and then you forgot to change one little name. And The Copilot, the GitHub Copilot PR reviewer is pretty good at catching that. We did try other ones out there as well, but we stopped using those because they were super chatty and very confident. And now with the GitHub copilot, it collapses comments that it thinks it's not confident that that's the case. So yeah, it really helps with like if I made an error, send it off to my colleague and I see, oh, GitHub Copilot already said, okay, this is, I don't think this is right.
Bruce [00:45:08]: That saves a lot of time because it saves the time that I ask the colleague. The colleague instantly goes back, okay, this is the comment there. I need to fix it. Then I need to wait again till he has time. So that's super n nice to have.
Demetrios [00:45:21]: So these coding tools add to this piece that I find fascinating, which is you've been building a lot of tools for building agents like the eval tool or the orchestration tool. You do not go out there and buy something. But it's not because there's a lack of push from the team. I was talking to Paul and he was saying that he's really trying to encourage you guys to buy stuff. Despite that, you come back and you say, ah, it's just not there yet. And there's this vector maybe that is worth exploring in your eyes on. Before, when you thought about building it yourself, it was with one scope in mind. But now that you are using these coding tools, maybe you're a bit more ambitious on being able to build it yourself.
Demetrios [00:46:19]: Do you look at it in that way where you say, yeah, I can probably build this myself with the help of Cursor and Windsurf or whatever in a weekend.
Bruce [00:46:30]: Definitely. Definitely. If something comes by and I think I get how it works, it's so much easier to have something like Cursor indeed explaining what I think it should do, and then also me and Cursor trying to find out how it works. If Cursor wasn't there, it would have been so much work to try to copy something you think is worth copying instead of buying. Like copying the functionality. Yeah. Yeah. I really think that tools like Cursor enable you to quickly build these versions of existing tools that you think, okay, they're lacking.
Bruce [00:47:15]: Something like this, we can probably do it better. But if Cursor wasn't there yet, super hard. It takes way longer to experiment for a very small feature. Definitely. Yeah.
Demetrios [00:47:27]: That's all we've got for today. But the good news is there are 10 other episodes in this series that I'm doing with Process deep diving into how they are approaching building AI products. You can check it out in the show notes. I leave a link and as always, see you on the next one.