Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale
Speakers

Ereli Eran is a founding engineer at 7AI, where he builds agentic AI systems for security operations and the production infrastructure that powers them. His work spans the full stack - from designing experiment frameworks for LLM-based alert investigation to architecting secure multi-tenant systems with proper authentication boundaries. Previously, he worked in data science and software engineering roles at Stripe, and VMware Carbon Black and was an early employee of Ravelin and Normalyze.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
A conversation on how AI coding agents are changing the way we build and operate production systems. We explore the practical boundaries between agentic and deterministic code, strategies for shared responsibility across models, engineering teams, and customers, and how to evaluate agent performance at scale. Topics include production quality gates, safety and cost tradeoffs, managing long-tail failures, and deployment patterns that let you ship agents with confidence.
TRANSCRIPT
Ereli Eran [00:00:00]: The language itself is very sensitive and you need to be able to test different versions very quickly and see if a change of one phrase trickles downwards into a different conclusion at the end of the investigation. And you need to show where you started thinking in a certain way. It's quite complex.
Demetrios Brinkmann [00:00:26]: I was recently talking to a friend and he mentioned to me, never have I paid so much for a tool. And he was referencing Claude Code. Never have I paid so much for a tool and felt like I'm still the one that is coming out on top. Like I'm getting more value than I'm actually paying for. It was like, I, you know, every time the Spotify bill comes through and it's $10 a month, I sit there and I debate, should I cancel it? I don't know if it's actually worth it. I'm paying $100, $200 for Cloud Code and I'm like, I would probably pay 5 times that because I'm getting so much value from it.
Ereli Eran [00:01:08]: I think that's what we all have been experiencing right now. I think it's not only Cloud Code, we have like, you know, a little setup of Cloud Code and then we have, you know, people have been trying Antigravity, people have been trying, you know, obviously Cursor has been around before that we were VS Code Copilot. Shop. So I think we're, it's very, we're switching between them, but it's definitely the velocity impact is huge. And it's still funny to see sometimes Cloud Code would give you like an estimate of work and says, oh, this is going to be 3 weeks of work. And you say, just do it. Trust me. It's not going to be 3 weeks.
Ereli Eran [00:01:44]: We're going to— It's like, should I.
Demetrios Brinkmann [00:01:45]: Start with the first step? And then you're like, go for it. And you know, an hour later it's all done.
Ereli Eran [00:01:51]: Yeah, that's amazing.
Demetrios Brinkmann [00:01:53]: Oh, it is crazy. Now the context here is that you're doing agent work at a security startup, and I wanted to talk to you just because we've both been seeing that software engineering in a way is changing, but you're coming at it from the traditional machine learning engineering space and you're understanding what we used to call, you know, ML, we now call AI, and now we're leveraging AI in just about every workflow possible. So break down your journey a little bit.
Ereli Eran [00:02:33]: Yeah, so I think I don't have a traditional data science background in the sense that I joined the data science industry when it was peak hype, maybe 2012. I got hooked on this being the you next, know, best career. And I kind of self-taught and got like online courses to go into this industry. And at the time it was very much around like predictive analytics and statistics and machine learning. Some of the algorithms we were kind of being taught were from the '80s or even from the '60s, right? Like, but they were very useful at solving business problems when big data was the biggest, you know, hype.
Demetrios Brinkmann [00:03:13]: The Hadoop base.
Ereli Eran [00:03:16]: Yeah, Hadoop was one of the things that I started with, right? And I think at the time there was a kind of a change in software development methodology because if you were doing traditional software, like someone would write a requirement spec and you'd just build it and hopefully the customers would love it and people try to do more agile as they were saying, we don't know exactly what customers want. And on the data science side, you had these teams that would hire data scientists and say, okay, sprinkle some data science onto this project. And they would say, oh, but we don't have data. So we don't have, we can't really do anything. So a lot of teams were kind of struggling in data that data sciences, scientists that were successful were those who were able to like hold on to data or find data to solve their problem, right? So if you were good at like getting datasets from other teams in the organization, you were, you're, successful.
Demetrios Brinkmann [00:04:09]: And you had to go with barter at lunchtime.
Ereli Eran [00:04:11]: You had to have to make friends to actually do, to get the job done. And a data scientist without data is really, you could go to, you could be the best in Kaggle competitions, but it's not going to make you productive at work if you don't have data that you can use for, to turn your kind of idea into a business problem that can be solved through data. And I think part of it is that because data scientists, or at least in predictive analytics, you have to use some sort of proof to show that your thing works. And that proof comes from having data and having some of these methodologies of, you know, validation that are kind of core to this industry. If you train a model and you didn't have a graph to show it's working, how do you know it's working? I think that part was never part of software engineering stack. People built software and just read the source code that said it'll work and flew to the moon. So yes, obviously they tested it, but it was not as methodological sound as what we would see with the data science. And I think now we're kind of experiencing another shift in the sense that when we approach an agentic system, it's a hybrid of both data science and traditional software engineering practices.
Ereli Eran [00:05:31]: In the sense that a genetic system are just software in the end, it's, but you use prompts to program something and the prompts are like predictive models. They're not deterministic in any way. And even within one vendor, one LLM provider, if you're using off-the-shelf commercial LLMs, you don't get this stability that you would get with software, at least if software crashes in a predictable way. Mostly predictable ways, but with LLM prompts, you might have latency deviation, but also behavioral deviations. So I think that forces you to kind of think differently than a traditional software engineer. And that's part of what we're experiencing, both as software engineers, as people that practice software engineering, and people that build agentic systems that have to deliver something.
Demetrios Brinkmann [00:06:21]: Let me see if I can play that back for you because I, there's a point you hit on that I'm not quite sure you were trying to make, but it instantly made me think about how since you're outsourcing the brain of the LLM to an API, usually to one of these big research labs, and they can be a little bit unstable. We all know folks who used Anthropic over the the fall of 2025, probably recognized how it's like, is it my thing that's not working? Is it because Anthropic's not working? You got to go dig through the logs and recognize, oh wow. All right. So what do I got to do to make sure that the uptime on my API calls is higher? And they, I don't think could have done any more they were just getting inundated because of the demand being so high. So there's that inherently that is unreliable, but then you're saying there's also the two sides of the coin where you're writing the prompts and you're doing more data science-y work, which is not like, does it compile? But you also have to create the software because you're creating the agent and that is very software engineering work. So you've got like these three pieces, the reliability side, you've got the data science side or the stochastic side, and then you've got this very deterministic side.
Ereli Eran [00:07:58]: I think what is part of what people forget and then they kind of realize is that the creativity of the LLM is the stochastic side. Is it, is it in order to have a creative LLM, it has to have this this random effect of variation in the output. And that is beautiful when you're trying to generate poetry and it fails miserably when you write jokes, but it's very miserable when you're trying to have to gain someone's trust in about a piece of software behaving in a particular way. I think in the end, when we write software, Software is, especially high-level software that we write, we don't write in assembly, we write in very high-level languages and writing prompts is the highest of them right now. We're trying to tell the computer to do certain things and in the end it needs to do what we want in a deterministic, predictable way most of the time. And if it doesn't do it, we have a problem. So we have to kind of build systems around this variation. So the creative side of LLMs, which sometimes is very entertaining, is a challenge in writing agentic systems that kind of follow orders.
Ereli Eran [00:09:14]: So how do you harness that? You can basically build guardrails into your system and test it, but you also have to test not only during the authoring time, which was the normal way of writing software, you just write it, test it, and then you had, you could, you know, same, the in the same sense that people did whiteboard interviews, you could write an algorithm on a whiteboard and people had an understanding that this is a shared language, a programming language, high-level one, and you could, you know, any advanced user of that language be able to read it and know how it will be compiled and run. With prompts, I could write a prompt on the wall, but you can't guarantee, no one can guarantee how it will be interpreted by any LLM. Even the authors of the LLM say, well, it's a good prompt, but no one knows, right? So that thing is challenging for everyone. It doesn't matter how advanced you are. If you didn't work in the best research groups, you don't know how it will run. So that's the part of software where you have to basically really build a garden around it.
Demetrios Brinkmann [00:10:25]: Yeah. Or pray to the software gods and hope that they hear you or the LLM gods, right? It's just like, I'm gonna throw this up. It is an absolute crapshoot what's gonna come back, but you can't build a business, a stable business off of that type of thinking.
Ereli Eran [00:10:42]: No, but, but I think it's the same when you want predictable systems, you, you have, you can really kind of narrow down. Your problems into small building blocks. A lot of the challenge here in agentic systems is deciding how much responsibility to give to each agent. Like you can, like with the experience we have as users with Cloud Code, you can give it a very freestyle task, like one sentence and say, refactor this thing, add a feature. And you might have mixed results depending on you know, your luck. Like sometimes it will just find the right file in your repository and understand what it needs to do. But sometimes it goes awry and you have to start over and say, no, listen, here's the file. This is how we do things around here.
Ereli Eran [00:11:32]: And after you do that for a while, you might start putting that into the context and then your context becomes a set of instructions that never do never, this, you know, change framework mid-flight. When you're implementing a frontend feature, but you don't want to put everything in the context all the time. And I think that's the challenge of being both a software engineer and like an agentic engineer is that have to, can't, you you you're not allowed to put all the instructions that you want to be enforced all the time. You have to kind of use them sparingly and in the right context to get the results you want. But you can't simply just say, here's everything that needs to be followed. You know, I hope, I wish it would be possible, but we know from like studies on context window limitations, even if the context window is 200,000 tokens, you can't really use them. You probably shouldn't be using more than 30, 40% to get anything decent out of it, which means you're, you're always, you're paying for capacity for, you know, for skill, but you're not, you can't really use all the MCPs that are out there or all the instructions that you can write down.
Demetrios Brinkmann [00:12:50]: Yeah, there was that blog post by Manus that talked about this and how they got around it, right? I can't remember the term that they came up with. It was something like progressive disclosure or something like that. And so I also have heard from a friend of the pod, Brooke, who runs Koval, and she does a lot of things with voice agents, how a lot of times since voice agents are these multi-turn conversations and they're very high stakes and you want to be low, as low latency as possible, you just can't have these gigantic prompts. Continuously being there for every turn. And what they'll do is they'll build graphs and then dynamically inject different pieces of the prompt in at different points, because more or less they have ideas of how the conversation should flow. If somebody's calling up a customer support agent, you kind of know what they're calling about. And so you can, in different points of that conversation, inject different prompts in there. And it reminds me of that Manus progressive disclosure idea too.
Ereli Eran [00:14:09]: Yeah, I think, I think many people are trying to find the right way to do this because we couldn't obviously compact the context window and compress it using various summarization techniques. But it's not deterministic. So we don't know what exactly will be lost if we compress too harshly. And I think the tricky part here is that every conversation is different. But I think if you understand your domain, this comes back to data scientists. I worked with a lot of smart people in the past that were kind of converts, people who had PhDs in neuroscience or biology and they came in to do data science in software, right? And like some people have this allergy to kind of study the domain because they want to stay on the algorithmic side. They want to be a different type of person. But if you understand the domain and your users, then you can have a more opinionated view in these questions.
Ereli Eran [00:15:07]: What is relevant? So in the question of customer support, you would know, you know, there are some scenarios that are not reasonable. If someone asks for a chatbot to start coding in Python, it's okay to say, no, this is not what I've been trained to do. And you don't have to use like one LLM with all the prompts. Like you can have one LLM do the guardrails and then another LLM only do the business logic with a limited set of functionalities. And if it can't solve the problem, it cannot solve the problem. Having less advanced LLM with limited skills is actually preferable in most of these contexts, both in terms of velocity and cost. And we can even see it with Claude Code. It doesn't tell us exactly when is it switching to haiku and when is it switching to sonnet or Opus, but you can tell that some tasks are better with cheap, fast model.
Ereli Eran [00:16:10]: Like let's say you were searching for source code and the task is to find which file we're going to modify. You don't need the most expensive model to run some grep commands or ripgrep if you have that installed, right? And for those tasks, it's nice to have different subagents that can do that they don't need the entire context to operate successfully. They feed back to kind of a more orchestrator pattern.
Demetrios Brinkmann [00:16:38]: Yeah, the orchestrate— basically the constellation of models is becoming a very common pattern that I'm seeing. And I was literally editing a podcast just before we hopped on to talk about this with my friend Paulo. And he said, we tried so hard to replace all of our traditional machine learning models with new LLMs, but there's a few scikit-learn models in our workflow that they do much better pound for pound against any LLM that you give it just because of the nature of the beast of what you're trying to accomplish. And so in their workflows, they'll have that scikit-learn model there and they, it comes with a bunch of inherent benefits too, because you're not, you don't have this gigantic model that you're trying to now serve and the infrastructure around that. You've got something that is a lot more common and people have been dealing with it for a lot longer time.
Ereli Eran [00:17:41]: I mean, now that you've mentioned like that, I've been thinking about, so I mean, my previous work, we did work on, you know, many years ago on fraud detection., right? And I can't imagine someone replacing a fraud detection model with an LLM. If you can say a transaction is fraudulent because someone has a new laptop and they're buying a Rolex watch on a shop where they just signed up for with a new email address, that type of information, you could obviously have an LLM analyze that transcript of information and make a prediction. But in terms of like velocity and cost, we had to, get the response in like less than 50 milliseconds and we had to be correct all the time practically, right? And you can't have that with an LLM. You can have an LLM to explain what the system did, which is probably where I would use it today. And I think people are a bit hesitant to kind of say that because it's trendy to use LLMs for everything, but you can definitely build, use graph methods, use still recommender system methods, and in a hybrid approach with your LLM. And I think that if you have the know-how of how to turn some section of your problem into a kind of predictive problem, you'll get better results. And even with your context management, you have these opportunities. Like if you have a kind of knowledge base of context that you may need to include, like the retrieval of that particular piece of context is a small machine learning problem, how to do efficient information retrieval in measuring that the information that you retrieve is relevant.
Ereli Eran [00:19:27]: Like obviously you can retrieve information, but who says it's relevant? Like that test of relevancy is a small data science problem that you can pick up if you have the appetite and the domain knowledge to say, I can, I can say what's relevant.
Demetrios Brinkmann [00:19:44]: Yeah, it's funny that you mentioned this hybrid approach, especially for fraud, because I was literally just reading an article and I think it was Pinterest that took a hybrid approach on their fraud. They're like trying to figure out ways to bring LLMs into their workflow and specifically around the fraud use case. I want to say it was that, but I'll bring up the article and try and figure it out a little bit better so that I don't misquote anything. But that is, I like wholeheartedly agree with you. If it ain't broke, don't fix it. Or, you know, like that old saying, it still rings true. If it ain't broke, don't fix it. And the other thing that I was thinking about as you're saying that is how different it is when you are just working for yourself and trying to boost your own productivity or playing around with your own context versus you're trying to make a product that can then be used by many people and like mass production, we could call it, right? Because if I'm just doing it for myself, I, think about how, oh, there's these tricks that I know when I'm looking at the code base and there's an error that's happening.
Demetrios Brinkmann [00:21:08]: And I'll say to Claude Code, okay, explain this whole file, explain it to me in as much detail as possible. And it will explain exactly what's going on. And then I use that as the context with either Claude Code, or I'll throw it in another model. And say, now find the bug. Here's what's happening. Here's the flow. Here's the documentation. That is how it should be happening.
Demetrios Brinkmann [00:21:35]: Where's the difference? And that's a great trick for me as an individual, but how do you make that so that now when you have this AI product that's out there, it is operationalized?
Ereli Eran [00:21:51]: So I think the tricky part is that there's more than one other Like if you look at our traditional kind of organization in software engineering companies, you might have, you know, kind of a business and go-to-market side, and then you have software engineering, and then you'd have this kind of specialties within it. And one of the specialties would be ML engineering or MLOps or data science or software engineering or, you know, DevOps people, right? We have this kind of skillset and obviously security practitioners are kind of carrying Security is like a meta domain because security is like a mindset. So you need to understand IT and software and also think like an adversary. So we have a lot of people that have like the security expertise is that they understand how attackers work and think, but they also understand how operating systems and networks work and have very deep intimate knowledge about how to spot the behavior of those malicious actors. As it's witnessed through telemetry that we have in the security industry, we have EDR and we have network monitoring telemetry. So you have these people that are experts and they can't all write the same software. So I think at least when we build the agentic systems, we kind of try to say which part is agentic business logic, which part is kind of instructions that relate to the security domain, like how do we investigate a security incident? How do we know to demonstrate to a human that we you did, know, everything that the human would do in the same manner that they would do it? And then how do we take input from users who say, you didn't do what I want, you should have done this extra step, or in our organization, this is okay. Somewhere else it's not okay, but we allow it.
Ereli Eran [00:23:41]: So those type of contributions are all separate. We kind of build 3 different parts of our system to allow these interactions, but they all, in the end, they all run in one runtime, but we have 3 people contributing code and prompts to the same kind of investigation in the context of our agents. And the tricky part is that they all need to have like a feedback loop, right? So you need, so for a software engineer building an investigation, it's very difficult to run an investigation without actual data. Like the first thing we did when we started the company, you know, I joined as one of the early engineers, we created the lab just so we can have like a real investigation before we had any customers. We set up, you know, some machines running malware, in the cloud provider, and then we would see the telemetry from some of the security vendors. And through the telemetry, we could tell the agent, you know, what would you do and see how it would investigate and through that build a process. Without actual data, it wouldn't have been possible. And we see that, you know, obviously customers want to contribute and part of it is very much a product question.
Ereli Eran [00:24:57]: How do you let people contribute without letting them kind of ruin the product in the sense that like if they could make a mistake and like, you know, write something bad and the product wouldn't work and then who's to blame, right? So we have to put some guardrails on what contributions each user can do so the system still works and give them feedback on what they, on the actions, right? It's the most critical part is as a user, when you have written a piece of code in a high-level programming language, you can compile it, you can run it, you can run tests on it. That's your confidence. You know what you know through that. With LLM-based investigations in, or an agent, how do you know that it's going to work? Sadly, you mostly have to try it. I think that's like the, you can obviously test incrementally different parts of the system, but end-to-end is, the most powerful proof point that we have. Yeah.
Demetrios Brinkmann [00:25:59]: And, and speaking about building for separate users, it then becomes much harder too when you are debugging. If the user is in some way, shape, or form not testing it or not looking at the logs as to why things are going wrong, there's not a clear feedback. It's not like The agent says, oh yeah, like I did all of this. I just got stuck in the last step. It's more like, huh, I wonder if it's not working because this problem is too hard or if, and it's not capable or it just like got stuck on one of the loops and it wasn't able to complete it. So I think there's a lot of that investigation work too that becomes a little bit of a nuance and a headache.
Ereli Eran [00:26:50]: Yeah, it's certainly, if you're using one of the popular frameworks like LangChain, you have limited visibility into what's happening unless you start instrumenting kind of the state of your agent. So in LangChain, they support these, LangGraph is the framework where you allow you to create these graphs, but the graphs mutate after each tool call. They can accumulate state and accumulate information, you need to build tooling to kind of see it. Obviously you can get a kind of a commercial observability framework in place and you could have a page, but sometimes your agents will have hundreds of actions. So scrolling through a thread of hundreds of actions is quite limiting even for very technical users. So I think breaking your questions into what am I trying to look at is important. And we see it also in the product itself. Like when we show the agent's output, people really want to know what did it do.
Ereli Eran [00:27:56]: So even if everything went well, we need to have like a very nice audit trail of actions. We don't have to show every thought the agent had, but we need to know, we need to show to the users what was the reasoning behind taking every action. And it needs to map into what a normal human would do. So that part is really kind of a UX question. How do we show enough and hide the information when it's too much? And we also have sometimes investigations, at least when we develop new content and new kind of prompts, we need to show how things are different. So having a view that shows you this is my version A, this is my version B, and showing you the difference in rerun of the same investigation is very powerful idea. Just being able to see the difference between them because the difference might be hidden if you have a thread of hundreds of actions, right? So highlighting what is different could be quite useful for the users that are trying to understand their own changes.
Demetrios Brinkmann [00:29:05]: Isn't that fascinating? How the UX design patterns are really the most crucial part in building the trust in my ability to know that this agent did the things that I wanted it to do, or that I would have done.
Ereli Eran [00:29:23]: Yeah, I think a lot of the nice part of it is that it's all human language, like in the end, like it's not, it's, so I think in in the, the most In other domains, like in video art or in image generation, if you're trying to understand why did the image get misclassified, there's a lot of— Then trying to say, oh, this pixel here is red and that's why it was a cat and not a car. But there's some noise in those algorithms that people are trying to understand. But in the domain of LLMs, at least you can really see how one word could throw the LLM off. If you use the word suspicious, right? We have in human languages, we use these words kind of freely, but then LLMs, as soon as you tell it something is suspicious, it's going to start thinking in those terms. So at least in the security space, you cannot say this file is suspicious. On what grounds? Why is it suspicious? Under what context or scenario is it suspicious? Because malware, at least modern-day malware has been so advanced that they use what we call living-off-the-land binaries. So instead of writing a file and compiling your malware into one file, they break the functionality of their malware into files that already exist on your operating system. So now you have, you're running Windows and you have a PowerShell command, and PowerShell could be used by legitimate administrators to do administrative work, install and remove software.
Ereli Eran [00:30:57]: But also used by malware authors to kind of gain persistence or do anything. So you have this duality. And if you use language like the word suspicious in one of your prompts, or even if the vendor said this file is suspicious, now the LLM is already triggered to think that this is suspicious. And we needed to think like a scientific explorer and say, why would this be suspicious and in what context it is? And so the language itself is very sensitive and you need to be able to test different versions very quickly and see if a change of one phrase trickles downwards into a different conclusion at the end of the investigation. And you need to show where you started thinking in a certain way. So it's quite complex.
Demetrios Brinkmann [00:31:47]: And this suspicious piece is because it is basically just leading the LLM into saying like, oh yeah, it is suspicious.
Ereli Eran [00:31:55]: Yeah. So one of the features that, you know, my wife also works in this domain of AI these days. And she tells me you have to give prompts that the AI is going to have this notion of agreeableness. It's going to try to agree with you. So because it's been trained to agree with you, You have to kind of tell it, don't agree with me. And I think this is again, a counterintuitive thing because we think it's an intelligent beast, but it's not. It's a very nice parrot. So if you tell it, if you give it these words that are starting to, if you give it words that are triggering a particular line of thinking, you will see that it will try to agree with you.
Ereli Eran [00:32:42]: And we want it to be a scientific explorer and really answer questions in almost like a scientific way. So we can create proof points that resonate with humans. So if something is suspicious and it starts with vendor, security vendor says you have a suspicious file on your computer, we need to corroborate that with additional external information. We can't use the vendor's word of suspicious to say it's suspicious because if we trusted the vendors, you would have a million alerts per day. Like our problem in the security space is that vendors have been optimizing for never getting it wrong and they alert on everything that could be potentially malicious as suspicious. And we have a kind of alert fatigue and volume problem in the security space. So we can't blindly trust the vendors. We always need the secondary evidence.
Ereli Eran [00:33:39]: And that's part of the fun part here is like trying to build a system that collects secondary information to corroborate what, you know, one system says.
Demetrios Brinkmann [00:33:52]: Tangentially related to what you're talking about, back in 2023 when LLMs first came out, we had the creator of Airflow on here, Maxime, and he said, and he also created Apache Superset. And at that time he was on a trip about how we need to treat prompts like code and less like something that we do and just kind of like throw at the wall. We need to really have the abilities to version our prompts and to understand them and then also run unit tests against them. And all these things that you're talking about, like the champion challenger of the prompts that we want to go out and then to be able to visualize how they are affecting the output in different ways. And at that moment in time, for what we were, the maturity that we were at, at that moment in time was so far behind this idea, but it still is like so true today when you really want to make sure that you're doing everything you can so that your AI product is reliable and it is useful to that user. Well, what do you know? Like it would be great if all of our prompts were versioned and we could figure out the lineage and we could figure out how introducing a new prompt or a new word into a prompt affects that final product.
Ereli Eran [00:35:41]: I think that's very, it's, it's, it's funny that it came from, from like, you know, I think worked on Superset because I think it's, it's not dissimilar from what you'd see in the kind of data analytics space. So if you look at the products that allow people to write SQL, they all start with having a kind of UI where you can paste your SQL and hit run and you get some table of results. Many of these products become analytics dashboards and maybe they support notebooks, maybe they support dashboarding, but then there's a question of Why do we store the SQL? And some of them end up with different methods of persisting the queries. And eventually someone says, can you put it in source control? So at least the mature BI and analytics platforms will allow you to have some sort of revision or source control functionality because you build on top of datasets and you share dashboards and you need to have some sort of source control. And that pattern, I think, is the same with prompt management. With us, we have been storing them in source control from day one, but we've built even kind of better separation between source and prompts, treating them as content, putting them in a separate directory and having multiple layers of validation against your prompts is very useful. And again, those the separation is not only for having clean source control trees, but also because you have different personas writing those prompts. If you're a security engineer and you're touching a prompt, it's probably nicer that it's not embedded into a Python very long variable, right? It's very, it's nice if their files are in YAML, right? Having your prompts in a separate place and having the tools to test the prompts before you kind of commit them.
Ereli Eran [00:37:34]: Is a nice kind of DevEx experience. I think DevEx is not, you know, it's what gives us velocity. If you look at, I think there was a kind of Twitter thread the other day about, you know, cloud code productivity and, you know, there's someone from Atropic saying they do 5 releases per developer per day. So if you think if that's real, How do you, whatever definition of release is, how do you know that the releases are not breaking anything? You need to have a very solid process for testing those small prompt changes. So I think that's the key part is saying, okay, anyone can commit to source control and make a prompt change, but we need to have a couple of guardrails. One is like, you know, have a suite of unit tests and integration tests, then have kind of a staging environment where you can see the prompts in action running and working against real-life data. And once you have that, you can deploy it and give it to customers, but you have to monitor that it's still doing what you expect it to do and the distribution of outcomes are what you expect them to be. And that's traditional observability in that sense.
Ereli Eran [00:38:52]: Like have, you we have the observability stack that kind of has been around open source observability tools are very popular. You can build tooling to kind of look for outcomes with the same observability tools you would use for tracking your web service availability, right? But you need to have the business, the domain knowledge and the business interest to say, I care that this prompt change doesn't break this outcome for this customer. So part of it is not because it's, if you're a security engineer, you might not be expert in DevOps, having the skillset of combining someone with the DevOps skillset with prompt engineering and AI engineer and security engineer works together on a feature, then allows them to be more independent long-term.
Demetrios Brinkmann [00:39:46]: Where are you having the evals fit into all of this? In this pipeline that you're talking about?
Ereli Eran [00:39:54]: So you have evals in two levels. One, you can eval on the unit test. So we can basically have unit tests that are more like integration tests and they actually make LLM calls. And then we have evals on a kind of a staging environment where we can see things run against customer data. And then thirdly, we have these LLM as a judge. So LLM as a judge is a very nice feature, but we don't believe that it could run in line with traffic. I think my observation is that like the business needs, at least in our domain, is that we have very strict SLA, at least in the security space, people expect you to have an investigation and then minutes later, they need to know the answer. 'Cause if it's real, it has a real impact on a business and we are competing with humans.
Ereli Eran [00:40:52]: If humans take 10 minutes to an hour to look at an incident, we have to be faster. So our evals often run asynchronously as kind of a scheduled task and they pick items from same queue and they kind of revisit them and say, are we happy with this conclusion? Are we happy with their set of tools that we use? Are we happy with the hallucination level that we see here. Obviously we don't want to see any, but we can run more deeper evals out of the cycle and we can run shorter evals before you release the software or before you give it to the customers. Does that make sense to have these three gates? I think it's also, there's a cost element to it, like in the ideal sense, We would run more, but we have a limitation of both performance and cost. We can't have unlimited gating to get confidence. And there's a cost element to having full reruns of something. We are limited by LLM cost because we want to use the same LLM. We can't use a cheaper LLM to do the eval investigations, right?
Demetrios Brinkmann [00:42:08]: And are you having security engineers go through and craft some type of golden dataset just to have as something that the LLM as a judge can reference, or is it full on, just give it to the LLM and hope?
Ereli Eran [00:42:23]: So it's not a golden dataset because we don't have, we cannot use historical investigations because some of it will age out. For example, if you're working on a security incident and you have data that is aged out, you can obviously store an anonymized version and kind of keep it as an integration or unit test, but anything that requires kind of dynamic nature will require live and fresh data because there's a timeline element to all these investigations. If you have malware today, I can't go back and run it on data from 6 months ago. It has to be all like fresh and recent and available in the downstream systems, or you can mock half of the system. So in our case, we try to use kind of recent data. And I think there's 2 reasons to do it. One is that we live in an adversarial space as well. So attacks change all the time.
Ereli Eran [00:43:26]: And part of it is because the security vendors change all the time and attackers are changing the way they attack due to security vendors. I'll give you one example. We are very good at spotting email that has phishing text in it. If you write, you know, a text that says, you know, please use my new bank account or here's whatever it is, the text analysis is very cheap and efficient. So what attackers do, is they embed an image and they know now we have to bring in OCR to read the content of the image and now your cost is more, it's more expensive for you, right? So they, now if you can read images, they will bring a PDF. So now you have to scan PDFs. So this is an adversarial space and because of that we can't like, you know, rest on our laurels and say, oh, we have a test running for this, it's fine. We have to kind of keep being good at what's currently being investigated, and that's where we use this kind of sampling approach of fresh data and try to use as real as possible attacks.
Ereli Eran [00:44:31]: And if we need to run expensive jobs, we do it on a kind of async fashion. And we do tap the shoulder of security engineers and say, hey, this one wasn't good. And we have a Slack integration to make sure that there's a human in the because, process, you know, at least as a company, we're trying to, tell people that we are still humans behind the product. We're not just agents in like one-person shop, you know, it's, we have to create confidence and confidence comes from like having a person that actually has the ability to help you. So it's, there's definitely a lot of human in the loop shepherding these agents.
Demetrios Brinkmann [00:45:10]: Are you okay talking about memory and shared memory, because it feels like that would make the product much better if there are certain patterns that you're seeing or a certain type of attack on one surface area or one client. And then now this becomes common. You can bring that back, learn it, and then the agents can reference it. Later, or is that just like a naive way of thinking about it?
Ereli Eran [00:45:43]: I think, so I worked for companies that had exactly this. So previous employers were pitching their product as exactly this. Like, if we see an attack against one customer, we can say the information about the attack will become threat intelligence, that way you can share threat intelligence. It's the high ground. And that is very powerful. For yesteryear's attacks. So if you look at attacks that used hashes of malware, you can see the hash of the binary. And if you see that hash, you know it's game over.
Ereli Eran [00:46:17]: And sadly for us, it's never like that. It's never that easy because attackers are very sophisticated and they have polymorphic malware and they have living-off-the-line binaries. So we don't, you cannot, make just threat intelligence your only strategy to combat attacks. And what we see is that even with our customers have the best security products you can buy and they still get attacked every day and not everything is stopped by you the, know, email security, the EDR security and the network security products that they bought. And even if they bought a SIEM where they aggregate all the information to one product that is still not successful in aggregating everything they need. So having an agent that can go to systems that are not integrated into the SIEM is very powerful. And in terms of like memory, we have built mechanisms for us, for them to tell us about false positives because that's more interesting than, I mean, so I would say you can obviously have false negatives as well to our system, but it's often false positives that make the most noise because if you have once in a month backup software and once in a month backup software generates an alert every time it runs. So every month you have a spike of noise and it's just nicer to have like this collective memory.
Ereli Eran [00:47:47]: So we have a way of saying, if you see this, it's okay. And that kind of is injected in an intelligent way. So the lookup into the memory has to be aware of what are we looking at. So we have a notion of artifacts in the security space. So some people call them observables, but you might have an IP address or a domain or a file hash. And we have a way of saying, when you see certain elements in telemetry, you know, look it up and see if we have something to say about them. And that allows us to kind of inject this collective knowledge that we have. And we don't need to do that for attacks necessarily, because attacks are by nature suspicious and cannot be explained naively.
Ereli Eran [00:48:33]: If I show you a command line that is running from a malware, you wouldn't be able to understand what it's trying to do, because the authors of the malware have tried to obfuscate it. So it's already, it's the fact that you see obfuscation is making you suspicious because a normal developer would not try to obfuscate their code. So we have multiple levels of, you know, knowledge, but we don't need to tell the AI what is malicious because that is, it's almost like a common knowledge. Kind of, you know it when you see it sort of thing.
Demetrios Brinkmann [00:49:12]: I kind of want to switch gears and just get your opinion on where you think or where you've seen the most common failure modes when it comes to building agents that you're putting out into production.
Ereli Eran [00:49:26]: We, I mean, we started with writing the tools in, in the, like the wrong language. I think one of the things that, that we, we first did is say, you know, What's a good language to write backend in? And we chose Go, which is a very nice language. I love Go, but it is not the best language for both coding agents to write code in and when you're trying to build tools. So this was maybe 2 years ago, right? So today it seems funny. We have MCP, but this was before MCP. So when we tried to give tools to our first version of the LLM, we tried select the wrong language. And Go creates language safety, which is very nice for software engineers. But what you want with the LLM is really give it the maximum amount of tools.
Ereli Eran [00:50:15]: So I think one failure mode is giving it not enough tools. So it can't really do the job. Like you can't expect an agent to do what a human does if you don't give it the right set of tools. So we were a little bit behind when we were trying to handwrite all the tools in Go. The second thing is like when you have too many tools. So when you give the agent, you know, 1,000 tools, it's not going to be able to work. Like you have a limitation on the memory. And as we talked about context, the format in which the LLM vendors want to know about tools is very verbose.
Ereli Eran [00:50:49]: This JSON schema that describes the tools is not an efficient format for declaring the tools. So basically you cannot give all the tools to all the agents all the time. You have to have an intelligent way of saying what tools are relevant for the context in which you find, and that's like tenant-specific. So each customer has a different set of security software installed and each investigation has relevancy. So you have to apply this relevancy test so you don't pollute all your your context. And that's something that we've all seen with even software engineering coding agents. And I think the trend now is to use skills just because of that. Skills is kind is of, like, it's coming from Anthropic, right? But it's something we have similar notion of filtering and the idea that you can only pull into context subset of the functionality at the time is very appealing.
Ereli Eran [00:51:52]: So that is definitely one thing. And I think also, if you look at the other frameworks, like we use LangChain when it was one of their early frameworks. I think it's a good idea to evaluate kind of open source frameworks. LangChain, when we started, was not very mature. And I still feel like They're doing as best as they can as a fast-moving startup, popular in open source community, but you have to kind of not get swayed by GitHub stars. I think you have to kind of decide for yourself what is the business logic you want and how to build it and not say, I'm going to build it using this tool just because that's easy because it's not necessarily easy. If you look at the Pydantic Agentic framework, the Pydantic folks have released one. They basically say you probably don't need a graph.
Ereli Eran [00:52:51]: They have like in the documentation a that says segment like, do you really need a graph? Maybe you do. Maybe you need an orchestrator and a sub-agent and agents that call tools that are also agents, but maybe you don't. Like having a minimalistic version of your app is really a good way to start. I think a lot of people, they have, they want to go to the advanced mode immediately. And in our case, we have, we're like on a higher version of our system right now because we've kind of adapted multiple times and it's constantly changing. So try to prototype and evaluate before you kind of commit to a framework or tool.
Demetrios Brinkmann [00:53:34]: And it's funny about that graph piece because it feels like more and more you are just getting that one agent. The sub-agent architecture idea is becoming less relevant when the main agent is more powerful and can do more things.
Ereli Eran [00:53:54]: I think, I mean, I agree that it, it, we wouldn't need sub-agents if we would be able to manage the context of the main, of having one agent. So it's really about keeping the context not polluted long enough to complete long-running tasks. And if you can't do that, then you have to break it because that's basically our way of hacking the memory and the context. But if you can have good visibility into utilization of your context, you can solve a lot of issues, a lot of problems with a single agent. It's just very tricky because your hammer, writing more prompts, is destroying your agent. So you have to really find a toothpick version, not a hammer, to slightly nudge it towards the outcomes you want without over-engineering a multi-agent graph. Necessarily. And the graphs, I'm not against using graphs.
Ereli Eran [00:54:57]: I mean, if you come from data analytics background, from using Airflow and using any kind of Spark system, you understand that graphs have a place in building data analytics jobs. But we have seen, at least in the security space, like two things that don't necessarily always work. We have SOAR platforms where you can build like automations using code. And we have seen platforms like, you know, Orchestrate and Temporal that are, you know, baking open source workflow engines. And I think those require software engineering skillset, but thinking about how to test those workflows in the agentic context is a lot of human context to consider. So when you pick your solution, you have to consider, do I need just a workflow engine and write some business logic in a general DSL for like a Temporal or Orkus? Or do I need an agentic framework and a graph of complexity? And deciding where do you need, like why does it need to be an LLM is like a tough question. Because obviously people are excited about LLMs, but I think it's a good question to always ask. Does it need to be an LLM? Can it be pure business logic? And who's the person to write the business logic? Does it need to be me or can it be the user? If you ask those questions, you will find that in many cases, you don't need everything to be an agentic LLM.
Ereli Eran [00:56:36]: Like you can get quite a lot of value from very limited set of LLM usage in your application. And it's obviously going to make it more testable and cheaper to run if you don't overuse agentic systems. And if you have to package skills, the skills are not LLM necessarily. You can write skills as regular software and write, you know, same way you write your MCP or whatever you, that functionality. Is traditional software, and that's fun to write because you know that part at least works.
