We Can All Be AI Engineers and We Can Do It with Open Source Models
Luke is a passionate technology leader. Experienced in CEO, CTO, tech lead, product, sales, and engineering roles. He has a proven ability to conceive and execute a product vision from strategy to implementation while iterating on product-market fit.
Luke has a deep understanding of AI/ML, infrastructure software and systems programming, containers, microservices, storage, networking, distributed systems, DevOps, MLOps, and CI/CD workflows.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In this podcast episode, Luke Marsden explores practical approaches to building Generative AI applications using open-source models and modern tools. Through real-world examples, Luke breaks down the key components of GenAI development, from model selection to knowledge and API integrations, while highlighting the data privacy advantages of open-source solutions.
Luke Marsden [00:00:00]: My name is Luke Marsden. My title is CEO and founder, and I take my coffee with an AeroPress at home, which I drink a lot of, which is just like, aeropress, plus, like, nice coffee from a local brewery, plus a splash of milk. But I really like a flat white if I'm in a fancy cafe.
Demetrios [00:00:19]: Folks, welcome back to another ML Ops community podcast. I am your host, Demetrios. Today we're talking to a longtime friend, Luke Marsden, and I appreciate whenever he comes on this podcast, he's been on a few times, and that is because he is one of the reasons that this podcast and this community exist. He's the guy that told me we should start a community when I was working for him. He was CEO and I was just the lowest man on the totem pole. And that is why we are here today. I am with you. In your ear, Luke talks about two things that I want to highlight.
Demetrios [00:01:02]: First thing is AI specs, which is all about creating standards as YAML files for how to have CI CD for AI apps. And the second thing that I absolutely enjoyed him diving into and getting really technical with us on is how he set up the evals, specifically the tests for AI specs and his product, Helix. All right, let's dive right into it. As always, if you enjoy this episode, share it with just one friend. Luke Marsden, I will always have a very special spot in my heart for you because I remember those times where I had no idea about technology at all. I specifically remember a time where we were walking down the street in London after a conference, and I was asking you what Kubernetes was, and you were telling me, and then I was like, so it's like Hadoop? And you're like, no, no, no, no. Different things. Yeah, very different.
Demetrios [00:02:22]: I'm like, well, how is Hadoop different than Kubernetes? And so you went, and you have always been so patient with me in explaining things, and that is why I'm very excited to talk with you again so you can be patient and show everybody else how patient you are when you explain things to me.
Luke Marsden [00:02:39]: Bless. Well, thank you, Demetrios. And likewise, you have a very special place in my heart, because when DOT science shut down and the mlops community got born from it, it was your drive and passion that made it a thing. There we go. And when I was speaking at the. At an mlops community meetup in San Francisco a few weeks ago, I said something at the beginning which was like, it's so incredible to be here in San Francisco with this like thriving community around me that like you and I started back in those dark days. So thank you.
Demetrios [00:03:17]: For people that do not know, Luke and I were working at Dot Science. You were the CEO at Dot Science and I was just some random sales guy and who was trying to learn about tech. And right before the company went out of business, Luke told me, hey, maybe we should do something with like a community or something. And I said, okay, let's try it. And that's how the ML Ops community was born. Let's get into CICD4 gen AI. And also you've come at AI and traditional ML and data from a DevOps background. And so I think you've always been very anchored in software engineering practices and software engineering.
Demetrios [00:04:06]: And so that's part of why I think you had that inspiration in the beginning to start Dot Science and say we need lineage, we need tracking for all of this stuff. Because if you make changes, who's going to know and how are they going to know and what are you going to be able to do about it? And as my friend Chad likes to say, it's data change management. And we have like GitHub to help with just regular code change management. So let's start there. Like CI CD for AI. What does that even mean in your eyes?
Luke Marsden [00:04:38]: Yeah, so thank you. And I think I'd like to start that by talking about how software engineering gets done. And like we've had decades to figure out how to do software engineering well and to like ship software well. And when we're shipping just regular software, there's basically two fundamental pieces that you need to be able to do production ready software, and that is tests and deployment. And so that means being able to test your software automatically before you merge a pull request to the main branch, for example. And it means being able to control which version of software is running in production. And often there's a term that used called GitOps, which is this idea that was coined by my old boss Alexis Richardson at waveworks back in the day. And the idea there was like you should keep a record of what you want running in production in a git repository and then reconcile that against production.
Luke Marsden [00:05:51]: So there's these two key ideas of like CI which is continuous integration, which just means testing your Software, and then CD, which overlaps with this idea of GitOps, which is continuous delivery. And it's like being able to know what version is running in production, roll it back easily and all that stuff. So then like, okay, probably everyone I'm talking to already knows that because like cicd is a fairly common idea, but. But the question then is, well, how does it, how does it apply to, to generative AI apps like LLM apps? And the thing I'm trying to really push on at the moment is to argue that it's basically exactly the same as software. It's just that the definitions of what we mean by testing and deployment are slightly different. And so if you, you hear people all the time talking about evals. I know, I think I saw you had Gideon on recently talking about evals and that's great because how you test gen AI apps is just evals. It's like evaluations and.
Luke Marsden [00:07:00]: Yeah, so my argument is basically to do production ready gen AI with you need CI CD for gen AI. And then we can break that apart, which is the CI piece is. Well, you do testing, you test your app before you merge a change into the main branch. And the way you do that is with evals and then you deploy the app. And then the thing that we're trying to kind of make a reality is this idea that the spec for an AI app should just be version controlled YAML and that you could almost deploy it to a Kubernetes cluster in the same way that you deploy like an ingress record or something. So I can go into more detail and explain what I mean.
Demetrios [00:07:48]: Yeah, yeah, tell me that I haven't heard that before. So version controlled YAML.
Luke Marsden [00:07:53]: Yeah. So I mean, if you're a sort of Kubernetes person or a DevOps person in general, you'll be familiar with this idea of having like a git repository full of different YAML files for different kind of parts of the application. So you might have an irregular software application, you might have like five different microservices, a database and so on. And you would have. All the different parts of that application are described by Kubernetes manifests. So they're like YAML files in GIT that say this is a deployment object, for example, this deployment object says, I'm going to deploy the front end, for example, like a web front end or an API server. And I'm going to, I want there to be at least five of them and you can scale up to 10 or something. And they want to have this Docker image.
Luke Marsden [00:08:51]: And then when you apply that to your Kubernetes cluster, Kubernetes will reconcile that and be like, okay, I'm going to make sure that there's always five of these running and if one of the servers dies, I'll spin up some more and it will keep things sort of in equilibrium like that. And so the thing that we're trying to make real that we have this project called AISPEC.org is to try and define a similar spec for how you can define a generative AI application. Because if you can do that, then the exciting thing is that you can. You can manage that generative AI application in exactly the same way that you manage all your software. And why are we doing this? It's in order to make generative AI much more accessible to people who are familiar with DevOps and software engineering. Because I think there's no. Like, the tooling has got good enough now that you can do that. And there's no reason that we can't all be AI engineers, whether we're a business person just prototyping an Application or a DevOps or software person trying to productionize an application by adding tests to it and then doing GitOps with the manifest.
Demetrios [00:10:10]: So, yeah, so how do you. So I love AI spec, by the way. And if anybody wants to or feels inspired to join that movement, there is a channel in the mlops community all about that. And how do you foresee a kubernetes manifest giving detail about the system? The AI system? Yeah. What does that even look like?
Luke Marsden [00:10:41]: Yeah, yeah, yeah.
Demetrios [00:10:41]: Right.
Luke Marsden [00:10:42]: So I'll. I'll kind of walk you through that. Um, yeah, so the AI spec is basically. It's basically. If you're not familiar, I guess everyone's familiar with YAML. Right. But a YAML is basically just like a configuration file written in a text format. And I'll just go through kind of the building blocks of an AI application, if that's okay, and we can talk about how each of those can be represented by one of those fields in this YAML file.
Demetrios [00:11:13]: And this is just. So I'm super clear on it, it would be a microservice, which is a model. At the end of the day, you're getting a model out of it, or you're like hitting a model, or is it rag pipeline? Like, what are we looking at?
Luke Marsden [00:11:31]: So, yes, you can think of it as a microservice. How it's implemented can vary, but in particular, you can think of it as like part of your application, a microservice. And the AI spec argues that we should try and come up with a standard format of describing not just the model, but also a system prompt for the model that's kind of the baseline, but then also a definition for what we mean by knowledge. So for kind of a basic form of RAG that everyone can use. And then you can have all the different parameters that you can twiddle to make it better, chunk sizes and things, but then also API integrations. So this is something I'm really big on at the moment because it's what we're finding most of our customers want, which is the ability to integrate these LLMs into business systems. Because it's like, well, it's all very well to be able to just like have a chat with a chatbot, but it's much more interesting if the chatbot can talk to your JIRA instance and you can talk about the current Sprint.
Demetrios [00:12:41]: Right?
Luke Marsden [00:12:42]: So. Or you can say, write some code for me for this issue that I've been assigned and then it will bring in all that context. So, yes, we're trying to make this AI spec be, well, let's see, is which model you're using. Any system prompts. Knowledge, I. E. Rag. I think knowledge is a nicer, more accessible word than rag, because RAG sounds all fancy.
Luke Marsden [00:13:07]: Retrieval. Augmented generation. What are you talking about? It's just knowledge. It's giving the model knowledge. There's integrations, which is this, like integrating with business systems. And then there's tests. Right? Remember we were talking about testing? So you should put the tests for your AI app in the app spec itself so that they all get versioned together. Right? It's like that's how you would do it with software.
Luke Marsden [00:13:30]: You'd have your tests, your test code right next to your real code, and then the deployment piece, which is like, how do you actually deploy that? But in terms of the actual AI spec, it's model prompts, knowledge integration and tests. And I think if we version all those things together, then we're going to have a better time.
Demetrios [00:13:46]: And does. Because for me, it feels like there is a lot of. There's a lot more things that go into the. If we're talking rag, they go into building a RAG system that aren't on that manifest. Like, how do you see that being reflected?
Luke Marsden [00:14:06]: So that's a really good question. And we've had some users come along and want to do more advanced RAG systems than the one that you get out of the box with AI spec. And the way that we've handled that is to have some other fields that are like optional fields that are not, like, required. So basically, you should be able to have a good time most of the time by just having a knowledge spec inside your AI spec that says scrape this website strip off headers and footers, go 10 links deep go. Oh, and refresh every hour or refresh nightly because those are some of the examples for what we're doing in our out of the box rag. But then we had these other users come along and say, well, we're used to building these much more sophisticated RAG systems that like, they might have different databases and then like a sort of agent in front that decides which database to query based on like the user's intent and things like this. And so what we did there was, we said, well, if you want to get fancy like that, then let's use, let's have some other fields in that spec that allow you to kind of break out of the simple version into a more sophisticated implementation. And that might either be that you.
Luke Marsden [00:15:35]: I think how we did it was that you can reference an HTTP server that you're running that has some specific RAG endpoints that are like customized ingestion and query endpoints. So the idea is, yeah, like we can, we can make it easy for most people out of the box, but then when you need to go in and like do the more detailed rag work, then you can break out of that, of that box. And what tools would you use to implement those HTTP endpoints? Well, you'd probably use Llama Index or whatever other sophisticated tooling you have for building advanced RAG systems. So we're, we're explicitly not trying to compete with the likes of Llama Index with this, with this proposal we're trying to make, with AI spec. We're basically trying to make it so that step one, any business person can prototype an application in a web interface that's like actually a really nice property. Step two, any person with some technical skills like a sys admin or DevOps person. Product programmer, sorry, product manager or a product manager. Exactly.
Luke Marsden [00:16:52]: And with some technical skills could like build this YAML spec and version controller and write some tests for it so that they know that it keeps working. And if they change one prompt, does it regress other cases that you care about. Like you can think of it as like test driven development for gen AI. Like we need to get to that point as a community I think. And then yes, if you want to like go and do more advanced RAG stuff and you need to learn more about the RAG to make it good enough and your tests will tell you hopefully if the quality is there by the way, because the tests is just evals, then yeah, you can break out the box and do more sophisticated stuff.
Demetrios [00:17:36]: This is a bit of a tangent, so I'm going to warn you right now. Sure, I'm going to go down a different road. But I had the realization, or almost a thought experiment where I was talking to Fernando probably three, four weeks ago about how he is constantly trying to use AI to solve problems that he has within the business and he's rapidly prototyping things and he's rapidly trying to stand up agents or whatever. And one thing that he said in passing, but it stuck with me, is how easy it is right now for him to create a agent to do something versus if he were to bring on a SaaS solution to try and do that.
Luke Marsden [00:18:26]: Yeah.
Demetrios [00:18:27]: And I thought this was fascinating because it's a little bit of you can rapidly prototype something so quickly and if you are very comfortable with building AI apps, for lack of a better term, then it may be faster for you to build the AI app than that uses AI with all its headaches than it would be for you to not use AI. And that for me is mind blowing because up until now I've always been in the camp of the first rule of AI is don't use AI unless you have to. Yeah, yeah.
Luke Marsden [00:19:10]: Don't just cluster everywhere like.
Demetrios [00:19:13]: Exactly. But then if it's easier for you to create a prototype that will do what you want by using it versus hard coding it, that's where I, my eyes opened up from something that he said in passing. Right. And so I think what you're trying to do here is really cool because you're just making that rapid prototyping even easier for folks. And so talk to me about almost these three levels. Right. You've got the non technical person that can prototype something. You've got the semi technical or they're technical but maybe in a different domain than an AI engineering domain.
Demetrios [00:19:51]: And then you've got the full on production engineer for AI apps domain. What does it even look like if I'm a non technical person, how do I interact with it and how do I get value from this? Because I know that that is something that it, it feels like is the, it opens up so many doors.
Luke Marsden [00:20:16]: Absolutely. And I think on, on the point of that comment that you said Fernando made, I do think it's a super interesting time that anyone can now develop software quite quickly, especially with like Cursor and Claude. And I think the age of rapid prototyping of just about anything you want to build is definitely upon us. I think.
Demetrios [00:20:45]: Sorry, there's, there's the. Yeah, we got 90% of the way there and then the other 10% was really hard. Yeah, it takes longer than getting 90% of the way there.
Luke Marsden [00:20:55]: I'll come back to this sort of three layers thing. I think the level one is like the super non technical person. Like the only requirement is a web browser and full disclosure. Like we got massively inspired by the GPTs feature in ChatGPT when, when we started thinking about, about enabling these sort of very non technical users to do it. Of course our angle is like open source models and make it run locally. So like if you're, if you have like regulatory reasons for example not to be able to send your data somewhere else and you want to be able to do what I'm talking about, then come talk to me. The non technical user being able to prototype something quickly. I think that is really exemplified in, in the GPTs feature in chat GPT.
Luke Marsden [00:21:50]: And when GPTs launched I was initially a bit skeptical of it because they'd had a couple of iterations before that that didn't go so well with. I can't even remember what they were called but they were like different plugins for Chat GPT that everyone was a bit meh. But then I started hearing something really interesting at conferences and meetups that I went to, which was I actually started hearing about serious business use cases where people had used GPTs internally inside a company to do really quite complicated things. So for example there was a, there's a company here in Bristol that is doing risk assessments. This might sound slightly terrifying but they're doing risk assessments for, for the film industry. I'm going to go and shoot a film like on a great like iceberg somewhere, like what do I need to think about? And they'd rigged up like they called it like a chain of GPTs. They'd literally like written this, they'd done this all inside chatgpt like by pointing and clicking to prototype an application that could generate these risk assessments automatically or at least generate suggestions for them that humans can then check. And they were having success with this thing and I was like, oh, maybe I shouldn't be ignoring this GPT's feature because actually it's kind of like a silent winner potentially because it's actually getting adoption inside, inside businesses.
Luke Marsden [00:23:11]: So I'll recap what that is. I'm sure most people listening know already but what you can do is you can go into ChatGPT and you can create what's called a GPT, like your own kind of customized thing and you can give it like an avatar and a system prompt which is like instructions like behave like this. This is what you should do. This is what you shouldn't do. But then what's really interesting is that you can also give that GPT, like your sort of customized chatbot. You can give it knowledge and you can give it integrations. And the knowledge piece is a basic rag pipeline like we were talking about earlier. Like, can we give everyone a basic rag pipeline that works most of the time for a lot of use cases? And it's like literally like put a website into this thing or drag and drop a Word document into it and.
Luke Marsden [00:24:03]: And that works. And it works pretty well. And then the integrations piece is where it gets super interesting for me is that you can say, okay, if I've got like a. It's called a swagger spec for an API for a business API, I can also just put that in there. And now this LLM, this chatbot can make API calls on my behalf. And so it can do things like integrating into business systems and. Yeah, so that's kind of this sort of basic level of like what we're trying to enable with open source models is the ability for anyone to come along who's non technical and use a web interface to construct one of these AI apps. And when I say app, what I mean is it's like a GPT from ChatGPT.
Luke Marsden [00:24:59]: So does that make sense?
Demetrios [00:25:00]: Yep, yep. So basically throw some information at it, give it a system prompt and integrate it with whatever you're working with and you're off to the races.
Luke Marsden [00:25:10]: Exactly, exactly. Now this is where it gets more interesting. I'm a software DevOps person, right. That's my background, like you said. And if I think about people building apps by pointing and clicking in web interfaces, I kind of scream and cry a little bit on the inside. I don't know if you've ever heard of Jenkins. It's like the old CI system. Jenkins was configured by people logging into it and pointing and clicking in a web interface.
Luke Marsden [00:25:42]: And it was widely hated. Right. In sort of DevOps, like more sophisticated DevOps teams, I guess, maybe 10 years ago. Because it's like that's the wrong way to do it. You shouldn't. Like, how are you going to make this system reproducible? Like, if we lose our Jenkins server, it can't be a snowflake. You probably heard this idea of like.
Demetrios [00:26:05]: Individual, unique piece of software.
Luke Marsden [00:26:10]: Yeah, exactly. And it's like, no software should be declarative. We should be able to recover from losing a server by just spinning another one back up again. It should be cattle, not pets. Sorry, I know you're vegetarian, but.
Demetrios [00:26:23]: I've, I've actually heard it. It's funny enough, somebody the other day said this in passing, like, oh, what you're describing right now could be called gooey ops because you have to work with a GUI to get what you want.
Luke Marsden [00:26:37]: Yeah. And it's bad and wrong.
Demetrios [00:26:42]: Don't forget that.
Luke Marsden [00:26:42]: Ever. So the interesting idea here is like, okay, so we can make building AI apps more accessible to non technical people by giving them a web interface that's like the GPT's editor in ChatGPT. And like you can see that today if you go and look at Helix ML and like you log into the SaaS, you can see our, we've built this app editor thing in there which is that same. It's basically a GPT's clone from ChatGPT. And you can see that on the AI spec website as well. There's a little video demo of it, but then you get me like waving a stick and saying like that's the wrong way to do it. So how do you bridge from there's a GUI where you can do GUI ops to build your prototype of your AI application and then how do you go from that to something that you feel good about from a DevOps perspective? And the way that we're trying to push that is that with GPTs from ChatGPT, there is no output format of a GPT. You can't take it and export it.
Luke Marsden [00:28:02]: It's just there in the web interface and it lives there. And if you go in and someone clicks buttons and changes it, then it changes. And do you know that it got better or not when you changed it? Like, I don't know, like you're not doing evals on it. The idea that we're pushing then is that this app editor for these AI apps should output version controllable YAML. Right? So there's the bridge. It's like you go from a non technical person having prototyped something and the thing that I'm really, from an organizational perspective, I think it's quite interesting to ask the question, could a product person or a CEO or someone who's non technical, right. Prototype an AI application, you know, in a GUI like this, but then hand the artifact, which is this YAML file, off to a technical team inside the organization who are maybe a bit closer to AI engineering, but they don't actually need to be AI engineers, they just need to be software DevOps people who are familiar with like pull requests and YAML files and IDEs and, and then ask them to productionize it. So that's, that's the kind of organizational question is can we enable people to prototype things easily and then bridge that gap into productionizing? So I can talk more about productionizing if you like, but.
Demetrios [00:29:26]: Well, the main question that I have as you're saying this is, this is for that GPT use case, which is almost in a sense productivity use cases, I guess you could say. Usually when I think about a GPT, I am getting productivity from the LLM in some way, shape or form. It's as you said, plugging into Jira and maybe creating code or summarizing the last or the next Sprint.
Luke Marsden [00:29:59]: Yep.
Demetrios [00:29:59]: Et cetera, et cetera. Or it's looking at Notion or Google Docs and helping me find and create a new doc with knowledge of past docs.
Luke Marsden [00:30:08]: Sure.
Demetrios [00:30:10]: What about when my app or my company product wants to incorporate AI into the product itself? Are you just purposefully not trying to go after that use case right now?
Luke Marsden [00:30:30]: That's actually a really good question. I shouldn't sound so surprised, but it's. Oh no, no. It's a really interesting question. I think you can get to both of those use cases through this. But I think that if you were starting from the perspective of wanting to incorporate AI into an exit into your business application, you probably wouldn't start with a sort of GUI ops way to set up the endpoints. But what you can still do with the AI spec is create API endpoints that you can call from your business application. And so actually what we are building here is probably a faster way to get started at incorporating AI into your, into your application that your company's building by just spinning up some of these endpoints.
Luke Marsden [00:31:32]: And what we did there, which might be interesting to people, is we went all in on the OpenAI API. So we said like everything should be OpenAI API, like chat completions, API kind of on the way in and on the way out and the, and what we've. So, so what these AI spec compatible things do is you can have this like chunk of YAML that defines knowledge API integrations and then you can surface that whole app as an OpenAI compatible API. It's so it's like you're talking to an LLM. Well, you are, but it's like you're talking to an LLM, but it's an LLM that's have, that's had, that's been imbued with superpowers like that LLM already knows about the contents of your rag store because it's got rag built into it. And it already knows how to make API calls into API endpoints, so it's done like a bunch more of the work than you get, for example, just with function calling from the OpenAI API because it will actually make the API call for you and summarize the response. So yeah, I hadn't really thought about it like that before, so it's interesting. But I would say that you can satisfy both those use cases of like a productivity use case, which is starting from playing around prototyping and then maybe pushing that into wanting to roll it out to more internal users and productionizing that and making sure that you've got quality there.
Luke Marsden [00:33:07]: But you could also use this approach for adding AI capabilities to an existing app because it's all OpenAI compatible API all the way through. And when I was looking at that, I was looking at the OpenAI SDK support in different languages and frameworks and it's everywhere. Like I wasn't surprised, but I was like there's like, if you want to plug OpenAI compatible API into your WordPress site, there's 10 different WordPress plugins to do it. And the same is true in Java Spring and like all these other different community like language communities. So I think that was a no brainer.
Demetrios [00:33:43]: Yeah, because it's fascinating for me, I guess the thing is when you want to incorporate in into your product, it's a little bit more complex because how are you going to present this in your product as this AI feature? Is it going to be just one of those little star buttons that people can click on and then they have super AI features or is it directly when someone is doing what they normally do first it hits OpenAI and it tries to give you an answer before it goes into the product. And for each use case it's going to be different and I think the way that you prototype those is a little bit more in depth and it also, I'm not sure if it's something that is going to be done by the CEO or the product manager over the weekend, but maybe it is. I know that when we had Philip on here from Honeycomb, he said something along the lines of I think I prototyped our features, our AI features within the app over a weekend. And, and then it was that whole thing of it took me a weekend to create the prototype, but then another six months to create the real production ready thing. But it goes back to the conversation and really I think the theme of what we've been saying is you prototype it to get an early signal on if it's valuable. And then if it's valuable, you can hand it off to the right folks who can productionize it. But it's a lot easier to see if it's valuable when there's a prototype.
Luke Marsden [00:35:26]: Yeah.
Demetrios [00:35:27]: And if you start playing with it and if you start to see like, is it able to do what I have in my mind?
Luke Marsden [00:35:36]: Yes.
Demetrios [00:35:36]: Can we at least get there a little bit or am I totally crazy for thinking that?
Luke Marsden [00:35:42]: Yeah, absolutely. And then if it is able to do what's in your mind, you can write a test for it which makes sure that it's going to continue to work. And the idea of being able to take this like AI specifically from all the way from the prototype in the web interface through to the YAML format and like have the same thing, the same artifact throughout that then get version controlled and iterated on, is that you can add those tests early and you can make assertions about what business capabilities you want this thing to be able to perform early on.
Demetrios [00:36:17]: Well, yeah, go into the tests and like, how do you see those working? Or what have you seen for the tests that can be valuable ways of creating them, using them, doing them, all that stuff. Besides, I know we've talked about evals and you have your evals and so I guess you have a whole bunch of use cases or evals that you're looking at and you want to continuously be adding to those evals as you find new use cases or new new ways that your users are interacting with the feature. But what else are you seeing?
Luke Marsden [00:36:56]: Yeah, so I mean, I can describe a bit about kind of my experience of building this Helix test feature and some of the learnings from that. Now, I'll start by saying we're not trying to replace something like Deepaval. There's lots of great evals tools out there that are really strong for doing general purpose evals and have all sorts of suites for things that they can handle. Like is the model able or is the system able to remember things that were said at the start of the conversation, later in the conversation? And that's awesome. But what we, what we're doing with the tests that we're putting into AI spec is making is like having this hyper specialized evals piece specifically for testing the features that are enabled by the AI spec itself, which is the knowledge piece and the API calling. Because I mean, and I'll just tell you about like the actual experience I had of building this JIRA integration. So the whole reason that we started down this road of adding a limited evals feature to our product was because we were talking to a prospect, a large bank in the Middle east. And they said to us, we want JIRA integration.
Luke Marsden [00:38:25]: Can you build that and show it to us? And we were like, okay, we'll have a go.
Demetrios [00:38:30]: Give us a week.
Luke Marsden [00:38:32]: Yeah, right. I mean, I say it was Thursday. I said, we'll get it to you by Tuesday. Which was a stupid mistake, but, but you know me, you've worked with me.
Demetrios [00:38:41]: Some things haven't changed at all. Exactly.
Luke Marsden [00:38:45]: So anyway, so we were like, okay, well let's build this JIRA integration. And this prospect also gave us this list of things they wanted the JIRA integration to do. And I was like, this is a perfect use case for building this kind of basic evals piece that we wanted to add. Because the questions were things like what issues are there? Then the most interesting one I thought was like, write code for this issue. So suppose you have a front end developer and you want to give them a cheat code, which is like they can start work on their issue by talking to this system that will, you can say like write code for issue like DS9 and it will go and retrieve the issue and then start writing code that will solve it. So get the user start or get the developer started on being more productive. And I was like, okay, well so this is super interesting. I plugged in the Jira like open API spec and started testing it.
Luke Marsden [00:39:45]: And of course it doesn't bloody do what you want it to do. Like, so why would it be easy?
Demetrios [00:39:52]: What's that? Why would it be easy?
Luke Marsden [00:39:54]: Why would it be easy? Exactly. I mean, fortunately we've got all these fields in the AI spec that allow you to customize the prompting that goes into the API calling. But, but I'll tell you why it didn't work out the box. It didn't work out the box because a big part of the JIRA API is that you have to send it jql, which is the JIRA query language. So the JIRA has invented this whole like sort of SQL ish syntax for how you search over your JIRA issues. And the models that I was using, even though I was using like llama 3.1 and I tried Quern and I tried a bunch of these because we do have.
Demetrios [00:40:27]: They were not versed in jql, they.
Luke Marsden [00:40:29]: Didn'T, they were not well versed in JQL or they weren't able to just figure out from the Open API spec that they needed to write JQL and somehow recall how to write JQL and how to handle all these different cases. So at that point I was like, well, it's time to write this eval system I've been wanting to write so that I can build the JIRA integration. Because what you really want is you want to be able to iterate on the prompting that goes into making these API calls and know that like. So I had these seven different, like, cases. I wanted to write seven different tests and if I'm changing the prompting to try and get like, show me overdue issues working, I want to know that that hasn't broken what issues are assigned to me. Which was one of the other cases. So it was a really good opportunity to put this together. And, and then the other thing, the other signal that we had for like, oh, might be useful to build out this like mini evals thing for these specific AI spec features of knowledge and API integrations is that we have this wonderful customer in Germany called AWA or AAVER Networks who are brilliant.
Luke Marsden [00:41:44]: They've been with us since the very beginning and they were doing exactly the same thing on the app that they were building with Helix. And their app is one for. It's a natural language interface to renting heavy machinery. So if you have, if you're like on a building site and you're driving from like Hamburg to Berlin and you're a building site manager at the moment, you phone someone up and you say, have you got a crane that can handle like 3 tons in Berlin next Thursday. And so what this company is doing, they're doing a natural language interface for that so that this company can scale their sales team. Right. So, which is kind of exciting. Use case so.
Demetrios [00:42:26]: And so just a little side note, for the JIRA integration specifically, you are writing one big prompt that is trying to encapsulate these seven different use cases.
Luke Marsden [00:42:40]: Almost so the way it actually works. And it's quite nice to, we've got the time probably to kind of dive into how does the API integration work. It comes in three part, three parts. So there's a classifier, a request builder and a response summarizer. Those are the three bits and I'll explain why. So the classifier, it takes the user's query and it just decides, do I need to run an API call based on the user's query? And that's a lot like function calling in the OpenAI API, but actually we didn't use function calling in the implementation because we started before function calling was even available in the open source model. So actually we just do this all with prompting and asking the model nicely to output JSON, which actually works quite well now. So yes, there's these three pieces in the API integration, right? There's we call it is actionable, which is this like, classifier.
Luke Marsden [00:43:40]: Like if the user is just asking what's the capital of Paris, sorry, the capital of France, then you don't need to call the product catalog API to get the answer to that question. And so the classifier just determines, like, basically which API to call, if at all. Like, are there any APIs that we want to call? Then the request builder. Yeah, go ahead.
Demetrios [00:44:02]: Always going to go to the LLM no matter what. And the classifier isn't classifying different LLM calls, it's just classifying which tool it's going to use.
Luke Marsden [00:44:14]: Correct. So the way you can think about this is that there's one LLM call that goes into the system from the user and that ends up spawning like, sub LLM calls that go and do these like, intermediate jobs that I'm describing. So LLM call comes in from the user. Like I said, Everything is OpenAI compatible API. Big believers in that. We'll classify the request. Is this a request that requires an API call at all? Or is the user just saying, hi, how's it going? My name's Bob. Hey, great to meet you.
Luke Marsden [00:44:50]: I don't need to call the product catalog API yet. Um, and, but then if it does require an API call, then the second prompt is construct the API call for me. And that's actually a bit more interesting because it's like, okay, the, in the first classifier it's like, oh, I might have three different APIs I can call and I know their descriptions. So based on the user's question, I can figure out which API to call. Then the second prompt is saying, oh, in the first prompt I figured out that I want to call the product catalog API. But now I've got a more interesting job, which is based on the open API spec, that is the swagger spec, like for the definition of that API. How do I construct the API call that will answer the user's question? And the output of that sort of sub LLM call, if that makes sense, is a JSON object that describes back to Helix how to construct that API call. Helix then goes and actually makes the.
Demetrios [00:45:50]: API call, oh, interesting.
Luke Marsden [00:45:52]: The API call comes back and then the third piece is summarizing the response from the API. So the API gives you a chunk of JSON back. You don't want to just slap a chunk of JSON in the User's face. So then the job of that third piece is to turn that back into English or natural language and say like, oh, here's a nice summary of the response.
Demetrios [00:46:13]: Um, okay, so take, walk me through. Now you're doing that with Jira.
Luke Marsden [00:46:17]: Yeah.
Demetrios [00:46:17]: And I say, give me a summary of all of my issues. And so that when I, when I say that, first it goes to Helix and Helix then says, okay, this needs to go to Jira. It will hit another model that says, how do I construct an API called to Jira? And the other model will say, well, funny enough, JQL is your ticket. Here is how you use JQL for a JIRA call. It sends that back to Helix. And then Helix will send the API call with the JQL to the JIRA API.
Luke Marsden [00:46:59]: That's it.
Demetrios [00:47:00]: It gets what it needs and it sends back JSON to Helix. Helix then says, okay, now LLM call again to explain to me what the hell is inside of this big ass JSON file.
Luke Marsden [00:47:15]: That's right. So that's.
Demetrios [00:47:16]: And then, yeah, it goes back to the user.
Luke Marsden [00:47:18]: Exactly, exactly. And so that all happens like in a, in a couple hundred milliseconds.
Demetrios [00:47:23]: I was going to say in like three minutes. It's just that happens in like three to four minutes.
Luke Marsden [00:47:28]: No, it's pretty quick. Like we use the 8B models for this stuff actually. They're nice and fast and. Yeah. But from the user's perspective they just ask, have you got any cranes? We say, yeah, we've got seven different cranes you can rent. Which one would you like? So it's nice and natural from the user's perspective, but you're right, that's exactly what's going on under the hood.
Demetrios [00:47:48]: Wow. Okay. And now the piece that I think spawned this whole little conversation that we're having is the prompt that you are doing to make sure that it's covering each one of your use cases or your test cases. You're basically testing it on each one of these LLM calls.
Luke Marsden [00:48:16]: So I can now, so I can now answer your previous question, which is like what are the prompts that go into the system? And the answer is that there's three of them. Basically there's the prompt for the classifier and all of these can be customized in the AI spec and you find yourself having to customize them when you do any non trivial integration, like the JIRA integration. So the is actionable prompt, like the classifier prompt, that can be changed because you might want to give the LLM more information about when to choose a certain API or not, or give it like synonyms for things that the users often say that it might not figure out on its own and things like that, like for which API call to make. Then there's like, how do you construct the API request? So then we have the prompt for like, how do I construct the API call? And of course, if the model doesn't know how to write jql, then that's the prompt where you put, where you teach it, you teach it about jql and then there's the prompt that's used for summarizing the response back to the user. And that's super useful for like customizing the format of the response. And so for example, for this customer in Germany, their app, all the prices are always in euros. And if something comes back from the API then you have to say that it's available. I remember that off the top of my head because I had to customize those prompts for that app because the API doesn't say the prices are in euros, but you want to quote them to the user in euros.
Luke Marsden [00:49:45]: So you can put all this like sort of useful tweaking and finessing in the way that the interaction works in there. So yeah, you've got these three different prompts you can edit. And now the question is, how do you stay sane while you're trying to make all of these different cases work? And the answer is you do test driven development just like you do with.
Demetrios [00:50:06]: Software, but you're testing the output, the final output.
Luke Marsden [00:50:09]: Correct. And that's the other. I can now answer your other question, which is are we testing the individual prompts or the whole thing? And it's an integration style test, so we're testing the whole thing. And so the tests, I mean, I'm looking at them here on my screen, they say something like test JIRA issue search. The steps are that the user asks what issues are there? And the expected output is a list of nine issues with brief summaries and, or details. And. And that's it. We're just writing these test cases in natural language as well.
Luke Marsden [00:50:44]: And then we're using the LLM as a judge in order to take the response back from the system and then tell you whether it's good or not. Interesting. And this is how, this is how you do CICD for Gen AI open source models.
Demetrios [00:50:59]: Yeah, because basically that LLM is a judge is coming in when it is summarizing the JSON that it's getting out of the API on that last call, correct? Yeah.
Luke Marsden [00:51:13]: And so I'll give you a concrete example. So I'll give you a couple of concrete examples because they might, this might be interesting for like how we actually built this or how I built this, this JIRA integration. So to begin with, I plugged in the open API spec and I started asking it questions like, what issues are there? What issues are there that are assigned to me? What overdue issues are there? I was just giving me garbage and I looked into it and I looked. I clicked the debug button inside our test thing and like, it shows me all the internal LLM calls, like the three steps that I described. Yeah. And I could see like, ah, it's generating garbage JQL like this thing doesn't know how to write JQL and so I started Googling. I think I might have asked Claude, to be honest, but I started looking up like, oh, how do you write jql? And then I started adding to the second prompt, which is that one where you help the model construct the API call.
Demetrios [00:52:09]: The API call.
Luke Marsden [00:52:10]: Yeah, here are some examples. And it literally says in the prompt, I'm reading out examples of how to specify JQL and I say, what issues are there goes on to the empty string. If you just send no JQL at all, it will send you all. It will return all the issues. Get issues assigned to me. Maps onto assignee equals current user. Open bracket, close bracket. Write code for issue DS9.
Luke Marsden [00:52:35]: Well, you start, you need to look up DS9. So you do key equals DS9. And then show me overdue issues. The JQL for that is due date is less than start of day. Open bracket, close bracket. So I was like, okay, I can, I can actually. And the way I wrote those was I wrote a failing test. I ran the test suite and the LLM as a judge checks all of the answers and it says, like, this failing test for show me overdue issues isn't giving you the right answer.
Luke Marsden [00:53:05]: And then I can go in and inspect the JQL check that it's not giving you the right thing. And then I can just add to the list of like, I've taught you how I'm teaching you how to write jql. So I'm just going to add another item to that list. And then the really nice thing about, the satisfying thing about that is that you then run the tests again and you see they pass. And now you know that they're always going to get checked and that that use case is always going to carry on working in the future, even if someone else goes in and changes the prompting for something else to enable some other use case. Because. And this is how we do cicd, you don't merge a failing pull request to main. Right.
Luke Marsden [00:53:45]: If you've got a pull request and there's tests failing, then in general, the policy and engineering teams is, unless it's a super emergency, you don't merge the pull request to main.
Demetrios [00:53:54]: Yeah, I wouldn't know anything about that. I just force push.
Luke Marsden [00:53:57]: Well, fine, so do I. This is.
Demetrios [00:54:03]: And then I throw in slack.
Luke Marsden [00:54:05]: The.
Demetrios [00:54:05]: This is fine.
Luke Marsden [00:54:07]: Yeah, yeah.
Demetrios [00:54:07]: Dog.
Luke Marsden [00:54:09]: I don't even do that. No, I don't force push. But I do. I do commit to remain sometimes. But I'm a very bad person and I should feel ashamed.