How to Systematically Test and Evaluate Your LLMs Apps
Gideon Mendels is the CEO and co-founder of Comet, the leading solution for managing machine learning workflows from experimentation to production. He is a computer scientist, ML researcher and entrepreneur at his core. Before Comet, Gideon co-founded GroupWize, where they trained and deployed NLP models processing billions of chats. His journey with NLP and Speech Recognition models began at Columbia University and Google where he worked on hate speech and deception detection.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
When building LLM Applications, Developers need to take a hybrid approach from both ML and SW Engineering best practices. They need to define eval metrics and track their entire experimentation to see what is and is not working. They also need to define comprehensive unit tests for their particular use case so they can confidently check if their LLM App is ready to be deployed.
Gideon Mendels [00:00:00]: My name is Gideon Mendels. I'm the CEO and co founder of Comet. I make my own espresso at home. I grind and weigh my beans. And depending on my mood, I either drink San Americano or a cappuccino.
Demetrios [00:00:17]: What is happening? Good people of Earth, we're back for another mlops community podcast. I am your host, Dmitri Ose, and I've got a little bone to pick with some of y'all. There is this trend going round. And instead of saying, I'm just following up or I'm circling back, tell me if you've noticed the new trend is just a gentle reminder. Just a gentle ping on this. There's nothing gentle about it. You're bucking the shit out of me. I'm not answering for a reason.
Demetrios [00:00:52]: Get over it. No more gentleness. Let's be brute. Let's just be crash. So, talking with Gideon today, I had a blast because he's been in the traditional ML world and he recently open sourced Opik, which is in the gen AI world. It is all around evaluation, and it's a tool to help you evaluate your LLM or genai product output. And we talked at length about how these two worlds of experiment tracking that he's been doing for the past eight, nine years, and the evaluation tracking or world, those two worlds, they're not different. They're actually eerily similar.
Demetrios [00:01:47]: And so let's talk to him about it. A huge shout out to the Comet Team for sponsoring this podcast. I gotta say it. They are the reason that we are able to do what we are able to do. Thank you to them and all of our sponsors here at the Mlops community podcast. Let's jump in to this conversation. And as always, a gentle ping, a gentle reminder to pass this on to one friend. So, first thing we got to talk about is, what is that solar panel behind you?
Gideon Mendels [00:02:30]: Um, that's a photo of an astronaut. I think they're doing a space log working on the ISS, probably. I'm not sure, actually, but yeah, it's in our office. I should probably ask where this originated from.
Demetrios [00:02:45]: I'm literally looking at it. And for the people that are just listening in some world, some alternate reality. Before you told me what it was, I saw it as a doorway behind you. And it was like, I thought you were back in, like, the laundry room. And then there was some random solar panel that was stashed behind you overhead that probably was decommissioned or something.
Gideon Mendels [00:03:08]: Thankfully, the startup, the early startup days of working from the laundry room are, you know, six seven years behind us. But, yeah, hey, it's never too late.
Demetrios [00:03:19]: You've matured as a company. You don't need to both be doing your laundry and building your startup.
Gideon Mendels [00:03:25]: Exactly. Life goal.
Demetrios [00:03:28]: Exactly. So I'm excited to talk to you about Opik. I also want to just, like, dig into evaluation and evaluation metrics and what you've been seeing out in the field. Also, I think it's probably worth talking a bit about experimentation, because you're kind of the OG in that. First off, let's kick it off, man, with, like, you've been out there, you've been talking to a ton of people. What are some key evaluation metrics that you've seen people using?
Gideon Mendels [00:03:57]: Yeah, so kind of taking it from the top, right. As we. For some context, like when we talk about evaluation matrix, right? Like you said, we've been doing ML experiment tracking for almost seven years, and evaluation is not a new concept. Right. I think it definitely behaves a little bit differently when it comes to LLMs, you know, and kind of more supervised models depending on the task. We have, you know, accuracy, f one scores. You know, LLMs by itself have kind of more traditional metrics, such as perplexity, which is not just a company name, it's an LLM metric or language model metric. But yeah, I think in the context of people building on top of LLMs, so not necessarily training or fine tuning, things look a little bit different.
Gideon Mendels [00:04:45]: Right. If you have labels in a supervised model, then all the kind of classical missionary metrics make sense. When we talk about LLMs, there's essentially the way we view it, and it's definitely task dependent. There's three types of metrics you could use. So let's take an example. Use case if we're building a rag chatbot, internal or external. So the basic one are just the deterministic assertion. I don't want the model to ever say, I'm an AI model, so I can search for that string a ser does not contain.
Gideon Mendels [00:05:23]: So those are the classic ones from the software engineer world. But as we know, the output of an LLM could look very different from a string perspective, but semantically could mean exactly the same. So these deterministic assertions break very quickly, and then it brings us to the two other sets of metrics, which are the. Call it the heuristic distance like metric. So, for example, if I have a question that I want my model to answer, right, say, what is the Mlops community podcast about? So I'm fine if the answer doesn't look exactly the same every time as long as it cover the main things that I wanted to. So I can measure the distance with something like bird distance or any type of distance between two embeddings between my preferred answer and what I got now. So those are the second set and then the last one, which is a slightly newer concept, is using another LLM to produce a metric on an original LLM response, which is what's called LLM as a judge. So we can use it for the same example earlier.
Gideon Mendels [00:06:35]: Instead of measuring the distance and with cosine similarity, I can ask another LLM, hey, does both of these answers look the same? So that's a high level, that's what people use. There's still a lot to uncover in that space. It's still hard, but I would say those are the three main ones.
Demetrios [00:06:55]: So one thing that I want to point out that you mentioned and articulated beautifully is there's a difference between the traditional ML evaluation because it was more evaluation of the model when you were training it. And now with LLMs you have evaluation, but that's once it's out in the wild or once it's already been trained, it's already outputting something. You're evaluating the output on that.
Gideon Mendels [00:07:24]: Yeah. Well, so you're always evaluating the output, right? If you have a simple classification model, you compute the metrics based on the expected output versus what the model actually gave you. But you're right in the sense that it's no longer a model that you even have control over in terms of training and fine tuning. You could be using OpenAI chat, GPT four or something, but you still need to evaluate how well you're doing. So you don't control the model, the model weights fine tuning, but you do control the prompt and in some cases, some call it production hyper parameters such as temperature and seed maybe. But yeah, so it is quite different.
Demetrios [00:08:08]: So then the other piece that you were talking about is this, if you understand the question that's being asked, or if you understand what you're going for, you can use burp metrics, right? And that's if you know what you want. But a lot of times you have no idea how things are going to be used. Or you, maybe it's not that you have no idea. You kind of assume or expect folks to use your AI feature or your AI product in one way, and then maybe somebody comes out of left field and starts using it in a different way. So you can't really use those burp metrics.
Gideon Mendels [00:08:48]: Yeah, so it's a good point. Right. It's very use case dependent. And sometimes you're right. Like, there are cases where you have no idea, just have to put it out there and see what people do with it. If you're building a rag or something, I would argue that in most cases, you should be able to create a, like, call it a goal data set of some examples, questions and answers. But either way, your point is very valid, because even if you created that and you have some kind of testing and evaluation before you release it, once it's in production, people will definitely surprise you. So that's a big part of what Opik does.
Gideon Mendels [00:09:24]: Right. And there's definitely other solutions that do that. But the idea is you can bootstrap these datasets. You put your, let's call it chatbot in production. Opik does all the tracing of all the responses, the questions, the chat sessions, everything happens automatically. And then you basically get a dataset that you can either manually label human label, so a person can go and say, hey, this answer was against our company policy, or I, this answer had bias, or this answer was just pure wrong. And then you can also use some of the built in evaluation, like LLM as a judge metrics, and then you bootstrap this dataset by automatic and manual labeling. So the next time you iterate, you can actually run through those questions and answers.
Gideon Mendels [00:10:16]: So that's like, I think when I speak to teams and companies trying to deploy, especially rag applications, I would say that's the majority of the LLM, not the only ones, but the majority of LLM use cases. That's what separates, I would say, like those who made it to production, and production could be an internal solution. Right? Like, but still, like, with the real active users versus the one they put it out there and it kind of blew up. So it's this more data science oriented methodologies versus kind of more software engineering ones that make sense.
Demetrios [00:10:57]: Yeah, explain that a little bit more. Because I do like what you're saying there, because the software methodology is, I know exactly what I do not want. So I'm going to say if you pump out this string, just cancel that. Right. But there's going to be a million different ways that an LLM can say that same thing. So all those edge cases, you're going to have to either spend a ton of time trying to cover them all in your test cases, or you do something else, like, I think. And that's what you're getting at with the data science methodology.
Gideon Mendels [00:11:34]: Yeah, yeah, absolutely. So, like, first of all, like, it's really fascinating to see, like the. You know, there's. I would even say the majority of these people that I spoke to building LLM applications are software engineers, which is great, because being a good engineer helps you move so much faster. There's obviously data scientists who are amazing engineers as well, but it's like a slightly different paradigm. But to your point, it's very accurate. On the software world, testing is a huge component. There's books and methodologies and unit testing, integration testing, smoke testing.
Gideon Mendels [00:12:10]: There's a lot of way to test software, but essentially all of those break when you try to test an LLM application, because to your point, I can write assertions, unit test integration, test assertions, expecting a string, but an LLM, you know, your LLM provider updates their version, you're gonna get a different string, and probably, hopefully a better one, but you don't want to break the build because of that, and you don't want to go and change all these strings again. That's where you need that fuzziness, if you call it, which adds risk as well. But it's this deterministic versus non deterministic nature of software and data science.
Demetrios [00:12:53]: So the fun part there is the idea of software engineers moving more towards understanding data science paradigms, and then vice versa. Data scientists moving more towards understanding software engineering paradigms, which I think a lot of people on the Internet and on social media have been screaming about for a good couple years. Like, data scientists should learn how to code, or they should at least learn how to use git or software engineering principles. At the very least, you're the first person that I've heard to say vice versa. Like software engineers, if you're dealing with LLMs, you want to start to understand the nuances of that fuzziness.
Gideon Mendels [00:13:36]: Yeah, yeah, exactly. So, by all means, I think data scientists should pick up as many software engineering skills as possible. The data scientists I've met in my career were like brilliant software engineers as well. Right? It's just, it goes together. But I think. So, first of all, I think what we're seeing is the two fields converging. Don't get me wrong, you're still going to have the pure research, data science, machine learning. People deeply understand the math of these optimization algorithms and loss functions.
Gideon Mendels [00:14:17]: There's still definitely going to be that group, but I think the majority of the use cases around machine learning and AI will be this convergence of, call it software engineer, data scientist, but being able to use those things. But there's still a lot in the data science methodologies just way of thinking that. I think software engineers who have never been exposed to that before could benefit a law firm. Right? Like the, in data science, it doesn't matter which task you're working on or which, you know, algorithm or model you're using, but it's an experimental approach and specifically a metric driven approach. So you generate, assuming a task, you have a data set, hopefully with goal data sets, so you know what the right answer, what the right output should be. And then you start changing, in this case, the prompts, the pipeline, these parameters. And every time you run one of these experiments, you test it against the dataset. Now, you still want to look at the data and all of that, but it's not this bibe check approach.
Gideon Mendels [00:15:28]: It's more of a metric driven approach. Then with that, it opens up this kind of warms of all the things you should be aware of, overfitting and such. It's still a risk in this case, even if you're not controlling the LLM. If you have a small goal data set and really tuning the prompt to like ten or 15 answers, you might overfit, meaning it will look perfect on those 15 answers. But then when you start putting into production, it will fail. So that way of thinking is just, it's really the only way to go here. So it's exciting to see a lot of software engineers pick it up. And that's a very huge focus of what Opik is.
Gideon Mendels [00:16:08]: It's kind of trying to bridge those two paradigms.
Demetrios [00:16:11]: Have you seen a way to unit test LLMs that are super helpful?
Gideon Mendels [00:16:20]: Yeah. So I guess it's, again, the definition. Yeah, the pure definition is of unit testing of like, hey, let's, I mean, there's two sides of, you know, unit testing. There's a philosophy you should test what you should test. And how does it look like? Obviously, in an LLM, you don't control, um, the internals. But I think the idea here is, again, having some deterministic, simple assertions is very helpful. Um, but they're very brittle in that case. So that's where you want to start including more heuristics and LLM as a.
Demetrios [00:16:54]: Judge once, and talk to me about LLM as a judge, because I know there's 50 different ways to do that. Have you seen certain ways that folks are gravitating towards more?
Gideon Mendels [00:17:09]: Yeah. So it's right, like ill on a judge has its benefits and definitely some disadvantages. I wouldn't say there's like a first specific judge. Call it hallucination detection or bias detection. I wouldn't say that there's necessarily a way how everyone does it. Right? Like the prompt Lln of the judge is a fancy name. It just basically means you take the output of your original LLM and then you tap on it a prompt that asks a different LLM, or even the same one. Hey, does this contain bias?
Demetrios [00:17:45]: Right, right.
Gideon Mendels [00:17:46]: So obviously people develop those prompts, and those prompts are also performed differently on different LLM providers. So I wouldn't say there's necessarily a method that everyone uses. It's really exciting to see OpenAI in their last dev day, actually released some form of in their playground. An LLM is a judge mechanism there, so they wrote the prompts for them, and knowing OpenAI, they probably did a good job developing those prompts. So that's great to see some form of consolidation there. The one thing I would say I've seen people do lms a judge is an LLMs in general, depending on the scale, could be very expensive. So I've seen customers use a lower or cheaper LLM in production, but then use an expensive judge to test it. So then you kind of get best of both worlds.
Gideon Mendels [00:18:48]: That's one approach I've seen.
Demetrios [00:18:50]: Interesting. Now, coming back to Opik, because I wanted to dig into one of the pieces that you have around it, which is that track function. And for those who don't know, my understanding of it after reading the GitHub page, because it's fully open source, which is awesome to see, is that you can track just about anything that you want when it comes to these LLMs, as long as you tag it with like hey, this is important to me, I want to track it. Is that correct?
Gideon Mendels [00:19:27]: Opik is fully open source from a functionality perspective with exactly the same functionality we have in our cloud hosted version. So it's not a crippled product or anything like that. You can actually run it for production use cases at very large scale just from the open source project. But to your point, yes, it comes with built in integrations with the major LLN providers, the major libraries like Llamaindex, Lang chain ragas and so on. But even if you're not using any of those things you pointed out, we have this really cool way of auto instrumenting your pipeline. So just by adding annotation track above your function or class, we will track all the basically everything is going on in your college chain or pipeline. So yeah, I think you had the right idea there. And we've seen people track pretty different use cases and products with it.
Demetrios [00:20:32]: So that's everything basically on your side. Or if I'm creating an app, I can track whatever on my pipeline I also wanted to get into. Can we track the responses of folks or signals that will help us understand if the response was valuable? Like something like if someone copy and pastes this, or there's the thumbs up, thumbs down, which I think we as a community have come to understand that it's not that valuable. Just because people that give you thumbs up or thumbs down may not be the best user segment that you actually are trying to shoot for or whatever. There may be a lot of reasons why that can be biased or not the best signal, but a great signal could be you get one answer and then you don't ask for any follow up questions and you copy paste. And so that could be something that is really interesting. Or if there are follow up questions, then you can see what parts of the follow up questions it's trying to dig into. And all that stuff I can imagine could be valuable for the app developer.
Demetrios [00:21:45]: Have you seen any of that?
Gideon Mendels [00:21:48]: Yeah, absolutely right. So it depends on the use case. Opik is not just for college chatbots or rag applications. It's for essentially any LLM powered feature. And by default, we'll track the input of output of that LLM too. Definitely in the chat example you gave, you'll definitely see the responses. Now, I think where you're going to is like, hey, can we feed in additional call it features, bi features or other ones, so we can later understand how well the response was. I haven't seen people use a copy paste one.
Gideon Mendels [00:22:26]: That's actually a cool idea. But we've seen definitely people feed in a downstream, call it product analytics event. You can create essentially all these quality evaluation criteria on Opik, both, again, that you can manually tag, but also automatically in your example, I think you said, hey, this user, based on this answer, did an activity in the product. You can add that, as I call it, categorical annotation. And then again, you bootstrap this dataset, so later, when you're reworking and improving your prompt, you can actually test how the new prompt does on those things. So it's pretty cool. Obviously, you wouldn't be able to simulate whether the user did that or not, but it's a very good proxy for quality. I think it brings the question, can we actually feed it into the training process down the road? That's an open research question.
Demetrios [00:23:26]: Oh, yeah, that's fascinating to think about how. And the other piece to think about is not just, all right, we get that information but then with that information, what do we do with it? Right? So, cool. Yeah, I understand that when we have this certain snapshot of prompts and temperature and rag retrieval, whatever, the whole system, we have that snapshot that is tracked and we see that it performs higher on these types of queries or this type of task, and we think that because we're seeing more copy paste or we're seeing whatever that metric is that we're tracking downstream, like you said, the product analytic metric. Now we have that. What do we do with that?
Gideon Mendels [00:24:15]: Yeah, so I think we're talking hypothetical. It's very use case dependent. But for example, if you're doing an A B test, so you're running two different LLM variations or prompts and production, and you're testing these downstream product analytic metrics very quickly. You can know in production which one's better, which is pretty big, right. When you think about it. Now, that's why I'm saying it's very dependent. In some cases, you don't have those things. And then I think in some scenarios, maybe, and I'm sure if we look on archive, there's 50 PhD students working on this, but I think in some cases where you have those kind of production labels called them, you can start thinking of how you merge both the traditional ML approach of things and this LLM approach.
Gideon Mendels [00:25:19]: So just thinking out loud, if you have an embedding vector of features, so, like user information, for example, and then you have some kind of downstream supervised task, like, did the user buy the product or not? I'm just trying to keep something simple. Then sure, you can add it in the prompt today. In many ways, that context information, which is what people are doing, but it'd be pretty cool to think, can we actually fine tune the model with these features? There's similar to how people do RLHF, which is, in this case, it's not human or I think there's direct approaches. DPO, if I'm not mistaken, you basically train the model on these infra. Yeah, so it'd be cool to see. It is so early and this space is moving so fast. So I think as companies and use cases start hitting the limits of what they can get by just changing up the prompt or their rag or the vector database parameters, we'll start seeing more of these sophisticated approaches.
Demetrios [00:26:32]: It's almost like, because it's been quote unquote democratized, we all can have our wacky ideas and just test it really quickly. And so that's where you get, hey, you know what if we use a graph database, it kind of works better. Let's try this. And now we've got this thing called graph rag, right? And somebody can come out with a random paper, and next thing you know, a few folks are testing it and seeing if it's reproducible and it's actually giving them some lift in what they're doing. And so I wonder if that is an area that we can see more exploration in. That would be really cool just to marry those product analytic metrics with more of the training process to say, can this give me a better LLM? Or maybe there's nothing there. And it's absolutely a dead end, but it's definitely. It's fascinating to ponder, and I'm sure it's going to be somebody else that's doing it, and neither you or I are going to go and try and figure that out.
Demetrios [00:27:37]: But if anybody out there is listening and they are working on this, I would love to talk to them because it's really cool to think about, super.
Gideon Mendels [00:27:43]: Cool to think about. And that's what I love, you know, but kind of the research world and, like, my past life before comet, that's what I did. It's just like, yeah, like, in many cases, like, worth throwing spaghetti on the wall and seeing what sticks. But once someone finds an approach that shows promise, people build on top of it. And that's how we got to where we are. Right. Transformers and the attention and all this. All you need paper is a mix of a bunch of different approaches that we already had at that.
Gideon Mendels [00:28:19]: At that time. Right. I was actually at Google when that paper came out. I had nothing to do with it, but it was written, like, a few buildings from where I was.
Demetrios [00:28:28]: Nice.
Gideon Mendels [00:28:28]: And we're using at the time, lstms with attention. And sure, there's a lot of differences, but also, it's not that different. And I think now people are talking and starting to show that in the right scale, you can even make those older algorithms, like RNN's lstms grus, perform closely to transformers. So, yeah, it's just really cool to see how. How much progress we made based on, like, you know, thousands of people just testing different ideas and approaches.
Demetrios [00:29:04]: Yeah. Getting the ability for it all to fan out. And then you have survival of the fittest happening. And so, absolutely, it's cool to see. And it also kind of dovetails into this next topic that I wanted to talk to you about around version control and almost like experiment tracking, you have parallels with evaluation. Right. And it feels like they're almost like brothers or sisters. They're siblings in a way, because you want to be able to, in experiment tracking, you want to be able to have a snapshot of everything so you can go and reproduce it and be able to get that model or that same output and the same with like, LLMs and evaluation, and being able to have a snapshot and say, okay, now I want to be able to get that output all the time.
Demetrios [00:30:03]: I don't want it to be just some random thing that, because I had one prompt that was tuned to really nicely and I forgot what I saved it as, or I forgot exactly how I worded it. Now I can't ever get that output again.
Gideon Mendels [00:30:17]: Yeah, no, that's amazing, because, like, that's really how we got to build Opik, right? Because, you know, comet, their original focus is on the analog side and experiment tracking, right? Like, we have over 100,000 data scientists using that. And we power some amazing ML teams at like, you know, Netflix and Uber and stability AI and Cisco and many, many more. And what we started seeing is like our own user customer base kind of trying to hack the existing experiment tracking solution for these use cases. Because when you think about it, you're right, they're very, very similar. So in traditional ML, you have, what do you control as a data center? You control the data set, you control the hyperparameters, you control the code or the algorithm, and then you part with intuition, part with methodologies, you keep changing those things until you get a metric that's satisfactory. I'm not even getting yet to the reproducibility side of it, just the nature of it. You're testing many, many, many things, and you're trying to get a better result when you think about these LLM applications. So, yes, you control different things.
Gideon Mendels [00:31:34]: You don't control them. I mean, let's start with what you do control. You control the prompt, you control some LLM parameters. If you're building a rag, you have all these, call it hyperparameters around chunking and configuration of how the database does the re ranking and so on. And then you can create a test data set like we spoke before, and you can continue iterating on those things until you get a result that's satisfactory.
Demetrios [00:32:04]: Yeah. The embedding model too. That's one that you could.
Gideon Mendels [00:32:07]: Yeah, absolutely. Absolutely. You want to test different embedders, you want to test different LLM providers. Like, it's, you know, it's a, it's a, you know, it's a, you're searching that space, it's very, very similar. Yes, you're changing different things, but you're still blindly searching a space, right? You can't run gradient descent on it like you have to blindly, with intuition, of course, and knowledge, start tweaking things until you get a result that's good enough. So it was really fascinating to see users trying to use common experiment tracking for that. You can get pretty close to what you need. But then we want to track the traces and the span so people can look at the data.
Gideon Mendels [00:32:51]: We want to include these built in evaluations because that is what we saw customers needing to manually implement all the time. So, yes, so very similar. And then that's the main layer of the main value proposition of experian tracking, whether it's an Opik or comet en, which is I want to help you build a better application, better model, but on top of it, and you called it out, you have this notions of reproducibility. I have multiple people working on this project. I want to pass it to someone else. They need to know exactly what I did to get to this result. And then the other piece is collaboration. If you don't use something like Opik or comment em, it's basically impossible to collaborate.
Gideon Mendels [00:33:39]: You can copy paste logs to slack all day long. But Demetrius, if we're collaborating on something and I want to show you, hey, this is the prompt, these are the outputs. How do I share this with you today I put in an excel sheet. Sure you can use Excel for everything, but it's really cool to see two products that are so much in common and also from a user and customer base perspective. When we identify this about call it. 18 months ago we decided to double down on Opikousen and build it. And we wanted to give it back to the community in this case because we have a strong foundation with our existing platform. So we made it completely open source as well.
Demetrios [00:34:27]: It's almost like the microcosm of the macrocosm we were just talking about. I want to highlight that real fast where what you're doing when you're exploring this unknown space in your own prompt pipeline or your own LLM application, or creating a traditional ML model and doing your experiment tracking, that's exactly what we were just talking about with the advancements of AI is it? And it's just that for this, it's me going through on my computer working on it. But then for the advancements of AI, there's hundreds or thousands or hundreds of thousands of us that are all trying to figure out what the best ways to optimize this are, and we're all just exploring this unknown space together and then whatever we see hits, hits and we kind of run towards it. And so it's cool to think about that from the micro to the macro and how there's, there's those parallels. But the other piece that I wanted to mention is this collaboration factor, which I think is a huge part of what's going on, since collaborating is, is almost like a must in certain ways, because you don't want to just have it all for yourself or you want to see if you can get a fresh pair of eyes on something. What is it that I'm missing here? What do I not know? And can I ship it over to someone and see?
Gideon Mendels [00:36:03]: Yeah, that's a great question. So essentially Opik has, let's call it three main views. I'm talking about the UI, right? There's a very robust SDK and an extension to pytest that allows you to add these evaluation metrics, but depending what we're trying to collaborate on. So let's start with, let's start with the basics. We have a dataset, we created a dataset, I can share you a link to that data set. You can have obviously multiple ones in the UI, you can look through it, see all the inputs, expected outputs and so on, whatever columns we have there. Not that different from an excel table at this point. But then you want to start annotating things because we established you really want to have these, this notion of what's good and what's not good.
Gideon Mendels [00:36:54]: So I can share this with you, you can go in and annotate different samples, I can annotate samples, I can send it out to external human annotator as well. And then we can also collaborate on which annotation categories we care about because it's never as simple as this is good, this is bad, it's more complex than that. Then the last call it, the third main view is the call of the observability view. We put this in production. You're a product builder, you're an engineer, you want to see how people are using it. With Opik you can actually see all these sessions. For each one you can actually see a full breakdown of the traces and the Spanish. If you have tools that you're calling within your chain, you can see what their response is.
Gideon Mendels [00:37:49]: So you get just full observability of how people are using it. And then I can select a specific trace and share this with you, Demetrius, and say, hey, this seems odd, or we should fix this. So yeah, it's been pretty cool to see how many people use it. The product's only been officially out for, we're recording this in October 4. It's been like two weeks, already reached, I think 1500 stars on GitHub and hundreds of users. So pretty awesome to see how people use it.
Demetrios [00:38:19]: Yes, and I definitely saw it when it came out and I was very curious and that's why I'm happy that we're having this conversation. I also wonder if you can, because I wasn't clear on this, are you getting a snapshot of someone's whole, like we talked about how there's all these x factors or knobs that you can tune when you're dealing with your rag pipeline? Let's stay centered on like a rag use case. And you've got the, the way that you're chunking it, you've got the reranker, you've got the embedding model, the model that the actual LLM you're using, the provider that you're using. All of that feels like it should be caught in a snapshot. If you want to reproduce what you want to reproduce. Are you getting all that information too?
Gideon Mendels [00:39:15]: Yeah, absolutely. We have this notion of experiment configuration, which is exactly those things. It's a snapshot of all these knobs and parameters so you can easily reproduce them. But then we go deeper and we can see in a rag use cases, you don't just want to see what the question and answer was, you want to see what did the vector database return, what was pasted into the prompt template. You can really dig in and see all those things, which helps iterate on a specific parameter versus the whole system as a whole all the time. Maybe it's your chunking, maybe you want to add traditional VM 25 input as well, which is, by the way, it seems to be extremely powerful and useful to include those things. So yeah, called experiment configuration.
Demetrios [00:40:14]: And that is exactly what I was thinking about is how valuable that can be in the collaboration aspect, because I can send you over all of that experiment configuration and you can pick up where I left off. And so then you can be like, yeah, oh, you know what? Maybe you have some inkling of something that two weeks ago you were playing around with a different configuration and you saw it was cool if you switched out one of these parameters and then boom, you see a lift and you get to see some gains there. I also wonder about the actual output of and that dataset that you're talking about, and then being able to zoom in to different pieces of that output and say, you know what, this is doing really well on everything except for maybe this slice.
Gideon Mendels [00:41:10]: Yeah.
Demetrios [00:41:11]: And I imagine you've got something like that because that also doesn't feel so different than the experiment tracking worlds, right?
Gideon Mendels [00:41:18]: Yeah, absolutely right. So on the datasets or the experiment side, there's one of the things you always want to do is compare against something, whether it's a goal data set or experiment a versus experiment b. Right. So you have the evaluation metrics and you can look at where does, you know, how does this experiment perform versus the other. But when you look at responses, you also want to be able to see an actual text diff. Right. This is how they responded differently. And then, yeah, you can slice and dice based on, I guess show me all the dataset items where this evaluation score is equal zero, for example.
Gideon Mendels [00:42:02]: And then you look at the ones where you failed in, and then you go back and iterate on the prompt and you try to fix it. So that's exactly the process which is very much back to our previous point, very much a data science or. Yeah, that process versus the software engineering one.
Demetrios [00:42:17]: So I can really take that data that is output, and I can dig deep into each little piece of that and understand what it was when it was output this way. Okay, all of these, like you said, whatever it is, if the evaluation score is zero, and I'm looking at all these examples, I can know exactly what was pulled from the vector database or what kind of chunking happened. I can understand if there is something in my pipeline that is not working. It's a little bit easier to debug that.
Gideon Mendels [00:42:57]: Absolutely, absolutely. That's the goal. You mentioned something that's wrong in your pipeline. I think that most of what we're talking about is call it in this development phase, and then maybe you have, it's an iterative process you put in production, you start getting new production data, you label it, you go back to development. Right? Yeah. A huge piece here that I'm very excited about because I haven't seen anyone try to approach it the same way we do is the CI Cd aspect of things. Because in software engineering, you know, you, after you build all your tests, every time you release something, not even into production, right? Like every time you create a pull request on GitHub, there's a build and it runs through all the tests and it gives you so much confidence that whatever you did didn't break someone else's work or some old functionality. But like how do you do that when you have an LLM in place? Right? I.
Gideon Mendels [00:44:03]: Yes, we talked about the evaluation metrics, whether they're deterministic or heuristic based, but we really worked with our customers to try to figure out, okay, but how do you use this in the CI CD flow? So what we ended up doing is we built an extension library for pytest, essentially within your existing testing code. You can add additional tests and say, okay, this is a heuristic test. I am okay. If the distance between the answer in my goal data set versus the distance from the LLM is lower than some threshold, then you get that additional confidence that you didn't break anything, even if it's not LLM related. If you change something, let's say something's broken in your vector database pipeline. Like, yes, you can write some software tests for that, but this is really a really great way to have this confidence to ship. And then that just allows you to move so much faster because you're not worried all the time you broke something. So it's really been really cool to have that opportunity to take a concept that is so meaningful for software engineering and try to bring it to the LLM world.
Gideon Mendels [00:45:22]: And I'm just excited to get more feedback. I'm sure there's a lot we should fix and improve. So it's been really fun.
Demetrios [00:45:28]: Well, the thing that keeps screaming in my head is feature flags. And have you been thinking about how to incorporate? Because it's a natural thing that you would try and want to have. Is that all right, cool. We're going to test out this new one, and we're going to slowly roll it out into production, or we're going to roll it out just in this area or that area, because we have 100% confidence that it's doing well here. And then we might do a little bit more of a rollout. So you're a b testing on that rollout, or I know some people call it like, whatever, the champion challenger, really kind of rollout. Have you thought about any of that? Or is it you're just playing in the evaluation side and then reincorporating it back into the Cice pipelines, but not quite going as far as the deployment piece?
Gideon Mendels [00:46:25]: Yeah, so Opik doesn't, or comet, for that matter. We don't do any work around kind of production a b testing. A lot of people, when we say experimentation, assume that's what we do, but no. So now it might change, obviously, depending on what users and customers are asking for. But that's like the existing platforms and tools for a b testing. And essentially every company uses something, right. I think translate well here. So we're not trying to build something that is super established and people have been done before and just put an LLM tag on it.
Gideon Mendels [00:47:11]: If the existing tools work well, by all means, let's use them. It's about where the existing tools don't work as well. For this new paradigm where we get excited about adding more functionality, that's the history of the company as well. We had git version control, but it just didn't cover all these other things that we needed from experiment tracking. I guess the short answer is we don't do any a b testing feature flags functionality. But there's some great tools out there.
Demetrios [00:47:44]: Yeah, but it's like, you work well with that and then along those lines. Are there ways that you've seen folks architect their pipelines to have them be more bulletproof or resilient? Maybe not resilient, but just be more accurate and have more success in their AI products?
Gideon Mendels [00:48:05]: Yeah, I mean, there's a bunch of ways we can break this apart, right? Like, and there's obviously a lot of knowledge in data engineering, software engineering, of how to build reliable data pipelines. Probably not much to add there, I would say, without trying not to repeat myself. But I think the main thing that adds this new layer is around, including automated evaluation testing for your LLM outputs. That's like the new thing that is hard to do with the existing knowledge that we have around data and software and engineering. And I've seen teams that adopt that approach and suddenly just move ten x faster. So that's been really amazing.
Demetrios [00:48:55]: And there's this notion of the evaluation. Basically, one thing that's hard with evaluation is the human labeling and how expensive that can get. Right. Have you been seeing anything that will help that besides the LLM as a judge type of thing, or is it just, you know what the best teams hold labeling parties and they embrace the whole labeling piece of it.
Gideon Mendels [00:49:30]: Yeah, I mean, there's. Sure there's ways it can get closer, but at the end of the day, you're right. Good human label data is priceless and it's very similar to traditional model. Right. There's definitely companies that will outsource that work for you if you're a smaller team or don't have budget. Labeling parties are always a great idea, but you. Yeah, like having even, like, it's just, you know, crazy. Even a small amount, like, even like 50 or 60 samples and like on a Q and a rag use case will help you so much.
Gideon Mendels [00:50:09]: Like, so definitely recommend. If you don't have that, just go ahead and label a little bit of data. It also gives you a lot of insights, once you start doing that, about how to design your prompt.
Demetrios [00:50:20]: And have you seen differences between, like, internal versus external facing LLM applications and how to evaluate things there? Because I feel like you can get into hot water on both of them. And so maybe there's no difference. Maybe with the internal stuff you have a little bit more leeway because you're not going to get thrown on the front page of the Internet, or at least unless you're likely to. Yeah, yeah, yeah. But I don't know, maybe there isn't a difference and it should be taken with a lot of rigor on both cases.
Gideon Mendels [00:51:05]: Yeah, it depends on the use case. Right. And I think most companies, for that reason, start with internal use cases. There's less risk in there. But I mean, look, there's companies building internal HRH chatbots or rag chatbots. And theoretically, like, one can just answer questions about other people's compensation. Right? So there's risk there as well. That's one thing.
Gideon Mendels [00:51:31]: The bigger difference and the main one, I would say, is usually the scale. Right? Like, if you're putting something in production in front of like hundreds of thousands or millions of users, it's just, you know, it gets much, much harder in many ways. So that's more of an engineering thing. Yeah. But it has been cool to see, like, some of our customers starting with internal use cases, getting a lot out of it, moving to, like, beta, where some, some of their customers have access to it, and then finally putting it in a production. It's such a good feeling to see, like someone does that and like, you know, as a builder, like, it's so hard to build things that are good and bring value and people are excited about and seeing our customers do that. Like, like, literally what gets me up in the morning.
Demetrios [00:52:21]: And when you talk about the scale between the two, internal versus external, I also, since we are really harping on evaluation, how do you manage the scale and doing proper evaluation at scale?
Gideon Mendels [00:52:41]: Yeah. So the first thing we had to do when we built Opik is like, support that skill, right? Because if you have hundreds of thousands of users or millions of users and a very significant number of qps, and we want to do the observability side for that, we have to support that. So Opik was built from the ground up based on a lot of the knowledge we already had around model observability to support that. So that's the first thing that's hard, right? Like how do we collect all these instrumentation information from you in real time without slowing down your application in scale? So that's the first piece. If you're listening and you like that kind of work, definitely take a look at what we did there on the implementation side. We would love to get some feedback. We're pretty excited about it. But then, but then you're right, because the evaluation is also really hard.
Gideon Mendels [00:53:44]: So all these queries we talked about, slicing and dicing the data, running automated evaluations, it just becomes much harder and scale, like many things. But this is where this world is going to. We're talking about features and use cases that will be served to millions of customers. And maybe it's a nature of our existing customer base where we serve some of these very, very large, I think about Netflix and Uber and the amount of users that they serve that we have to think about those things. But it's really hard. And we're still early on, both Opik and an industry in general. So I think if we meet again in a year from now, I'll be able to tell you a lot more and what breaks when you get to that scale.
Demetrios [00:54:31]: And one thing that is classic observability, type of feature or best practice is setting up alerts. Have you, do you have that ability with Opik and have you thought about that type of thing?
Gideon Mendels [00:54:46]: Because that's a good question.
Demetrios [00:54:48]: It's also.
Gideon Mendels [00:54:49]: Yeah, but too.
Demetrios [00:54:50]: Right. Because you don't really know. How do you alert? Like when, when your software goes offline, that's one thing you can alert on. But when the LLM is giving kind of bad output, are you going to alert for that? How do you alert for that? What does that even look like?
Gideon Mendels [00:55:05]: Yeah, it's a good point. So we don't, Opik doesn't support alerts today. It very likely will soon. Alerts, like you said, are really hard because you very quickly start going into this world of alert fatigue or too many false positives. And we have a reservability product for traditional ML around feature corruption, model and concept drift. So it's very hard. Right. Like if I, if I tell you six, seven, or even like five times that something's wrong in your model and you then take a look and it's fine on the 6th time, where something's actually broken, you're just going to ignore the alert.
Gideon Mendels [00:55:43]: Yes.
Demetrios [00:55:44]: The boy who cried wolf.
Gideon Mendels [00:55:45]: Yeah, exactly. Absolutely. Right. So I wouldn't say we have figured that out. Exactly. But the idea is like, using Opik is like, yes, we give you a lot of built in evaluation functions and metrics, but you can and probably should in many cases implement your own. So if you know something that's absolutely critical for you, like, I don't know, I've seen one of our customers wanted to make sure that chatbot does not mention their competitor. Fair enough.
Gideon Mendels [00:56:17]: So that's a simple assertion, right? Does not contain x. Then you can definitely do alerts on that. So, yeah, that's a good point. I think it's on the roadmap, on the open, open source roadmap there. But if not, I'll make sure to add it later today.
Demetrios [00:56:32]: Well, also, the thing about alerts, I was trying to think how you would be able to today just capture value from understanding the output of the models. And there are those hard lines that you don't want to take, such as, don't mention a competitor, or if this is some kind of question answer bot, don't like, respond to people if they say, give me the answer in JSON or in Python. You don't want to be doing that, because as you saw, probably with the Amazon sales bot, it was doing that. And then that's a great way to end up on the, the front page of the Internet, as we say. One thing that I was thinking about that could be very valuable is if all of a sudden I'm seeing a lot of folks ask the same question, or it feels like there's some kind of anomaly in the output, maybe that could be worth looking into. Like, hey, all of out of nowhere in the past, x amount of time, minute or hour, we've seen an unusually high rate of this or that. That could be valuable because it could be another. It's like a proxy signal, right? It's some telling you something is, is happening differently than usual, and maybe you want to go and look into it.
Demetrios [00:58:12]: And that was the only thing that I could think of. That could be a nice alert if you understand the output.
Gideon Mendels [00:58:18]: Absolutely. Yeah, we've seen a lot of those things on our NPM model production monitoring product, which is designed for teams that control the model versus an LM application. So, yeah, it'll be actually pretty cool to see how those things kind of converge. I would say the majority of use cases I've seen out there are very early, meaning that kind of observability is they're not even thinking about this, they're just trying to get something working in its basic form. But I have no doubt we'll get there. If you think about on the software side, the apMs, the data dogs, new relics of the world, I have no doubt we'll absolutely get there eventually.
Demetrios [00:59:04]: Well, I remember when we're a year.
Gideon Mendels [00:59:06]: It might take ten, but we'll get there.
Demetrios [00:59:08]: Yeah. I just say this because I remember the conversation that I had with Philip from Honeycomb probably a year and a half ago, and he was talking about how it became very obvious to them once they put the AI product or AI feature into their product, when somebody was trying to prompt, inject the feature, because it was almost like a programmatic, there was a lot more queries per second, or there was just like, it was pretty obvious it wasn't a human that was trying to do it. And so those are things where you can set them up. Maybe you don't need a tool like an evaluation tool. You can just set those hard rules like, hey, if a user all of a sudden is querying 50 times per second, then that's probably not a human.
Gideon Mendels [01:00:02]: Yeah. And I think in that case they also have, it's an interesting LLM use case because like you're, I think if I'm not mistaken, they're generating queries for the user like a text to SQL ish use case. So you can actually know whether the query is like syntactically correct. Right. For example. So that's a super easy alert if your model is generating non syntactically correct queries. I'm not even talking about whether the query returns what the user expected. That's harder, but syntactically correct.
Gideon Mendels [01:00:37]: And that's classic op use case. Implement your own metric. I think we even have one for SQL, SQL query. But implement your Opik metric and add something around that and then you can have these alerts or even just this information on the UI that these are the use cases that break.
Demetrios [01:01:00]: So it's almost like you can today get alerts. You just have to do a little bit more creative thinking.
Gideon Mendels [01:01:06]: Yeah, there's no alerts built in. I mean, you can forward the logs and so on if you want to hack it. I'm sure there's a way to do it. But I definitely urge the listeners, if you want to go and open a pull request for an alerting system, there we go. Integration to slack pagerduty and such, we would love to accept it. Yeah.
Demetrios [01:01:27]: Yard open.
Gideon Mendels [01:01:28]: Yes. Now you baby.