MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Navigating the AI Frontier: The Power of Synthetic Data and Agent Evaluations in LLM Development

Posted Jun 18, 2024 | Views 449
# AI Frontier
# Synthetic Data
# Evaluations
# LLMs
# Okareo.com
Share
speakers
avatar
Boris Selitser
Co-Founder and CTO/CPO @ Okareo.com

Boris is Co-Founder and CTO/CPO at Okareo. Okareo is a full-cycle platform for developers to evaluate and customize AI/LLM applications. Before Okareo, Boris was Director of Product at Meta/Facebook, where he led teams building internal platforms and ML products. Examples include a copyright classification system across the Facebook apps and an engagement platform for over 200K developers, 500K+ creators, and 12M+ Oculus users. Boris has a bachelor’s in Computer Science from UC Berkeley.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Explore the evolving landscape of building LLM applications, focusing on the critical roles of synthetic data and agent evaluations. Discover how synthetic data enhances model behavior description, prototyping, testing, and fine-tuning, driving robustness in LLM applications. Learn about the latest methods for evaluating complex agent-based systems, including RAG-based evaluations, dialog-level assessments, simulated user interactions, and adversarial models. This talk delves into the specific challenges developers face and the tradeoffs involved in each evaluation approach, providing practical insights for effective AI development.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Boris Selitser [00:00:00]: Boris Selitser. I am CTO, CPO, Okareo. You know, to do as many things as I can. Well, I, you know, had many years of self-deprivation, but I finally bought an espresso machine, and it's one of the best purchases I've ever made. And I make a mean cortado. It's two shots and then just super dry, just foam. That's. That's my coffee routine.

Boris Selitser [00:00:34]: Can't live without it.

Demetrios [00:00:36]: What is going on? MLOps community, this is another good old podcast. I'm your host, Demetrios, and we're talking to Boris today all about evaluation, synthetic data, and how it helps you evaluate AI agents and how you can evaluate them. What he's been seeing on the agent field and what some best practices are. Also a few hot takes in there, as you know how we like to do it on this podcast. We kept our feet on the ground. We did not get too hypey. He even called out, we may be at peak hype agent. Let's see if we are, because I still think we got a little bit more hype to go when it comes to the agent field.

Demetrios [00:01:24]: Now, he has worked at Facebook, and one thing that I want to call out from this conversation that he mentioned was how his time at Facebook helped him understand what it means and the power of choosing the right metrics and the power of being able to track those metrics. And then he also left a little caveat in there to say, it's not only choosing the right metrics, because you don't always get the metric correct the first time. So you have to iterate on it and update your metrics constantly. And this is so important when it comes to the evaluation metrics and how you're evaluating the output and also the retrieval, all of the fun stuff that we know when it comes to these AI systems. So, let's get into this conversation with Boris. And I should say that his company is hiring for a machine learning engineer. So if anybody is interested in joining, then we'll leave the link in the description. All right, let's get into it.

Demetrios [00:02:32]: As always, if you like this episode, feel free to just share it with one friend. That's how we can spread the good work. It's probably. Probably good to start where we just almost were talking about it before I hit record. And that was on, like, this convergence of software engineering and then data science, machine learning engineers. And it's something that I've been thinking about quite a bit lately, too, is how much do you need in one, how much do you need in another? How transferable are the skills of a software engineer into building really production ready AI features or products? And what do they need to know? And then vice versa, like a machine learning engineer that was traditional ML. Now going into playing with LLMs, there's a lot of things that are very familiar and then some things aren't. So how do you see this space?

Boris Selitser [00:03:42]: Yeah, I think that is definitely a relevant question right now. And I think the, you know, the way you framed it is, I think is accurate. I think the, there's a lot of sort of convergence that's happening right now between the software engineering and traditional sort of software development world and the world of the traditional ML and lots of the listeners. This podcast, I think in terms of, I think before you had six months, you were building a model, two to three, six months sometimes. And at the end of it you don't know what you're going to get. You're driving like a long term experiment, and then you have to be gritty, you have to be sort of tolerant of failure, and then you have to have this way of, hey, can I drive? And iterate in these long cycles, but maybe deliver high returns at the end? I think with the coming of a lot of the foundational models, I think the iteration cycles are starting to resemble more of the software development cycles and they're getting shorter and shorter. And you can get that same ROI out of, you know, the prosaic prompt tuning and tweaking. And there's lots of good research that's saying that, you know, prompt engineering, as, you know, as non engineering as it might feel for some people, is actually quite effective in producing the outcomes and behaviors from foundational models.

Boris Selitser [00:05:14]: And so you get the same outputs that you would get in other places in six months, much faster. And that drives sort of the mentality and the tools that people use and how they approach building systems in a different way. And I think from a software engineer standpoint, there's a lot of things that are being automated away, like the mechanical parts of writing code are slowly but surely are removing and so are being removed. What does the software engineer do? Right, they're picking up all these data skills, they're picking up all these metric driven development skills, and they're thinking about how to, you know, transition into this, you know, rapid development with foundational models world.

Demetrios [00:06:01]: Yeah, it's funny you say that too, because I was making a TikTok that my parents found and they gave me a bunch of shit for. But in the tick tock I was saying how you have this aperture that's been opened. And I like, it's not only that the potential people that can play with AI have now increased like 100 x, but also the speed at which you can get value from AI has shortened, or it's gotten much faster to be able to see if there's some value that you can get from something. And what I was saying is that it's easy to go from like, yeah, zero to 80, 85%, and you've got a lot of software engineers that can do that. And then their mind's blown and there's magic that's happening. And not just software engineers, basically anybody, if you can think of it. And then you say, well, yeah, let's see if we can use an LLM to help us out on that. But to get from that 85% to that 95%, that's where all of a sudden you're like, well, I guess I might have to figure out, like, data pipelines, and I might have to, like, go deep down the rabbit hole on data engineering, because you're figuring out the vector database and your chunking strategies and how to update the data and keep it clean and data access maybe, or just how to make sure that the data doesn't leak.

Demetrios [00:07:28]: And so people become, they turn from software engineering into a bit of a data engineer, too, I think.

Boris Selitser [00:07:36]: Yeah, there's like, a lot of, you know, non sexy parts, I guess, of, you know, data wrangling that go into building good quality models, good quality applications that use models. And so I think there's definitely, yeah, there's definitely like, you have to go deep and not be, like, scared to get your hands dirty in terms of actually diving into the little bits of data and figuring out what makes sense and what doesn't. And this is where I think a lot of the use cases that people are building now are not going to be grounded on a bunch of production data, on a bunch of historical data. They can be informed by that. And there's some things you could do with it, but a lot of it is new applications that are being built that weren't possible before. And this is where I think synthetic data plays a huge role. This is where people can actually leverage synthetic data to generate the scenarios and the behaviors that they expect from the model. And this is where there's a set of tools that they can bring into the whole building lifecycle, all the way from describing what they expect to have in the system to begin with.

Boris Selitser [00:08:55]: And think of that description now, not in terms of functional specs necessarily, but more in terms of behaviors that are data defined. And so categorizing those behaviors into groups, into things that you expect the model to do, including the negative cases, what you expected not to do. The very basic example is a lot of jailbreaking, etcetera, that you include in your product. That doesn't have to be like vanilla jailbreaking that you do in terms of a large language model, like the testing that OpenAI anthropic are doing on their models themselves, but more specific to the nature of your application, whatever you're building. And so kind of the new surface areas that exposes. Right. Including that in the kind of the description and testing sort of flow. So you start with defining what it is you expect and then evaluations become a key component, right? Like without evaluations, you don't have baselines.

Boris Selitser [00:09:52]: And so you have to kind of run through and actually establish where you stand. And now your life becomes like very metric driven. And, you know, we can get into the whole conversation of, you know, what metrics actually make sense in this world. But as you continue to tweak, you have your baselines, you can iterate, and then you can actually drive the synthetic loop all the way through to improving the model, fine tuning it. And that's where I think there's a lot of leverage, where you can build these new use cases with synthetic data and not just have to rely on production data out there.

Demetrios [00:10:31]: What have you seen as far as well done prompt injections or looking at a specific product or AI product specifically, and people being able to protect themselves against prompt injections or just toxicity or ways that it could go off the handlebars.

Boris Selitser [00:10:58]: You know, honestly, it's kind of interesting. I think there's, you know, if you look at a lot of the jailbreaking data sets, they're, you know, they're, they get quite humorous, but there's things that are like Dan or this is the yes man or so on and so forth. And so you could actually kind of expand those into application specific scenarios. But I think it's the nature of the application that, where you can actually go deeper. For example, if you're trying to build like a routing component that routes to a collection of agents that you have, the obvious things would be to catch all the things that are completely off topic or the users intentionally trying to get outside the main purpose of that router. While it's not necessarily malicious in nature, it might be misguiding routing decisions and so on. So it's like jailbreaking has this extreme sort of connotation in a lot of the large model sense where people are coming in and intentionally, like, trying to hack this thing. But then there's like the more practical, like I'm trying to build my app and make sure, like, it doesn't embarrass, you know, me, the company and so on.

Boris Selitser [00:12:19]: Right. So in that context, I think jailbreaking could be, you know, slightly less, you know, horrific, but still impactful and very relevant. And so I think it's more about like actually building things that are keeping the application in its lanes and sort of collecting a number of scenarios that capture that behavior.

Demetrios [00:12:42]: How do you feel about online evaluation and how needed is it? Because I feel like I've talked to a lot of people and they're like, yeah, that's great in theory, but we're.

Boris Selitser [00:12:57]: Not doing it, I think. So here's, I'll give you what some of our customers are doing and back into maybe like what I think about it, but I think there's definitely potential value. I don't think so. One of our customers, for example, is driving this complex DevOps automation agent system. And so they're actually, it's a multi turn generation that's executing shell scripts, commands, driving python scripts, provisioning kubernetes clusters and wrapping all that up with maybe a Jira update and ticket status and sort of moving all those pieces very independently. As you can imagine, in that kind of a scenario, lots of things can go wrong at many different levels. The thing you don't want to do is necessarily intervene, because by catching these things and saying okay, at iteration or generation number 15, if this happens, then you should do that instead. You can't anticipate that, you can't plan that.

Boris Selitser [00:14:10]: And so you definitely don't want to intervene in an online fashion. In fact, in many cases, these planning functions of LLM or these agents that are performing a planning function will recover in very interesting and unpredictable ways, like those that weren't anticipated by the author in the first place. So if they encounter certain failures, they will find workarounds that you haven't seen before, and those are good, valuable, and you don't necessarily want to intervene when these variations introduce themselves. What we've seen with them that was effective is actually doing kind of an offline post processing of a lot of what happened in production and essentially trying to map there's two layers of metrics. One is, was there overall task completion? And then you always, if you can, want to tie that to user feedback and some sort of user signal, like was the user involved and were they actually happy with the output, with the completion of that task? So the top line is kind of like task completion, and then the set next set of metrics that we look for is kind of more control metrics around. Was the max number of turn generations exceeded? Was the thing caught in an error loop? I think there are a number of use cases where it could be useful. Right now, if you're doing something simple like product description generation, and you can do an evaluation fast enough on like with the qualities within certain parameters, then you could do it. There's obviously the question of tolerances and latency as well.

Boris Selitser [00:15:58]: And I think that's kind of the reality of a lot of these evaluations are leveraging other language models under the hood. And so in many applications, just introducing that, if there's a user involved, introducing that into the loop might be prohibitive. There are use cases like, and we have customers pushing in that direction, like code generation, for example, where your tolerance as a user is much higher and you don't expect it to be instant. And so there is a level of basic kind of checks you can do. Like does this thing compile? Does it have imports before you show that to the end user? And so you could do that and still be effective. Or we see people actually, before you show the code to the user, you can perform those checks. And I think that would be considered value by the user because they don't want to get a bunk set of code that immediately is wrong after the fact.

Demetrios [00:17:04]: That's very true. Well, first off, I'm very happy to hear that there is a potential future timeline where we can have agents working with Jira and we don't have to be working with it. That is exciting for me. The second thing is that I think you talked about two fascinating things where when it comes to these agents actually doing what they're supposed to be doing, you let them try and complete the task. You never want to go and cut them off because they a lot of times get creative with how they actually finish the task. It may not be the way that you thought they were going to be able to finish it, but they figure out a way and they do a little bit of problem solving and dare I say, reasoning. And then you can get the user to say, like, is this actually valuable? Well, that's up to them. And then go through all of those different metrics that you feel like are important.

Demetrios [00:18:04]: So when it comes to those different metrics, you hinted at this earlier, but what are the ones that you feel like you should be looking at.

Boris Selitser [00:18:14]: I think it's been already sort of probably well documented on this podcast that, you know, a lot of the generic benchmarks that talk about model versus model performance, a good as a index, as a guide to what to select and try and play around with first. But at the end of the day, it has to be application specific. It has to be use case specific. Before I started a car IO with my co founder, I spent some time at meta and they're very metric oriented culture and I supported a number of teams and on a very regular basis I would be reviewing hundreds of metrics monthly, both from the standpoint of is this a good metric to have in the first place and how are we actually tracking on this metric? That has drilled into me in a number of ways what are really good practical metrics and when do they actually make sense from, you know, there is kind of a set of work that was driven, I think, academically, which kind of talks about a lot of these very, you know, I think, you know, generic metrics that make sense in a paper, but don't make sense for your specific application. My favorite is, you know, faithfulness, for example. Now what are you going to do with faithfulness for your application? I guess we have to have a, you know, a promiscuity metric as well to correlate.

Demetrios [00:19:48]: I was going to do it. If you didn't, I was ready to jump in.

Boris Selitser [00:19:51]: I think, you know, is your LLM faithful to you? So I, you know, and it's 3.5 faithfulness. Like, what do you do with that? Like, how do you actually make that actionable? What conclusions do you draw from that? And I think as long, I think there's, ideally it's a metric that actually connects to your end user value or to your application value is kind of your North Star. Now in practice, that's not easy, but there are some scenarios where that kind of aligns pretty head on. So I'll give you an example of like, again, everybody talks about customer support. It's one of the very popular categories of building with LLM, like issue resolution rate for the customer that's trying to get support. That's an obvious thing. It's existed in the support industry for a long time. And if you can actually use that as your North Star, everything else that you build can sort of ladder up to that metric.

Boris Selitser [00:20:56]: Right now, it's not always straightforward to even measure that, but ultimately it's less, you know, it connects directly to your business case, it connects directly to what you're trying to do. I think that's like that top layer kind of North Star metrics that you want to be application specific. Some people use LLM as a judge now that has its own set of kind of problems and hair around it, but we can get into that as well. I think the next tier of metrics that is, that people should be really investing a lot of time into is more like unit style, unit test style metrics and evaluations. And as much as they can be like pass or fail versus like a score, that's always going to be more effective and actionable. And so I think, you know, I think OpenAI evals have sort of started that trend. Are many people, it wasn't a huge leap of imagination. Many people have kind of jumped to that.

Boris Selitser [00:21:59]: But I think as much as you can put into these kind of deterministic evaluations of the output and do that upfront and do that really rapidly in your development cycle, that's going to be, again, that's a software engineering sort of principle that's tried and true, has been indoctrinated into people's minds for many years now. I think it applies really well in the LLM application building space as well. If you can put like for example, so I don't sound too abstract, right? Like again, if I'm generating JSON as the output to some next phase in my application, is it have the right format? Does it actually have the right kind of sections that you expect for that output? Does it have the right sort of structure, etcetera? Does it have the right components? And do all of those checks before you run anything more complicated, before you put any more involved and latency heavy metrics on top of it. So I think those are the two main tiers. I think that if people kind of invest a ton into unit test style metrics and kind of embrace the fact that this is going to be a custom environment, your metrics are going to be customer specific to your application. Now, a lot of the data scientists are. That's nothing new. For a lot of like machine learning engineers, that's nothing new.

Boris Selitser [00:23:23]: I think for a lot of the software engineers that are coming to the AI party, it is kind of a new practice that needs to be put in place. Now, you can't also, you know, perfect your metric over a six months period because as I mentioned earlier, things change very rapidly. You want to be able to put something in place, iterate and adjust and keep changing your evaluation metrics as well and keep adding to them because as new features get introduced, new vulnerabilities get exposed in your application and so on. You want to continue building into that metric set, but that is a replacement for how you build your baselines, and it obviously allows you to iterate on that baseline.

Demetrios [00:24:10]: It's funny you mentioned that it's not new for data scientists and machine learning engineers because we had a panel at the last AI in production virtual conference, and towards the end of the panel, a few people were talking about how it's almost like same, same but different when it comes to doing traditional machine learning and what you need to do with like your golden data set. And now you go out and you do LLMs and you're using AI, and all of a sudden you're like, wow, I think our evaluation data sets are like a golden data set. And so it's like new term, old thing that we're doing at the end of the day. And the other piece that I wanted to call out when it comes to metrics is that weve had an engineer on here, Sam Beam, and he talked about how, for him, the most important thing that hes doing is choosing the right metrics to be monitoring. And I think you kind of called that out really well, is not only do you choose the right metrics, but how are you going to be actually tracking those and how are you going to be able to know if what you're doing is moving the needle on those metrics and you're making sure that it's the right metric and you have the right visibility on that metric?

Boris Selitser [00:25:47]: It can get complicated. And I think that's why people want, like out of the box rag metrics, for example. And in some cases, that's a good starting point for people to begin building and experimenting around so they actually know what's going on in the system, because otherwise you're kind of flying blind and you don't really know if you're making anything that's meaningful. You're making improvements that are actually making any sense. I think the metric selection, it's kind of, as I connected to, it's kind of a, it takes kind of the application side of your brain as well as the, you know, the data science side of your brain, and kind of has to bring those two together in terms of like, what matters for my application and my use case or my, my business. Right. Like, what is actually the most important thing? Because the first of all, that changes right over time, and we often find the wrong metrics and we have to iterate on it and change it over time to kind of calibrate and find what actually works. So I'll try to come up with an example.

Boris Selitser [00:27:02]: But was the customer issue resolved? Was kind of an example I gave earlier. Now you could find a bunch of, like, if it's a customer support case, is that really like the north Star in all the cases? Right. Like, well, you know, how long did the resolution take? Was the customer like really frustrated at the end of it? Right. And so your issues get resolved, but at a high rate. But you're, you know, the resolution takes 2 hours for each customer on average. I'm just, you know, making something up here.

Demetrios [00:27:33]: But did they call back?

Boris Selitser [00:27:34]: Yeah. Did they? Well, yeah, I think that wouldn't be resolution, right. Like that would be kind of a repeat sort of call on the same issues. And so I think people, people need to embrace the fact that, yes, this will be custom, yes, the metrics will be custom, and we'll have to change over time. And that's okay.

Demetrios [00:27:54]: So we had San keet on here from Spotify, and he was primarily working on the vector store and the recommendation engine that they have and how they were using their vector store and the embeddings to kind of power all these different use cases in Spotify. Maybe it was the personalized DJ or it was search or it was recommenders or your, I don't know if you're a Spotify user, but they have like your daily mix and they have your weekly update type thing. And so there's all these products that are able to go off of the back of it. And one thing that he said, which resonates with a lot with what you're saying, is he spends more time on evaluating if the retrieval is actually good than he does actually like trying to figure out the algorithms or doing anything else really at his job. For him, the most important thing is like the evaluation of the system. And I know it's a little bit off because he's talking about a recommender system, but there still is that retrieval. If you're doing rags or it still is like trying to figure out what metrics are important, how do I evaluate them and then spending the time to really get familiar and intimate with those metrics?

Boris Selitser [00:29:20]: Yeah, yeah. I think retrieval is kind of convenient in the sense that it has a set of standard kind of information retrieval metrics that have been defined. But now you have to like what we see our customers trying to spending the most time on. And it sounds similar to the scenario you outline, is like defining what, like, okay, I can measure my top most relevant results in terms of MRR and DCG, etc. Type metrics. But defining what is most relevant for a particular user, right, for a particular query that is kind of, you know, that is kind of a lot more ambiguous and a lot more open ended, right? Like if I am on Spotify and I get these recommendations, like, are they, is that really what I want to listen to? Does that really want to align with, like, my taste preferences? And that's, that's not, that's never going to be ideal, right? And I think defining those data sets, and this is where we're using a lot of the synthetic data to help people define the initial set of data based on their enterprise, their company knowledge that they can use for retrieval later, essentially taking their knowledge and generating the retrieval data sets that then ladder up or line up to the standard retrieval metrics so you can isolate that retrieval step within your larger kind of pipeline, within your larger application, and then you can measure it. But you still need subject matter experts, you need people who understand kind of that field, whether that's actually relevant, and they still need to kind of filter a lot of that data in order to make sure it's qualified for evaluation, use or for anything else.

Demetrios [00:31:22]: All right, real quick, let's talk for a minute about our sponsors of this episode. Making it all happen LatticeFlow AI are you grappling with stagnant model performance? Gartner reveals a staggering statistic that 85% of models never make it into production. Why? Well, reasons can include poor data quality, labeling issues, overfitting, underfitting, and more. But the real challenge lies in uncovering blind spots that lurk around until models hit production. Even with an impressive aggregate performance of 90%, models can plateau. Sadly, many companies optimize for prioritizing model performance for perfect scenarios while leaving safety as an afterthought. Introducing LatticeFlow AI the pioneer in delivering robust and reliable AI models at scale, they are here to help you mitigate these risks head-on during the AI development stage, preventing any unwanted surprises in the real world. Their platform empowers your data scientists and ML engineers to systematically pinpoint and rectify data and model errors, enhancing predictive performance at scale.

Demetrios [00:32:34]: With LatticeFlow AI, you can accelerate time to production with reliable and trustworthy models at scale. Don't let your models stall. Visit latticeflow.ai and book a call with the folks over there right now. Let them know you heard about it from the MLOps community podcast. Let's get back into the show.

Demetrios [00:32:55]: So, but let's, let's move into this idea of the custom evaluations. And I'm pretty set on the idea that if you're not using custom evaluation data, metrics or whatever, but like you're, you're kind of screwed. If you're just going off of, as you mentioned, the benchmarks, then that's pretty much flying in the dark. I don't know if anybody actually does that. It seems like we've learned by now. But how have you seen with the advent of the synthetic data, like, are there any ways that you're like, okay, this is super useful to create that custom evaluation data set and then also make sure that it is very use case specific for my app?

Boris Selitser [00:33:54]: Yeah, yeah. So I think, yeah, no, I think there's using the benchmark metrics and then there's like using your own data versus benchmark data and then crafting your metrics for your use case. Those are two separate dimensions. I think if we look at synthetic data, I think it allows you to really focus on behaviors that you care about and describe behaviors that matter to your application. Curating a dataset, creating a benchmark for yourself, is a lot of heavy lifting. Right. And so how do you get a jump start on that? How do you get that going? And we see people, you know, use synthetic generators in conjunction with some seed data that they have from like dog fooding, for example, or like internal kind of deployments or just their dev experiments that they're running, and then creating templates for like what they envision. Like the ideal flow should be what the ideal scenario should look like, and then essentially using generators to create personalities, user personalities or user behaviors and populate that template of scenarios, right.

Boris Selitser [00:35:14]: I think synthetic data becomes super useful to kind of give you kind of a set of expectations and get you going and get you kind of tuning your stack, tuning your application, and then it gives you that really solid baseline to iterate on prompts, to iterate on the embedding model you're using, to iterate on how you're doing query extraction. All of those different aspects of your particular setup, your particular architecture, they're set up. What I think becomes interesting is evaluations are at the core of this. What becomes interesting is out of that you have a lot of failures that still sift through, right? Like after you've done all your tuning, after you fix your scenarios with synthetic data, how do you now, like, there's all these stubborn really, you know, you know, learned in somewhere in the model's weights, like errors that you cannot get rid of, right. And what do you do in that case. Right. Or it's intermittent. Something happens in production and it happens once in a blue moon, but you, like, you don't know if that's exposing you to a lot of risk or not, like.

Boris Selitser [00:36:23]: And so what you can use synthetic data for in that case is actually expand and protrude and multiply those errors and again, run them through evaluations and kind of map your risk and impact within those areas. Right. Like, what's going to actually trigger those kind of failure scenarios. Right. Is that a frequent use case? Is that something my users or my application will do often or not? Right. And then you are essentially building a map. You're enhancing your overall kind of model of the behaviors that you want to evaluate and expect your, your system to have. And you can leverage that further for fine tuning for things.

Boris Selitser [00:37:01]: Now that you have enough data to describe your errors and success cases, you can then generate instruction to sets in order to fine tune in order to get rid of those stubborn, hard to see or intermittent failures that you're seeing that drives that synthetic data loop that I think is critical and not really leveraged as much as I think it could be in a lot of the LLM application building right now. So. Yeah.

Demetrios [00:37:34]: And are you just asking an LLM to go and create this? Or how is your, like, what's your favorite way of creating synthetic data?

Boris Selitser [00:37:43]: So the way we kind of work with our customers is actually like, give them a set, a suite of these generators, and then they assemble them. They use templates to derive scenarios that are specific to their world, their application, and then they deploy this suite of generators and produce. Today, there's a number of different generators fit for purpose. And so I can't say that there's one default way of doing it.

Demetrios [00:38:12]: I like this vision of many different generators depending on what you're looking for. And you have a generator that is better suited for that use case or that, whatever it is that you're looking for, and you can fan out and create a nice little synthetic data set from that and potentially prompt it to get more use case specific or very pertinent to what you're trying to do. So, speaking of which, it feels like when it comes to agents, having the synthetic data there to also help the agents stay on track would be quite useful in their evaluation.

Boris Selitser [00:38:53]: Yeah. So staying with the customer support scenario, because it's frequent out there and we have some customers that are building just that. The thing you could do is actually think about, you know, your historical conversations, for example, and take a look at, like, this was a product return. Right. Type of conversation. This was like, you know, a shipping issue type of conversation and collect those scenarios from your historical kind of customer interactions and then build those into templates. Like this is going to be a template of a complicated return. This is going to be like a template about a complicated policy question.

Boris Selitser [00:39:42]: And then use like a language model based generator to simulate personalities and users and tone. This customer is very aggressive. This customer is a little bit more and has this sort of behavior to them. And so essentially build a collection of simulated users that are mapping to those scenarios, right. And then, so all of a sudden you have sort of like different aspects that you're trying to test, and then you have a lot of these user personalities that are populating these different aspects that you're testing. And so then you can kind of drive that against your model under test the model being evaluated. Right. And see exactly kind of like what happens in that case.

Boris Selitser [00:40:33]: Right. Now, this, this doesn't have to be like an agent based system. On the other side, this could be something more simplistic, but, but it also can be, you know, it like the, the question of how you implement it on the other side is a bit of a separate question, but definitely like the, the nature of this setup allows you for like a, like a mini multi churn, deep interactions that you kind of probing and expanding into with the, you know, model you're evaluating.

Demetrios [00:41:06]: I do love the idea of giving an LLM a bunch of different personalities so that the agent's ready. So there is one, one piece that I'm thinking about when it comes to this, and I feel like you've also mentioned it, but it's not exactly like I would love just a clear cut idea of what are emerging agent design patterns, especially because it feels like we've got so many clear design patterns. If you're doing traditional ML these days, hopefully people are doing it, but you never know. They could still be just using an Excel spreadsheet to track their experiments. And you've also got clearer design patterns when it comes to rags every day. I feel like that's getting more and more advanced. I think people are accepting that knowledge. Graphs are a very useful piece of the rag architecture, but when it comes to agent architecture or system design, I'm a little bit unfamiliar with that.

Demetrios [00:42:16]: So I'd love if you can break down what you've been seeing.

Boris Selitser [00:42:20]: Yeah, yeah. I'm also curious because you talked to. Maybe we can come back to this question, but I'm also curious because you talk to a lot of people in this space to see, you know, how far do you think people are pushing kind of the agent architectures or whether it's sort of a little bit, you know, on the emerging side right now. But the, you know, I'm informed more by like what our customers are doing. I'm informed by what I'm seeing in deployment. And there's a lot of really, really exciting things. Like last year you probably super tired of hearing the term rag, but this year it's all agent, agent, agent. I think midpoint.

Boris Selitser [00:43:09]: I think the hype cycle is also at its peak, I would argue. But I think if you look at what's in reality, there are some deployments, but it's still early days. And I think I like the, I think if you think about LLM as a stack, agents become like a natural sort of paradigm with analogies to like programming paradigms where you're breaking up your system into modules and these modules are collaborating and they're passing messages and they're actually then delivering the end sort of application versus you having like this massive prompt that you're somehow templating in from a bunch of different parts. And, you know, you can't really control exactly what the end thing is going to look like. And it's kind of a template based monstrosity that develops agents, becomes like a nice abstraction that people can reason about and write code around. And again has very, very nice parallels to how software systems have been built for some time now. I think there's a number of patterns. I think I like actually how Andrew Ng breaks it down in terms of typical patterns.

Boris Selitser [00:44:25]: There's the reflective sort of agent, which is the simplest one, which is about having the agent go back and look at its output and determine if there's any errors that need to be corrected. Then there's tool use, which we're all familiar with, leveraging whatever tools to search the web to do again, retrieval and other things. There's also planning function, which is one big distinguishing characteristics, I think, of an agent versus just a collection of prompts is actually that control loop being the agent yourself. Typically in these architectures, your control loop is fixed and then you kind of have a predetermined path when you're calling an LLM to perform a particular thing. Now you can take the control loop and put it into an LLM itself that then decides what are the steps and how to break down complicated tasks. And then there's multi agents that are collaborating together to perform a task which people have started using human analogies for like, there's a manager and there's a worker, etcetera, which I find especially humorous because everybody jumps to some sort of job replacement thing, which I don't think is intended by that at all. But if you look at how people are graduating from rag architectures, they're typically putting that control loop in the middle, or they're adding a routing function, maybe that then decides which retrieval should I invoke. If I have a bunch of one agent responsible for, let's say, retrieval of meeting data, another one for meeting summarization, and a third one for, I don't know, retrieving documents that I care about that are relevant to a particular meeting.

Boris Selitser [00:46:17]: Now, you have a routing agent, then coordinate between these and decide which one to invoke, when, and to assemble that into a larger output. This is a rag ish architecture that evolves to a agent based one. And it's a natural way to add complexity without going too far too deep. But then there are, I see people actually build these agents that run on a actual compute environment and perform actions on that compute environment. In that, you know, in that case, there's a lot of interesting coordination issues that emerge, both in terms of like, you know, I'm getting output from this, you know, ec two node that I didn't expect. How do I handle it? Let me try something else, and so on. And so there's a lot of like, this error handling and recovery things. And so, you know, and the planning function has to be really thoughtful with a lot of like, escape hatches, if you will, because it can get like, into, you know, it's often that this thing can get into sort of an endless air loop or kind of, you know, start looking at workaround after workaround and then just get kind of confused with the initial intent and get, get off track.

Boris Selitser [00:47:35]: So, you know, there's a lot of defensive sort of, I wanted to say code, but the right word is prompt, that you have to sort of build into it. Uh, but there's, there's some scaffolding that you can sort of also inject in terms of, you know, code, kind of putting in controls as well to make sure you're, you're not, you know, passing those thresholds of safety.

Demetrios [00:47:59]: They are so fascinating, man, especially because right now, I do like that you said we're at peak hype. It doesn't necessarily feel like I've seen agents working reliably yet in a way that people are outsourcing a lot of stuff to them. And I 100% agree with you on the idea of having modular agents or prompts, or almost like microsystems or microservices instead of the mono lithic, gigantic prompt, and then go figure it out. And the idea of passing information or having, like, just an agent that does something very, very small and specific so that ideally it's not getting off track because it knows what it's going to do. It's, you've broken this task down into very small pieces, and so hopefully it can't go too far off the rails. I was talking about this just last week, and my thinking got a little bit clarified on the idea of like, oh, well, prompts. It feels like we're gonna have smaller prompts, and these prompts are going to be very pointed. But after talking to a few people, they mentioned how look at the prompt as almost like data coming in.

Demetrios [00:49:25]: And I recognize that there are two separate pieces to a prompt. One is, and depending on the way you structure promptly, let's just say that there's probably many other pieces. But if we simplify it, you can say you've got the task that you're, or the outcome that you're trying to get from the LLM, and then you've got the context that you pass in. And so the context can be data. The task is what I think needs to get smaller and smaller and be as small as possible, but the data can be as big as possible. So you make sure that that task gets completed correctly.

Boris Selitser [00:50:04]: Yeah, yeah. I mean, obvious example is like tool or function calling, right? And in processing the output of those tools. Right? Like if I do, if my agent is like, checking out a GitHub repo and then trying to, like, ascertain what's in that GitHub repo and then go in and modify a particular configuration file, you know, the nature of the output is a little bit unbounded and, you know, you can't foresee it, but you have to be able to, that agent needs to be able to process all of that and then essentially make decisions again, figure out what to do with the output of that tool. And that could be of arbitrary length and structure, and it could get pretty complex. But that's runtime, if you will state that's not how you're building your system. Your system then is much more controlled and predictable, and it starts to more and more resemble kind of code. And then you're essentially kind of, you know, I like the microservices analogy. That's, I think it's spot on, right? Like, how do you kind of think about like the right microservices that you want to have in your system and how do you kind of split those responsibilities and then focus on evaluating those, those focused responsibilities for that particular agent?

Demetrios [00:51:26]: One thing that I've noticed when it comes to agents, and I was talking to someone who has, has had, they're in the process of winding down a startup that was working with agents. But the thing is that they were in the fintech space. They're very bullish about agents. They were getting real money to put these agents into production, you know, like serious contracts of whatever, 40. They stopped that company because they were like, we couldn't build a product. Every time we went into a company, it was us trying to do something very tailor made to that company. And it wasn't like us having a fintech agent that could do things because everyone needed something new and something different. And you think like, oh, yeah, that's the beauty of agents, right? That's like exactly why you would just want an agent and say, go and figure that out.

Demetrios [00:52:33]: But they weren't able to get it working and they weren't bullish on the tech being able to work. So they were like, yeah, let's wind this down. But then towards the end of the conversation, I was like, so what are you doing now? And they were like, well, I'm going to wind this down and then I'm going to go find a co founder and I'm going to start something else up in the agent space. Just like, what the hell?

Boris Selitser [00:52:56]: Yeah, well, I think, I guess that's a typical story right in, in the valley, but I think definitely the agent framework is going to be a very tough space if that's kind of where, you know, the startup was playing. And I think some people are like, you know, can you make any money on building frameworks or is it more like a open source type of, you know, again, obviously there are ways of making money on open source as well, but it's a different model. And so I didn't know exactly how this particular startup was pursuing it. I think it's very, very early to make conclusions because the whole agent paradigm of how to build with it is going to go through like crazy evolution over the next, you know, year or so. And so I think we're just starting to see like some emerging architectures and patterns and we're starting to see some successful projects. I think it's still early days, but it is definitely because, like, for, for once, your application becomes more complicated and it's leveraging LLM for a lot of different functions, then you, if you're just doing summarization of a meeting, if you're just doing some classification, you're using language model for that, let's say people doing that, then you don't need a lot of complexity. It's just a few prompts that you're kind of through together. But if you're actually building a lot more and you're trying to build these things that are breaking down large, complex tasks, I think there's going to be a lot of experimentation and there's going to be a lot of things that are going to stabilize in terms of what works, what doesn't, whether there are like a ton of great businesses there.

Boris Selitser [00:54:43]: In terms of framework building. That's a very interesting conversation. I think it will tend to go the open source plus plus route. I think you also have to choose your domain carefully if we're talking from that business standpoint, of which I obviously know nothing about. But are you building a generic agent for anybody out there and you're just trying to attack the world of possibilities? Or as we see some people maybe building agents that are just helping with customer onboarding for, I don't know, tech companies? So then your focus is much more segmented, your use cases are much more segmented, and you're focused on that and value you're delivering. And so the versus maybe. So in that case, you're also kind of using the agent framework internally more so than, you know, trying to build it for company XYZ. Right.

Boris Selitser [00:55:41]: You're, you're actually focused on delivering that end value, and agent happens to be a framework to get to that, to that end. So again, without knowing anything about this particular company, you know, I can, I can share my kind of high level thoughts, but I think, as you say, hopefully I'm predicting peak hype because I like to kind of get more to the plateau of productivity, as they say. But I think it's definitely a hot and emerging kind of area, and it makes sense. Breaking things down into microservices makes sense whether graph based state management is going to be a solution or that's going to be too complicated for people. We'll see. I think that's, that sounds really well from a, from like a pure purist standpoint, if I was to like a point on that quickly, I think whether that architecture like graph based design will take root is hard to say. I think a simpler state machine will probably be a safer start for a lot of people. I think like a graph based design that has lots of kind of edges and message passing can be a little bit unwieldy to both design and fully kind of reason about, so.

Boris Selitser [00:56:56]: But again, you know, I might be, like, regretting saying this in a couple of months. So we'll see.

Demetrios [00:57:03]: Dude. Boris, this has been awesome. Thank you for coming on here.

Boris Selitser [00:57:06]: Thank you, Demetrius. It has been a pleasure.

Demetrios [00:57:08]: Appreciate it.

+ Read More

Watch More

1:00:51
Data Observability: The Next Frontier of Data Engineering
Posted Nov 24, 2020 | Views 666
# Monitoring
# Interview
# montecarlodata.com
How LlamaIndex Can Bring the Power of LLM's to Your Data
Posted Apr 27, 2023 | Views 2.7K
# LLM
# LLM in Production
# LlamaIndex
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com