MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Reliable LLM Products, Fueled by Feedback

Posted Jul 30, 2024 | Views 378
# LLMs
# Generative AI
# Feedback Intelligence
Share
speakers
avatar
Chinar Movsisyan
CEO @ Feedback Intelligence

Chinar Movsisyan is the founder and CEO of Feedback Intelligence, an MLOps company based in San Francisco that enables enterprises to make sure that LLM-based products are reliable and that the output is aligned with end-user expectations. With over eight years of experience in deep learning, spanning from research labs to venture-backed startups, Chinar has led AI projects in mission-critical applications such as healthcare, drones, and satellites. Her primary research interests include artificial intelligence, generative AI, machine learning, deep learning, and computer vision. At Feedback Intelligence, Chinar and her team address a crucial challenge in LLM development by automatically converting user feedback into actionable insights, enabling AI teams to analyze root causes, prioritize issues, and accelerate product optimization. This approach is particularly valuable in highly regulated industries, helping enterprises to reduce time-to-market and time-to-resolution while ensuring robust LLM products. Feedback Intelligence, which participated in the Berkeley SkyDeck accelerator program, is currently expanding its business across various verticals.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

We live in a world driven by large language models (LLMs) and generative AI, but ensuring they are ready for real-world deployment is crucial. Despite the availability of numerous evaluation tools, many LLM products still struggle to make it to production.

We propose a new perspective on how LLM products should be measured, evaluated, and improved. A product is only as good as the user's experience and expectations, and we aim to enhance LLM products to meet these standards reliably.

Our approach creates a new category that automates the need for separate evaluation, observability, monitoring, and experimentation tools. By starting with the user experience and working backward to the model, we provide a comprehensive view of how the product is actually used, rather than how it is intended to be used. This user-centric aka feedback-centric approach is the key to every successful product.

+ Read More
TRANSCRIPT

Chinar Movsisyan [00:00:00]: My name Chinar Movsisyan. My title? I'm a co-founder and the CEO here at Feedback Intelligence. We call it feedback intelligence and how I take my coffee. I just take it however they give it to me.

Demetrios [00:00:20]: Welcome to the Mlops Community podcast. What is going on, everyone? I'm your host, Dmitri Ose, and today, talking with Chennar, we had a conversation about metrics and evaluating your AI products, not evaluation metrics. She brought a whole new approach and idea that I had not heard before. And really looking at AI products, just like we look at other products and figuring out what are the ways that we can know if our AI products are successful. But nothing thinking about it from the lens of an engineer needs to be the one that is figuring out if the product is successful. There are so many other stakeholders in the loop, we need to be thinking about how they are able to look at metrics around product usage, specifically the PM. How does a product manager know if a product is well received or not? If these AI products and this comes back to something that I have been thinking about frequently when it comes to the democratization of AI. What we've been doing with AI is allowing everyone to leverage its power, except for the most part, the majority of the tools are built for engineers.

Demetrios [00:01:47]: We aren't building as much tooling for the non technical stakeholders who are also in the room as these AI products are being shipped. Chenar talks a lot about how we can think about the other people that are in the room and encourage them to be there. I love this metaphor that she used or analogy that she used, which was where is the mix panel or amplitude for AI product usage? We have not seen that yet. I'm excited for the idea. Would love to hear what you all think. So drop in a comment. Let me know if this resonates with you. And as always, if you liked it, and especially on this one, if you know a product manager that would like this kind of stuff, send it on over to him.

Demetrios [00:02:41]: I think product managers everywhere are thinking deeply about this and wondering how the hell they're supposed to do it with these AI products, but that's my assumption and I'd love to get a little validation on it. Let's jump into the conversation with Chindar. I'm managing to stay hydrated on a hot day in Germany, which is like one out of three that we have per year. So I know you're in San Francisco still and you have to deal with the heat a little bit more often, but I want to talk about all the different cool stuff that you have done up until now. So tell me more about this object detection model. Putting them onto drones in the agriculture industry. Raspberry PI deployment. You're going from computer vision models.

Demetrios [00:03:41]: I imagine they were very big. You had to make them small so that they worked on a drone. Give me more context, because that sounds like an awesome problem to be solving.

Chinar Movsisyan [00:03:51]: Yeah, yeah, that was very fun. So, and it was my first project, to be honest, like, for deep learning. I did some projects before that, but that was something that I was, oh, this is going to be not a research project, but something that is going to be in production. So I should be careful. We should be careful. And I remember that was like Yolo three. Like, that was that time, the good old days. Yeah, we were excited about that.

Chinar Movsisyan [00:04:22]: And in terms of draws, like solving the problem for object detection, it's all about high resolution images, so very tiny objects you have to, for crop analysis or a lot of this stuff, or automatically spray the field. So this was the project, and we had a lot of data, more than. More than ten k, high resolution images, and we were required to annotate it. And we had a lot of people there to manually go over it and then train it, like, annotate the data and then fine tune that Yolo three, adapt it to that use case, like custom training, and then get some models, some checkpoints. And of course it's not going to be to be good enough on raspberry PI because of jaws. Like, you cannot have GPU's on Jaws, unfortunately. And the idea was just translated to lighter model and deploy it and then see what's happening. Even though after doing this, we got a lot of bugs, we had a lot of false positives, false negatives because of illumination, lighting, these old stuff, different things were happening.

Demetrios [00:05:52]: But, yeah, I'm still waiting for the company that straps a GPU to a drone, maybe Nvidia. Yeah, that's the next billion dollar company, huh? That is. Who wants to go and raise some money for that one? That is hilarious to think about. But the interesting piece there, I think, is you were doing this with yellow three. As you mentioned, it was deep learning before. Deep learning was cool. Most people, I imagine, were excited to work on deep learning back in those days, because everybody learns about it. It's the cutting edge techniques.

Demetrios [00:06:32]: And I can only echo what I've heard from the community in all these conversations and how everybody wanted to work on deep learning until they got into the enterprise. And then it's like, oh, yeah, like, there's not really deep learning happening here. I have to go and do that. You didn't have that case, and you went out and also did other cool stuff with drones. Right? Like, that wasn't your first foray into it. Later, you were doing surveillance stuff and working with yellow models.

Chinar Movsisyan [00:07:02]: Yeah, yeah, it was. It was a surveillance use case for person and car detection for street surveillance purposes. And it was a small project, like a small startup. We can say at that time, I was in Armenia, based in Armenia, and we were doing that for Yerevan. But the idea was very cool how we can just have it, person detection, card detection, and deploy it in front of shops, supermarkets, and just have that. Not manually count. What is the monitor, the camera data, like 24 hours footage, but just have some AI and automatically have some basic analytics. What's happening outside.

Demetrios [00:07:48]: Yeah, you don't have to have somebody with a clicker clicking every time a car drives by, which makes everyone's life a lot easier, I can imagine. And frees up some time from people. The other piece that I wanted to talk to you about, Washington, uh, was the whole, like, generative AI stuff you were doing before. Generative AI. But before we do that, did you come up with any cool techniques or ways to get these big computer vision models smaller? Were you pruning in those days? Were you distilling them? What was it looking like?

Chinar Movsisyan [00:08:25]: I feel like that's one of the. One of the. One of the challenges that right now still people are facing, because these all huge foundation models or transformers, like, if we are able to deploy it on. Distill it and deploy it on small. Like, where there is no GPU stuff, there is no computational razors available. So that's the solution there. So, at that time, we were doing just a translation from Python to C, which was very helpful. With that, we were able to distill it, and we were able to keep the accuracy.

Chinar Movsisyan [00:09:08]: And on the other note, keep the speed, like, how many frames we can process per second, which is very important for Jaws, where you just get a lot of data, a lot of video data, and you need to process it, and you cannot just skip, skip frames there. So this was the. The technique that we came up, and it was good. Like, good enough for. For that use case. But I imagine that for a lot of use cases, especially, like, draws, are being used in military or other, like, verticals. For sure. There should be some other techniques to have banners as accurate as possible, like, close to 100%.

Demetrios [00:09:50]: Okay, 20 seconds before we jump back into the show. We've got a CFP out right now. For the data engineer for AI and ML virtual conference that's coming up on September 12. If you think you've got something interesting to say around any of these topics, we would love to hear from you. Hit that link in the description and fill out the CFP. Some interesting topics that you might want to touch on could be ingestion, storage or analysis, like data warehouse, reverse etls, DBT techniques, et cetera, et cetera. Data for inference or training, aka feature platforms, if you're using them, how you using them? All that fun stuff. Data for ML observability and anything finops that has to do with the data platform.

Demetrios [00:10:35]: I love hearing about that. How you saving money? How you making money with your data? Let's get back into the show. So now tell me about this plastic surgery stuff that you were doing.

Chinar Movsisyan [00:10:46]: So I got that project from doctors. Like they were doing plastic surgery. And their I question was, their challenge was, hey, we are using Photoshop, very classic, and we have a lot of data, like images of faces, and we need to set the landmarks, understand, maybe nose should be shaped this way, etcetera. A lot of manual work, and then back and forth with the passion. So whether he or she likes the new shape or not. So this was the question. And we were like, okay, how we can solve this? And we take, we took that those all kind of like, they had a lot of data reference, like reference images, like pairs where they already manually did that. And we trained a pix to pix model.

Chinar Movsisyan [00:11:43]: I remember it was like by Tensorflow, and it was based like generative, adversal, network based solution. And then after that, doctors were able to upload their like images, like the actual image of patient, and then get generated image with a new face, like new nose or something else. There should be, it was very fun project, especially having those little like images, like a lot of like the different faces, different shapes, and then trying to have these generative adversarial networks trained in a way to have better, good results for other, like, not the data, not in the training site.

Demetrios [00:12:33]: One of the first people that I talked to when diffusion models were getting really popular was a friend of mine who's a dentist, and I was showing him some of the cool stuff you could do with at that time, stable diffusion. And he instantly was like, huh, I wonder if I could do this with people's teeth. You know, I have to constantly be thinking about how to better make the mouth and the teeth and the shapes of the teeth, especially if he's doing work on them or he's putting in different new teeth. And, and I was like, I bet you could, but I think there's probably a better way to do it. Like I would be a little bit worried of for a diffusion model to give someone a new teeth or a new tooth.

Chinar Movsisyan [00:13:19]: Yeah, yeah, yeah, that's very funny. And also I feel like that's helpful for not technical, for doctors having that, like, yeah, some applications deployed on their computer just in case to have some reference before doing their manual work. For sure they have to check it. You don't know what, what kind of generated image could be there. For sure. There are false positives of both. Like this, we cannot guarantee 100%, but, yeah, that was really fun. And also like, good project.

Demetrios [00:13:51]: Yeah, it gives them a little inspiration, so hopefully that helps. Now, I for sure want to talk about like, the idea of basically highly regulated industries. And it feels like you've worked in some different highly regulated industries like healthcare. Are there gotchas that you've seen that keep models or AI from making it to production?

Chinar Movsisyan [00:14:23]: One of the patterns that I can mention here is in these all verticals, like from agriculture industry to healthcare surveillance now, like fintech, we are involved in fintech now, but there is this gap between domain experts, end users, and us as a builder. That was something that we always deal with. And now as well, we always think about building something, but the model, actual users are people who think that this is a magic, like, oh, we need to get this out. Like doctors for plastic surgery purposes, we need to get this output. But there are a lot of cases where it's hard to communicate that output, why that is happening and why you should expect that as well.

Demetrios [00:15:27]: Yeah. Have you found any strategies on how to better communicate that?

Chinar Movsisyan [00:15:32]: No, I cannot say. I found, I remember when I was doing my PhD and one like, I had my supervisor, but also I had a doctor, professor supervisor. The first two months were horrible just to kind of have some language to talk about. I was seeing something and she was like, she's an expert, I mean, professor in cardiovascular field. But it was very hard to find this ground truth and talk to each other just to understand, even though we were solving the same problem.

Demetrios [00:16:12]: And so you think that's because you didn't have that shared vocabulary or was it just because the understanding or the expectations from.

Chinar Movsisyan [00:16:20]: Yeah, I.

Demetrios [00:16:21]: The non technical side were way out of scope of what is actually possible?

Chinar Movsisyan [00:16:27]: I would say both. But the second one, later one is it has more way than the first one like this. All expectation and also understanding of expectations and understanding what's happening with AI? What is that? What is the definition of AI?

Demetrios [00:16:49]: Yeah, yeah. I feel strongly about this. And I feel like these days, especially because there is so much noise around AI, it's very hard to get non technical stakeholders and even technical stakeholders, like, there's so many technical people that I've talked to who have played around with LLMs and they still think that it will do way more than it is actually capable of doing. And so being able to properly manage those expectations is an absolute art.

Chinar Movsisyan [00:17:22]: Yeah, yeah, 100%. And I feel like that will help a lot in terms of product. Like any AI based product development and making it to production. That's the goal. Right. We are not doing anymore. Well, we are doing research, but the goal is to have it in production.

Demetrios [00:17:46]: Exactly. And so speaking of these almost like product metrics, and jumping to that area, I think there is some fun stuff that we can look at when it comes to how you like to measure product metrics, how you like to think about evaluation metrics, almost like even we can go into monitoring and observability and what you've been seeing out there, because I think you have a different viewpoint than most people that I've talked to where we know that evaluation is the hottest topic right now. We know that everyone is talking about evaluation, but it's almost like nobody has really cracked that nut and figured out the tried and true and tested way to evaluate output. So what's your take on this whole scenario?

Chinar Movsisyan [00:18:41]: Oh, this is one of my favorite topics in terms of evaluation, observability, monitoring, and then this, like product analytics, diagnosis, understanding, like AI based product, not classic web applications. So, like in this topic, one of the examples that I try to bring, I try to bring as an analogy is this, whenever we build any web application and like just host it, what do we do? We use some analytics, right? We are not waiting for a couple of months and then turning on this Google Analytics or amplitude after six months. No, whenever we host it, we just link it. And then every day we check it. Right? We check what's happening. They use this button, they don't use this. So my question is why we are not doing the same thing with AI based solutions, right? We only care about evaluation, which is way important. That's important.

Chinar Movsisyan [00:19:44]: I'm not questioning that. Yeah, we should evaluate it. We should like calculate these all metrics, like f one score or like accuracy recall, map, all these metrics. Those are very important. But once it goes to production, we should have something like some data driven way of understanding well, how it is being used, what is, how end users are using it, what are their expectations, what they want, actually from this product, and then catch it, like fetch it, and then do this cycle again and again. I feel like we are taking from model to user direction. But right now, as we all are thinking about production LLM based solutions, maybe we need to take the other direction, like backward from users to model. So this is something that I feel like we are missing in this AI evaluation.

Chinar Movsisyan [00:20:49]: Observability monitoring, observability like chain.

Demetrios [00:20:54]: So how do you see that playing out in practice? Is it the conversations, like, so let's just take the quintessential rag, chat bot. It is the conversations you're having with the chat bot, or is it beyond that? And I know that a lot of people will put the thumbs up, thumbs down, which is a pretty bad metric, honestly. I think we've all recognize that doesn't really work, because it's unclear if it's thumbs down. Is it? Why is it thumbs down? And what does that actually mean? I am not the biggest fan of chatbots, but let's just like, stick to that example. How do you look at other metrics around it besides when somebody is getting very angry at the chatbot?

Chinar Movsisyan [00:21:45]: Yeah. Yeah. So basically, we can say that there are two options here in terms of feedback, in terms of understanding how the chatbot is being used. One of options is explicit feedback, which is thumbs up, thumbs down, or a plain text about, well, I'm not happy with the output. This is wrong. I did a query to get customer transactions in Q 420 22, but I'm getting business transactions. This is not correct financial analysis using this chatbot. This is like explicit feedback.

Chinar Movsisyan [00:22:24]: But what about all implicit, like, feedback that we can derive? We can get it and then derive signals. And we saw that. What I have seen, based on my experience, just the usage of chatbot, like querying something, paraphrasing it, leaving the chatbot, or closing it, and then opening it again. These all signals can make a difference in terms of understanding what is the issue. Maybe there is an issue in terms of knowledge. Hold in RAC system like context is not informative enough to retrieve business transaction or vice versa.

Demetrios [00:23:14]: Well, it also feels like, all right, maybe some kind of document is retrieved, or you give a link to where you grabbed a snippet of that document, and then that link is clicked upon. And so these are all signals. How do you make sense of which signals are good and which are bad? Because I'm thinking about this and I'm like, is that good? If they click out of the chatbot, does that mean that the answer, they've gotten their answer, or does that mean that they're frustrated and they don't want it anymore?

Chinar Movsisyan [00:23:44]: Yeah, that's a very, that's one of the hardest problems. Like, I frame it as a root cause analysis. So root cause analysis in general, it's a like term that we know, we have known for a while, like for decades, right, for traditional applications, and now root cause analysis for LLM based solutions, LLM based products. So in order to understand those signals, like negative, positive, neutral, just having these logs, like, queries, responses, and then some information about knowledge, like context, maybe retrieval, like system is available, like change size, all these things, how we can use all these inputs and perform some root cause analysis, derive these all signals, and then based on those signals, maybe get some list of actions to automate this whole process.

Demetrios [00:24:44]: So it's basically trying to create metadata on all of the information that's going, like, the chunk size or which chunks were fed into the LLM. Are you also looking at which embedding model was used to create the embeddings? Are you looking at that far back? How do you.

Chinar Movsisyan [00:25:02]: So far? No, we haven't done that part yet, but we are, I bring this analogy like onion layers. Like, you can do this more and more, but it depends on the data that you have access to. If you have access to this, like, only the hyper parameters of retrieval system and, and a couple of other hyperparameters, like temperature, yeah. You can do some analysis based on that and provide a list of actions about that hyper parameters as well. And when you have access to embedding model. Yeah, you can do more and more, but it depends on what information you have access to.

Demetrios [00:25:50]: So, yeah, yeah, that makes a lot of sense. And basically you're saying, all right, let's get all the analytics on what the situation is instead of just the actual text that the user wrote back or the output text. It's like trying to capture every relevant piece of data so you can fully analyze the output or the experience that someone had with that chatbot, as opposed to that one question, did it get answered? What would the correct answer be versus what the output of the LLM is? You're like, how can we get a full picture of what's going on here?

Chinar Movsisyan [00:26:34]: Yes, yes. Yeah.

Demetrios [00:26:36]: Okay, so what are other, so that's a chat bot. What are other ways that you've seen this happening?

Chinar Movsisyan [00:26:43]: Yeah, like, chat bots are one of the, one of the interfaces that right now, LLM based solutions are deployed. Right. Besides chatbots, we have identic workflows or conversational AI, just a dashboard, like analytics for intelligent document processing, where, again, they can do some queries, and based on those queries, they see some analytics or. Yeah, agentic workflows are going to be the next big thing. They are being deployed in a lot of verticals, and root cause analysis of agents. Agentic workflows is very, like, it's more difficult than just root cause analysis of rag or pratt engineering.

Demetrios [00:27:41]: Yeah, break that down for me. What are things that we need to be aware of as we're trying to root cause the agentic workflow?

Chinar Movsisyan [00:27:49]: Oh, in order to root cause this agentic workflow. Agentic workflow, like, the implementation was being done based on LLMs, rag, like, small, different ways of implementation of lms. Right. Prompt engineering, these old tests. So now for agentic workflow, it's like orchestration of other root cause analysis. Like, root cause one, root cause two, and then orchestration of this. So that's why I'm saying it's more difficult than the classic ragdez, we can call it, but we should solve this to be able to solve the orchestrated problem.

Demetrios [00:28:43]: Yes. So it's not only that, basically, you're just going up in complexity. You're multiplying the complexity every time you add a new agent. And so what are some other ways that you are looking at the data to be able to root cause it? Like, I imagine you can get as creative as possible, but have you seen big levers that give you a much clearer picture?

Chinar Movsisyan [00:29:13]: Can you elaborate on the question a bit more?

Demetrios [00:29:16]: Yeah. So I could see myself thinking, wow, if I can get all these different metrics, like the. All the metadata around it and what browser the person is using or what all. Everything, you. You can get very granular with the data that you're trying to capture to give yourself that full story and that big picture? But at the end of the day, you may find that there are three or four metrics that give you 80% of that story.

Chinar Movsisyan [00:29:51]: Yeah, yeah, that's a. That's a really good question. So we can say we can frame it. What is that base combination that gives you that holistic view of your product, of your generative, AI based solution? Yeah, that's a good one. I don't have an answer, but basically, it's all about peeling those all layers, identifying these all metrics, and then understanding, like, which metrics can be used as a can, can present the other three metrics over there, maybe you can use one and just have the picture of your product.

Demetrios [00:30:40]: And how do we know? Is there clear signals that you've seen where you're like, oh, when you find those metrics, and it's probably different for every use case, but when you do find those metrics, you recognize them because they are so different or they are such an anomaly, or is it like you're really just going off of intuition and instinct?

Chinar Movsisyan [00:31:10]: You know, what we did recently, we introduced this impact score idea. This impact score is being calculated using different metrics, like several metrics, the results of several metrics. And with this, we can kind of have an average and we were able to kindly introduce one, like matrix number. So they can use that and they, they can try to make some decision based on that impact score. And right now we are iterating over that and trying to understand from fintech to legal AI, how it is like, it is different.

Demetrios [00:31:58]: Oh, yeah. And the impact score, if I'm understanding that correctly, it's just the metrics, like how impact, how impactful this metric is to the greater picture or the big metric that you're trying to move the needle on. So like, how are these small metrics moving the bigger metric?

Chinar Movsisyan [00:32:18]: So in Pax score, imagine, like there are issues, and this issue you can present, you can describe using a couple of metrics, but which one has more weight in terms of analyzing it further? So there is no one ground truth, right? Which one we should take into account? Like for issue one, recall is x, for issue two, recall is y. And now how we can compare it. So we kind of map it to one dimension and then we do comparison. So issues are being compared only using impact score. What is the impact score of this issue? Across different customers, across different models. So how repetitive is that issue?

Demetrios [00:33:26]: Oh, so I think I'm seeing that you're saying like, hey, this metric may be very high for some use cases, or some instances, maybe not use cases. It's more like the instance, the interaction with this end user. This metric is off the charts. But then on other instances, this metric is pretty much non existent. So why is that? How impactful is this metric? Is it a little bit wonky? What does it really tell the story of?

Chinar Movsisyan [00:33:57]: Yeah, yeah, yeah.

Demetrios [00:33:59]: What I'm seeing is that we need data analysts for our AI metrics and our AI usage metrics. Really?

Chinar Movsisyan [00:34:11]: Yeah, yeah, yeah. And we see that it's not only about data analysts, but also product managers. These people are involved in the process, like they are in charge of, like in charge of product usage. Right. It's their role. But also it's not a classic. Like, this is a bit new, like to understand these all metrics and underlying, like root cause of those, like numbers.

Demetrios [00:34:45]: Yeah. It's like you have to have a bit of the understanding of what a data analyst does or how a data analyst thinks. You have to be able to get intimate with that data. But it's not necessarily looking at how a user is interacting with a product, a traditional product, or how someone is buying something from an e commerce store. It's not that type of data analyst or the, the internal data that a company has and analyzing how to boost revenue or whatever it may be. It's very much, and I've heard this from I can't remember who right now my brain is failing me. But they were saying how data scientists are very much in a great position because exactly what you're saying sounds a lot like what a data scientist could excel at.

Chinar Movsisyan [00:35:45]: Yeah, yeah. So what also I noticed with this AI like adaptation, LLM, let's say, let's call it LLM, generative AI or whatever, is like eval tools evaluation, like environment for developers. But these product managers are in between and they need something to know, like about these products. Right. Well, they cannot use product managers. I don't think they can use eval tools like open source git clone, these old things, but they need something like mix panel a bit more, or Google Analytics a bit more with more features, like for data analysis. So kind of we need a combination of these old, like, existing things into one place so different stakeholders can do their job.

Demetrios [00:36:52]: And now that is a really key point because you're talking about how the product manager needs to understand the value of the AI product and just using the evaluation tooling that is built for software engineers and machine learning engineers or AI engineers, that doesn't serve them very well.

Chinar Movsisyan [00:37:16]: No.

Demetrios [00:37:17]: Yeah, that brings up another great point. And it's almost like the same idea, but because now we have so many different Personas that are of able to use AI and the traditional tooling, like the infrastructure tooling and tooling in general is built for engineers. Kind of like, I'm just making a broad, sweeping general statement. The majority of the tooling is built for engineers. You get the marketer who has been using low code, no code tools, and they've been able to create some kind of a use case that generates a lot of cash, but they have no way to track their prompts. Right. And it's like they're not going to whip up weights and biases or ML flow. No, that's not built for them.

Demetrios [00:38:09]: And so there's these tools that will need to come onto the market that are specifically for the low code, no code users or the people that aren't interested in the software tooling side of things, but they still need to do their job with the AI products that are going out.

Chinar Movsisyan [00:38:30]: Yeah, yeah, yeah. And on the other note, that tooling should be something, because what we have seen, what I have seen with these product managers and everything they are seeing that we try to kind of communicate these all things to AI engineers, like using Miro, we write down things, and I'm like, how? Like, that's not a good workflow. How do they use it? And they were like, we translate that, like, product, let's say, like financial analysis using, I'm a product manager. I'm dealing with that financial analyst at bank of America, and my AI engineer doesn't want to work, like, doesn't want to deal like, of course, no. And then I need to translate that to this AI engineer. I need to get that information to write down, structure it for AI engineer, maybe using Miro. I have heard that they are using Miro, and I'm like, how it's possible. You have a lot of tools.

Chinar Movsisyan [00:39:34]: These people have a lot of eval tools, a lot of things you use also, Jira. No, there should be something like these old people can just log in and everyone will be happy, right?

Demetrios [00:39:47]: Well, even just thinking about, if you are the product manager and you're doing what you were just talking about, let's gather as much data as possible and analyze this data so that we can get a full story on how effective our AI product is. Right? Let's, let's say that is an awesome way of deciding if an AI product is valuable or not. How do you put that into practice? Right? Like, is the AI engineer going to call up your data engineering friends? You're going to be like, okay, so now all of a sudden, we've got a ton of data. We got to figure out where we're going to store it, how we're going to present it to our product managers, how we're going to do all of that. You now are going through that cycle, and you're doing what you do with traditional products, but in a little bit wonky of a way, I would say it's almost like, and it's so new that you're kind of searching in the dark and you're figuring things out as you go and you're recognizing oh, do we also need that data? Maybe that could be good. Whatever. Let's throw it in there just in case.

Chinar Movsisyan [00:41:01]: Yeah, yeah, exactly. Maybe we need that for the second iteration. And we have heard this as well. Like, you know, I was interviewing this small AI team, and they were, they are delivering to financial, like, institutions like finance, now, technical people. And one of the, I asked, like, what is the challenge right now? Is it about rag? What is it about? And he was like, it's about under translating their, like, these end users experience in terms of product usage to, like, code requirements, to feature requirement. This, like, translation is like, what kind of data we need, how to process that data, how to understand them, and how incorporate that information to, let's say, rag implementation. So, like, there is a kind of gap, like, right between these two people, even though they have the same mission. Let's have an LLM in production that.

Demetrios [00:42:13]: Is 100% something that I can understand and vibe with. It's like there is a disconnect, but it's not a disconnect on purpose. It's just a disconnect in the way that things have shaped up. It's almost like we've ran really, really fast on the engineering side, and we've been trying to keep up on all the other sides of the business. But since the exciting part is all the cool stuff you can do on the engineering side, that's where the energy has been going into, and that's what people have been focusing on. And so I see the. And now I understand completely what you're saying in terms of evaluation is something that we've been thinking of and we've been talking about, but we almost need to look at it from a different lens, as opposed to only the engineers looking at evaluation metrics and evaluating the output, we need to bring in more stakeholders.

Chinar Movsisyan [00:43:19]: Yeah, yeah, definitely. Because this is not a one side problem. We look at that, they say, like, you are an engineer, you have to deliver 100% accurate chatbot. But, I mean, it's not only my problem, right. There are a lot of people involved in this process, and it's kind of something like, they get this business problem, they try to kind of explain it to me as an engineer, and then they are not happy with my work, but. Yeah, right, I. Right, like, these people should take care of that as well. Like, how can I do that alone, right.

Chinar Movsisyan [00:43:57]: And then have this back and forth? So, like, I have had that before, this, like, generative AI. I remember when I was delivering that, like, culture, that object detection model, and everything. They were like, time to time, they were coming back. Oh, it's not working. Well, the weather changed. I mean, now it's very rainy. And because of that, we didn't have much data. I didn't know that this fall we will have a lot of rain.

Chinar Movsisyan [00:44:28]: So this kind of, like, you know, but it should be something like everyone is aligned, some platform toolkit. So basically, same is happening with LLM based solutions.

Demetrios [00:44:38]: It goes back to the big disconnect that you've seen with the subject matter experts not being able to properly work with the engineers and map out all these different use cases or all these failure modes or the whole user journey, whatever it may be that you're trying to explain. You have so many different people that need to be involved, and we as a community and as like, just people working on LLMs have almost over optimized, it feels like, for the engineering side of the house. And now we need to play catch up on all the other sides of the house. And how can we get every stakeholder properly in the room, having the conversation and making sure that the project is successful?

Chinar Movsisyan [00:45:35]: Yeah, yeah, 100%. And compared to, like, if I compare my life, the problem that I was facing like, six, seven years ago with this, like, object detection task, and now the difference is that right now with this generative AI, LLM based solutions, like chatbot, it is more user centric. Like, bunch of users are using it. But for like, computer vision model, it was just like, let's say five drones with this, like, spraying model to spray the field automatically. Yeah, that was a different, like, very, like five complaint messages. Okay, we will do something better. Like, hard, like thousands of, like, I don't know, a lot of thousands of users are using git, and then you're getting this all, like, you don't have any insights, you don't have any data driven way to understand. And then there is this disconnection, as you call it.

Demetrios [00:46:41]: This is fascinating to really harp on. And it's exciting to think about how much potential there is because of this, basically this gaping hole in the way that we're doing things now. And almost like a, not a utopia, but just, hey, wouldn't it be nice if we also had other people that could be dealing with this and helping us out and bringing their wisdom and expertise into the picture?

Chinar Movsisyan [00:47:11]: Yeah, yeah, yeah. This is something that should be solved by, it's a multi angle problem. It's not only about engineering, it's not only about product, is not only about maybe customer success, but it's all about, like, this all, like, aligning these all stakeholders into one place, and, yeah, I feel like with that, we will have some approach to solve the problem, 100%.

Demetrios [00:47:40]: All right, before we go, I gotta know, what's your deal with flamingos?

Chinar Movsisyan [00:47:44]: Oh, my. I didn't remember that. Oh, I love them. You just saw that my hair is pink, so I love them so much. The design and everything we have is all about pink and purple, so it's just a love. You cannot do anything with that.

Demetrios [00:48:04]: Well, this has been awesome. I really appreciate you coming on here, and thank you so much, making something that I have had a hard time articulating. But now that you say it, it's like, yeah, of course. And I really like the fact that you've gone out there. You've been talking to a whole lot of users that are trying to capitalize on AI. They've got AI in production, and now it's almost like the day two of. Okay. So usually with, like, when we ship products, we have observability, and, like, we want to know how it's doing, but we don't really have that clarity on it.

Demetrios [00:48:44]: And so coming on here, talking about these issues, and hopefully there are people out there that have been seeing this, too, and they've been recognizing it. And I would encourage anyone that is also on the same wavelength to reach out to you.

Chinar Movsisyan [00:49:01]: Awesome. Thank you so much. Thank you for having me. This was awesome.

+ Read More

Watch More

LLMOps: The Emerging Toolkit for Reliable, High-quality LLM Applications
Posted Jun 20, 2023 | Views 3.8K
# LLM in Production
# LLMs
# LLM Applications
# Databricks
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io