MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Evaluation // Panel 1

Posted Jun 28, 2023 | Views 376
# LLM
# LLM in Production
# Scalable Evaluation
# Redis.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zillis.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Share
SPEAKERS
Abi Aryan
Abi Aryan
Abi Aryan
Machine Learning Engineer @ Independent Consultant

Abi is a machine learning engineer and an independent consultant with over 7 years of experience in the industry using ML research and adapting it to solve real-world engineering challenges for businesses for a wide range of companies ranging from e-commerce, insurance, education and media & entertainment where she is responsible for machine learning infrastructure design and model development, integration and deployment at scale for data analysis, computer vision, audio-speech synthesis as well as natural language processing. She is also currently writing and working in autonomous agents and evaluation frameworks for large language models as a researcher at Bolkay.

Prior to consulting, Abi was a visiting research scholar at UCLA working at the Cognitive Sciences Lab with Dr. Judea Pearl on developing intelligent agents and has authored research papers in AutoML and Reinforcement Learning (later accepted for poster presentation at AAAI 2020) and invited reviewer, area-chair and co-chair on multiple conferences including AABI 2023, PyData NYC ‘22, ACL ‘21, NeurIPS ‘18, PyData LA ‘18.

+ Read More

Abi is a machine learning engineer and an independent consultant with over 7 years of experience in the industry using ML research and adapting it to solve real-world engineering challenges for businesses for a wide range of companies ranging from e-commerce, insurance, education and media & entertainment where she is responsible for machine learning infrastructure design and model development, integration and deployment at scale for data analysis, computer vision, audio-speech synthesis as well as natural language processing. She is also currently writing and working in autonomous agents and evaluation frameworks for large language models as a researcher at Bolkay.

Prior to consulting, Abi was a visiting research scholar at UCLA working at the Cognitive Sciences Lab with Dr. Judea Pearl on developing intelligent agents and has authored research papers in AutoML and Reinforcement Learning (later accepted for poster presentation at AAAI 2020) and invited reviewer, area-chair and co-chair on multiple conferences including AABI 2023, PyData NYC ‘22, ACL ‘21, NeurIPS ‘18, PyData LA ‘18.

+ Read More
Amrutha Gujjar
Amrutha Gujjar
Amrutha Gujjar
CEO & Co-Founder @ Structured

Amrutha Gujjar is a senior software engineer and CEO & Co-Founder of Structured, based in New York. With a Bachelor of Science in Computer Science from the University of Washington's Allen School of CSE, she brings expertise in software development and leadership to my work.

Amrutha has experience working at top tech companies, including Google, Facebook, and Microsoft, where I've worked on a variety of projects including machine learning, data collection studies platforms, and infrastructure. She also had the honor of being a TEDx Redmond Speaker and receiving several awards and honors, such as the National Merit Scholarship Finalist and NCWIT Aspirations in Computing National Award Finalist, and Washington Affiliate Recipient.

In Amrutha's current role at Structured, she is focused on unlocking the power of expert knowledge to supercharge language models. Prior to this, She spent four years at Facebook as a Senior Software Engineer, where she worked on the Community Integrity team building a Knowledge Graph to maintain policy designations of terrorism and hate organizations on Facebook platforms.

Amrutha's passion for computer science and leadership is evident through her work and involvement in several organizations, including being a keynote speaker for the Northshore Schools Foundation and attending the Grace Hopper Conference. Amrutha is also a Contrary Fellow, and ZFellow, and has completed the YC Startup School.

Connect with me on LinkedIn to learn more about my experience and discuss exciting opportunities in software development and leadership.

+ Read More

Amrutha Gujjar is a senior software engineer and CEO & Co-Founder of Structured, based in New York. With a Bachelor of Science in Computer Science from the University of Washington's Allen School of CSE, she brings expertise in software development and leadership to my work.

Amrutha has experience working at top tech companies, including Google, Facebook, and Microsoft, where I've worked on a variety of projects including machine learning, data collection studies platforms, and infrastructure. She also had the honor of being a TEDx Redmond Speaker and receiving several awards and honors, such as the National Merit Scholarship Finalist and NCWIT Aspirations in Computing National Award Finalist, and Washington Affiliate Recipient.

In Amrutha's current role at Structured, she is focused on unlocking the power of expert knowledge to supercharge language models. Prior to this, She spent four years at Facebook as a Senior Software Engineer, where she worked on the Community Integrity team building a Knowledge Graph to maintain policy designations of terrorism and hate organizations on Facebook platforms.

Amrutha's passion for computer science and leadership is evident through her work and involvement in several organizations, including being a keynote speaker for the Northshore Schools Foundation and attending the Grace Hopper Conference. Amrutha is also a Contrary Fellow, and ZFellow, and has completed the YC Startup School.

Connect with me on LinkedIn to learn more about my experience and discuss exciting opportunities in software development and leadership.

+ Read More
Josh Tobin
Josh Tobin
Josh Tobin
Founder @ Gantry

Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.

+ Read More

Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.

+ Read More
Sohini Roy
Sohini Roy
Sohini Roy
Senior Developer Relations Manager @ NVIDIA

Sohini Bianka Roy is a senior developer relations manager at NVIDIA, working within the Enterprise Product group. With a passion for the intersection of machine learning and operations, Sohini specializes in the domains of MLOps and LLMOps. With her extensive experience in the field, she plays a crucial role in bridging the gap between developers and enterprise customers, ensuring smooth integration and deployment of NVIDIA's cutting-edge technologies. Sohini's expertise lies in enabling organizations to maximize the potential of machine learning models in real-world scenarios through efficient and scalable operational practices. Her insights continue to drive innovation and success for enterprises navigating the rapidly evolving landscape of machine learning and operations. Previously, Sohini was a product manager at Canonical, supporting products from Ubuntu on Windows Subsystem for Linux to their Charmed Kubernetes portfolio. She holds a bachelors degree from Carnegie Mellon University in Materials Science and Biomedical Engineering.

+ Read More

Sohini Bianka Roy is a senior developer relations manager at NVIDIA, working within the Enterprise Product group. With a passion for the intersection of machine learning and operations, Sohini specializes in the domains of MLOps and LLMOps. With her extensive experience in the field, she plays a crucial role in bridging the gap between developers and enterprise customers, ensuring smooth integration and deployment of NVIDIA's cutting-edge technologies. Sohini's expertise lies in enabling organizations to maximize the potential of machine learning models in real-world scenarios through efficient and scalable operational practices. Her insights continue to drive innovation and success for enterprises navigating the rapidly evolving landscape of machine learning and operations. Previously, Sohini was a product manager at Canonical, supporting products from Ubuntu on Windows Subsystem for Linux to their Charmed Kubernetes portfolio. She holds a bachelors degree from Carnegie Mellon University in Materials Science and Biomedical Engineering.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Language models are very complex thus introducing several challenges in interpretability. The large amounts of data required to train these black-box language models make it even harder to understand why a language model generates a particular output. In the past, transformer models were typically evaluated using perplexity, BLEU score, or human evaluation. However, LLMs amplify the problem even further due to their generative nature thus making them further susceptible to hallucinations and factual inaccuracies. Thus, evaluation becomes an important concern.

+ Read More
TRANSCRIPT

 Now we've got a panel coming up all about evaluation, so I'm gonna bring our host with the most onto the stage. Abby, how's it going? Hey, what's up? I am excited for this panel that you are going to lead right now. And before we jump on and do the full panel. I just wanted everyone to know that Abby and I, Abby's like one of our most frequented co-hosts on the ML Lops Community podcast.

So if anyone wants to check out the podcast that we do, there is a QR code. Go ahead and take a shot of that, a snapshot or just scan it and you will see, uh, all the different podcast episodes that we have. Abby, if I had to put you on the spot and say your favorite podcast that we have recorded. Which one would it be?

I think I'll have a hard time picking from two. One is by Alex, Alex Ner from Skel, and the other is from two people I'm very good friends with now, which is Maria and Basak. Hmm, that one just came out. So yeah, if anyone wants to check out, they had this really cool ML ops framework and maturity assessment test that you can do if you are wondering where you're at with ML in your OR organization.

They talked in depth about that in one of the most recent episodes, so I'll leave a link to that in the chat and let everyone dig in. That was a really good one, I must say. And so I, uh, Just wanna mention too, because there was a few technical difficulties in case anyone cannot see or your, uh, your stream got paused, like my stream did.

Just go ahead and refresh and it should hopefully, and if it's not getting back to normal, let us know in the chat so we can troubleshoot. But you should be good. Now let's bring on the rest of the guests. First up, Josh. Tobin is here with us, Mr. Full Stack and Deep, deep learning, and also the surprise guest of our in-person workshop that is going on in San Francisco later today.

So it's good to have you here, man. I'm glad that we are getting to do this and this is happening. We've also got. Saini Roy coming up. Hello. And this is where I get to say the last person. Then Abby, I'm handing it over to you to take over. All right. And where's Amru? You there? Yes. There it is. Okay, so now there's too many people on the stage.

I will remove myself and let you all take it from here.

Okay. Hi everyone. Hello. Well, so very quick intro for Josh, uh, AMRU as well as ny Josh is the founder and c e of Gentry. They're working on tools to analyze, explore, and visualize. And so you can improve your models using Gentry Habit, n s D Key that is available. Amru is the c e o and co-founder of Structured dot ia.

They're building their engineering tools for L lm, so everything from your ingestion to your gear, pre processing pipelines, they've got you covered. So is the senior developer relations manager at N B D. And one of them there, there's a lot to mention about her. She's been very active in the ML lops as well as Ellam mobs committee, where one of the things we wanna talk about today is the open source toolkit that they've recently released, uh, Nebo Guard rails, uh, and media, or any more work that they haven't released on LLM based conversational systems.

So, I wanna start with all of you first, because you've been in the space for quite a while, all of you from ML to now in lms. How do you think, and, but also because now you have a company in this space, uh, plus you're leading in for, for um, spec, for Sohani specifically, you're talking through a lot of enterprise companies right now.

How do you think about evaluations at your company? Ooh, who goes first? So, Josh, sorry. I'm good with that in that series. Beautiful. Will the expert talk first? Yeah. Um, I have a lot to say on this. I mean, should I just do my whole thing or, uh, I'll, I can do the short version. Um, yeah. I think, um, for those of you who have been in the ML world, um, pre lms, Um, you know, there's kind of a way of thinking about evaluation that we're used to, which is, um, based on the fact that in the pre l l M world, most of the time the project that we're working on starts by building a data set.

Um, and most of the time since we're training a model, we have a clear, like, objective function that we're trying to optimize on that data set. And so that makes, you know, a, a naive pass evaluation really easy. Cause it's like, okay, you hold out some data from your training set. You use the same metric that you are training the model on, and that gives you an evaluation of how your model's doing.

Now, that hides a lot of complexity under the rug, but um, uh, I think moving to generative, um, AI and LLMs, um, some of those assumptions are violated and that's, I think what's making this so difficult for a lot of companies to figure out. So first of all, When I start an l l M project, I usually don't start by building a data set.

I usually start by, you know, thinking of what I want this thing to do and then just, you know, coming up with a prompt to try to encourage it to have that behavior. Um, and so that means that, um, one of the key challenges for evaluating LLMs is like, what data do we even evaluate them on? How do we know what the right data set is to, um, to test these models on?

And then the second question comes from, um, the second challenge comes from violating the other assumption that we have in traditional ml, which is that. You know, oftentimes in, in generative ml, we don't have a clear objective function, right? Like, let's say that you're building a summarization model. How do you measure whether two summaries of the same document, whether one is better than the other, or whether one is adequate?

Um, it's really non-obvious question. Um, and so those two challenges I think are at the core of like what makes this really difficult for a lot of companies. And, um, I'll, uh, I'll pause there and I, I think there's a, um, a, a framework that you can use to think through how to, how to do this well. Um, But, uh, I, I will, uh, I'll save that maybe for a later question.

Oh man, what a, what a cliffhanger on that. I would say, uh, there's so many things to think about when it comes to evaluation. I agree that, you know, thinking with the end in mind first, that's definitely what we got our clients to do. Like, uh, similar to what Josh was saying, you know, uh, but also it's really tough when you're, you're creating something for such a generalized environment.

I think a good example of, of where this was interesting. Actually, you know what, we'll, we'll pause on the example cause I think that'll be helpful later. Um, but just coming back to more broadly, what do I think about evaluation? I think ultimately it comes to accuracy and speed, right? But accuracy is so dependent, so on, on what the actual goals are.

Um, you know, Thinking about, I guess, more detailed questions. Is it helpful? Does the model actually do what you instructed it to do? And when it comes to accuracy, does the model actually answer, um, something correctly? Is it coherent? Those are two very different ways of measuring, you know, how accurate, how, how well, you know, things are set up.

I think hallucinations is, is another really important, um, thing to be considering. So to the model makeup, any part of the response, how are you putting controls in for that? You know, are they, are they deviating? Um, another part I think is, um, context. So does the model actually remember what you're talking about?

Are you staying on topic? Um, and I think also for, for my particular realm, when we talk about the framework NEMO guardrails, which I'm sure we can get into in a little bit for how it can actually support, but is it safe? Is it secret? Like that's like what I keep reminding my clients, sorry. You know, is it secret?

Is it gonna, and is it safe? Are we gonna execute malicious code? Is it gonna call an external application? Is data set robust enough to prevent individual identifications? You know, especially when thinking about healthcare and privacy codes, um, toxicity bias. I mean, there's so many, so many different factors.

But if you really have to sum it up to the top two, is it accurate and what's the speed, um, and latency for it. So much to talk about. This is a really great session. Uh, amta, what, what do you think? Yeah. Um, so yeah. Um, great answer. Um, so one thing I always think about is when you're doing evaluation for, um, you know, let's say like a classifier, there's also some notion of like what ground truth looks like, what a correct answer looks like.

If you're building a classifier of like, oh, cat versus dog, or red versus blue, like there's a ground truth. That you can always operate from and like evaluate against, I think with LLMs, um, one thing that I noticed is because everything is so qualitative, even from person to person, like thinking about what does a good answer look like?

What does a bad answer look like? What is the expected length? How concise should it be? What kind of tone should it have? There's a like, like I think attributes of like what a good answer looks like, and that varies from like based on the expectation of the user, of the model. So I feel like there's an opportunity to build.

An evaluation mechanism that is highly personalized. And I think the primitives of how to build that are things that are still, um, up in the air right now. And, you know, really excited to find that. So I, I think I really love the collective perspective here, but I wanna, I wanna ask you more so, um, about, you know, how do we think about that perspective for different companies?

So right now there are two different kinds of interest groups, uh, when it comes to large language models. The first are the model builders, and the second are the. People who are developers who are building their own application on top of these large language models, what would an ideal performance mean for both of those interest groups?

And how do you have, how do you decide how to think about building an evaluation framework when you're probably on each of those sites?

Yeah, I mean, I think, um, one thing that you're pointing out is one big change that I've seen in the industry in the last, um, uh, six months or so is, um, call it the chat g p t effect, right? So, um, in the old world, the olden days, you know, back, back before many of you, um, even, you know, ML was even a twinkle in your eye.

Um, machine learning projects are these, like, just really technically complex, um, uh, high effort projects in most companies, right? Like there's, I personally don't know of any examples in large companies where a project, a deep learning project took less than call it six months, and most of them took more than a year.

Um, now in the post chat G B T world, um, all these companies, like you see these announcements every week or every couple of weeks from companies that are building, announcing their. You know, um, uh, chat, G b t Howard feature. And if you talk to these folks, the amazing thing is a lot of these, a lot of these products were built in like weeks, three, four weeks.

Um, so you ask yourself like, how does that happen? And the answer, unfortunately, um, for those of us in the, on the ML side of the house is, um, the reason why they shipped those things so quickly is because the ML people on the team didn't build them. Um, there's the software engineers that built them, and, you know, that's not, that's not to say that ML folks don't have a role to play.

Uh, going forward. I think actually evaluation is one of the critical places where ML folks have a role to play. But what, um, what chat GBT did is it really, um, kind of lowered the, um, The, the barrier to entry and also the, um, I think what's really critical is like the, the level of intimidation that most folks in the org have about interacting with these systems.

So if you're like an exec in a company and you're sponsoring a deep learning project, you kind of, you know, you kind of like hire the, the nerds, the PhDs, and you let them do their thing for six months and you hope that something good comes out of it. But now if you're building like a chat G P T project, you've been playing around with chat gpt on your own, you know, you know that it's not, Um, that difficult to get it to do what you want to do.

And so what what I'm seeing is, um, what we're seeing is that a lot of folks in, um, uh, in non-technical roles are a lot more involved in the process of building these applications than they were, um, six months or a year ago. And so the implication that that has for evaluation, I think it's actually really positive, which is that, um, you know, One of the key things, um, for evaluating l l m based applications is that it's, um, I think the thing that we need to find as an industry is the right mix between automated and human based evaluation.

Um, but the great thing about having more stakeholders involved is that you can involve those stakeholders in the process in such a way that they are helping you progressively evaluate the quality of your model as they go. Um, and so I think, you know, I, I really see like the, um, The non-technical stakeholders, you know, sometimes are actually doing a lot of development in, um, prompt engineering sense, but they're also like the producers of evaluations and then the technical folks are more the consumers.

A hundred percent. I mean, like, I think that's a really great, I think that it, um, the, the line is also I think for domain specific information, like grabbing people who've been in the industry or for that particular domain for 20, 25 years, right. And bringing that information and layering and thing down with.

You know, software engineers, ML engineers, and, and actually layering that in. It's a delicate balance, I think, again, between human evaluation and automated evaluation. Um, it's, it's an interesting question. I have to mull a little bit on that. Uh, I don't know Ertha, any, any other things to add? Josh, it's so great that you're going first.

I feel like you're, you're getting the right light framework in for us. Um, yeah. Um, I don't think I have too much to add on that one.

Yeah, I, I think that was a very, very comprehensive answer. I'll, I'll say, which is basically like, I, I guess it, it depends which is most of. At least my personal experience has been talking to a couple of companies is none of the large language models, at least at big companies, have been deployed in production.

We're not really talking about open ai, we're not talking about Googling all of those companies, but let's say about banks, which were huge when it came to financial modeling as such, or were that. Doctors for machine learning models and as such, um, and it's, it's great that we have reduced that time for everybody to get started and have more stakeholders, because that was one of the problems with machine learning, which is like bringing everybody together on a single table and say, you know, let's, let's assist what you really want out of this model.

So my second question here is we are seeing so many evaluation benchmarks right now from home to, um, you know, to couple of others, and there are so many leader boards that are available. How do we create stable experimental setups to be able to validate and monitor the accuracy of our applications during and post-deployment l m based applications during and post-deployment?

Yeah. Um, maybe I can kick off on that one. So, one thing I often think about is how do you, like, anytime you have an experimental setup, you want to keep all the variables the same as much as possible. And I think like in this case, like having like a set of prompts, uh, that you, uh, test against and benchmark against on like a continuous basis is like super important.

So one thing we've noticed even from experimentation that we've done at our company is you can put the same input into like G B T 3.5 turbo like, and it gives you a different answer like one hour later versus like one hour before. And, um, I think that, you know, there's like a lot of reasons for that and it's like, not like deterministic in the way that it answers.

And so it's, um, the outputs are, you know, going to be different even if you have the same inputs. But I think that like if you do a qualitative. Evaluation where you keep the inputs as stable as possible and like maybe have like some sampling and ongoing measurement of the system health and like perhaps alerts if it like drops too low, you can make, turn this into like a, um, observability problem as well in a lot of ways.

And I think that, um, building those kinds of systems will help like, you know, like help us respond when, um, When the quality of the output goes bad.

Yeah, I, I totally agree with that. I mean, uh, not to do the, not to do the founder thing, but that's definitely one of the things we're betting on at gantry is, uh, observability playing a key role here. Um, but, uh, one thing I I would want to add is, um, I think if your role is a researcher, then, um, or if you're in a really early stage of a project where you're kind of deciding what model to choose.

Then, um, public benchmarks are helpful. Um, I tend to prefer the sort of ELO based benchmarks. Um, I think they, um, correlate a little bit better with my subjective experience of interacting with models and how they seem to perform. But if your job is not a researcher or, you know, prototyping something, but your job is building an application, Language models, then, uh, you should not rely on public benchmarks like they are, um, basically, almost, almost the equivalent of useless for you in that role.

Um, and the reason for that is they're not evaluating the model on the data that your users care about, and they're not measuring the outcomes that your users care about. Right? And so if you think about like measurements in, uh, machine learning in general, um, there's kind of like, I think of it as like a pyramid of, uh, of, you know, usefulness.

Versus, uh, ease of measurements. Um, the most useful signal that you can always look at is outcomes, right? Like, is this machine learning? Is the system that this model is part of? Is that solving the problem for your end users? Um, difficult thing to measure, but really, really valuable to measure. Um, and then, you know, easier to measure, but less useful than that.

Um, are things like, hey, um, are there proxies that we can look at? Like, Uh, you know, um, accuracy or asking another language model. If the output of this language model is good, that correlate with the metric that we ultimately care about. And then all the way at the bottom of this pyramid, really easy, really accessible, super low hanging fruit, but not very useful is like looking at publicly available benchmarks.

Um, you know, you're just not really gonna learn that much about your task by doing that.

Yeah, my 2 cents. You know, like Emha said, standardization is king when it comes to this kind of thing. Um, automated testing, I think Seneca Monitoring Systems and those custom dashboards, uh, through that observability stack, I think is, is getting really exciting. I've seen a lot of really interesting tools that are, that are.

Doing this really well. Um, I'm actually curious to dive a little bit more into, to Andrew's response to this as well, but I know filler shout out to arise. I think their CEO or CPO is doing a, a, a presentation on this later. I think their, their stuff is really interesting. Hugging face also did, um, some custom evaluation metrics that I think make it much, much easier for, for the proper set of, and making sure you're standardizing and evaluating consistently throughout.

Um, Uh, I think Josh, you mentioned earlier, doing user human feedback for iterative development ongoing, I think is, is gonna be continuing to important for, for how we set up the evaluation and monitoring the accuracy long term. But, uh, you're right. I, I think those, those public standard benchmarks is probably the, the base and then continuing to, to evaluate and use your own metrics will be super important.

Yeah. So one of the questions I wanted to ask you, because we are going into the domain of talking about, you know, we need to have very domain specific, company specific benchmarks. Uh, what are the popular use cases for large language models that you've seen, uh, in, in production right now?

I think one example that that might come to mind is I think Google's, uh, I don't know. Is it palm? Uh, pa lm, however they choose to say it, they released their, their second version, I think back in, in April. And it's, it's a really interesting case study on how to develop something that's very domain specific because, um, I, I like it because it's very much designed for purpose.

I think Josh said in the beginning about how do you design the prompts and the testing data appropriately for the outcome in mind, right? How do you design for the end goal in mind? Um, they used data sets that were in q and a form that were long and short answer form. Um, the inputs were, you know, search queries for biomedical scientific literature and, and, and robust medical knowledge.

So that was actually a really, really great example of, of a good foundation. But I think, um, They had metrics that also aligned and they tested it against, I think, um, the, what is it? The us the medical licensing questions, um, as, as their metrics for actually determining is it successful. Um, but. But in addition to that, what I liked is they, they reached out for human feedback, not only from clinicians, uh, clinicians to answer the accuracy of the specific responses, but also from non-clinician for, you know, arranged backgrounds, you know, tons of different countries and languages to make sure the information that was coming out was accessible.

Um, and, and something reasonable. And I think that's actually, you know, everyone's gone on a. You know, Google Spiral asking medical questions. So I think it was actually very relatable and, and the results were, were really phenomenal in their ability and, and the results to, to answer those medical specific questions in a, in a, you know, in a, in a strong way.

Um, I hope I've, I've kind of dug into that, but I, I think that's one particular example where, where I think that that was done really well. I think another one is actually the Bloomberg G P T. I don't know if that's being done out in production, but I think, um, as far as domain specific benchmarking and assessments for financial specific questions, that one was, was a really strong example.

I think the use cases that I've found really compelling are, um, in like the customer success and customer response, uh, categories. And I think like it's one of those things where, um, you know, being able to respond to your users quickly is like, uh, you know, a huge win for companies. And being able to scale, like, uh, customer uh, interfacing, uh, roles is like a big challenge and can be expensive.

But, um, I think with LLMs you have like an opportunity to, um, really like, like leverage, like knowledge bases that you might have and expose that in a way that can like, you know, enable your customers to like, interact with your product in a more meaningful way. And I think that's really powerful. Um, there's also like a really good built-in way to evaluate the quality of responses here because you can just ask the user like, oh, did this solve your problem for you?

And you can also look at. Like, um, you know, other metrics like, oh, how often are they coming back? Like, how, how many messages did it take to resolve this issue? And there are like a lot of really great ways to, um, kind of proxy the quality of the responses. Uh, and so I think that's a good one. Good point.

Josh, you wanna add on that?

I mean, how can you, right. Sorry. He might be frozen. I, I, I think Josh got frozen, but in, in the meantime, silence. I know.

Am I having a connection issue by the way? No, he, he can move. He can move.

Um, Josh, we were asking if you wanna add on that. Uh, no, I mean, I, I think there's kind of like, in terms of real use cases, there's the, there's trifecta, um, it's, uh, like information retrieval, search, uh, question, you know, um, one category, information retrieval. Um, so it's things like search, uh, document question answering, things like that.

Um, Then, uh, second category is, um, uh, chat. So a lot of companies are building sort of chat bots into their products. Um, use cases like customer support, um, product features. Um, and then the third is, uh, um, yeah, like, uh, text generation sort of marketing, uh, copy type use cases. Those are the three that that, um, that we see the most.

Which one do you think is like most robust right now? Whereas is doing like meeting the needs, like the, the best right now? Mm. I mean, I'm asking, I get asked this all the time and I, yeah, it depends, right? Like it's, um, for all of these, there's such a range of complexity, right? It's like really easy to just.

Kind of dump, you know, uh, chat G P T API into your product and call it a chat app. But then how well do you actually need it to work, for it to be useful? Depends a lot on the context of your product and what you're trying to get it to solve. Um, so I would say all of them are relatively usable, um, depending on the broader product context.

And I think this is kind of like, um, a general point on, um, How, how to think about Lang like, um, machine learning applications in general is like, I don't think, um, I just made the mistake that I don't, I don't encourage other people to make, which is like, I don't think it's makes sense to think of ML use cases as grouped by like technical use cases.

Um, it's really, you should think about them in terms of products use cases because that has a lot more of a role in determining the difficulty and, um, challenges that you're gonna face when you're building this application than. What model you're using or um, what set of techniques you're using to glue models together.

Um, cuz ultimately at the end of the day, like things that make the, the thing that makes building products with ML hard is that machine learning models are just, don't always get the answer right. That's always been true. It'll always be true. It's a, it's the nature of the technology. It's probabilistic, right?

So it's not always gonna be right. And so the question is like, how do you build a product? In a context where, you know, one of the components of your product is going to get the answer wrong, like, you know that it's gonna get the answer wrong. Um, and so the broader product context of like what does it take to actually, um, work around that limitation is more important for the difficulty of the project than what set of algorithms you're using.

I really like the way you said that. Yeah. Yeah. Go ahead. One of the things I also wanted us to touch upon, maybe probably the last question that I will get before we open up the panel to, uh, audience q and a is as you're putting the model, You need lots of data and, um, all the data may, may not be good.

There are, there are techniques like early stopping to be able to filter the right kind of dss. Um, and after that, yes. You've, you've built your model, you've, you're now trying to deploy, but you're also putting some checkpoints and doing some sort of logging as well throughout the process. How do you guys think about.

Evaluating the performance at those specific points, what tools have you seen being used and what are the gaps that you've seen in, in terms of the tooling industry when it comes to, uh, um, Ruta for you very specifically when it comes to datas, what are the gaps that you've seen and Josh and Sohi for?

Both of you, I want to understand more so from like the perspective of the model Checkpointing itself as well as the prompt, uh, as well as like the prompt drifts and stuff.

So, uh, I'll let Amru go first. Uh, her, let her cover the data side of things, and then we cover the model side of things. Yeah, absolutely. Um, so in terms of like dataset quality, uh, it definitely makes a really, really, really big difference. I think a lot of, a lot of systems are very garbage and garbage out and like, you know, if you start off with like a bad, like baseline, uh, you're going to be kind of disappointed with the results.

And I think like oftentimes, like curating the dataset itself is like a really big challenge. Um, I think like sometimes like. Having like, um, rules or heuristics about like, oh, like, this is like understanding your data set. Like, okay, this is like the missing data, the, uh, quality of the data. Like this is what's, um, incomplete.

This is what's perhaps inaccurate or not in the same form as like the rest of the data. I think like, like, you know, some standard, um, data set, cleaning techniques make a big difference. But then I think also curation makes a big difference I think with a lot of, um, a lot of these problems. Like what you are looking for, uh, you like hypothetically wouldn't really need a model as well.

Like if you could define like that with words perfectly. Um, it would be difficult to, like, you wouldn't really need a model to do that. It could just be like a bunch of if statements. So you know, you are kind of dealing with like ambiguity around, uh, what you're looking for. And so there's also ambiguity around that, like curation of that data set.

And so I think it's like actually an iterative process. And so, Like the evaluation isn't really like a one time thing, but it's something that you have to iterate on and like, um, try to find, uh, overtime as well. Um, not sure if that was like fully what you were looking for, but just general thoughts in that space.

Yeah, I think that does answer. Uh, Josh, do you wanna add on top of that and go into the water side of things? Yeah, I think, um, I'll take a slightly different position. I think, um, Uh, I, I totally agree with what you're saying. I'm Ruth if, if we're talking about machine learning, but this is the M'S conference and so, um, my strong contention is if we're talking about lms, most of you, like most of the people in this room right now, you should not be thinking about training a model.

Um, the, there is very low likelihood that you're gonna get better performance on an NLP task by training a model than you will by prompting GPT four. That's just. That's unfortunately just the truth. It's a, it's a new world that we're living in. Like we've, um, talked to like many companies at this point that had kind of ongoing NLP projects, six month projects, nine month projects, year long projects that, um, they're able to beat their performance by switching over to, um, large language models, calling an api, doing some prompt engineering, maybe some like few shot, um, in context learning in a matter of weeks.

So, So, um, I think if your goal is to build products with machine learning, especially with, um, with nlp, then uh, I think the, the rule of thumb is like no training models until product market fit. Like, if, if you haven't built this thing out and demonstrated this thing can be really good using models that you can get off the shelf, um, for N L P, then it's not gonna work.

Like, you're not gonna be able to do better by training your own model. Um, for other domains of ml that's, I would have a very different answer to that question. Um, and then on the tooling side, I think, yeah, you should just come talk to gantry. Um, we've got good solutions for all the problems that you just described.

Tell us about it. Give the, give the pitch. Yeah. Mean pitch is, um, the pitch is, you know, I think the, um, the way that machine learning evolved is like there's so much focus on training models. Our thesis is kind from the beginning, and I think this is doubly true in the LM world, is that training models is not the hard part anymore.

Um, the hard part is you have this model, you've trained it, and you've gotten over the hump of like, you know, the statistic that like every ML lops company used to say, if like 85 models, 85% of models never make it to production, which is like a BS statistic, and it's definitely not true if it ever was true.

But, um, you've gotten over that hump, you've gotten this model into production. That's when the hard part actually starts. Which is like, okay, how do we know if this thing's actually working? How do we know if it's solving the problem for our users? Um, how do we maintain it? Right? Like, I think, um, really common issue is, um, you know, as you start to scale up the number of models per, um, ML engineer per data scientist on your team, the percentage of time that ML engineers start spend on, uh, maintenance increases and can, you know, quickly dwarf the amount of time that they're spending on actually building new features.

Um, and then lastly, you know, um, if you. Even if you have models up and running in production and you're maintaining them, you're leaving a lot of performance on the table if that's all you're doing. Because if you think about like what distinguishes really great machine learning powered products from, um, you know, average ones, it's like if you look at like chat CBT or Tesla or things like that, the thing that they all have in common is that they're maniacal about building this loop around the model building process that involves like looking at outcomes, looking at user feedback, feeding that back into the training process.

So if you're not doing that, your, your machine learning power product is just not gonna be great. Um, it could be okay, it could solve a problem, but it's not gonna be great. Um, and so what gantry does is we're basically, um, have, uh, infrastructure, like an infrastructure layer that supports, um, uh, opinionated set of workflows that teams of folks working on machine learning power products can use.

To collaborate on this process of taking a model that's been deployed and using production data to maintain it, um, and to make it better over time. And so we make that, um, a lot, uh, cheaper, easier, and more effective for machine learning teams to do together.

So it's hard to follow that. I mean, I longtime fan of gantry, longtime fan of your, uh, blog series, so I, it's just, it's kinda nice to to hear your, your perspective on it. I think for me, uh, for tool sets, I know there are a couple of ones about prompt engineering that are coming out. I have to play with them a little bit more to, to know.

Uh, you know, which one's the most effective? Um, I'm getting an inkling on which one's gonna be the best. Um, but, uh, I think from the Nvidia standpoint, obviously we, we talk about guardrails a lot. Um, it's our new open source framework that came out. We talk about, um, you know, I, I mentioned a little bit like how do we make sure even the, if the, the data in, data out problem is, is already.

You know, something that, that we're gonna, it's, it's challenging, right? Um, but I think if you create the right guardrails for it, so making sure that we stay on topic, making sure that, you know, when a someone asks a chat bot about your product, different offerings, are you gonna go and talk about, um, competitor offerings instead of your own?

Right? Um, also making sure that you've got, you know, I, I talked a little bit about like secrets and malicious code and management of external application access and, uh, interactions with, you know, um, to leading to misinformation and toxic responses and inappropriate content. That's all something that we're all gonna have to work through.

I think guardrails will get you part of the way there. I think we're still experimenting with, with kind of some other tooling that can help, um, with the rest of it, but definitely. I would say check it out. It's open source. Nemo is open source. Um, I think it might, you know, supplement a lot of the other tools that that were mentioned today.

I can't see any of the questions, by the way, Abby. So we're gonna have to trust you to, so I have, I, I have one question. I think that's probably the, I don't know, Demetrius is here. Are you going to kick us off or can we ask like, I'm music, I make a book for the, this workshop I'm doing tomorrow. Yes. Do it and then we'll, yeah.

So doing like a, um, hands-on workshop for full hour on, um, Uh, this exact topic. So, um, if, uh, folks wanna learn more about it or chat more about it, I'll, I'll be there, um, tomorrow at, at some time in the morning. Bsd, I don't remember what time, but it's on the schedule. It's gonna be really cool.

So, as much as I would love to continue this panel, I sadly have to cut it off because, We are running short on time and as you know, that is my one job. It's to keep us on time. Uh, and so I loved everything you all were saying and there was a ton of questions in the chat that were coming through. So all of the panelists, it would be awesome if you jump in the chat, respond to some of these questions that came through, and I will, uh, Leave it here and get you all outta here, but ask you to stick around and say a few things in the chat.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

29:04
Posted Oct 09, 2023 | Views 6.6K
# Finetuning
# Open-Source
# LLMs in Production
# Lightning AI