MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Incorporating LLMs in High-stake Use Cases

Posted Jul 12, 2023 | Views 479
# High-stake Use Cases
# LLM in Production
# Moonhub
# Moonhub.ai
Share

speaker

user's Avatar
Yada Pruksachatkun
ML Lead @ Moonhub

Yada is an NLP scientist and engineer, currently leading ML at startup Moonhub, where she's building the next-generation people-people search platform to increase access to opportunity. Previously, Yada was a tech lead at health tech startup Infinitus and a scientist at Alexa AI. She started her career in machine learning doing research in graduate school at the Bowman lab and Microsoft Research.

+ Read More

SUMMARY

While LLMs have taken the world by storm, especially with content generation, there are still many considerations in deploying them in high-risk applications. In this talk, I will be going over best practices in building applications with non-determinism in mind, how to think about humans in the loop systems, and more.

+ Read More

TRANSCRIPT

 All right, so we are going very quickly into our next talk with Yara. Hello. How's it going? Hello. Good. Um, that's a great trick having the timer. Um, on the other side of the screen. Should do that. Totally. And I like it cuz like, as sort of like the mc, it's the worst when you have to interrupt like an awesome talk.

But yesterday I got, uh, 20 minutes behind schedule, so I'm trying not to do that again. Um, cool. Do you have your, do you wanna share your slides? Yes, for sure. Can I just share my screen or how does this work? Yeah, there should be a button I think where you can share your screen. Alright. Uh,

Okay. Oh, there you go. Let's see.

All right. All right. Take it away. Hello everyone. Um, great talk so far. I hope everyone's having a great time. Um, I'm yada. And today I'll be talking about incorporating large language models in high stakes use cases. A little bit about myself. Um, we don't have that much time, so basically, um, I have a background research, INL engineering.

Um, I've also worked in the healthcare, currently working tech. All right. So, you know, we all know that there's a lot of hype, um, and we've also seen a lot of, um, great headlines on, you know, medicine and law and applications. And so I'm sure lots of folks are thinking, how applicable are these models to these use cases?

And you know, they are, but there's a huge caveat of. All the various, um, types of failure cases that you can see in these models, which are also prevalent in, um, weaker models as well. Some of these are, you know, robustness, um, for distribution shifts, which I know was talked about yesterday, um, as well as somatically proven perturbations, degradations and low resource settings.

And you can go on and on. So what does this actually mean? Um, and here, let's think about a therapy bot, which not saying we should build this, but let's just put it as an example. Um, In this case, you know, you, it's really important to structure, um, a therapy bot in, um, In some framework, maybe that's c b T, maybe that's family dynamics.

Um, controllability is really important. Bias and fairness is also important. So you can imagine if you're, uh, you have this bot that you can call in, you have, uh, you know, speech to text component, um, that speech to text could, uh, Have issues with people with accents and that could lead to downstream difficulties and, uh, suboptimal experience for folks with accents essentially.

So, you know, it's really important to think about these considerations when thinking about these new innovations. And in terms of how to actually. You know, approach controllability and these limitations, it's usually wise to look at, um, best practices from previous versions of models and ML systems. So for controllability, for example, you could have back in time to the land of, um, dialogue flow before language models.

Um, you know, more structured environments where you think about intense entities and, you know, drought, um, actual conversation, trees and branches. And what you could take from those is essentially, you know, in terms of controllability, um, having, you know, natural language understanding for dialogue, use case as well as dialogue states.

So for therapy bots, maybe during the intake flow, um, you should take into account and record the social history, the emergency context, et cetera, and make sure that, um, that's all recorded accordingly. Um, Course, just speeding through some other best practices, um, for kind of incorporating, uh, models into high stakes environments.

Course human loop is always best practice. We've heard a lot in this conference. Um, so having, if a domain expert is the one who's using your product, so a doctor for example, you could just show and have them check. If not, um, having humans who are experts on the background to see, um, if there are alerts or if there's, um, potential failure cases they can check.

Um, in the worst case. Now, one thing that is helpful, um, is to break up tasks into smaller tasks. So an example of this is if you have an information retrieval, um, Product. You might want to find document that's relevant to a question, but instead of trying to do that for the entire document, maybe do it for each paragraph and document and then.

And a couple of different thoughts on how to make tasks easier. And of course this is a generalization, so it depends on the task, depends on the model, but, um, usually no classification is easier than generation. Um, it's easy to reduce the output space. So, um, you know, if there's a thousand different classes that's harder to learn than, you know, 10 classes.

Um, of course also reducing the input space as well, so less for our ability in the input space. Now we've all, you know, we're all thinking about in context examples. So if you are using kind of more off the shelf models, um, the main question is, you know, there's of course a lot of different, uh, best practices here, but, um, you know, in terms of choosing in context examples, it's really important to also keep a database.

Um, and with this prompt database, you know, you are, you're usually going to be using some sort of embedding for retrieval. And it's also important to fine tune these embeddings to make sure that, um, you know, they're catered to your specific use case. And aside from that, uh, there's also, you know, as we've heard in this conference, we need to have a structured approach in building prompts and test sets and suites so you know exactly where your model is going wrong.

Now, um, speeding through, um, some other approaches. Sampling has, you know, been an approach in data science for a while. It's, this is the same for, um, this newer class of models and we've seen that in self consistency and other, um, methods as well. So, um, it's always best to ensemble not only with, um, black box APIs, but potentially if you have your own fine tuned, uh, model, even, you know, re space whatnot, you could ensemble all these.

Um, models to really ensure that there's, um, as much, uh, information as you have in creating a prediction. Um, and you know, of course, in high stakes environments, the best practice is. Don't use LMS if you don't have to. Um, you know, you can use RegX, you can use more traditional models, which, um, you can, you know, look at their weights or understand what they're doing, regression or random force, although of course there's some issues with those models as well.

Um, and. There's still issues with factuality, especially as the input, um, and context window, uh, and what you're grinding on increases in size. And, you know, it's still, even though there's a lot of, um, G B T chat, g b t seems to be doing fairly well. You know, of course in high stake environments, you really need, um, humans in the loop for emotional intelligence as well.

And however, if you do want to still use models, um, it might be interesting to think about fighting your own lm, not just for, you know, creating a moat or whatnot that folks are talking about, but, um, also, you know, having more control over the outputs, um, as well as, you know, all these other range of. Um, things they can do when you own and you have access to your weights.

Um, so you know, off the top of head. Um, being able to look at competence scores, um, being able to have more, um, control over kind of exactly the performance of your model. Um, There's not updates happening in the background that you don't know about. Um, and also being able to incorporate the other art methods from the community, both research and, um, otherwise into your own models.

Um, and a couple kind of last thoughts, um, in closing. First is evaluation. It's really important to evaluate over cohorts, so in high stakes environments for, you know, um, all the different, uh, subpopulations that might be using your product. Um, having those performance metrics over each cohort. Um, looking at robustness as well as calibration, so, you know, what is the correlation of confidence scores to how correct the model is.

And for dialogue, of course, using user simulators. And here are a couple, um, papers and works that could be interesting to focus here as well. Happy to share in the chat later. And two last slides here. Um, when you're looking and you see, you know, these headlines on Twitter, some questions to ask are, one, how different are those tasks to your own tasks?

Two is how different are your domain? And there it is. You know, think about all your privacy, robustness, et cetera, constraints. And the first two is really, you know, if those that gap is pretty wide, then um, it might be the case that those numbers won't actually translate to good numbers for your own task in, um, your high stakes use case and some open questions that you know, I'd love to talk with folks here.

Leader as well is. How does explainability look like in the world of black plugs, APIs and, you know, what are the best practices for active learning, et cetera. Um, especially, you know, you don't get, um, you might not get confidence scores and so how would you do active learning in that setting? Uh, with that said, um, that's it for now.

Super speedy. Happy to take any questions afterwards, but thank you so much. This is a really great, um, conference. So, Thank you so much. Yeah, that was awesome. And you know, when I was looking over the title before and it was saying, you know, LLMs and high stake use cases, I was like, where's she gonna go?

What's that? High, high, uh, high risk use cases or high stakes? And therapy bot is definitely one for sure. So for sure. Thank you. Yeah, thanks so much, Lee. Cool. All right. Yeah, please drop some of those links in the chat. That would be awesome. Um, and thanks again, Yara. Really appreciate it. All right. Have a good one.

Bye.

+ Read More
Sign in or Join the community
MLOps Community
Create an account

Watch More

LLM Use Cases in Production Panel
Posted Feb 28, 2024 | Views 3.8K
# LLM Use Cases
# Startups
# hello.theresidesk.com
# chaptr.xyz
# dataindependent.com
LangChain: Enabling LLMs to Use Tools
Posted Apr 23, 2023 | Views 2.2K
# LLM
# LLM in Production
# LangChain
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Code of Conduct