Lessons Learned Productionising LLMs for Stripe Support
Sophie is a data scientist currently working on improving user experience and efficiency for Stripe’s Operations team. Her favourite thing about being a data scientist is getting to work on a huge variety of business problems, and using analysis and machine learning to solve them.
Her hobbies include trying to keep up with the impossibly fast-growing LLM space (inside work), and binge-watching The Office (outside work).
Large Language Models are an especially exciting opportunity for Operations: they excel at answering questions, completing sentences, and summarizing text while requiring ~100x less training data than the previous generation of models.
In this talk, Sophie discusses lessons learned productionising Stripe’s first application of Large Language Modelling - providing answers to user questions for Stripe Support.
Link to slides
All right, so our next speaker, I think everyone is, It is Sophie Dally. I'm, I'm not gonna pronounce it right, so I'm just gonna bring her on the screen. Is it Dally? Daley? Daley. Ah, it was so close. Okay, well, productionalizing, LLMs, uh, for striped support. I think we're all fascinated to hear about that. So without further ado, uh, here are your slides.
Take it away. Great. Thank you so much. So, uh, my name is Sophie Davy. I'm a data scientist at Stripe, and today I'm gonna chat about the Lessons learned building our first application of large language modeling in stripe's support space. So for some very brief context, Stripe is a payments company. We offer a lot of different payments for data products, and we serve millions of customers all over the world.
So as you can imagine, having a large global customer base and a wide suite of products means that our support operations org is very, very busy. Every week. Our agents handle tens of thousands of support cases where they answer questions and solve problems for our customers. So as the majority of this support volume is text-based data, LLMs are an especially exciting opportunity for support, and there are many applications where this type of ML can add value.
But for our first undertaking, we decided to target the root of the support problem space, which is a very complex but high value task of answering our user's questions. So our ultimate goal here is to help agents solve cases more efficiently by prompting relevant responses to user questions using G P T.
Our customers will always talk directly to our support agents. We want to prompt these agents with relevant responses so that they no longer have to spend time doing research to look up answers, which can be really time consuming given how complex and wide raging the support problem space is. So for success, we need to make sure that our ML prompted responses are information accurate and hit the right stripe tone.
So the human experience is extremely important to us, and we don't want our agents to ever sound like bots. Of course, our ultimate measure of success here is that the agents are actually using this tool and it's helping them solve cases more efficiently. So the very first session that we learned is that LLMs are not oracles.
When we ask out of box da Vinci, G P T a basic support question, like, Hey, how can I pause payouts, G P T will give a very plausible sounding answer, but unfortunately it is factually incorrect. And this is true for the majority of questions that Stripe customers ask because the materials that G P T has been pre-trained on are either out of date, perhaps incomplete, or else confused with generic instructions that might relate to another payments company.
And while we could definitely improve this specific answer via prompt engineering, the scope and complexity of Stripe support space is just too large for this to be a viable solution. So we found that to solve problems that require deep subject matter expertise at scale, we need to break down the problem into more ML tangible steps.
First, we need to identify whether the user as is asking a valid question or not. This step will remove any chitchat or questions that don't have clear enough context. Next, we identify what the topic this question relates to, and then using topic, relevant context, we generate the answer to the question.
Finally, we modify the answers so that it meets that perfect striped tone, which means that the response has the right friendly but succinct tone that our agents target. And the benefits here are that now we have a lot more control of our solution framework and we can expect a lot more reliable and interpretable results thanks to fine tuning, which in our case, completely mitigated hallucinations.
Also, the other great benefit here is that fine tuning on G B T requires just about 500 labels per class. So we could move really quickly here, relying exclusively on expert agent annotations for our pilot. Our end solution consists of a sequential G P T framework. The first two fine tuned classification steps filter any candidate user support questions to our fine tune, fine tuned agent response model, which generates the information accurate response, and these responses are finally adjusted to meet striped tone before being prompted to the agents.
So the second lesson we learned was how important it is to prioritize online feedback and monitoring. During our development, we relied on back test evaluations to measure ML performance for our classification models. This was standard practice using label data sets for our generative models. We engaged with expert agents who manually reviewed and labeled responses so that we could get quantitative results.
We also engaged with agents for user testing and training data collection, so that they could dictate exactly what the ML response prompts should be for different types of input questions. So after many ML iterations, our offline feedback trended really well, and we got to the state where we were really confident in our ML accuracy and ready to ship for our production setup, we designed a controlled experiment to measure the effect of cases where, Agents are prompted with ML generative responses versus those that aren't.
And unfortunately because online case labeling was not feasible at this scale, we had a considerable gap when it came to un serving online accuracy trends. So once we shift, we realized that the rate of which agents were using our ML generative prompts was much lower than expected. Very few cases were actually using our ml, ML generated answers cause we didn't have visibility into online accuracy trends.
We were pretty in the dark trying to understand what was going on and whether there was a discrepancy between online and offline performance. So in the absence of online accuracy metrics, we developed a heuristic based match rate to represent how often our ML generator responses matched the response that agents actually send to users.
So match rate provided a very crude, like lower bound measure on our expected accuracy so that we could tell how the model was trending. And this was really important to us to be able to validate the quality of our ML responses in production. So even though our offline user testing feedback was really positive, and our online ML accuracy trends were good.
In practice, agents were just too accustomed to their flow of solving cases. So ignoring our prompts and this lack of agent engagement was a huge bottleneck for us, realizing efficiency impact, and we really needed a much larger UX effort if we wanted to increase adoption. So there were many big learnings from this experiment.
The first one was to always ask yourself whether human behavior can affect solving your business problem, and if so, engage with your UX team really early, uh, more practically. For ml, we also learned the importance of deriving proxy online metrics in cases where we don't have the right data to measure accuracy or whatever metric you're tracking.
Exactly, so directional feedback using heuristics is a million times better than being completely in the dark. We also learned to ship each stage of the framework in shadow mode as soon as they're ready, besides waiting for one large end-to-end ship. That way you can debug as you go and validate functionality and target expectations per stage sequence.
Uh, this experience also really taught us to prioritize monitoring efforts as highly as other ML development tasks. Often monitoring is compromised as something like we can catch up on later after we've shipped. And this is especially easy to do when you're working on a lightweight resource constrained group like we were.
But the lesson learned is that a model is not shipped unless it has full monitoring and a dashboard. Because this online feedback is key to ensuring that we're actually solving the problem we're targeting. So the last session I'm gonna talk about today is about how data is still the most important player when solving business problems using LLMs.
I think there's a bit of a misnomer that newer or more advanced LLM architectures will be able to solve everything for us, and we just need to integrate or write the right prompt. But LLMs are not a silver bullet. Production requires data collection. And testing, experimentation, infrastructure, et cetera.
Just like any other ML model. And I think the age old 80 20 rule for working in data science holds very true. Like writing the code for this L LM framework took a matter of days or weeks, whereas iterating on the dataset to train these models took months. And in our experience iterating on this label, data quality yielded a lot higher performance gains compared to using more advanced G P T engines also.
So the types of ML errors that we were seeing related to gotchas in the striped support space as opposed to more general gaps in language understanding, therefore adding or increasing the quality of our data samples usually did the trick when we had a gap in performance. Uh, we've also seen that scaling is more of a data management effort as opposed to advancing our solution to a newer model.
We actually found that collecting labels for generative fine tuning models adds a lot of complexity. So for our second iteration of this solution, which we're building right now, we've swapped out this generative ML components and replaced it with, uh, more straightforward than fire. And are moving away from generative, just more basic classification is so that we can leverage weak supervision techniques like snorkel machine learning or embedding classification to label data at scale without requiring explicitly human labelers.
Uh, we're also, he, um, heavily investing in subject matter expertise strategy program to collect and maintain this data set. So stripes support. Space changes over time. As we advance and grow our products, we need our labels to stay fresh so that our model is continuously up to date. Our goal is that this data set becomes a living oracle that will guarantee our ML responses are fresh and accurate into the future.
So to wrap, uh, to recap, the goal of this project is to help our support agents solve cases more efficiently by prompting relevant responses to user questions. The three big lessons from this talk are, number one, LLMs are not oracles. We need to break down our business problems into more ml manageable steps.
Two, online feedback really is key. Monitoring is a priority just as much as any other ML development task. And lastly, data is king. A good data strategy will outweigh any fancy L L M architecture, especially for solving business problems that require that deep domain expertise at scale. So please reach out if you have any questions.
We have a lot of open roles at Stripe. Anyone wants to join and work on some of these problems with me. That would be Grace.
Woo. Awesome. Thank you so much, Sophie. I loved it, and I always love like the striped graphics. They're just so like warm and beautiful. Thank you. All right, well thank you so much and have a wonderful rest of your day. Thank you. Bye-bye. Bye.