Productionizing Health Insurance Appeal Generation // Holden Karau // AI in Production Talk
Holden Karau is an American-Canadian computer scientist and author based in San Francisco, CA. She has worked for Netflix, Apple, Amazon, and Google. She is best known for her work on Apache Spark, her advocacy in the open-source software movement, and her creation and maintenance of a variety of related projects including spark-testing-base.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
This talk will cover how we fine-tuned a model to generate health insurance appeals. If you've ever received a health insurance denial and felt frustrated, this topic should resonate with you. Even if you haven't experienced this, come and learn about our adventures in using different cloud resources for fine-tuning and, ultimately, deploying on-premises Kubernetes in Fremont, CA. This includes the unexpected challenge of fitting graphics cards into the servers.
Productionizing Health Insurance Appeal Generation
AI in Production
Demetrios [00:00:00]: Now we get to bring out Holden. What's going on? Oh, my gosh, I'm so excited to finally get to be on a virtual call with you. It has been a long time coming. I've been following your work since I think you were at Apple, and I remember someone, my friend Willem, saying, like, oh, we should have Holden come and talk at one of the conferences we were having. And now we finally made it happen.
Holden Karau [00:00:28]: I'm happy to be here.
Demetrios [00:00:29]: Excellent. Well, you've got some great stuff coming for us. I know that you have already shared your screen, so I'm just going to hand it over to you and I'll talk to you in about 2025 minutes.
Holden Karau [00:00:42]: All right, rocking. So I'm going to try and get us back on schedule, in part because I also have to go do work after this, much less fun. I'm going to talk about productionizing health insurance appeal generation. An alternative title for this was getting AI to production on prem without a so I'm Holden. My pronouns are she or her. I'm on the Spark PMC. I work at Netflix right now. I don't represent them, especially in this talk.
Holden Karau [00:01:15]: Right? Like, I'm talking very much about things that I do that are entertaining and fun. They bring me great joy, but they're not what my employer pays me to do. My employer pays me to work on related, you know, big data stuff, but a little bit different. You can follow me on a whole bunch of different places, Twitter, blue sky, mastodon. I do open source live streams, and a lot of the things that I'm going to be talking about today, I also live stream. And if you're interested, you can go back and watch a lot of struggling with this. And I think that's interesting, or at least good to see that you're not alone if you're struggling. And if you're not struggling, that's great.
Holden Karau [00:02:00]: That's fantastic. Congratulations to you. So, in addition to who I am professionally, I'm trans queer canadian. Although in America on a green card finally took almost a decade and part of the broader leather community. This is something that I say for a lot of my talks. I think it's important for those of us who are building ML models and tools to try and have more diverse communities in the room building these here. It's a little bit more related because I've seen healthcare outside of this country, and also being trans means I have different interactions with american healthcare than some other folks. So there's going to be some references that people might not love.
Holden Karau [00:02:47]: Feel free to come back for the next talk and take a bio break. If this is not your jam, and this is going to be in a bridge version, because we're running a little bit behind schedule, we can think of this talk as going from the laptop to the desktop to the server rack. All kinds of fun. All kinds of fun. So what is the problem that we're trying to solve with our computers? Health insurance in America frequently denies medical care. Anthem Blue Cross, who is the insurer that I have currently denied around 20% of claims, more than that in 2019. And so they've consistently been denying a lot of claims. And anthem is not alone.
Holden Karau [00:03:38]: This is just the insurer that I happen to have. Allegedly, some of them are using AI to deny claims. Of course they deny that, but I somewhat trust the reporting a little bit more than I trust their denials. The other part of our problem is that my budget is super low, right? Like, this isn't a problem that I'm solving for Netflix. This is a problem that I'm solving because I have it. And so the resources that I have versus the resources like a large employer has are a little bit different. Personal motivation. I got hit by a car in 2019.
Holden Karau [00:04:15]: Lots of medical bills as a result of that. Also being trans. Bunch of surgeries. Personally, for me, not every trans person decides to have surgery, but I've chosen some different things that have really helped with my life. I love them, but also a lot of medical bills and a lot of fighting with insurance. But my dog is amazing, and he helps me get through all of this. His name is timbit. So I also want to be clear and set realistic expectations.
Holden Karau [00:04:45]: This is the flex tape slap on approach to trying to solve what is fundamentally like a societal problem. Health care should be accessible to everyone, and there's only so much we can do with computers, right? We can try and make things better, but structurally, America needs to change. So how are we going to use computers to try and make the world suck a little bit less? So we're going to make an ML model that'll take health insurance niles, and produce appeals for people. And then, because that's only part of the thing most people don't actually have, like RTX 40 90, let alone necessarily want to install Python or download a model off of hugging face. We're going to put together a front end to access the model. And to be clear, this is my approach. There are many other approaches that one could use to try and make healthcare in America more accessible to people. So we're going to need training data.
Holden Karau [00:05:41]: We're going to need evaluation data, and we're going to need a bunch of computers, and then we're going to need some software to put everything together. So training data is like, that's the hard part, right? The people that have them, for the most part, are insurance companies and doctors offices. Insurance companies aren't likely to help us because, well, our goal is to make their lives harder. And doctors offices are unlikely to help us because while our goal is indeed to make their lives easier, there's a lot of laws that they have to be sort of very careful about when it comes to sharing information. So they'd have to go ahead and go in and manually redact all of these things and be like, hey, I can make your life easier. Just spend the next few months redacting patient records. That doesn't sell very well. Right.
Holden Karau [00:06:34]: We can look at the Internet. Some people post their denials and appeals online. It's fantastic. Love it. Great data source. Not big enough to fine tune a model, but there are other sort of substitute data sources that we can use for training our model, the independent medical review boards. Now, California has this data, and it's open and you can download it, thank God. Other states also probably have this data, but it's not as open or it has restrictions around it.
Holden Karau [00:07:05]: Washington state, for example, has it, but it's not approved for commercial use. Is this commercial use? I don't know. I also don't want to be the one that finds out. So we're just using the California data for now. One day in the, you know, we'll file a Washington state freedom of information tile request and get the data from other states, Texas, et cetera. Every state has independent medical review boards. The results that are published don't contain the patient name or really too much information. So in theory, these things are difficult to figure out who it would be.
Holden Karau [00:07:49]: It probably is. And the great thing about this is that there's a lot of these, right? Like the state of California is big enough that there are so many of these records, and we can take these records and we can generate synthetic data that we can use for fine tuning. So we're going to ask our llms to take our inputs. These are what our inputs look like and output what the denials might have looked like and similarly appeals. There's some downsides to this. The generated data costs money to generate, not a huge amount compared to having humans do it. But the generated data might not be very good. So then we have to do some filtering on top of it.
Holden Karau [00:08:33]: And we also have to be careful with what licenses we use, what license models we use. So let's say we go ahead, we do that, we've got a bunch of data, and then we're going to go ahead and we're going to pick a model to fine tune, ideally one that otherwise performs well and will fit on the kind of hardware that we happen to have available to us. Sorry, that's my dog. He's very upset. Normally, this is the sign that the mail is about to arrive. Could be something else. Thankfully, this has gotten so much easier. Fine tuning things is fantastic.
Holden Karau [00:09:06]: I used oxalotl. I don't actually know how to pronounce that properly. The first attempt, I used dolly and a whole bunch of shell scripts. Oxalotl abstracts away almost all of these terrible shell scripts into terrible python scripts and then lambda labs. Because while I have an RTX 40 90, an RTX 40 90 isn't enough to fine tune the model. It's enough to run inference on it, but that's good. And then we've got a whole bunch of shell scripts still, because copying the data around, doing all of the setup on all of these computers, is kind of painful, but it's much better than it used to be. And then we write some relatively simple config files.
Holden Karau [00:09:45]: We write a model fine tuning config, specify what our base model is, where our data is coming from, sequence length, sliding window, then some special testings and fine tuning the model. All told, it costs about $112, right? That is not too, too much money. It's not like you just do it because you feel like it, but it's certainly within my budget. And then it's time to serve our model. So we've fine tuned it and we want to do our serving VLlM. Really cool. We still need some gpus, though. I have some Arm gpus, arm plus GPU, things from Nvidia.
Holden Karau [00:10:31]: Really interesting devices. They don't party very well with most of the tools. The other discovery that I made is that an RTX 3090 does not fit well into any of the two U servers that I have. They also still cost money and use power. So we ended up using a desktop computer and shoving it in the bottom of this rack that's pictured here. I don't have a picture of that particular abomination, but it's okay. Used more power than I was expecting, so I had to shuffle things around because I've only got 15 amps to play with, but not too, too expensive. And then doing the actual serving itself is relatively simple, right? We can specify a Kubernetes deployment and this will just take the VLM container and run it on cube.
Holden Karau [00:11:19]: Here we see that I limit it to the AMD 64 host because the arm hosts don't play nice with it. Right now the runtime class has to be Nvidia. That way we can actually access the gpus from inside of our container. And then we also add some additional arguments. And this is like, hey, this is the model that we're using. Listen everywhere. I don't know what enforce eager does, but if we turn it on, it works, and if we don't, it doesn't. Bit of a joke.
Holden Karau [00:11:48]: One of the things that you'll note here is that we're pulling from latest. This is not a good idea. And in practice this actually does break every so often. Originally I was pulling from the mistral container with latest, but periodically the mistral folks update their containers and they generate the prompts to the model differently. And they do this ahead of releasing any of the source code changes. So you don't really know that it's going to break until it just breaks. So you should actually pin to a specific version. The other thing here is we need a fair amount of ephemeral storage because that's where we download the model and put the model RTX 3000 and $9748.
Holden Karau [00:12:39]: It's more than it costs to train the model, but it's still pushing my budget limits. But it's a lot cheaper than renting a gpu with enough ram to run model inference for like a year. This is much more affordable. The front end. We put it together in Django because I work in Python and Scala, and it's a lot easier in my experience to make a front end really quickly in Python. Also, llms do a pretty good job of generating Python compared to generating scalar code. These can run on pretty much any of the computers that we've got in our rack, so it's probably actually running on some of the arm machines. It's pretty great.
Holden Karau [00:13:24]: Those arm machines are super efficient power wise, so I love that we also need Internet access. This is a bit overkill. So here we've got an autonomous system for pixcomfly labs, and it has three upstream Internet providers. It's fun. If you took a Cisco networking class when you were in high school, this may be your dream from being a smallish child. If you didn't, you may be like, Holden, this is silly, and you are right in either case. But if you want Internet transit in hurricane electric Fremont too. Give me a call.
Holden Karau [00:14:03]: And now we're going to go ahead and do a demo. I'm going to really hope that it doesn't crash. The other thing that I want to be really clear about is please don't use this right now. In real life, it is not super ready for production usage. Okay. And we're going to see if it loads. Otherwise we're going to debug it in real time. Let's see if it'll let me share a screen.
Holden Karau [00:14:34]: Okay, so here we see. This is sad. I'm not getting any pods back from totally legitco right now. So that would be why it's not coming up. This is how we know it's a real demo, not a fake demo. Let's see. Why is it broken? Computers. Okay, so our router is up.
Holden Karau [00:15:08]: That's a good first step. Let's see here. Can we ssh to. Oh, okay. That's why the computer running it crashed most likely. Well, see if I can try another machine really quickly. Okay, that machine's up. Okay.
Holden Karau [00:15:48]: Yeah. Okay. Jumba is down, so that's not going to work. We're going to send an email to the fine folks at hurricane Electric to reboot Jumba. But that's not going to happen in time to run the demo, so that's okay. Instead, let's go back and look at some of the places where this code lives. So most of the logic lives inside of totally legit code. So the LLM stuff is all inside of this health insurance, LLM repo.
Holden Karau [00:16:41]: And we can see our deployment code lives inside of here, as does our fine tuning code and all those things. So if you want to play with this and you have a Kubernetes cluster that is currently up, you can go ahead and deploy it. In the meantime, my cluster is down because it looks like the head node is also the node where the model inference is running. So it's just not going to work. But you can go ahead and run it too. Downsides of on prem is it takes real time to reboot computers instead of fake time. The front end, which we can actually somewhat look at today because I can run that locally. It do? Yeah.
Holden Karau [00:17:42]: There we go. 8000. Why is it not listening? Sorry. Demo sadness. Um. Run local. Yeah, let's. Oh, Docker container.
Holden Karau [00:18:22]: Ls docker container. Stop. Got the model running locally too. Stop that. So we can bring up the UI and then cool. Should work. Come on. Loading.
Holden Karau [00:18:55]: Well, yeah, let's call it on the demo. I don't think that's going to work. And we can have two minutes for questions. I'm sorry the demo didn't work, folks. Really bad luck with that computer crashing. But we know it's not staged because it didn't work.
Demetrios [00:19:16]: I was loving it. I am so excited that you were doing this. This is very cool. And that you broke down every piece of this puzzle. It's so much more than I usually think most people will talk about. And so that's awesome. Now there are some questions coming through. I mean, we'll start with some basic ones, which are, what is.
Holden Karau [00:19:42]: All of. I shouldn't say. All of my servers. Most of my servers are named after characters from Lilo and. Oh, so here, Jumba, Lilo and Stitch. If you want to bring the screen back up, you can see jumba.
Demetrios [00:19:58]: I thought it was a service. That is classic. Okay. Yes, perfect. There is the first question now. I'm sure more people have a few more questions. One is, yeah, this is kubernetes serving. We definitely saw that.
Demetrios [00:20:19]: How do you feel about cpus versus gpus? Do you ever try and run it smaller?
Holden Karau [00:20:25]: Yeah, so that's a super solid question. I've played with cpu inference and it's never given me particularly great results here. Part of it is, I think that realistically, this is already we're fine tuning a model with a synthetic data set, and it works okay, but it doesn't work amazingly. Right. And so when we throw away these extra bits of precision, it's not super fun. The other reason why I kind of didn't go down super far down the cpu serving path is that the bits and bytes library, which a lot of things are used for quantizing, doesn't work super well on the arm nodes that I have anyways. And by doesn't work super well, I mean doesn't compile. So it's like, yeah, I could run cpu inference on an x 86 node and it would kind of work kind of slow.
Holden Karau [00:21:32]: But my x 86 nodes are already pretty power hungry.
Demetrios [00:21:43]: For all of us that really enjoyed this. You do this often, right? On Twitch?
Holden Karau [00:21:50]: Yeah. So I do this very often on Twitch and YouTube. And so later on this week, once that node is rebooted, I'll redo the demo on my YouTube and on my twitch. And if you want to see that, I'll post a link to it and we can do that there. I'm really sorry about the demo again, folks.
Demetrios [00:22:13]: We are all very forgiving.
Holden Karau [00:22:16]: Thanks.
Demetrios [00:22:17]: You haven't heard my singing. People are very forgiving in this because you know what they're about to hear some horribly out of tune singing. That's going to happen. Your demo is the least of their worries. Anyway, there's some questions about the gpus, and if you use any special cooling when you're working with them.
Holden Karau [00:22:42]: Totally. So for inference. So I have two gpus and all I do is it's just sticky to one gpu for inference. The reason why I have two is I had the theory that it might make sense to have a second model for inferring additional information. So instead of just generating the appeal, extracting information from the denial. But it turns out that honestly, one model was pretty good at doing both. And so that's cool. That does let me start up a new model without having to take the old model offline first, which is great.
Holden Karau [00:23:29]: But as we see, physical nodes fail, so only so great for the fine tuning part. That part I leave up to Oxalotl for the most part. And it does a really good job of splitting up the work on multiple gpus automatically. And I haven't felt the need to do anything funky with it. Like Nvidia SMI shows like everyone's happy and using a bunch of power and chugging away. So that's been great. I'm really happy that I don't have to think about that problem.
Demetrios [00:24:03]: Nice. Have you found any tricks or tools that you like just to make sure that you are getting that full saturation of these gpus or speeding up the inference on the other side of things?
Holden Karau [00:24:19]: Yeah. So for making sure that I'm getting the full strength of it on the fine tuning, that has not been a problem. Right. Because the oxalotl does a very good job of picking batch sizes that just clobber the GPU and you can just watch the temperature go up and the power consumption go up and you're like, okay, cool, good luck. Don't catch on fire for inference. So while there are a lot of health insurance denials, right. Right now this is like a thing that I am building, that I am using. And so my batch size is one.
Holden Karau [00:24:55]: And so that is obviously suboptimal. But I use DlLM in part because of the theory that once we achieve a second user, we'll be able to go up to a batch size of two, and that'll do a better job of getting the GPU utilization better. So that's why I use VlLM, so that we get nice batching that hasn't played out in practice yet, just because it's like very early stage nice.
Demetrios [00:25:26]: Last question for you. Totally coming through in the chat, asking about for inference, you mentioned cpu is not great and it hasn't really worked for you. But you were talking about the AMDs, right? What about have you experienced or have you played around with intel optimized libraries like intel?
Holden Karau [00:25:49]: Lex? Yeah, sorry, when I say AMD 64, I just mean x 86 underscore 64. That's my bad.
Demetrios [00:25:58]: Okay.
Holden Karau [00:25:59]: Because that's what the nodes get labeled in kubernetes for whatever reason. But no, I haven't played with the intel specific accelerator libraries. That is in my list of things to do and explore because I think it looks really cool. And that's actually something that I want to explore for my day job, too, because we've got some things that I think we could accelerate with some of the cool intel stuff there. So if that's something that someone else thinks is cool, reach out. I'd love to chat. You can find me on LinkedIn, Twitter, and all those places.
Demetrios [00:26:34]: There we go. Excellent. Well, Holden, this has been an absolute pleasure. I am so thankful that you decided to come on here and code with us, show us what you're working on. And it has been a long time coming. Thanks so much.
Holden Karau [00:26:52]: Thank you so much for having me. I really appreciate it. And next time, we'll have a working demo.
Demetrios [00:26:57]: There we go. We'll find you on Twitch, don't worry. And we'll be watching.
Holden Karau [00:27:01]: Okay. Thank you so much. Thank you.
Demetrios [00:27:04]: See you, Holden.