Agents in Scrubs: Designing for the Complex Realities of Healthcare // Sarah Gebauer // Agents in Production 2025
speaker

Dr. Sarah Gebauer is a practicing physician and healthcare AI evaluation expert with a unique combination of clinical expertise and technical assessment knowledge. As the founder of Validara Health, she bridges the gap between medical practice and artificial intelligence through rigorous, clinically relevant evaluation frameworks.
Dr. Gebauer's experience spans over a decade of clinical practice, health technology consulting for dozens of start-ups, and most recently, research at the RAND Corporation on AI evaluation for biosecurity risks. Her work encompasses both the potential and risks of AI in healthcare, with particular expertise in AI biosecurity assessment and validation methodologies for clinical applications.
As the founder of Machine Learning for MDs, Dr. Gebauer has built a thriving community of 500+ physicians developing their AI skills. Her weekly Substack newsletter on the intersection of AI and healthcare has grown to over 1000 subscribers since its launch in 2023, establishing her as a thought leader connecting clinical and technical domains.
Previously, Dr. Gebauer held multiple leadership positions at UCHealth and the University of New Mexico health system. She attended Stanford Medical school, completed anesthesiology residency at UC-San Francisco, and a Hospice and Palliative Medicine fellowship at the Institute for Palliative Medicine. She holds a Graduate Certificate in Clinical Informatics from OHSU.
Dr. Gebauer's published work on AI benchmarking and evaluation has appeared in Nature, BMJ, and other leading publications. She brings a physician's perspective to the technical challenges of validating healthcare AI—focusing not just on what these systems can do, but what they should do to meaningfully improve patient care.
SUMMARY
Deploying AI agents in healthcare isn’t just a technical challenge—it’s a clinical one. As a physician working at the intersection of care delivery and machine intelligence, I’ll walk through what it really takes to make agents useful, safe, and credible in high-stakes environments like hospitals and clinics. This talk will focus on: What makes healthcare environments uniquely hard for agents—ambiguity, interruptions, human variability, and risk tolerance Why typical evaluation metrics often miss the mark, and what to measure instead (think: harm reduction, workflow fit, and appropriate escalation) How to scope agent autonomy to reflect the real-world roles of nurses, physicians, and support staff Where agents can shine in augmenting clinical work—and where they’re likely to fail without robust oversight
TRANSCRIPT
Adam Becker [00:00:00]: [Music]
Adam Becker [00:00:08]: Wonderful. Thank you so much. And thank you everyone. It's a pleasure to be here with you. So when I talk to engineers about AI agents in healthcare, a lot of times I get a question along the lines of, well, I think agent based tools in healthcare are like kind of a clear upgrade from humans. Like, of course they're going to be better or anything in healthcare is a legal and logistical nightmare and I never want to deal with it. So if you're on one of those two extremes, I have great news for you. And that is that we already in medicine have a way to use unproven agents in healthcare.
Sarah Gebauer [00:00:47]: We have medical residents and those are basically untrained agents who, who operate mostly autonomously with safeguards in place. And so we have this great framework that's been honed over hundreds of years and we're able to use that framework to deploy AI agents in a very similar way and using a lot of the lessons that we've learned with a lot of mistakes over the last decade or last hundred to thousand years or so. So medical residents start right after medical school. They have never really been on their own. When I started as a medical resident, I was given a pager and a list of patients and told to go see them and give them whatever medicines they needed with absolutely no instruction. And you can imagine that there are many mistakes when that happens, some of them mine, some of them by other people, but basically they're kind of partially trained agents and they're constantly learning and constantly making decisions and mostly under some kind of supervision. So they also usually have great test scores. I've been an attending physician, I've trained lots of residents.
Sarah Gebauer [00:01:56]: And right now the AI LLMs have amazing results on some of their step one scores for the US Medical Licensing Exam, for example, both for things like clinical reasoning and basic sciences. So similar to a lot of my residents, they were great in terms of the book learning and then they got to the bedside and maybe had no idea what to do. So they really, you know, the residents right now don't have much real world testing. That's obviously something that we can do much more easily with agents in a lot of ways than we can with humans for a lot of different reasons. So today what we'll cover is the key aspects of managing humans and AI agents in a health care setting, the risks and what you need to really need to be worrying about how to know if your agent is ready to deploy in health care and then post deployment monitoring. So how do you keep an eye on it afterwards? So Three key aspects of managing humans or AI agents in health care. One is human oversight. You guys have heard Human in the Loop a million times and we already do that for medical residents.
Sarah Gebauer [00:03:09]: And we can do that in a lot of ways in a similar context for AI agents. So identifying the crucial junctures is the most important step. So can the agents realize when a critical juncture occurs and escalate to the appropriate person at that time? And then another piece is teaching them how to interact with with team members and react in a time appropriate manner. So for example, if you have a stroke alert and you have a person who needs a medication within the next hour, or else the paralysis may become permanent for the rest of their life, then you really want to have a tool that is going to give you a very short, succinct answer, maybe with a small amount of description. Whereas if you're in neurology clinic and you are trying to diagnose a very unusual, rare disease, you really want to have a tool that's going to give you more of a long, detailed explanation with a lot of resources and references and understanding that difference and how that context impacts that communication is really important. There have been lots of studies that have shown the main cause of errors and patient harm in the healthcare setting is communication. And there's no reason to believe that the integration of AI agents will be any different in that regard. We still have to have excellent communication to take good care of patients.
Sarah Gebauer [00:04:33]: And then fairness and bias are always important. And we know that humans are biased, we know that healthcare workers are biased, and that's part of the reason that our tools are biased. But we have to be able to monitor that and help mitigate the risks of that in a healthcare setting. So risks that failure to escalate that I mentioned, that's a really big deal. So say there's a patient who looks like they might be starting to get sick, and the agent, whether it's a resident or an AI agent, just doesn't quite know enough to know when they should make that phone call. And I've worked with a lot of residents and the difference between a good resident and a bad resident is knowing when to ask for help. And it's going to be exactly the same with AI agents. They are going to be evaluated on their ability to recognize when they need help and when they need to call in a human to do the next step in the process.
Sarah Gebauer [00:05:29]: Absolutely critical time latency I mentioned, you know, if you don't do things at the right time in medicine, it's A huge deal. This is different than a lot of other fields like marketing and some sales pieces. You really want the amount of time that the agent is thinking and calling on his different resources to be appropriate to the setting and how much time the healthcare workers or the patients have to make a decision that's meaningful. And then a big thing with AI agents that's not quite as much of an issue with human agents to some extent, is incomplete input. So when I walk by a patient's room, say they have their legs crossed, their arms are behind their back, they're looking through a magazine, I know that they're fine. If some. Somebody might have called me in the middle of the night to rush in to, you know, check on this patient, I can just walk by the room and I know they're completely fine. And I have a hundred different pieces of input coming into me just from that simple walking by the room.
Sarah Gebauer [00:06:34]: And we are not yet at a point where the AI agents have that same amount of access to information about each patient. And therefore, one small, seemingly small piece of information could really change how you treat the patient, how you diagnose the patient, and ultimately what the patient's outcome will be. And then incomplete output. So this is a major problem for medical residents. They will often tell you things, and then they leave out key pieces of information. And part of being a human in the loop or an attending physician is being able to recognize when you should be getting different kinds of information or different facts or different kinds of facts from that resident. And that's another piece that is really important with the agents. Are they giving the right person the right information at the right time? And then agent, agent conflict or disagreement.
Sarah Gebauer [00:07:27]: This happens all the time in hospitals amongst all levels of physicians and all kinds of healthcare workers. They just have different ideas about how to treat things. We don't really know how two agents working together with similar information might handle those just different kinds of disagreements or how we are going to handle when we don't disagree, when we disagree with the AI agent recommendation. You know, a lot of physicians feel like they're kind of damned if they do and damned if they don't in terms of if they listen to the AI agent and it was wrong, then they're liable. But if they don't listen to the agent and it was right, then they're also liable. And. And there is a lot of, of course, legal liability that goes along with being a physician and then explaining themselves. There's a really clear format that I expect to receive information in.
Sarah Gebauer [00:08:17]: So this Is a. For example, this is an 87 year old gentleman with a history of heart failure. He comes into the emergency room this evening after a very salt heavy meal, complaining of shortness of breath and with swelling in his legs. I can tell you almost exactly what he has. He has a CHF exacerbation. And that more importantly, because the information was given to me in the format that I expect, I can very quickly make a diagnosis and treatment plan. If somebody for example had told me, well, there's a guy and he has a swelling in his legs, but two weeks ago he went to his doctor and they listened to his heart and they found this sound and you know, the story goes on and on with details that really aren't as pertinent. Then I really, it's going to take me a lot longer to get to the point and even more important, I'm going to be very annoyed by the time this person finishes telling me this story and even more annoyed if it's a computer telling me the story in this non protocolized manner.
Sarah Gebauer [00:09:24]: And then supervision, you know, is it ready for the real world? We have to test these things on the local data. Patients and settings are so important in healthcare and can really lead to differences, minute differences, but very important ones in the kinds of patients they see and the kinds of illnesses they see, what's most likely. And we need to make sure that the models account for those kinds of differences amongst the different kinds of institutions that we have. And then workflow integration is key. This is going to be the key to making doctors, nurses, healthcare workers happy with your product or not is if it fits into the workflow. As, as everyone knows. And so you want to be tracking end user satisfaction time studies, making sure that it talks to the right person at the right time with the right kind of information. And then, you know, humans in the loop are absolutely vital.
Sarah Gebauer [00:10:17]: And then we also want to just think about things like edge cases, scenarios, you know, failure modes, user acceptance, training, robustness, reliability, all these things are common in other fields as well. We want to make sure we're still doing those things in healthcare, of course. And then post deployment this is something we're not doing a great job of yet in healthcare, but I have high hopes. So things like performance tracking, we do these for residents, for example, with monthly evaluations and case log reviews. And then we can do that with agents as well, of course. And then near MIS analysis we do with morbidity and mortality conferences for example. And then for AI agents we can look at cases where humans overrode the decision making of the AI agents and then feedback integration. We do this all the time for residents.
Sarah Gebauer [00:11:08]: You know, we say, hey, you know, you didn't think about this diagnosis and that was really important. And we need to capture that as well with the AI agents. So in summary, we can use lessons from medical training and using medical residents and hospitals and high risk settings to give us ideas and tools for how to use AI agents in this setting safely. We can anticipate and address risks, we can validate performance and then implement post deployment tracking and we can do all this and make it to where it's both a clear upgrade and not a huge, horrible logistical nightmare when you're practicing in health care. So thank you very much. And that's the end of my talk.
Adam Becker [00:11:51]: Sarah, thank you very much. That was absolutely incredible. And are you give us a little bit of context to how you began to think about all of this? Are you actively deploying these things? Are you? Yeah. Like how does like your day to day fit into this?
Sarah Gebauer [00:12:09]: You know, that's such a great question because I actually don't have any AI tools available to me in my day to day world. I do a ton of health tech consulting. I worked at RAND and did AI model evaluations for national security issues. So I've used AI models and evaluations extensively in other settings. But healthcare is very slow to change and it just hasn't quite reached all the places where it seems like it should yet. A lot of the use cases right now are on the administrative side, so billing, revenue, cycle management, those kinds of things. And not quite yet in the clinical area.
Adam Becker [00:12:46]: Yeah. Folks in the chat are asking if you're trying to implement them in your workplace, what are the top use cases that you did see in like more critical areas? Have you seen any?
Sarah Gebauer [00:12:57]: Yeah, so I mean there are so many great use cases and I, you know, I am very hopeful. I think it's really going to revolutionize healthcare. It's just going to be slower than we, you know, wish it were. Predictive modeling is being used pretty extensively. So predicting which patients might get sepsis, for example, which is a bloodstream infection that goes all over your body and has a high mortality rate. Those are being widely deployed. So you'll have a group of nurses in one location monitoring a large number of patients across all health systems hospitals and looking at which patients vital signs or lab results are suggestive that they might be developing this and we can catch it early and treat those patients early and intervene more quickly. And so those kind of tools in the clinical setting are really important.
Sarah Gebauer [00:13:39]: And then of course ambient scribes are being very widely deployed quickly and those are more of a non clinical kind of use case, although they are now in the UK considered a medical device. So they summarize all the information between a doctor and a patient and put it into a nice doctor's note. And that is a huge time saver for physicians, or at least a mental time saver, even if it's not a.
Adam Becker [00:14:01]: Physical time saver for folks in the audience. It's tons of engineers, folks building apps. What does that take away you maybe want to leave them with? Obviously we've spent this time thinking about the framework, by the way. Absolutely brilliant analogizing. Normally we anthropomorphize agents. You've agent the sized humans. This is it was. I mean just residents being the original agents.
Adam Becker [00:14:30]: You guys have had this experience in that. But is there something that you would wish engineers in the audience perhaps trying to work on different products. What is. What should they walk away with?
Sarah Gebauer [00:14:42]: 1 is that clinical input really is helpful and important. And number two is that trying to solve problems, as you all know, real problems that are really happening to people is the best way to go about things. And number three, I am almost hesitant to say this, but doctors tend to hate tools that suggest diagnoses to them because that's the one fun part of their job and they don't want to lose it. So there's a way for tools to be helpful and and integrated and really be assistants in that arena. And I think that's crucial. But pitching it that way instead of this tool will diagnose everyone for you, which is really the best part of being a doctor is figuring things out and using your brain is going to be an easier sell.
Adam Becker [00:15:31]: Fascinating. Sarah, is there a good way to connect with you? Are you on Twitter?
Sarah Gebauer [00:15:36]: I'm on LinkedIn. I have a substack ML for MDS on substack and always happy to have more connections on LinkedIn as well. It was a pleasure to be here and really appreciate the invitation. Look forward to connecting with others in your community going forward.
Adam Becker [00:15:49]: Drop the links in the chat below and I'll make sure everybody gets them.
Sarah Gebauer [00:15:53]: Will do.
Adam Becker [00:15:54]: Thank you very much. Keep up the good work.
Sarah Gebauer [00:15:56]: Thanks.
