Todd Underwood leads reliability for the Research Platform at Open AI, working to improve the reliability and usability of the software and systems that train some of the best models in the world.
Prior to that, he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services and are critical to almost every Product Area at Google. He was previously the Site Lead for Google’s Pittsburgh office. He recently published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).
Todd has primary expertise in distributed systems, especially for Machine Learning / AI pipelines, although he also has a background in systems engineering and networking. In addition to Reliable Machine Learning, Todd has presented work on ML at various conferences and forums including OPML20 and TWIMLCon 21 and 22. He has presented work on the future of systems and software reliability engineering at LISA13, LISA16 and SREConEU15, SREConEU 22. He is a co-author of a chapter in the O'Reilly SRE Book. He has published three articles in Usenix’s; login: magazine. He has presented work related to Internet routing dynamics and relationships at NANOG, RIPE, and various Internet interconnection meetings. He served on the Program Committee for OPML, was Chair of the NANOG Program Committee, and helped found the RIPE Programme Committee.
He is interested in how to make computers and people work much, much better together.