MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Reliable ML

Posted Oct 05, 2022 | Views 831
# Reliable ML
# Revenue
# Decision Making
# Google
# Google.com
# Stanza
# stanza.systems
Share
speakers
avatar
Todd Underwood
Research Platform Reliability Lead @ Open AI

Todd Underwood leads reliability for the Research Platform at Open AI, working to improve the reliability and usability of the software and systems that train some of the best models in the world.

Prior to that, he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services and are critical to almost every Product Area at Google. He was previously the Site Lead for Google’s Pittsburgh office. He recently published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).

Todd has primary expertise in distributed systems, especially for Machine Learning / AI pipelines, although he also has a background in systems engineering and networking. In addition to Reliable Machine Learning, Todd has presented work on ML at various conferences and forums including OPML20 and TWIMLCon 21 and 22. He has presented work on the future of systems and software reliability engineering at LISA13, LISA16 and SREConEU15, SREConEU 22. He is a co-author of a chapter in the O'Reilly SRE Book. He has published three articles in Usenix’s; login: magazine. He has presented work related to Internet routing dynamics and relationships at NANOG, RIPE, and various Internet interconnection meetings. He served on the Program Committee for OPML, was Chair of the NANOG Program Committee, and helped found the RIPE Programme Committee.

He is interested in how to make computers and people work much, much better together.

+ Read More
avatar
Niall Murphy
Co-founder & Consultant @ Stanza

Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies world-wide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
avatar
David Aponte
Senior Research SDE, Applied Sciences Group @ Microsoft

David is one of the organizers of the MLOps Community. He is an engineer, teacher, and lifelong student. He loves to build solutions to tough problems and share his learnings with others. He works out of NYC and loves to hike and box for fun. He enjoys meeting new people so feel free to reach out to him!

+ Read More
SUMMARY

By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision-making, solve problems, or understand and influence customer behavior, you'll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind. (Book description from O'Reilly)

It was great that they wrote this book in the first place in a space that's new and lots of people are entering with a lot of questions and this book clarifies those questions. It was also great to have all of their experiences documented in this one book and there's a lot of value in putting them all in one place so that people can benefit from it.

+ Read More

Watch More

Build Reliable Systems with Chaos Engineering
Posted May 31, 2024 | Views 1.8K
# Chaos Engineering
# MLOps
# Steadybit
LLMOps: The Emerging Toolkit for Reliable, High-quality LLM Applications
Posted Jun 20, 2023 | Views 3.8K
# LLM in Production
# LLMs
# LLM Applications
# Databricks
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io