MLOps Community
+00:00 GMT
Sign in or Join the community to continue

DataOps is a Software Engineering Challenge

Posted May 17, 2022 | Views 483
# Maersk
# Software Engineering Challenge
# DataOps
# maersk.com
Share
speakers
avatar
Micha Kunze
Lead Data Engineer @ Maersk

Micha has a background in physics, in fact, he has a Ph.D. in Biophysics. He's always been interested in crunching data, be that using HPC clusters or his laptop.

Micha loves building and improving systems that provide value through data.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
avatar
Ben Epstein
Founding Software Engineer @ Galileo

Ben was the machine learning lead for Splice Machine, leading the development of their MLOps platform and Feature Store. He is now a founding software engineer at Galileo (rungalileo.io) focused on building data discovery and data quality tooling for machine learning teams. Ben also works as an adjunct professor at Washington University in St. Louis teaching concepts in cloud computing and big data analytics.

+ Read More
SUMMARY

Micha's team delivers millions of forecasts a day for the global operations of one of the largest ocean logistics companies in the world. They need reliable systems while also changing quickly.

In this talk, Micha shares how they achieved this following simple software engineering practices.

+ Read More
TRANSCRIPT

Qoutes

“I think there are many practices from software engineers we can benifit from by applying them to data work.”

“We need to be excellent in operating data products as well. We don’t just develop models. We build the data, build the models, and operate.”

“Our complexity, a lot of it is coming from the business domain. It turns out to be very hard to run a global network of ships and utilize that to a good degree.”

“The amount of technology and data we have to integrate are very very different which makes it a bit harder. It’s not a greenfield. We have to integrate from systems that are really old to systems that are really new.”

“Customer experience is important but also customer behavior changes. The quality of the data because of that is a big big factor.”

“With as fast as possible, I don’t mean real-time. When we build new things and extract new features, that should be as fast as possible. It shouldn’t take six months or so to get new data.”

“We have a high change velocity so we live in continuous deployment. In production, we have roughly 20-25 changes per day. That’s important to us because we have to change quite quickly and react to a lot of things.”

“The model you built today won’t be good next year.”

“You can’t waste time on the bad experience that you set up. You want to do the right thing at scale.”

“If you want to have an industrialized setup, two things a really important, speed and reliability.”

“There are safe ways to test in production. You don’t have to go through different environments and create waste or create process steps that can increase errors.” “We predict how much we have to do, when, and where.”

“Tests make you faster and that’s something I rarely see data teams doing.”

“Like a caveman, you can just make your changes and see that it’s what breaks. Do something about it or rollback or whatever you need to do.”

“Using generated data helps you need to have a better understanding of the data. the connection of the data to the business is what creates value.”

“Observability tools cost you all the time. In that case, if you have an engineering team, I would highly recommend just sticking to open-source. It’s not worth it yet in my opinion.”

“The observability tools haven’t blown me away that I would really need them. I think the key-value store plus the metrics are already enough to get 80% of the value.”

“Run full pipelines, not unit tests.”

“Typically, If I want to test something, there is an investment upfront to write the test.”

“If the pipeline changes over time, of course, I have to maintain those tests as well so it’s extra baggage. I want to make sure when I write these when I set up actual tests, I get a customer, not just schemas.”

“Test an interphase that is rather stable. Spend extra time on the test.”

“It’s better to at least run the full job, not the pipeline.”

“Find one problem, fix one problem, find a new one.”

+ Read More

Watch More

57:43
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard
Posted Jul 14, 2021 | Views 593
# Open Source
# Interview
# Fast.ai
Why All Data Scientists Should Learn Software Engineering Principles
Posted Jul 05, 2024 | Views 501
# Data Scientist
# Software Engineering Principles
# Coding
Prompt Injection Game - Large Language Model Challenge
Posted Apr 18, 2023 | Views 1.2K
# Large Language Models
# LLM in Production
# Prompt Injection Game