Sign in or Join the community to continue

Scaling Data Reliably: A Journey in Growing Through Data Pain Points // Miriah Peterson // DE4AI

Posted Sep 18, 2024 | Views 852

Share

speaker

Miriah Peterson

Data Engineer @ Soypete tech

Miriah Peterson, a seasoned engineer with 6 years of expertise in Go programming, excels as a Data Reliability Engineer. Her professional journey includes crafting videos, tutorials, and courses, showcasing her mastery in Go and Data Engineering. A dynamic speaker, Miriah has delivered talks on Go, machine learning, and data engineering. As a board member of Forge Foundation Inc. and an organizer of the GoWest Conference, Utah Data Engineering, and Machine Learning Utah meetups, she actively shapes the tech community. Miriah earned her bachelor's degree in physics from Brigham Young University in 2017, laying a strong foundation for her multifaceted contributions to the field.

+ Read More

SUMMARY

Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. Data Systems depend on data that can be out of date or late. Pipelines and automated jobs fail to run. Data sometimes arrives late changing the outcomes of processing jobs. All these situations are examples of Data Downtime and lead to misleading results and false reporting. As a DRE team (Data Reliability Engineering) we borrowed tools and practices from SRE to build a better data system. In this talk, we will explore real-world reliability situations for our data systems and address three major topics to strengthen any pipeline: Data Downtime: What Data Downtime is, how it affects your bottom line, and how to minimize it. Data Service Level Metrics: We will talk about metadata for your Data pipeline and how to report on pipeline transactions that can lead to preventative data engineering practices. Data monitoring: What to look out for and how to be aware of system failure versus data failures.

+ Read More

TRANSCRIPT

Skylar [00:00:11]: But yes. Welcome, Mariah.

Miriah Peterson [00:00:16]: Hello.

Skylar [00:00:17]: Floor is yours.

Miriah Peterson [00:00:18]: Thank you so much. I hope you guys are having a fantastic conference and that you are ready for everything. I have to share tidbits and wisdom, and we'll all just leave better engineers together. Like Skylar said, my name is Mariah. I am an engineer for Soybeat tech, where I create content and give talks and stream on Twitch. I also have a day job where I do data engineering, and I have struggled, always, always a struggle, to build things reliably, but it's very important in today's society. So do you guys have reliable data? So I know you can't see each other, but raise your hands at home if this is you. Have you ever had a broken dashboard or a missing data view? Have you ever had an airflow job not run? Have you ever had data not existing in a table or a data set or a file not, you know, being able to be pulled from Ms three? You have had duplicate data.

Miriah Peterson [00:01:41]: Have you ever run training on duplicate data? Oh, that's fun. Have you ever had an API unavailable? Maybe you're serving a model behind an API and you just can't call it. You ever had training jobs fail? Well, if this is you, you've experienced data downtime. So anytime a data downtime, I stole this from bar Moses. Is a period where your data is partial, erroneous, or missing anything. This can also affect things that you serve with your data. Right? We use data for training and inferencing. We use it for a lot of things.

Miriah Peterson [00:02:13]: And if you're not able to get results from it, you're experiencing a period of data downtime. And it's not just like software downtime when the servers are down, but it can come from data just being wrong itself. So what happens when your data is down? We obviously get broken dashboards. We get ML training that just doesn't work right. We'll get trainings that are inaccurate or are giving false results, or, you know, if you're like me and you've ever had financial operations run off of your data warehouse, that can break, too. Anybody working in fintech trying to predict stuff off of ML models? Oh, you better have some accurate data for that. So if we just step back, go back to the fundamentals. This is my favorite book ever.

Miriah Peterson [00:03:04]: Favorite book ever. Designing data intensive applications. The very first chapter of this book says, any system you design, so any piece of software you build, which is literally everything we do, has to be built with the mindset of reliability, maintainability, and operability. And if we don't use those as our foundation. We're not really building Apple, you know, these kinds of systems or applications that can scale to reference another book, another great book. Database reliability engineering. What does reliability mean? So we're building off of that, right? We want to make things reliable, maintainable, and operable. But what does that mean for failures? Systems will fail.

Miriah Peterson [00:03:48]: We want them to break, because if they don't break, then we're scared to touch them. We're scared we can't innovate on them. If it's, you know, whenever you're like, oh, don't touch that legacy software, because if we might touch it, then we own it or we break it. That's not what we want. We want something we can innovate on. And so what that means is instead of just, we want it to fail, but we want to fix those failures fast, right? We want to fail fast and resolve them fast. And that's how our systems become reliable, right? You can receive five nines of uptime by resolving things quickly every time there is a failure. So let's talk about the whole point, right? We want to minimize downtime whenever our systems break.

Miriah Peterson [00:04:31]: We want to resolve it as absolutely fast as possible. This is your data engineering lifecycle. And here we have different points where things can fail. All of these yellow circles, they might be a little bit hard to see, are points where they can fail. See, we can fail when we're serving to machine learning or to analytics. We can fail at storage or transformation. But by leveraging security, data operations orchestration, and software engineering, we can then combat those in the most effective way. So that takes us to data reliability engineering, right? It's not a job, but it is a field and a mindset.

Miriah Peterson [00:05:13]: I stole this from the data engineering podcast. Igor Groznov, who is a founder of Bigeye, said this on the podcast, and he says, you know what the next part of being a data engineer is? It's being a data reliability engineer. It's treating data quality as an engineering problem. What can we do to iterate on and make our data itself better and make the things that come from our data better, right? If data is the foundation for AI and ML, we have to make that foundation better, iterate on it quickly, and solve it like we do an engineering problem. So how do we measure our data? Is anybody here an old school data engineer that remembers the five versus? We haven't talked about five v's since noop came around, but they're still important because being able to quantify the availability and reliability of your data helps you express to the business the value that you're providing for potential AI and ML products, right? So you want to know how much data is flowing through your streams? How much data are you using in your transform, in your transformations, in your trainings? How much data is going through APIs when they're called? How much data is in the warehouse? Being able to say that confidently, you can say give an importance to the different kinds of data you're doing. Variety. What are the sources of your data? Are you pulling data from various APIs? Are you pulling data from one database or one table or multiple tables? What's the veracity of your data? Are you getting the insights you expect from your data? Right. We expect things to behave a certain way, just, you know, a certain statistical confidence, and we can get that by understanding our data.

Miriah Peterson [00:07:07]: And if the data doesn't match our statistical confidence, we have bad data. It's that what is the value, value of your data? Is all data being used? Is some data not being used? Should we store data we don't use for, you know, because storage is a cost and then velocity. What is your throughput of your data flows? Are you able to get more data, train a new model, iterate and deploy quickly? Are we able to get new data, embed it quickly, and use it as an embedding for our LLMs? Like how quick are we able to iterate on those foundations and ship them and make them go live into our other products? The next part is once we figured out the data itself. Now we have to figure out the services that are leveraging the data. These are basic metrics that we use in reliability engineering. I pulled these four straight from Google's site reliability engineering book, and these are called the four pillars of reliability, which is latency, traffic errors, and saturation. If you have an API that you're calling or serving a model from, how often is that being called? How long does it take to hit those calls? How many errors are being returned? Are we getting too many calls? Do we need to have more endpoints available to make it a more available thing? An example of this, I have a twitch bot that I have a model running behind so that twitch can type something and I can get a response from the bot. And when it right now at a 32nd delay is usually about how long it takes for a normal person to respond in twitch chat.

Miriah Peterson [00:08:53]: Because you type it, then you have to read it, then you have to respond. So 30 seconds to a minute is a normal iteration for chat itself. If it takes five minutes to respond, that is too long, and I need to do something to make that more available to twitch chat for a better experience. So now that we've defined the basics, we have to make our apply observability to it. So making something reliable is giving it minimizing downtime. Making something observable allows us to understand what's happening to the data in our system, right? So we use SLA's slos and slis, or system level agreement, system level objectives and system level indicators to define that. Here's just an example for a traditional data pipeline, right? Most data pipelines run in batch scripts, right? Unless you're a FAANG company, you're going to, you really don't need super high freshness. So you know about a day, right? We're not doing live trainings, we can train once a day.

Miriah Peterson [00:09:53]: So we have an SLA that our data freshness is always going to be at least at maximum 24 hours, right? It hopefully it's, you know, we have data newer in 24 hours, but the oldest it can be is 24 hours, right? That's our agreement. The objective, we try to make sure this happened then, that our pipelines extract data from the source and complete their transformation once a day, and we have the remediation measure of it being triggered manually. If for some reason that process doesn't occur, and the SLIs, the way we indicate to the engineers that something's happened would be an error on the pipeline, a timeout reported, and an alert that says, hey, this didn't run, you need to act and re trigger it. So, great, we understand how to do it. Now we need to understand how to make those agreements for the right system. If you're a data engineer, you have lots of different kinds of people you report to. You might have a data scientist who's doing research and investigation and trying to make better models or do better tokenization or something like that. These people care mostly about historical data, they don't care about real time, right? So they're going to say, great, is the data there? Is the data accurate? And they're so you're not getting any real time alerts, you're just saying, oh great, this data is accurate, I'm not missing data.

Miriah Peterson [00:11:16]: There's no changes in data set size, and they can pull them accurately. If you have a business executive, right, you might have financial operations running out of a warehouse, or you need, you need a presentation for a board meeting, they might care more about a 24 hours freshness. So the SLI's that we were using in that last slide, apply perfectly to them. And then if you're an end user of an ML application, right, you're using something on your phone that has a chat application in it, you're probably going to want something that feels pretty instantaneous. So that means we need to have up to date models, we need to have slis of potentially, you know, just a few seconds. And if there is multiple errors in a few seconds, we're getting alerts and we're maintaining those contracts for the benefit of the end party or the stakeholder. So the last, last thing I want to talk about is severity sometimes, right, we have lots of things that can go wrong with data, and a database not being available is a much bigger deal than we have duplicate values for a data scientist. So sometimes a downtime does not mean an outage, right? So when we're trying to determine the severity of a situation, right, we're trying to say, is this something we need to address immediately? Can it wait till the next business day? Can it wait till the next sprint? Those are typically how I like to think about severity.

Miriah Peterson [00:12:49]: If you get pinged in the middle of the night, is this an acknowledge, deal with it tomorrow, or is this something you have to get out of bed, get your cup of coffee and address it right now, because if you don't, somebody on the other side of the world can't get their service and you're literally losing money, actively ignoring it. These severities kind of help us maintain a little bit of sanity. Right? If you're a data engineer, you're constantly getting pinged by somebody saying, hey, my data is missing. Hey, this is wrong. Hey, can you run this report for me? And we don't want that. We don't want to be constantly pinged, we want to be ahead of it. So being able to say, is this causing us to lose money actively, then I'll be on it. If not, great, I'll get to it tomorrow or we'll punt it and address it in the future.

Miriah Peterson [00:13:36]: So with that, we are data reliability engineers. Right now we're thinking in the mindset where we can say, hey, I am crossing the bridge between what I need to serve to my customer and their experience and the data underneath it, right? We're making it so people can use the data in the experience they want, the way they want to use it and we understand the way they want to use it and we're delivering that to them. That is a data reliability engineer because you have the alerts in place to make it appropriate, you understand what you're doing for liability and you're making leveraging your SLA's to your advantage. And so just in conclusion, right, we're wrapping this up. We had ten minutes. We'll take our ten minutes. Everybody experiences downtime. We all have data downtime.

Miriah Peterson [00:14:31]: And the way to minimize our data downtime is with reliability engineering this. Reliability is not what normal sres do. Right. We have to add a different complexity layer of involving data, so we want to leverage metrics to that. You're going to have more metrics the bigger your data organization is. And you're going to want to be able to understand, quantize, use our five v's to measure how accurate your data is and how reliable it is as a data set. And then you're going to want to leverage SLA's to say, great, now the systems on top of that data set are also there. And make sure you use the appropriate urgency to remedy downtime for your sanity.

Miriah Peterson [00:15:14]: Right. We all want to be happy when we go to work, not get any burnout, but also to set the appropriate expectations with your stakeholders. And thank you so much for that talk, for listening, and I hope you enjoy the rest of the conference.

Skylar [00:15:34]: Awesome. Thank you so much. We're right at time, so we don't have time for questions, but yes, this is great. Loved learning a little bit about making our data reliable and love the mentality of treating it like an engineering problem. So really love to see that we have our next speaker ready, so I'm going to go ahead and bring him up. Thank you so much, Mariah.

+ Read More

Sign in or Join the community

Watch More

A Survey of Production RAG Pain Points and Solutions

Posted Feb 28, 2024 | Views 2.5K

# LLMs

# RAG

# LlamaIndex

A Journey in Scaling AI

Posted Mar 30, 2022 | Views 860

# Scaling

# Centralization

# Model Serving

# Data Platform

# Ocado

# Ocadogroup.com

Data Meh-sh: Composability over ideology in the data stack // Stephen Bailey // DE4AI

Posted Sep 24, 2024 | Views 842