MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Do We Really Need Data Contracts and Observability? (Hint: Yes) // Mark Freeman // DE4AI

Posted Sep 18, 2024 | Views 493
Share
speaker
avatar
Mark Freeman
Tech Lead, GTM Engineering @ Gable

Mark is a community health advocate turned data scientist interested in the intersection of social impact, business, and technology. His life’s mission is to improve the well-being of as many people as possible through data—especially among those marginalized.

Mark received his M.S. from the Stanford School of Medicine where he was trained in clinical research, experimental design, and statistics with an emphasis on observational studies. In addition, Mark is also certified in Entrepreneurship and Innovation from the Stanford Graduate School of Business.

He is currently a senior data scientist at Humu where he builds data tools that drive behavior change to make work better. His core responsibilities center around 1) building data products that reach Humu's end users, 2) providing product analytics for the product team, and 3) building data infrastructure and driving data maturity.

+ Read More
SUMMARY

When companies explore data quality initiatives, it’s common to wonder whether data contracts or observability is more critical. In this talk, we’ll clarify the unique roles each plays: data contracts focus on preventing known data quality issues, while data observability detects unknown issues across the entire data system. Drawing on real-world insights, we’ll show how these two approaches complement one another—think of observability as a flashlight illuminating the whole data landscape, while contracts act as a laser pointer, targeting specific areas. Attendees will learn why using both is essential for ensuring data reliability and efficiency.

+ Read More
TRANSCRIPT

Adam Becker [00:00:05]: I think this is perfect segue for Mark's presentation. Let's say, mark, are you with us?

Mark Freeman [00:00:11]: I'm on mute. I'm on. I'm here now. So to be here, and that was a great talk. Ashwara, thank you so much for having me. And Adam, I'll let you do what.

Adam Becker [00:00:21]: You need to do real quick with the slides. Thank you very much, Aishwarya. So, Mark, you're going to talk to us about whether or not we need data observability and data contracts, whether those are still needed. And I think that you told us, in short, the answer is no, that we don't.

Mark Freeman [00:00:40]: It is yes. Oh, okay. Yes.

Adam Becker [00:00:44]: I got to mix that with another talk. Okay. So I'm going to share my screen.

Mark Freeman [00:00:48]: I need to go talk to that person.

Adam Becker [00:00:52]: Okay, so let me find your slides. I think they're right here. And I'll be back in a few minutes.

Mark Freeman [00:01:00]: Perfect. Perfect. And then you're going to be running the slides in the background as well?

Adam Becker [00:01:03]: Yes. Just tell me or indicate or do like a wink.

Mark Freeman [00:01:06]: Perfect. That sounds good. Smoke signal. Perfect. So, hello, everyone. My name's Mark Freeman. I'm a tech lead here at Gable AI. Basically a data contract, data observability, that's what I'm really focusing on is, hey, why do we need data contracts? Why do we need data observability? They offer very similar things.

Mark Freeman [00:01:27]: My argument is hint, yes, we do. And I'll go into, if you've never heard about data contracts before, I'll provide a high level overview of that. And so next slide, please. And this presentation is based on Chad Sanderson and I's upcoming O'Reilly book on data contracts. And to give you a little bit of a background of, like, how do I get here? I'm a data scientist turned data engineer, and in my last role, kind of the typical villain arc of a data engineer got brought into the startups. They do all the cool ML analytics things, and the data was just awful. It was just super hard to work with, and infrastructure wasn't there. And it's because we were a startup, there wasn't much to work with in the first place.

Mark Freeman [00:02:11]: And so I just taught myself, I said, I need to do these analyses. I'm just going to build the infrastructure myself. And I started doing that over and over and over again, and I became a data engineer. And through that, I sat between a software engineering team and the data science team and just saw this huge disconnect between both sides and how they think about and use data and so I really wanted to solve that problem. I was like, how do we fix this communication problem? And Chad Sanderson reached out to me. He's like, hey, I'm starting a company, Gale AI, to fix this exact problem. So I joined as the first employee to help really think through how do we make data contracts happen and bring it to market. And so next slide please.

Mark Freeman [00:02:48]: And so basically really understanding what are the roles in data quality. This is just a typical kind of data data stack, high level, where you have your transactional database, you have an analytical database, and then it's being replicated into where you do analytics or ML, things like that. And there's all these different points in which you can put data contracts or data observability. I would argue that everywhere you see a letter, ABCDeFG is where you can put data contracts. Basically, this communication between one node in the data system to another from where you're extracting and where's the target database. Same thing can be said for data observability as well. And so it leads to this question for many teams, like where do I even use this? Does it make sense make on one side or another side? Should I use both at every single area and many answers in data? It depends. But I argue that many times you can think of data contracts as upstream and data observability as downstream.

Mark Freeman [00:03:54]: Next slide here, we can start diving into this. What is data observability? This quote is from Gartner and I highly recommend this. This presentation won't focus too much on observability because there's a great book already for data observability from O'Reilly as well. But at a high level, data observability is the ability for organization to have a broad visibility as data landscape, multilayered data dependencies such as pipelines, infrastructure, and at all times with an objective to identify, control and escalate and remediate data allergies rapidly within acceptable SLA's. So typical thing like hey, I think Ashworth was talking the previous conversation. You know, the predictions are going outside of standard deviation of XYZ. You are going to want to know that ahead of time. And that's something you can't really catch early is as it's happening and you're seeing these data changes happen, you know you want to be notified, but what if you can be notified before data is even written? And that's what data contracts are.

Mark Freeman [00:04:59]: So going to the next slide. And data contracts are a data architecture pattern that extends software driven collaboration to data teams enhancing data quality through human elite reviews, similar to how these systems have improved quote quality for product teams. And what we're essentially talking about is the CI CD workflow and GitHub and version control. And you look to the left. Here we have the building blocks of data contracts. And so at the base later we have data assets, whereas analytical database, transactional database, event streaming, or even like first party data and third party platforms like Salesforce, that's a common messy data set you might have to work with, right? So then from there you're trying to understand what data are we expecting. And so we have the contract definition. So you have a data contract spec, which is typically a YAML file.

Mark Freeman [00:05:51]: From there you input the business logic, the schema registry, sorry, the business logic and the schema for that, and through that you cross reference that to a schema registry or a data catalog. So we say, hey, this data asset and this database, we expect this schema with this many columns, with this criteria, and it's going to cross reference in the data catalog and say, hey, does this match expectations? Yes or no. The next phase is detection. Whether it's change, data capture, stream processing, lineage static code analysis, or just live data monitoring, basically constantly checking, hey, a change is about to happen to the database, let's make sure it meets your expectations. And then finally in the prevention, how do you action? This is through CI cd version control monitoring alerts. So that's the pieces of it. But what does it look like altogether? And going to the next slide, we have a high level view from the book and there's a lot going on in the slide, so bear with me here. V.

Mark Freeman [00:06:57]: One of the book, right, we're trying to iterate and get this kind of more condensed here. But essentially what happens is as a data consumer, say you're like a data scientist, you're going to identify a constraint that you need be like, hey, we use this for its ML model. The data needs to be in this format for it to work. You're going to bring that up to your data producers upstream, the transactional database where the data is being sourced, and say, hey, can you agree to this? We want to create a contract spec and they're going to be like, yeah, we can do that, we agree to that. Well, here's the thing. What if that data producer left next day they got another job and you have a new person, new person has no idea who you are, what the agreement was, but because it's saved as code via the YAML file and version control now, whenever they make a new request for a data asset change. It's going to go through the CI CD process and version control. And that CIC check is going to check the contract.

Mark Freeman [00:07:53]: Hey, did we make a change? And is the data asset under change under contract? If yes, did it pass the checks? If it does, no error, you're happy. If it fails, then you're going to have a failed tiering. So maybe it's like a really important mo model that's like doing the ranking for, for like shopping, things like that, where it's like tied to a high revenue number. You may want that to be a hard failure and be like, you cannot merge this code and change the data asset until we talk. Or maybe it's something a little lightweight, like a dashboard that's used maybe once a month. Okay, we'll let this go through, but let's have a conversation. And so a big piece of this is how can you have a human loop aspect of it and get the right people at the right time to talk? From the data producer side and data consumer side. Next slide, please.

Mark Freeman [00:08:41]: And so these are the kind of key differences that I see between data contracts and data observability. So for data contracts, we are preventing specific data quality issues while data observability is highlighting data quality trends. And one of the things is to observe means that already happened, right? And so data contract is before you observe data observability after you observe. In addition, data contracts are included within the CI CD workflow, while data observability complements the CI CD workflow. And to kind of elaborate on that, is that data contracts are informed by business logic. And then data observability reflects how data captures business logic. Furthermore, data contracts very targeted visibility, while data observability is looking at the entire data system. And then the biggest difference between both is alerts before change and alerts after change.

Mark Freeman [00:09:36]: Next slide, please. And kind of as a takeaway, this is how I view both of these tools, is flashlights and laser pointers. So say, for instance, you have your data system that you're trying to make sense of, right? If you just try brute force data contracts for everything, that's gonna be a very tiring task because there's so many different data assets and there's so much logic, you only really want the most important things. Well, for data observability, you want to know where all the skeletons are in the closet. And so I remember I was talking to Chris Burke from data kitchen asking about where should companies start with data ops, right? And he said, the first thing I think companies should do is put little thermometers everywhere to understand what is wrong with your data system. I think data observability is perfect for that. You have the flashlight, you can shine over the entire different data system and understand what is wrong, what is working. Then once you have idea of like, okay, these are the most important pieces.

Mark Freeman [00:10:30]: Take the laser pointer with data contracts, say this critical workflow. We want to have an extra attention on this and make sure everyone's aligned. Next slide. And so that completes my talk. Hopefully it fits within the time for that. Perfect. He's popping over right there. And so you can actually download the early release chapters of this book at Gable AI slash data contracts book.

Mark Freeman [00:10:56]: Or you can do the QR code. Fingers crossed. It worked. And then also feel free to add me on LinkedIn. Mark Freeman on there and happy to connect and talk to anyone and nerd out about data quality, data contracts.

Adam Becker [00:11:12]: Thank you very much. This was excellent and well done on the book. I'm stoked to get it. I'm just going to sign up right now.

Mark Freeman [00:11:19]: Thank you so much.

Adam Becker [00:11:21]: Of course, Mark, this was fascinating. Please stick around in the chat in case people have questions. Actually, there is one. Let me just see if I could snatch a little bit of time here. What tools available do we currently have to start using and implementing data contracts?

Mark Freeman [00:11:36]: Yes, that is a great, great question. We're going to be covering that in the book. There's two aspects to it. There's the open source side and then there's the closed source side, obviously. I'm with Gable AI. Ours is a closed source system for that. So if you're interested, definitely check out our website for that. But there's like emerging kind of open source tools that are coming across that have like a standardized contract spec.

Mark Freeman [00:12:02]: So I would look up data contract spec in GitHub and find tools like that. It's still kind of emerging, emerging space. So standardization is kind of relatively new. But kind of the key components is that there are data contract tools, but I would consider more data contracts to architecture pattern. You can use various different open source tools to build that pattern. One of the key ones that we're discussing in the book is to build a spec is use JSON schema because it has high flexibility for what you can say. This is what we expect for various languages. There's different plugins for that.

Mark Freeman [00:12:41]: Then in addition, for the CI CD process, you have GitHub actions, things like that. And then the data assets is just really dependent on your company and what you use for that. Then finally for data catalogs. There's so many different data catalogs popping up right now. I know. Snowflake and Databricks came up with their own data catalogs recently. Also just data hub and things like that. It's relatively new.

Mark Freeman [00:13:07]: There's no one all-encompassing tool yet that's open source. You have to piece these things together. There's a lot to talk about in one kind of question.

Adam Becker [00:13:15]: Yeah, yeah, of course. Last thing here. What is the book on data observability from O'Reilly?

Mark Freeman [00:13:20]: Yeah. Oh, man. I if you can find it and.

Adam Becker [00:13:23]: Put it in the chat, yeah, that would be excellent.

Mark Freeman [00:13:26]: I'm happy to do that. I'll send that over. But if you look up O'Reilly data observability, it's literally called data observability.

Adam Becker [00:13:31]: Okay, sounds good, Mark. Thank you very much.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Machine Learning Operations — What is it and Why Do We Need It?
Posted Dec 14, 2022 | Views 767
# Machine Learning Systems Communication
# Budgeting ML Productions
# Return of Investment
# IBM
Code Smells in Data Science: What can we do about them?
Posted Aug 14, 2023 | Views 455
# Code Smells
# Data Science
# Hypergolic
# hypergolic.co.uk