MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Data Engineering for ML

Posted Aug 18, 2022 | Views 1.4K
# Data Modeling
# Data Warehouses
# Semantic Data Model
# Convoy
# Convoy.com
Share
speakers
avatar
Chad Sanderson
CEO & Co-Founder @ Gable

Chad Sanderson, CEO of Gable.ai, is a prominent figure in the data tech industry, having held key data positions at leading companies such as Convoy, Microsoft, Sephora, Subway, and Oracle. He is also the author of the upcoming O'Reilly book, "Data Contracts” and writes about the future of data infrastructure, modeling, and contracts in his newsletter “Data Products.”

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
avatar
Joshua Wills
Angel Investor @ N/A

Josh Wills is an angel investor specializing in data and machine learning infrastructure. He was formerly the head of data engineering at Slack, the director of data science at Cloudera, and a software engineer at Google.

+ Read More
SUMMARY

Data modeling is building relationships between core concepts within your data. The physical data model shows how the relationships manifest in your data environment but then there's the semantic data model, the way that entity relationship design is extracted away from any data-centric implementation.

Let's do the good old fun of talking about why data modeling is so important!

+ Read More
TRANSCRIPT

Josh Wills [00:00:00]: Hi, I'm Josh Wills. I am a software engineer at Weave Grid. I'm a recovering manager. I used to be the director of data engineering at Slack many, many lifetimes ago, director of data science, etc. And I drink espresso and in particular, blue bottles. 17 foot ceiling is my just perfect. Like, it is basically perfect espresso as far as I'm concerned. Just straight up double espresso.

Josh Wills [00:00:23]: That's my happy spot.

Chad Sanderson [00:00:24]: My name is Chad Sanderson. I am the lead product manager for the data platform team at Convoy. And then prior to this, I was a product manager at Microsoft working on the big data team called the AI platform over at Microsoft. And I don't drink coffee, but I do drink green tea every morning.

Demetrios [00:00:54]: I am stoked because we've got Josh Wills and Chad Sanderson. This is like Christmas came early for me. I'm so stoked to have the both of you on here. I know this was kind of last minute, rearranging all of our schedules to make it work. It worked. I mean, originally, Chad, I wanted to talk with you. We had our first encounter. You came on the meetup.

Demetrios [00:01:17]: You blew my mind. I want to dive into more of this stuff. Since then, you've become somewhat of a LinkedIn personality. You might win LinkedIn's top voices for 2022. And, yeah, your personal brand has been growing. You're killing it. You've got your sub stack going. That is also killing it.

Demetrios [00:01:35]: It's been making a lot of waves in the ML ops community. Anyone out there that is not subscribed to Chad sub Stack, go and do that right now. Do yourself a favor. You will not regret it. And last thing I will say, josh, we gotta plug this before we jump into it. Josh is going to be leading a course on data engineering for machine learning, I believe.

Josh Wills [00:01:55]: That's right. I. That's right.

Demetrios [00:01:56]: And that's coming up in October. So anyone?

Josh Wills [00:01:59]: September is September. Mid September. Yeah. Sorry, just had to get that.

Demetrios [00:02:04]: You are a slacker.

Josh Wills [00:02:05]: Yeah, exactly.

Demetrios [00:02:06]: September. It's happening. Get after it. We'll leave a link to that in the show notes if you want to jump in to his course. And we should probably start, Chad, with what I feel like made a lot of waves in the Mlops community. You wrote an incredible blog post about data modeling. Can you just give us some background on that? What the hell is it that you're talking about? For those who do not subscribe to your sub stack. And then we'll start doing the good old fun of talking about why it's so important yeah.

Chad Sanderson [00:02:37]: So data modeling, if you're not super familiar with it, the sort of elevator pitch is you're basically building relationships between core concepts within your data. There's actually two places that you can do this. So there's something called the physical data model, which is like how the relationships manifest in your data environment, whether it's like snowflake or whatever it happens to be. So this is the process of using TBT, and you're joining tables together and all of that stuff. But then there's something else, which is like a semantic model. This is the way that I hate to call it legacy or old school, because I don't really think it's legacy. I think it's still pretty meaningful. And a lot of data architects still do this entity relationship design, and you think about the entities that are meaningful at your company, abstracted away from any data centric implementation, you then diagram the relationships between those entities.

Chad Sanderson [00:03:48]: You think about the cardinality, like, does it have a one to many relationship or many to one? You create this essentially diagram of the world, all of the properties that are relevant and sort of germane to each entity, and then that is what informs your physical model, like the thing that you actually built in Snowflake or whatever it might be. The point that I called out is that process of taking sort of a strategic, very thoughtful approach to modeling is not that common in a lot of cloud centric data based organizations. Convoy was certainly among them. What has become increasingly common, I've noticed, is the elt approach of, like, let's just collect everything we possibly can from every source. We possibly can dump that into snowflake or dump it into databricks, and then we'll kind of, like, do a little bit of modeling just so that we can actually use the data to answer some business question. But it's not comprehensive. It's not like the full spectrum of relationships, which makes our data fundamentally less useful. And then you see a lot of these basically the same queries being written over and over and over again.

Chad Sanderson [00:05:06]: One reason that happens is because the data model is actually, like, not comprehensive. And so people end up replicating things and duplicating things. I think that's a big problem. I think there's a huge problem there. On Snowflake spin. We're seeing this at convoy now. If you do not have a great data model, you get a ton of replication, and your spin goes through the roof. The complexity of the queries is quite impactful, and when you start achieving big data at scale, um, that query complexity is, like, very meaningful.

Chad Sanderson [00:05:39]: Maybe when you're a small company, it's, it's not. But, um, once you start adding a lot of sort of data to the fire, it is. It is actually pretty meaningful. And then just from the perspective of, of, like, the data being useful, like, being able to answer questions and not having data engineers and data scientists and data analysts, like, constantly getting pinged from your business partners, like product managers. Yeah. Like, having a great data model solves a lot of that. So that was sort of the core of the, of the piece, so.

Demetrios [00:06:09]: Good man. And who would have thought, like, data that is useful for people outside of the data scientists. That's a theory that we can all rally around. And I also think about, like, this idea of strategy and needing to have the strategy when you're going into things. And I imagine that you can get away with it. It's almost like the idea of technical debt, right? You can get away with it for a certain amount of time until you get to a certain level of scale and you see that, whoa. Now, once we've hit such a large scale, we're not able to properly or quickly take advantage and gain insights from the data that we've got. A.

Chad Sanderson [00:06:56]: Is that right? I think that's certainly correct, and I think it manifests in other problems. Like, there are. It manifests in problems around ownership. You know, it manifests in problems around, like, there's a bunch of questions that are important for data teams to be able to answer that are just either not possible or it takes a lot of work. So a very common use case that I've seen happen is actually on, like, the customer service side or, like, data science for customer service or operations where, you know, say you have some support ticket or you have, like, an email thread between, like, an operations person and a customer, and you're sort of going back and forth, and you want to understand if this particular thread or the service ticket or whatever it might be, had some real world impact on that customer. Like, did they buy the thing because their problem was resolved, or did they not buy the thing? Like, what happened? If there's not a great data model in place to allow you to answer that question, it's really difficult to do. You end up having to write, like, you know, a whole bunch, like, do a whole bunch of fuzzy matching and things like that, a whole bunch of case statements. Well, if this thing happened and this thing happened and, you know, we see that, like, this particular shipment that we're kind of guessing might be related to the thing that we're talking about, like, you see, you see a lot of that and it for, frankly, it's just, it's inaccurate, it's wrong.

Chad Sanderson [00:08:18]: If you want to start automating those processes is really hard because you're going to be wrong like 50% of the time or more. It's really hard to build machine learning on top of that type of stuff, and it's everywhere. And then the last piece, I would say, is that there's just some, there's fundamentally some aspects of the business that people are not even aware of that they need to be aware of. There are these. The way I like to think about a company is as a network. You actually have a bunch of different entities. They're interacting with each other all over the place and through life cycles. And what happens oftentimes is that product teams focus on one particular piece of that lifecycle.

Chad Sanderson [00:08:59]: And if the modeling is not done pretty well, you lose out on a lot of the benefit of how does the entity that you care about interact with all of these other entities in your business? Like, we have machine learning teams that are focused on our pricing model, for example. And there's a lot of things that impact a pricing model, but when the modeling is not done, it becomes incredibly hard to use that data for training and so on and so forth. So I think there's a lot of benefits to data science, to having modeling done well, and some of them are not. They're not. They're not completely apparent.

Josh Wills [00:09:36]: Chad. Hey, I'm just going to hop in here. How you doing, guys? How's everybody going? How's it going, y'all? You and I talked a long time ago. It was a long time ago. I think one of our mutual VC friends introduced us. I don't even remember why we were talking about whatever, but in that conversation, you said something to me that has stuck with me, and I've quoted it to other people. Chad Sanderson had a data convoy. You said something, which was the idea that the modern cloud data warehouses have made it so cheap for me to just ask whatever question I want relative to the old way of teradata and OlaP cubes and all that kind of stuff, that it costs less for me to formulate and execute the query that I want to run than it does for me to invest in the cognitive overhead required in understanding someone else's query, someone else's business question and understanding.

Josh Wills [00:10:34]: Okay, what are the reusable components of that that I can then extract into my. You know what I mean? Just to answer my question, and share that kind of stuff. And that was deeply and profoundly insightful to me, because, again, I'm super old, right? And I've been doing this stuff for a long time. Reading the data warehouse modeling toolkit back when I started right out of college in 2001, was a formative experience in my life. And one of the true, like, nerdy. I don't know. Again, this is like. I don't know.

Josh Wills [00:11:07]: There was one day where I was working at Google where I met Dustin Hoffman and Jeff Dean on the same day, and I was really excited to meet Jeff Dean. While most people would be excited to meet Dustin Hoffman, right? But meeting Ralph Kimball, who wrote the data warehouse modeling toolkit, I had him autograph my copy and stuff. I was just such a nerd about stuff. Yeah. This is sort of the core of the problem, I think, at least as you framed it, I think this is the core of the problem. It is now so cheap. I mean, I had a tweet about this once, which was like, snowflakes genius, is that they have invented a business model where anyone can just throw money at a problem, right? I want to answer my question. I want it faster, and I'm just going to throw money at this.

Josh Wills [00:11:50]: I'm just going to spin up the warehouse, make it bigger. Answer my question. And what a license to print money that is. And that being like the core of our problem, right, is in that we've moved fast, we have infinite scalability, infinite flexibility. And in getting so excited about these things, we have thrown out all of the things that were great about the old. Again, to be fair, heavily resource constrained single teradata box. It's got 30 terabytes. That's all you have.

Josh Wills [00:12:19]: You gotta be careful with it. You gotta schedule it, you gotta plantain. We've thrown out the baby with the bathwater here, essentially. I guess my question for you, man, is, how are we going to put this genie back in the bottle? How do we start to unwind? I think we're all coming down from the sugar rush or the cocaine high, whatever your preferred analogy is here. We're like, okay, this has clearly gone too far. We've all lost our minds. Whatever your drug of choice is. Exactly.

Chad Sanderson [00:12:47]: Precisely.

Josh Wills [00:12:48]: Apple juice, whatever. To each their own. What do we do now? How do we start to fix this?

Chad Sanderson [00:12:57]: Yeah, yeah, yeah. I think that's exactly the right question. And one of the things that I think about, one of the comparisons I think about is actually GitHub. When people talk about GitHub, it's usually, I mean, nowadays, it doesn't have a huge number of competitors, but, you know, there used to be a lot of people doing source control, and GitHub sort of emerged because it did something else really well. Which is it? It actually allowed people to be agile. It allowed someone to be agile. And all the principles of agile were like, you know, you've got this idea of like creating like, new branches and reviewing those branches. And I like, you could do like, really great code reviews and make people like, make sure the right people are sort of in the loop.

Chad Sanderson [00:13:52]: And you can, like, you can physically see what's changed. And you have alerts and monitors. Like, you have all this, you basically have all this like, really great infrastructure on how you move quickly, safely. And it sort of reminds me kind of of where we are in data right now. What? But the cloud unlocked is our ability to move really fast. To your point, a little too fast. There's a cost to that. Yes, right.

Chad Sanderson [00:14:20]: There's a cost to that. And it seems like the next step is to kind of regress back a little bit. Like we need to put the, to say your phrase, we need to put the genie back in the bottle a little bit, but we still want to move fast. But we just need to have the safety guardrails in place that allow us not necessarily to have an extraordinarily well modeled system from day one, because I think that's just impossible given how startups are formed these days. Like, hired six engineers, zero data people, and maybe a product manager and a designer. So it's just not feasible to actually have a great data model from day one. But what would be great is if it was very simple to like. As these scalability problems start to emerge, you can modify your data model easily and quickly and do it in an iterative way.

Chad Sanderson [00:15:12]: And the more your business scales and the more senior data people come on, instead of flooding them with service tickets, instead they sort of focus on, they take this GitHub like approach where they're like, hey, there's some refactors that need to happen. Here's what needs to happen. I can do it quickly. I can do it in my incremental part of the model. Over time, we can figure out, like, how do these various business entities relate to each other, and it improves incrementally. I think that that's the type of thing that has to happen at convoy. We've been trying to do this in a pretty manual way so far through the idea of like, contracts, data contracts, which we can, I mean, you wrote a post about this, I'd love to talk about it more, but like, yeah, for sure. Contracts with this sort of abstraction on top of it of what I've been calling data design, or like data UX, which is literally like a design surface for your data model that people can add to collaboratively.

Chad Sanderson [00:16:13]: Like, okay, I've got an entity called a shipment, or I've got an entity called a shipper. I've got these like twelve properties, and I actually need a 13th foreign key so that I can answer like some question. And I should be able to add that really fast. Someone should review it, that should be enforced through a contract, flows into the warehouse, and then I can really easily and quickly answer my business question and it's all safe. And I think all the tools to do that stuff exists today. Like, it doesn't actually need any new like net new advanced technology or anything. All the stuff is there, it's just, it's just applying it.

Josh Wills [00:16:44]: Process workflow, like all that kind of good stuff. Yeah, yeah, for sure. Say I was like, I had questions and thoughts and they all just got away from me, like all of a sudden.

Demetrios [00:16:54]: As it happens, that happens to me all the time.

Josh Wills [00:16:57]: Don't worry about that. It's just story. Being a human, you know, it's just the worst, right? So I guess, Chad, I guess what I hear you saying is we don't need to throw the baby out with the bathwater. We don't need to, like, obviously no one is going back to the old days of whatever, right? I guess. So what you're saying is we can layer on contracts. I guess contract is a fun word for me. Contracts is a new term for me because when you say contract, I hear in my head, schema because I live in the land of protocol buffers and APIs and stuff like that. But obviously schema is an overloaded term in the data warehouse world.

Josh Wills [00:17:36]: If you say schema to people, they're going to be like, oh, you mean schema a namespace in a database? I'm like, no, that's actually not what I mean. I love contract as a word for this. Just to be clear and disambiguate that we're not talking about a literal database schema, you're saying this is a thing we can layer on to our existing data warehousing system. It's not a thing where you need to go, okay, we screwed up because it's just a very human instinct. And you see this with the DBT, people are currently completely rewriting their IDE from scratch. Let's just throw this out and start over. You know what I mean? I get that. I empathize with that, but I just feel like, in my career, I have learned that that is, 99% of the time, the wrong thing to do, you know? So is that what you're saying? Like, this is something we can, you know, gently introduce?

Chad Sanderson [00:18:25]: Basically, exactly. Like, I think. I think that pretty much every business I've talked to about this, and convoy included, usually has some set of use cases where contracts are and sort of enforced schemas are incredibly meaningful. In our case, it was. It was our pricing model. Right. We had pricing model, one of the most important models in convoy's business. It's responsible for deciding the price of our auctions.

Chad Sanderson [00:18:58]: Essentially, when we stop an auction and the data was not good, we were dropping, I think, 10% of the rows that we're feeding into our trading set. And that was because there was a tremendous amount of transformations. There was a lot of stuff that was just randomly getting, like, columns that were being dropped, upstream column names that were getting changed. Like, the engineers weren't really aware of what was going on downstream. And the reason that was happening is because we took this approach of, okay, like, software engineers, you guys don't really need to think about the data that you're emitting. We're just going to build a pipeline that pulls everything that you're capturing and dump it into our storage environment, and then we're going to build all of our training sets on top of that. And turns out that's not a good idea. That's actually not a great idea.

Chad Sanderson [00:19:47]: But if we can say, hey, what if that didn't happen? What if we could make it so that this data was actually incredibly trustworthy, that there was a clear owner to it, it solved a very discrete problem. And the actual consumers of the data, the data scientists, they understand the data that they need. They're like, I know what I need to power this model. I just need someone to implement it for me in the way that I want, and I need it to be trustworthy, and I need to arrive on time. And that, to me, is the purpose that the contract serves. It does all of that through schema, and you can introduce this in a totally iterative way, one contract at a time, solve a business problem, get the value, and move on to the next one.

Josh Wills [00:20:37]: Totally. Totally. I love that. This is an interesting. I want to call it a disconnect chat. I think, between us that I think a lot, and I think about a lot through my course as well. Which is that I guess I learned back in my Google days. I was hired at Google way back in 2008 as a statistician.

Josh Wills [00:20:57]: But as soon as I got there, all I did was write code because basically no one stopped me more or less, which would be like I wanted to analyze some data about how an ad auction worked. I went into the ad server and wrote a code reverse and I added the data that I wanted in the structure that I wanted it to do my downstream analysis and send it out for code review and had it reviewed with the logs team because at the time, if you added a single int field to the logs, that was an extra terabyte of data every day, which was just great and got it approved and my log started flowing and I got my data. I have, for better or worse, taken this attitude of Game of Thrones style. I will take what is mine with fire and blood kind of thing to most of my career. If you're not going to give me the data I need, I'm going to go into your system and I'm just going to go get it. While I'm there, I'm probably going to fix a few bugs I find because I'm to be perfectly honest with you, like much better at engineering than you are, and that's fine. That's the way that it works. And in small companies, that's kind of like, that's the way because there is no, you know, that's just the way it's done.

Josh Wills [00:22:15]: Right. But as companies get larger, like I don't scale, you know what I mean? Like you can't and you know, God willing, I shouldn't be allowed to scale. I guess basically you need a, you need this, you need a process, you need contracts, you need workflows because again, to handle this kind of stuff and I guess I have never successfully made that transition in my career. I am for better. I am very much myself. So I dont know how would something like this work in a big company where there are too many people and you cant just go cowboy this kind of stuff? Thats the part thats most fascinating to me about what youre talking about.

Chad Sanderson [00:22:55]: Yeah, yeah. I mean, ah, man, I would love it if more you would love me at convoy.

Josh Wills [00:23:03]: I would be like your right. I would be like your ninja amazing.

Chad Sanderson [00:23:06]: You would be amazing at convoy. That would be, it's so funny that you say that because that has traditionally been my experience. That's been my experience as well. There's just such difference between like companies of different scales and like totally and it's something I didn't even realize until I joined Microsoft and then I joined convoy after that. In some places, there's almost like a Stockholm syndrome type thing going on with the data.

Josh Wills [00:23:39]: Exactly.

Chad Sanderson [00:23:40]: They're like, I'm not allowed to go up and touch this database or whatever. And so I just need to accept everything.

Josh Wills [00:23:50]: And I call the Oliver twist thing. It's like the, please, sir, may I have some more kind of thing? Like, is it, please, please, data producer, please, please let me have the data. And I'm just like, man, I'm saying, Demetrius, can I swear on here? I've been swearing for a while, haven't I? Is that okay? Okay. So I'm like, I'm like, fuck that shit. Fuck that. I am going to go in and get it. I don't care. This is how it was taught.

Josh Wills [00:24:14]: This is how it was done. I don't know. I want this to happen on my timeline. Not your timeline anyway.

Chad Sanderson [00:24:21]: And honestly, I think that is so the thing that I have felt sort of as a data platform lead is that what's really missing is the interaction between the producers and the consumers. And if the consumer is just going to go up into the warehouse, they're going to go up into production and make some change. And the producer, like, doesn't like that. Great. Like, that's now a conversation that can happen and we can talk about why, like, this data is important to me. You aren't like, you. You aren't getting it to me the way that I wanted, so I did it. So if you don't like it, we need to talk about our options here because I need to run my model and, you know, whatever.

Chad Sanderson [00:25:00]: But what I find is that, you know, I don't know if it's almost like a meekness or sort of an acceptance of like, well, you know, I can't get the stuff that I need. I ask people for this, they don't give it to me. And so instead I'm going to spend a month, like, you know, writing SQL.

Josh Wills [00:25:15]: In Snowflake and, or doing some crazy analysis instead of just going and getting.

Chad Sanderson [00:25:20]: Exactly, exactly.

Josh Wills [00:25:21]: I'm going to build, I'm going to build a machine learning model to impute the missing features that I need to train. Like, it's just, like, it's absurd. Just go get the goddamn data. Like, what is wrong with you? Right? Yeah, but I think what I feel like we're saying here, and I think this is important and this is the part I lack, right, is like, you basically need a carrot and a stick mold.

Chad Sanderson [00:25:38]: Yes.

Josh Wills [00:25:39]: Right. Like, I'm the stick and you're the carrot. Because I'm just saying, like, if it's just all carrots, it's not going to work. Right. But if it's all sticks, it's not going to work either. Because, I mean, again, inevitably I get fired because, you know, someone takes me out. Like, I, you know, it's, every ninja goes down eventually. No one, no one survives forever.

Josh Wills [00:25:56]: Right. So, I mean, that's, that's the thing is you need that sort of pairing, basically. You need to figure out, like, what is your carrot stick strategy for moving this stuff forward and making this happen?

Chad Sanderson [00:26:06]: Yeah, that's fine.

Josh Wills [00:26:07]: That's useful.

Chad Sanderson [00:26:07]: You're right. And I think, I think that's, so here's sort of an interesting thought that I've been having, and I'm still trying to express this in a way that's, like, not confusing. And people don't yell at me because I get yelled at a lot on LinkedIn is.

Josh Wills [00:26:21]: I love it, man. You do? And it's, I thoroughly enjoy it. Like, I love, your posts are the best. The comments are fantastic. Anyway. Okay. Yes, please.

Chad Sanderson [00:26:29]: But so one of the things, so I sort of, I, I've been thinking a lot about this, like, GitHub comparison. And I think at first, when I was approaching this problem at convoy, I was approaching it basically the way that you were saying, which is a data scientist or someone would come to me and they'd say, hey, what's going on with this quality? Well, I don't understand why it's so bad. I was like, well, you're a software engineer and you care about the quality, so you, it should be on you. Go do it. Go get it done. But there was like a real hesitancy. There was a real hesitancy to do that. And I think a lot of it has just come from organizational design.

Chad Sanderson [00:27:15]: A lot of it has come from sort of this clear separation culture concerns, like, mounting expectations of, like, you know, we don't, we don't, the business, like, doesn't care if you put a lot of time into data quality and sort of data governance and, and things like that. We just want the model and we just want, give us a report. And so I need to, like, maximize my time as a data person to deliver those things. And now if I have to add sort of quality and ownership of some upstream, like, production thing into my workload, that's going to be a net negative for me. So I need to find someone else to do this, and the software engineers aren't listening. So a lot of the times, like, those things get redirected to the data engineering team. Like, hey, the pipeline broke. Go fix it, data engineers.

Chad Sanderson [00:27:56]: I can't yell at the software engineer, so I'm going to yell at you instead that, because you have to listen to me.

Josh Wills [00:28:01]: That is so true. Go ahead.

Chad Sanderson [00:28:03]: Oh, yeah, yeah, yeah. But I was gonna say is like, what that's kind of led me to feel is that this sort of GitHub based approach doesn't actually work for data. If the consumer, you know, if they, if they care about the code and they care about the quality, then that's great, and then it does work. But if you're living in a world where they don't care, and they're basically just saying, I just want you to give me this particular data asset. When I need it in this form at this time, then code may actually not be the right level of abstraction for that. We're moving to a world at convoy where what we're trying to do is abstract the code and instead focus on the semantics and, like, the data itself, wherein, okay, where in someone, someone might be able to say, for instance, there's like a real world thing that happens. Let's say a shipment has been canceled. And I want to know every time the shipment has been canceled, and I want to know that in real time.

Chad Sanderson [00:29:15]: And here's what a cancellation means to me. And there's like six different places. There's like, in a lifecycle, there's six different places where that cancellation could occur. I want to know all of them. And here's, like, the properties about those cancellations that are, that are meaningful to me, all of that can actually be described using English and semantics.

Josh Wills [00:29:36]: Yeah, of course.

Chad Sanderson [00:29:38]: And if you're able to capture that as schema, and then either an engineer or a data engineer or whoever it else, you have some agreement, and that's something we can talk about. But I if there's some agreement to actually implement this as a contract, then once that data flows into your lake or your lake house or your warehouse or whatever it is, you can essentially map the semantic meaning to the actual code in the warehouse, thereby abstracting the code. Now, if I want to make a change, I can make a semantic change. The engineer understands how to do that from the schema sort of code perspective, and I can go and train my model without even needing to think about, like, oh, okay, I need to go into production and write something.

Josh Wills [00:30:22]: Did you said so much? And you have to forgive me, I'm going to comment on like three different things you said because they're each, like, you did. No, no, it was great. No, it's fantastic. Because I just, it's basically like, if I could just take that, your little, like the speech you just gave there, and just turn that into like an ad for my data engineering for ML course. Like, that would be like ideal. Like, just because one, you clip it all the time.

Demetrios [00:30:41]: We'll make a clip, don't worry.

Josh Wills [00:30:43]: What's up, bud?

Demetrios [00:30:45]: We'll make a clip.

Josh Wills [00:30:46]: Thank you. Thank you. Spear will be grateful. First thing you said, which I think, which is really like, the reason I want to teach a course, is because there is some things about the reality of doing data that cannot be talked about in podcasts and cannot be written about in blog posts, which is the implied social dynamics cultural hierarchy of various engineering teams at various companies. This is a very real thing. Anyone who does this stuff for real can talk about this, but we never talk about it. We never talk because it's very touchy, it's very sensitive. It's not something that can be discussed out in the open.

Josh Wills [00:31:29]: I don't say you need a group therapy session or something. And that's to a certain extent what my course is. It's a place that's private, where the entrance are screened for practitioners, where we can have these conversations about these things and how do we deal with this. The second one I want to say, and again, this is just again. Advertisement the two topics for the first session of my course are one, data collection, obviously, because it's so foundational. The second one is data quality. And I think I have, like a lot of people, I have very much had the challenge of how do I sell data quality? Like as a thing, as a thing we invest in, as a thing we like, we put money into. And I think what was funny to me was I've been reading up on how different companies handle this, and I love, love, love Google's wrote this wonderful paper about their data validation system.

Josh Wills [00:32:24]: I don't know if you've come across this, it's actually largely open source. It's in the TFX, the Tensorflow extended system data validation library, they wrote. And basically, again, talking to my friends at Google, it is taken over at Google, data validation, data quality with clear, defined schemas, specifications for what values are allowed for different fields, with what tolerances is everywhere. It's everywhere. They use it for everything, all kinds of stuff. It's completely taken over there because the data quality stuff it does really three different things. It's documentation. What does this model expect? What inputs does it need to see for it to work correctly? So it's documentation for new engineers exploring new models.

Josh Wills [00:33:11]: Two, it's actually like they use it for testing. They will use the schema and the specification, the contract, if you will, to auto generate training examples that are designed to push the limits of the schema. Let's intentionally throw some weird data into this model and see what happens and see what breaks. And it catches bugs. So it's like fuzzy effectively. Right. And the third thing, and this is the most important thing, and just this is like, to me, this is separates, like, people who can do ML from people who can't. Google has learned through trial and error and lots of years of mistakes that there is free money in doing a better job of constructing your training data sets by using data quality checks to basically test extreme values, rebalance your classes.

Josh Wills [00:33:59]: There's free money in there. You don't need a better architecture, you don't need new features, you don't need anything. You can do a better job of constructing your data set using these data quality metrics as a guideline and get like a three or 4% lift. Like, you'd be insane not to do that. It's like literally free money. Anyway, I'm sorry, I shut up about this, but like, this is. This is the stuff I want to talk about because it's not getting talked about and we. And this is my opportunity to try to fix that.

Josh Wills [00:34:24]: Yeah. Anyway, I'm sorry. Oh, yeah.

Demetrios [00:34:27]: There is something that I wanted to. I wanted to touch on, which I think is a bit tangential to this, but stay with me and I'll bring it back around. And it is going back to this point where you've got some data that maybe has been. It's been tainted along the way, along its life cycle, along its journey, and that's because somebody upstream did something with it. How do you make sure, or how can you get someone, anytime that they touch data or they mess around with something in the database, they know everyone that it's going to affect, like those second and third order effects.

Josh Wills [00:35:08]: Totally.

Chad Sanderson [00:35:10]: Yeah. Yeah. So, okay, so there's a few things there that I think are really interesting to talk about. I think the core of your question, the core of the answer is lineage. The problem with most lineage systems today is that it's just metadata collected off of the warehouse like you pull out. So Snowflake metadata, and you can build a lineage system based on that and you can see how various tables are connected. But to your point, there's nothing that actually ties all the way back to the services themselves. And if you had that, if you went that one additional step, it actually becomes an order of magnitude more valuable for like three different reasons.

Chad Sanderson [00:35:47]: Like, one is sort of what you said, which is if you're a data producer today, data producers have literally no idea what's going to happen. If they do anything. They drop a column. They're like, I don't know. Who knows? But if you did have some sort of lineage component, and I could go quite deep with this, actually, but if you sort of combine this idea of contracts and lineage together, then the data producer could essentially know if I am going to drop a property or change a name, I can go out, check my lineage, see who are all the consumers of this, and go all the way downstream, and then be like, listen, if you make this change and you do it in a non version controlled way, you will break our pricing model. You will break the training set for our pricing model. Just having that level, being told that you're going to break something before you break it is, I think, would fundamentally change how software engineers interact with data. It's not that like producers don't care.

Chad Sanderson [00:36:59]: Like they would just break the pricing model even if they knew that it would happen. They just don't know what's going to happen. And how could they? They're not really brought into that part of the data life cycle at all. So, yeah, I think that would be really important. The other thing, there's not really the idea of a backwards incompatible change when you're managing database changes. There's the idea of that change within the service, but not within what's going to happen to all the consumers of this, which is another area. I think data contracts is really amazing because you could essentially say, hey, we understand what this schema is supposed to be. There's a contract here, and the change that you're going to make is going to be backwards incompatible.

Chad Sanderson [00:37:44]: Are you sure you want to do that? The best thing for you to do, and if it is, if you are going to have to make an incompatible change, we should at least tell people what's coming and be able to communicate to all the consumers downstream. So that would be, I think, massively beneficial.

Josh Wills [00:38:02]: I'd say a couple of things there just sort of riffing on what Chad was saying. One is that at slack, first and foremost, we rely primarily on our structured logs. We had thrift schema for our structured logs designed to go in the data warehouse. And that was obviously integrated with code review, integrated with CI CD checks, such that if you introduced a backwards incompatible change to the logs, the CI CD check would fail. There goes my dog barking. That's good. She probably just wants to get up. I'll go get her.

Josh Wills [00:38:33]: The CI CD check would fail. And exactly that. You could, again, it could be overridden, but you had to have a conversation with people before you did it, that kind of thing. Two is like, it's certainly the case that people unintentionally, through ignorance or whatever, do these changes and stuff. But in my experience, the bigger problem is really just honestly, bugs, more or less. No one meant to break the thing. They didn't mean to break the pricing model, but they introduced a bug. Therefore this RPC call stopped.

Josh Wills [00:39:06]: That was actually critical, stopped working silently and no one noticed and stuff like that. I just want to call out that for me, in my experience, insofar as schemas and contracts upfront remove just a huge class of just the knucklehead stuff, like the unforced errors, which is key. That doesn't mean that you don't need data quality and data validation stuff downstream, because again, bugs happen. Nobody's perfect. This is normal. But again, we have to be defensive against it because especially with ML, it's just way too easy. Chad, as I'm sure, for this stuff to just slip through and no one notices until you end up with model that, like, is just, you know, is terrible anyway. So.

Josh Wills [00:39:48]: Yeah.

Chad Sanderson [00:39:51]: Yeah.

Josh Wills [00:39:51]: Do you call it the dangers?

Chad Sanderson [00:39:52]: Yeah, I think that's a good point. And, you know, to sort of continue on that, I do find that in the data community and even in the data science community, again, I use the term Stockholm syndrome, where there is this expectation that, like, we need to put all of the quality checks downstream. And like you said, there obviously does need to be like, you don't want to just leave your fate up to the hands of the gods and hope for the best. But at the same time, having the people who are producing the data also having a skin in the game when it comes to quality is incredibly, is incredibly meaningful. I think one of the questions, and, Josh, I'm actually interested to hear.

Josh Wills [00:40:45]: Oh, man. I was just about to ask you a question. You can't do this to me. Come on, I'm here.

Chad Sanderson [00:40:52]: How you handle this at Slack is the biggest feedback I get from people when I talk about these things. And even earlier at convoy when I was pushing a lot of these ideas, Washington, like, well, you know, the engineers just don't care. Right. It's just not like the engineers don't understand that this is something that they need to be able to do. It's like an additional category of work that they need to take on. It doesn't actually benefit their service in any way to do all these things. So, like, how is that something that, you know, just adding these quality checks at Slack? Like, how did you overcome that? And sort of, how do you think about it more, more broadly?

Josh Wills [00:41:31]: That's just a perfect question, man. It's a perfect question. And actually, it's funny because it gets exactly to what I was going to ask you, which my answer, and I'm going to answer your question, and I'm going to ask basically my question right after that. Back to you. Okay. Right. So the answer is that we hired engineers who cared about that data. And basically, you know what their title was? Their title was machine learning engineer.

Josh Wills [00:41:56]: That was their title because the machine learning engineer, unlike literally every other engineer of the company, is basically the person who is going to both generate the data upstream in the production service and consume it and process it downstream in the data warehouse. They actually have, to your point, skin in the game and a stake in the fact that the data they get for their model to work and for their life to be good and happy, these things need to work. They're incentivized in this way. Okay? So ML engineering, at least at Slack, was the first best customer for all of this stuff. They wanted the schemas. They insisted on the schemas, right? They needed the schemas. It solved their problem because they, again, were the only person who lived on both sides of the fence here. Now, once we had built the minimal tooling, again contracts, you know, all the downstream data processing, all data quality checking, all that kind of stuff for those engineers.

Josh Wills [00:42:58]: Then it started to grow. It started to expand. And then, like, the growth engineers. So, I mean, I would say to me, like, you know, broadly speaking, in any company, there are only like three. Like, there's like four use cases for data, and three of them are things that engineers care about, right? One was obviously ML, obviously. Second is growth, like growth product led growth in particular, slack being like, obviously the canonical product led growth, the original product led growth company, right? So then the growth engineers started clotting onto this and, like, working with their data scientists, and so, and they started what was great about the growth engineers, like, for the ML engineers, again, love ML engineers. We built the most minimal sort of shitty tooling possible, just the minimal acceptable tooling the growth engineers come along and they've got their front end shops and their full stack and they just start making our tools. They start using our tools and they're like, wow, this is garbage.

Josh Wills [00:43:45]: We're going to make this much better. So they start building the tools and adjusting them for their needs. Then the last one was really the performance engineers. This was my favorite one. When we started getting very serious about performance because performance and cost cloud spend all that stuff, the performance engineers started coming in and then they started using structured log to capture weird database query edge cases and teams that were having cache misses way too often. They started getting serious about this stuff. It all spiraled out from there. But I think it backs us back to where I'm questioning to you.

Josh Wills [00:44:18]: If I don't have those engineers who were designed to cut across the data warehouse, I don't know how to get started, my man. I don't. I don't know how to operate in a world where you're exactly right, because, like, the regular engineers don't care. They absolutely. It's, it's. And again, I've been a regular engineer. I empathize. They have to worry about security.

Josh Wills [00:44:36]: They have to worry about, you know, obviously performance. They have to worry about functionality. The product manager wants this. Like, they have to worry about 800 million things. And so I feel terrible piling on. Another thing they have to worry about, especially if, like, the tooling is immature and stuff like that. Right. So, dude, that is a problem I don't know how to solve.

Josh Wills [00:44:52]: To me, that you solve the problem by hiring ML engineers and making them the first best customer for the stuff you're building, because they do care. It's in their interests.

Chad Sanderson [00:45:01]: Anyway, I totally agree with you. So I can tell you one of the ways that I've thought about this at convoy, and, please, a bit of it is somewhat authoritarian, I guess. So who knows?

Josh Wills [00:45:19]: That's great. I love. I'm the stick man. I'm with you. Let's do it. Carrot and stick. Let's bring some stick. What do you got?

Chad Sanderson [00:45:25]: Yeah. Yeah. So I've basically found success in two main ways. The way is actually with the underwater data engineering team and where the data engineering team right now, a lot of them are software engineers. They're building these pipelines. The stuff breaks for reasons outside their control. And they are essentially caught in between the pincer of the software engineers who are owning these upstream things and the downstream folks that are just causing them all sorts of pain. That was actually where a lot of these ideas originated at convoy was the data engineers who were like, can we just go and solve this problem ourselves? Like, can we just go in and very, very easily add the data that we need, add the data that this downstream team needs, build it as a contract, and then hand it off to that software engineering team and say, hey, if you make any additional changes to this, just understand that there is some process in place, there is some CI CD workflow where if you make a backwards incompatible change, you're not going to be able to do that, or you have to, to alert people.

Chad Sanderson [00:46:43]: And generally speaking, like, that additional level of ownership, which is basically just owning an API, was not super painful for anybody.

Josh Wills [00:46:52]: Totally. Exactly right.

Chad Sanderson [00:46:55]: And so that was actually really helpful. The data engineering team loved that and there was a lot of use cases. When I remember we were trying to get this rolled out to certain software engineering groups sort of in the organic way that you were talking about. There was just sort of gradual adoption as people realized like, oh, man, these contracts are awesome. And sometimes the software engineers, when they were first starting to adopt was like, hey, we're going to budget a week or maybe a whole sprint for doing this work, even though it might just be adding one or two properties to an existing thing. And the data engineer said, yeah, I'm just going to go do it myself. I'm just going to go in, I can do this in ten minutes. And then it's done.

Chad Sanderson [00:47:31]: And that happened a lot.

Josh Wills [00:47:34]: Exactly. But then at the same time, dude, I see so many places where the data team and the prod team aren't even in the same AWS account, like this. Bizarre. Literally, physically, they're like, again, physically in the before times. Right? In the before times, they'd physically be sitting like 10ft apart, but technology wise, they would be living in completely and utterly different universes. They're galaxies. Yeah, precisely. The production system and the data system, to your point, Chad, live in completely different worlds.

Josh Wills [00:48:05]: So I am curious for you, the first time this happened with the data engineers, how did that go? Who did they talk to? How did that work? Was there, was there a meeting? Was it like just a slack DM? How did that work?

Chad Sanderson [00:48:18]: Yeah, so, so there's. So the type of tooling that we build is probably similar to maybe what existed at slack, but like, we essentially created. So there's actually two things, and this is going to sort of help. Then I can explain the process that happens sort of as a result of the, as a result of the tools, we essentially have sort of two things. We have one thing that's like an SDK. It's like we don't use like Protobuf or Avro at convoy. So we have like an internal sort of IdL that we built. It's like very similar.

Josh Wills [00:48:52]: Is it like JSON schema? Like, what is it? Is it.

Chad Sanderson [00:48:54]: Yep.

Josh Wills [00:48:54]: Yep. Okay.

Chad Sanderson [00:48:55]: Yep.

Josh Wills [00:48:55]: Nice.

Chad Sanderson [00:48:56]: Yep, it's JSON schema.

Josh Wills [00:48:57]: Right on.

Chad Sanderson [00:48:58]: Yep. And so we basically created a library, put CI CD checks around, that made it really, really simple to omit these things that we called events. But they could just be a database change, could be an event, theoretically. We were actually mainly focused on this other idea, which I can get to a little bit later, called semantic events, which are capturing real things that happened in the world. There's a reason that we focused on that. That is an answer to the question you just asked a second ago. Then the second piece that we built was actually an abstraction of the production table itself that lived in the stream processing layer. The idea was, if you're an engineer, you could basically wrap your production table in this abstraction.

Chad Sanderson [00:49:52]: You could say, I only want to push these seven or eight things that are within a contract into the warehouse in a structured format. Everything else I can just continue to dump and not think about. And there's some really interesting things that go along with that. So in that way, we were sort of capturing as events, as clearly defined schema with a contract, every single update to the important entities within our databases. And then we also had these backend events that were being emitted from the services by the software engineers or the data engineers, if it was simple. So the goal through those two pieces of technology was, we want to make this as unbelievably simple as possible for anybody to do that, understands, you can write JavaScript or typescript or something like that. It's just real easy. Then the pipeline is actually auto generated.

Chad Sanderson [00:50:46]: We would say, okay, we would check the contract, we would create a new Kafka topic, push that event via Kafka into the warehouse. We'd automatically parse it like once it landed in Snowflake. And so you sort of, then you can actually, we built another thing that's like a catalog that sits on top of that, essentially links the contracts to the schema registry exactly to the underlying, like Snowflake table.

Josh Wills [00:51:10]: Totally.

Chad Sanderson [00:51:10]: And so there's sort of a clear, a clear line of sight there. And so what happened after we sort of had these two pieces was that we had a data science team. Oh, and this is another piece of context, is data engineering at convoy is actually like a centralized organization. They're not embedded. They're actually technically part of the software engineering organization. They're not a part of the data organization. And so what happens is that the data science team would say, we've got some data that's breaking. They would sort of make their traditional complaint to data engineering.

Chad Sanderson [00:51:44]: Like, this is a problem. Can you guys go fix it? And what we did is we initially worked with them to say, what is the data that you actually care about and that you need? Do you know what service that this data is coming from? And if not, we can go figure that out for you. We went out and basically did that. We said, okay, we figured out that the data you need is coming from this service. And the way that we can express the data set you're asking for is actually through, like, maybe three or four real world semantic events. Like, you want to know every time that a new auction is completed, every time, like, a new auction has. There's like, a bid is made on that auction, and every time, like, some customer, I don't know, like, makes a bid, like, they press a button in the app to say, like, let's, you know, we want to have this auction, and then we need to add, like, three or four new properties on the actual auction entity. And with that combination of data, we can hand that off to a business intelligence engineer, and they're going to be able to derive whatever datasets that you need.

Chad Sanderson [00:52:47]: The data engineer went in, they implemented that in a day. The data flowed through in real time, and then the bi engineer was able to, since it was all properly modeled in advance, it took them two or three days to build the data set.

Josh Wills [00:53:01]: Easier to build the model, cheaper to build the model. So much cheaper to build the model. Right? Because everything's. Because, I mean, the cheapest time, this is, like, one of my things I harp on, like, the cheapest time to do a join is the moment the data was created. By far, everything after that streaming snowflake, way more expensive. Way more expensive to do the join. Then do it when it's cheapest. Oh, my God.

Josh Wills [00:53:22]: If you can. Oh, I love that, man. That's fantastic. I love that. I mean, that, that. I mean, that data engineer you're talking about, whoever was doing that, I mean, that sounds like, that sounds like an ML engineer. I wouldn't, I wouldn't utter that person's name lest, you know, the recruiters, like, descend upon them after hearing. So just keep that one to yourself.

Chad Sanderson [00:53:39]: Yeah, I will, for sure. So there's one use case that I want to give you, which is really helpful in thinking about this, and we discovered it a bit later, is there is actually a product use case here, too. And the product use case that we found, this is sort of why we sort of been pushing for this idea of semantic events. And we found that a huge number of the questions that, like, both data scientists and product managers want to have answered is like, I want to understand in a sequence of think, like, what are all the things that happen to a specific entity in some order? And then there is some set of, like, real world funnels that exist. And then I basically want. So an example of this might be shipment cancellations. It is really, it was really hard at convoy to figure out, you know, who, like, who is responsible for a shipment cancellation? Is combo responsible for that cancellation? Did we, like, mess up somewhere? Did we get the wrong information from a shipper? Like, why did this person cancel? And the only way that you can know that is by looking at the log of everything that happened in the sort of history of that entity over time. And this was like a huge product question.

Chad Sanderson [00:54:57]: Like, this is something that, if you can answer that correctly, there's millions of dollars that we could potentially be saving. And we didn't have a great way of doing it. We're like, well, you can do it through contracts. The way that you would do it is you would say, let's look at the lifecycle of this particular shipment event, of this particular shipment entity. Let's very, very clearly document all the things that would be germane to cancellation that would cause a cancellation. Let's instrument that in the code itself through a contract clearly defined schema well modeled in advance of. And then once it flows into the warehouse, we can basically generate that sort of history table of that particular sessionize it. And then all you have to do is.

Chad Sanderson [00:55:39]: It becomes trivial at that point to solve that. So we actually found that a bunch of product teams started asking their engineers for that and then contracts with a.

Josh Wills [00:55:49]: Mechanism of implementation that's super smart, man. You can get really, if you can tell a product manager that they can have a funnel for something, oh, my God, they'll do anything you want. That's dreamy.

Chad Sanderson [00:55:59]: Exactly.

Josh Wills [00:56:00]: I think I want to call it here, like, one of the themes of my course, and I'm sorry to be like, a broken record about this stuff, right? Is really is about, like, relationships between engineering teams. And the funny thing is like, is this. This pattern of analysis of, like, the sequence of events occurs all over the place. They call it, like, customer journeys over in marketing land, right? And then, but at the same time, in, like, the land of observability, this is traces and spans. This is like, what is the life cycle of a request? When does a request die? What's the slow? Right? And so it's all the same thing. And the joke to me is like, literally it's the same thing. It's the same technology. Now, your example is perfect because it's, like, for these custom things where it doesn't fit naturally in an e commerce conversion workflow or a trace on a request, the data teams are left with constructing these custom systems basically to do exactly this stuff.

Josh Wills [00:56:56]: The funny thing to me is the price point of constructing this funnel changes dramatically even though it's the same problem. You know, like, the engineering challenge is essentially identical, but, like, for observability, it's like not that much money. And for marketing, it's like, you know, extreme amounts of money, like that kind of thing. I'm sorry, but, yeah, it's, it's just funny to me. And I love, anyway, I love these conversations just because we get to talk about this stuff. It's so fun. Ah, dude. Demetrius, it's like 1020 and I got to get going.

Josh Wills [00:57:22]: I'm sorry.

Demetrios [00:57:23]: I know. I wish we could talk all afternoon. I know. You gotta go. I appreciate both of you coming on here. I got to be a fly on the wall and just watch you to chat about this stuff, which I was hoping to do. I'm so thankful for this happening. And maybe we can do it again sometime.

Josh Wills [00:57:42]: Never know, right? Just ping me randomly like ten minutes before you. You want to do it, and I'll let you know if I can. Okay.

Demetrios [00:57:49]: If you're free, that is it. This is awesome, guys. I appreciate it.

+ Read More

Watch More

Driving ML Data Quality with Data Contracts
Posted Nov 29, 2022 | Views 2.4K
# ML Data
# Data Contracts
# GoCardless
Applying DevOps Practices in Data and ML Engineering
Posted Oct 26, 2022 | Views 643
# Versatile Data Kit
# DevOps Practices
# Data Engineering
# VMWare
# VMWare.com