MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

The Rise of Modern Data Management

Posted Apr 24, 2024 | Views 847
# Modern Data
# Machine Learning
# Gable.ai
Share
SPEAKERS
Chad Sanderson
Chad Sanderson
Chad Sanderson
CEO & Co-Founder @ Gable

Chad Sanderson, CEO of Gable.ai, is a prominent figure in the data tech industry, having held key data positions at leading companies such as Convoy, Microsoft, Sephora, Subway, and Oracle. He is also the author of the upcoming O'Reilly book, "Data Contracts” and writes about the future of data infrastructure, modeling, and contracts in his newsletter “Data Products.”

+ Read More

Chad Sanderson, CEO of Gable.ai, is a prominent figure in the data tech industry, having held key data positions at leading companies such as Convoy, Microsoft, Sephora, Subway, and Oracle. He is also the author of the upcoming O'Reilly book, "Data Contracts” and writes about the future of data infrastructure, modeling, and contracts in his newsletter “Data Products.”

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

In this session, Chad Sanderson, CEO of Gable.ai and author of the upcoming O’Reilly book: "Data Contracts," tackles the necessity of modern data management in an age of hyper iteration, experimentation, and AI. He will explore why traditional data management practices fail and how the cloud has fundamentally changed data development. The talk will cover a modern application of data management best practices, including data change detection, data contracts, observability, and CI/CD tests, and outline the roles of data producers and consumers. Attendees will leave with a clear understanding of modern data management's components and how to leverage them for better data handling and decision-making.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

Chad Sanderson 00:00:00: My name is Chad Sanderson. My title is chief executive officer. My company is Gable AI. Well, I drink tea. Does that count instead of coffee?

Demetrios 00:00:10: Welcome back, everyone. We are in another Mlops community podcast session. I am your host, Demetrios. And today, talking with Chad Sanderson, he dropped a few bombs on me. This is probably, I think this is the third time that he's been back. And to be honest, he does not disappoint anytime. He is one of the greatest thought leaders in this space of data platforms, data quality, data engineering, you name it. He has strong stories, and I think the reason that I'm able to latch on to the stories that he tells is that he encompasses and articulates what people are feeling and talking about in the community, on LinkedIn, just in general, so well.

Demetrios 00:00:58: And he harps on why it doesn't need to be like this. First of all, he talks about the problem. He has a deep understanding of the problem from his experiences as a data engineer, and he knows that data engineers can be a bottleneck sometimes when you have these different roles, like a data scientist or a data analyst are asking a data engineer to go and figure out, go, debug why my data isn't coming in the same format that it was a day ago or why my machine learning model all of a sudden is going haywire. And I think it's a data quality issue. He's able to talk and harp on why data engineering does not need to be 90% debugging. He also goes through what it looks like, what a traditional analytics system looks like, what a traditional data platform system looks like, and how these systems currently break down and why he has set out to fix it, what he's doing at his new company. Gable was really cool to see because I knew him when he was at convoy, and he was talking a lot about the problems that they were having there and also the issues that they were trying to solve, and he mentioned that. And then he went out, and for the last year and a half, he's been talking to just about everyone in the data space, talking to companies small and large about how they are running their systems.

Demetrios 00:02:23: And he's in a lucky position because he is so well known. He gets to have access to so many different people in different size companies, different maturity levels, and different quality of their data infrastructure, and he can show what it looks like in all of these phases and where there are common problems across all company sizes, maturity levels, you name it. Let's get into this conversation with my man Chad. And as a little aside, a little secret, he's going to be speaking at the AI quality conference. We've got a few tickets left, so come join us June 25 in San Francisco. Oh, yeah. And of course, if you liked it, give us a share. Let us know what you thought of the episode.

Demetrios 00:03:18: Rate us on Spotify, subscribe on YouTube.

Chad Sanderson 00:03:21: All that fun stuff.

Demetrios 00:03:22: We will talk to you on the other side. Chad Sanderson, it has been a year since we talked, I think, and when you came on last time, we were beating the drum about data quality. It's probably more pertinent than ever before. I want to know, though. It's been a year, man. What has changed?

Chad Sanderson 00:03:49: Well, in the last year that we've talked, there have been some meaningful, meaningful changes. One, I got married. So congratulations. That was great. That happened in October. And then number two, I started a company, left convoy. I now have a business called Gable AI. And we are, we are, we are off to the races.

Chad Sanderson 00:04:16: Building founders journey is very interesting and different than what I've done before, but pretty rewarding as well.

Demetrios 00:04:25: I believe it. So I want to get into gable a ton because you pretty much created this term data contracts, and I know you are thinking deeply about data contracts and how to help productize that idea. Before we go into that rabbit hole, though, you've been talking with people for a year, maybe over a year, on how they're implementing their data pipelines, what their data architecture looks like. Do you have insights for us on anything that you've learned when it comes to setting up proper data infrastructure?

Chad Sanderson 00:05:06: Yeah, that's a good question. So I think there's a few major trends that are starting to emerge that are, that are different from the last few years, especially the sort of data bubble, I'll call it, of the early 2020s, where it was the midst of COVID The interest rates were still low, and so growing startups could afford pretty large bills for their data infrastructure. Snowflake databricks a whole host of tools, even if they didn't have that much to show for it, that's something that we're seeing is starting to change. Less companies are starting to assume that the data will inherently be valuable and that teams will just be able to find value within it. And more companies are starting to demand from the beginning that data teams have a good idea of how that data will be used and how it will generate ROI before the infrastructure exists. This is actually a callback to the way the data infrastructure used to be set up and created in the eighties and nineties, before storage and compute was decoupled. If you're a company like Ford and you wanted to invest in a data warehouse, this was an incredibly expensive task. And so you had to think very carefully, not just from a cost perspective, a literal storage and compute cost perspective, but from a human capital cost.

Chad Sanderson 00:06:43: How many people are we going to be deploying to answer some data questions? How many data architects? We'll need to have some tool that's handling all the ETL, doing the transformation in some centralized place. What will we do with the data? And so this is where you see the data warehouse becoming a very popular trend. Not because it was like a generic system that sat on top of all data, but because it was very, very useful for answering a specific set of questions that drive, that were driving business value. And so what we're seeing is teams starting to think a little bit less about building up an end to end stack from day one, or primarily thinking about analytics as the main use case for data. Um, the latter. The latter is actually a pretty important point because analytics for a very long time in a company's life cycle is not going to deliver massive business value. It's more so giving you information about the product. And so it makes sense why product teams really want analytics.

Chad Sanderson 00:07:49: But this is very different from something like invoice reconciliation, which has an actual cost to it. And if you don't do it well, it can impact the business negatively. Same thing with machine learning. And we're beginning to see a rise in more of those, what I would call operational use cases for data, as the primary use cases of data infrastructure, and that requires the infrastructure to be built a little bit differently.

Demetrios 00:08:13: How have you seen people do this wrong?

Chad Sanderson 00:08:16: Well, I think it ties back to the analytics use case, because oftentimes the impetus for a data investment comes from product teams who want analytics for their product. Organizations can build their data infrastructure in service of analytics and not in service of operational systems. So as an example, a very common analytics data pipeline is you've got Snowflake for storage, you've got a data lake, and you can just throw all the data that you care about in there. You've got something like Fivetran to just quickly plug into these services or databases and move the data into something like Snowflake, and you've got DBT sitting on top of it to very quickly and rapidly transform it. And then you have an analytics tool like looker. This is a very common setup for early stage companies. Teams, basically thousands of teams have been using this. The problem is that setup does not actually scale, nor does it support the operational use cases.

Chad Sanderson 00:09:23: One, there's a quality and governance issue there, which is if we are extracting data from the data producers using a tool like Fivetran, and the producers haven't taken any level of ownership over that data, meaning they're not treating it as a product, they're not treating it as an API, they're not, you know, thinking about, they don't have any accountability to it. Then when something changes, it means it is 100% the responsibility of the data teams to deal with that change. Now, in an analytics world, that's okay, because if the change affects your dashboard, the company doesn't lose any money. You just go in and fix whatever the issue is. But if it's a machine learning model or it's an AI system, it actually does have an impact. You can't just go back retroactively and fix things. Like, especially if you don't know about the change. Like if you've got a, let's say that your model is training every night and you have a data quality issue that maybe your test didn't catch for some reason, then your model is going to be retrained on bad data and the predictions will start having a real financial impact on the company.

Chad Sanderson 00:10:30: That's one instance, but invoice reconciliation is another. Like, if we, if we don't catch that problem in a certain time window and deal with it, there are real, you know, physical consequences. And that means there has to be ownership. There has to be visibility of the data as it flows to the system. There has to be very clear expectations that are divine defined. There has to be some feedback loop from the person that's using the data to the person who's producing the data. There's a whole sort of set of features and functionality that we could talk about. But I think the best way of summarizing it is in the old world, eighties and nineties, you had centralized data management, and you can think about centralized data management in the same way you can think about the management of a library where you've got books coming in, they get stored in a catalog, you've got a librarian, and they're responsible for what comes in.

Chad Sanderson 00:11:26: They know everything that goes out. They're checking what comes in and what goes out. They're organizing the books in a certain way so that they're easily findable and discoverable. And you have a staff of librarians, and this is their job, and it's what they do all day. And I think I read that the average number of books was like 20,000. Like 20,000 books in a library, which is not all that many if you're thinking in terms of datasets. Right. Some companies have hundreds of thousands or millions of datasets, and that was sort of the old way.

Chad Sanderson 00:11:58: And what we're moving to is the federated way. It's like, well, we don't have these librarians anymore that are these centralized kind of people managing all the data in the company and determining who gets to have access and who doesn't. So how do you do this in more of a federated environment where the people who are actually writing the books are making those books available to the people who want the books? What does that system have to look like? And you'll find that it looks a lot more like a bookstore or like Amazon, something like that. And I think that's the change we're undergoing in data right now.

Demetrios 00:12:35: The question that came to my mind directly, as you were saying, that is just how can we make sure that if we are writing a book, other people come, and in this metaphor, they come and buy our books. And it's not like I just write a book for nobody to read. Yeah.

Chad Sanderson 00:12:55: So that is one area where the metaphor slightly falls down. Because in Amazon, right, the purpose of writing the book is to sell it. But in data, the purpose of producing data may not be for other people to use it. I might produce data as an engineer because I need that data to run my application, right? Like, if I'm collecting data about, you know, a customer goes to my website, I ask them to fill out a form. It includes their name, their age, you know, whether or not they've bought any new socks in the last three days. I will use all of that information as the owner of that service to do interesting things. Like, maybe I want to send them a gift card for their birthday. Maybe I want to show their name on their profile page.

Chad Sanderson 00:13:41: If I know they haven't bought any new socks in the last three days, maybe I'm going to show socks as, like, the number one item for them to buy in their list of referred items. Like, I'm actually using that data for something. And so I'm not thinking about how other people in the company might be using that data as well. Right? Like, if. If I am, if I'm saying, hey, this is useful to me, the way most organizations work is that the data teams will come in and look at all this data that's stored in the transactional system, and they will say, well, this is actually really important data for me, too. Like, we would like to know what the average age of all of our customers are. Maybe there's something meaningful we want to do with that. Or we want to know, like whether or not, you know, people have bought socks in the last three days.

Chad Sanderson 00:14:25: Like, is that a relevant question for us to continue asking? What should we be showing them? In those cases, they will use tools like Fivetran, these data extraction tools, to forcibly rip the data out of the hands of the engineers that are producing them and put that data into their own environments where they can analyze it. Now you have this problem where the engineering team is making the changes to the data that they really care about. That makes sense for them, and that might mean changing stuff. They might decide, hey, I've got today in my database, I've got sort of a first name field and a last name field, but I don't think that's relevant. We're wasting space or whatever. So I'm going to create a single column that's just called name. Now, as long as it is only affecting me, that's no big deal. It's just my service.

Chad Sanderson 00:15:16: I build my service in a way that can handle that change. But if these downstream teams are extracting data and they're expecting, maybe they're putting into some marketing automation or something, and they're always expecting first name, last name to be in different fields, and I've just broken that downstream marketing automation. The problem I feel, is not that these product engineers don't care, it's that they don't know, right? Like if you go to any application developer and you ask them, do you know how your data is being used and you know where dependencies on you exist and what it's being used for and how valuable it is to the company, most of these developers are going to say, no, I have no clue. And if they don't have any clue, then to your point, they are not able to even start thinking of, how do I make this data available in a safe, structured way, to make it accessible and to make it fit for purpose in the application world? It makes total sense to try to build in, you know, in this very decoupled way where I'm not stepping on anyone's toes. I can move as fast as I want within my own service. I don't have to depend on or rely on anyone else. And the whole purpose of that is speed. And the reason speed is important is because this is the age of rapid iteration and rapid experimentation.

Chad Sanderson 00:16:38: We know that some features are not going to work with customers, we know some will. The faster that we can ship things and we can try things out, the more that we could hill climb our way to some sort of meaningful success with that product. But data is not like that. If I am computing a metric called profit, by definition, that metric will take a dependency on some other metric called revenue, and another metric called cost. And the revenue metric will itself take a dependency on the variety of sources of revenue within a company. When I worked at Microsoft, one source was Bing ads revenue. And then you also had all the revenue you got from selling windows, and then you got the revenue you had from the Microsoft store. These are all sources of revenue.

Chad Sanderson 00:17:34: And then if you trace that line all the way back, then you can split it out into the individual line items. And so there are these chains, these trees of dependencies that form when you're trying to answer questions on data. And that cannot be decoupled, right? It's impossible to decouple that. And so when you step back and you actually look at how the data flows throughout a company, it does look more like a supply chain, where you've got a producer of kind of the raw materials, and these might be the logs that are being emitted from some service. And then that data gets transformed in some raw layer or silver layer of a data warehouse into a bit more normalized, a bit more aggregated. And then they get transformed yet again into a domain. And so here's all the events for a particular business unit. And then from there, they're transformed again into a particular metric.

Chad Sanderson 00:18:38: And then it sort of on and on it goes. And if there's any break in the chain at any point, then it has a ripple effect to everyone downstream that is leveraging that data. And so, like, one of the things we've kind of been dancing around while we're having this conversation is this idea of change management, right? If I am a software engineer and I'm trying to make changes to my code, there are very well understood systems on how to do that without causing damage to everyone else that is dependent on that code. And we call this DevOps. If I'm an engineer and I'm adding a new feature that goes through a process, there's unit tests and there's integration tests, and there's pull requests. So humans look at these changes and then I get a diff. So there's things that make it easier for the humans to actually review the changes. Then there's a whole process that runs to merge the branch back into the main trunk of the codebase, then there's a whole process of ensuring that deployment is safe.

Chad Sanderson 00:19:47: Then there's tools like launchdarkly that allow you to toggle that feature on and off. Then you've got tools like amplitude and Mixpanel that actually track how the business users or how the users engage with that feature. All of this is change management. I've made a change. I want to make sure that that change was safe and it had the intended effect. But if you look at the data world, none of that exists. None of that infrastructure I just talked about exists and data changes just as frequently as software changes. If I'm an engineer and I'm making a change to my database, that's data changing.

Chad Sanderson 00:20:26: If I'm adding a new event that is emitted every time a user interacts with a feature, that's data changing. If I'm removing an event, that's data changing. Data is changing all the time. Yet there is no system for the teams actually leveraging that data to give any human in the loop review or feedback. There's no integration testing for data. There's no way to actually check to make sure that the data is flowing the way that it is supposed to. These are all things that you detect after the fact, right? Like, way after the changes have already happened. And generally you're looking at the outcomes, the use case, like you're looking at me.

Chad Sanderson 00:21:09: You're saying your machine learning model has failed, and that's how I know that a data issue has occurred.

Demetrios 00:21:16: Yeah. All of a sudden the dashboards are starting to go wild and it's like, ooh, maybe that was because my data was a little bit dirtier than I had thought and it exchanged one too many hands or somebody did something upstream that I have no idea how or what, and then I have to go and be a detective and figure out what exactly who was changing what and where did they change it?

Chad Sanderson 00:21:42: Oh, yeah. Well, I think it's on. It's on both sides of the workflow as well. There's definitely the workflow of a thing has changed. I have been impacted. How do I root cause this and what do I do moving forward? Usually that work is put onto the data engineer, because if you're a data scientist or an analyst, you don't know that.

Demetrios 00:22:04: It's very hard just saying, hey, it stopped working. What's going on? Why is this going all a wire?

Chad Sanderson 00:22:10: Why did this happen? I have no idea. All I know is that it's impacting me, so I'm going to escalate it to, to my data engineering team. And then data engineering is usually the one that has to trace the lineage and say, okay, well, where is this data actually coming from? I can try to manually figure out the services that may have changed. I can look at the pull requests over time. Then I have to go and talk to the owners of those prs and say, hey, did you make a change? Did it impact this? I have to manually investigate it myself, and then I have to convince them that they need to roll that change back or put in a fix into their pr. And that takes time, right? Because they don't really care. It doesn't affect them. So it goes onto their backlog, it may take weeks, it may take months, and then once the fix is actually put into place, now I need to do a backfill, and that's the data engineering team.

Chad Sanderson 00:22:56: And usually there are way fewer data engineers in a company than there are data scientists or analysts. So they become a bottleneck, right? They're getting all of these requests, they're kind of like stacking and it's very hard to address them all. That's sort of one side of the problem. And then the other side of the problem is, well, if you don't trust the data to be accurate and if there's no clear ownership of the data, that means it is your responsibility as a data scientist to make sure with, you know, 100% confidence that this data means semantically what you expect it to mean, that there's validation rules in place, that if anything changes, you are going to at least know about it. And hopefully it doesn't totally blow your model up. And that takes a lot of time and effort and work too. In fact, I think I've read like 75% of data science time is spent on validation. And that might seem like, oh, well, that's just kind of the state of the world.

Chad Sanderson 00:23:54: But this is not how software engineers work. They don't spend 75% of their time trying to figure out if the API that they're ingesting is giving them the right thing or not. Like they trust that API because it is an API. It's very clearly documented exactly what this thing is that they're consuming, what it means, how to use it, and because that API exists, it means they can take a dependency on it and just refer back to the documentation and the usage and the spec on how to manage it properly. When you think about data in that way, data is an API too, right? When I'm taking a dependency on data coming from some database, it's just another interface. And yet I'm doing that without the contract. There's that word that a software engineer gets when they are leveraging an API for their services. I'm taking the dependency, but I'm not getting any of the guarantees.

Chad Sanderson 00:24:58: I'm not getting any of the ownership, none of the trust, none of the documentation. And I think, like, this is one of the big things that needs to change.

Demetrios 00:25:07: All right, real fast, I want to tell you about the sponsor of today's episode, AWS, tranium and inferencia. Are you stuck in the performance cost trade off when training and deploying your generative AI applications? Well, good news is, you're not alone. Look no further than AWS's tranium and inferencia, the optimal infrastructure for generative AI. AWS, tranium, and inferencia provide high performance compute infrastructure for large scale training and inference with llms and diffusion models. And here's the kicker. You can save up to 50% on training costs and up to 40% on inference costs.

Demetrios 00:25:50: That is 50 with a five and a zero.

Chad Sanderson 00:25:54: Whoo.

Demetrios 00:25:55: That's kind of a big number. Get started today by using your existing code in frameworks such as Pytorch and Tensorflow. That is Aws, tranium, and inferencia. Check it out. And now let's get back into the show.

Demetrios 00:26:09: So, how are you doing it? Are you basically inject? I'm thinking about those tests that you can get when you go to the doctor, and they check to see if your veins are clogged or not, and they put something inside your veins, and then the blood pumps all through the body, and you can see where there are places where you have, like, different clots or parts of the veins that aren't working as well because it's been tracked, right. I think it's, like, blue dye or something. And so under the special machine, you can see, like, okay, these veins are not working that well, and you have a clear picture of it. I almost feel like you need to do that, where you put a stamp on the data, and if it comes out of this API, then it is with this, like, pedigree, and you know these things. But then if anybody takes it and then they use it, and you then have that data downstream from three or four different transformations or uses, or people have touched it in many different ways, you can still retrace it because it has almost, like, all the different pedigrees as it's gone through this supply chain.

Chad Sanderson 00:27:26: So the way that we think about. So what you're describing right now is sort of, I think, a more advanced form of data lineage. And if folks in the audience are not familiar with data lineage, it's essentially the process of starting from the either the looking at generally, it's sort of looking at an analytical environment and parsing through SQL and basically stepping through each query and understanding how that query was built. What table does it rely on and what table does it produce? And when you can answer sort of all of those questions for every query in your data environment, then you effectively have this big graph, and the graph shows you how one column flows to all these other different tables and views in the company. This is really great if you're using a tool like Snowflake or databricks. And the reason why is because these companies have effectively built abstraction layers on top of data. They have created the interface, which as an example would be Snowflake SQL. And everybody knows how to interact with Snowflake SQL.

Chad Sanderson 00:28:39: You also have tools like DBT that write the transforms. And DBT also provides an interface that's even easier to use than Snowflake SQL. And they actually manage a lot of this lineage work for you. The problem is generally not downstream. By downstream, what I mean is what happens to the data after it arrives in some environment that a consumer of the data can do something with it. So it could be a notebook, it could be Snowflake, it could be databricks, right? It's somewhere, as long as I can access the data as like a data scientist, the problem is not really there. It's everything prior to that. Because where that data is coming from doesn't have nice clear abstractions.

Chad Sanderson 00:29:26: It's the wild west, right? Like the data could be coming from, we could have a Python script that's pulling out Excel spreadsheets and dumping these CSVs into a data lake and then pushing that on some cadence into our analytical database. And anyone at any point in time can go in and make a change to the structure of that file and we would never know about it until it actually reaches us, which is too late. It could be a software engineer that is adding new logging to some feature that they've created. And then we're pushing that event via Kafka into a database. And then from the database we're going into the analytical environment. And so if I don't know that, if I don't have the link between the analytical environment and the event code, I will always have to, that will always be a manual slog for me because I will have to make that connection myself. And this is sort of the state of the world for a lot of data scientists. There is that gap between when the data actually lands in some storage system and what's going on before that.

Chad Sanderson 00:30:42: And so when people talk about, like, a source of truth, that's what they're talking about, right? When someone says, hey, I want to know the source of truth of these transactions. They're not talking about the table and snowflake. They're talking about the actual moment that the event of a transaction occurs. And what does that mean? What is the underlying code that is producing that interaction, and where can they find that? So in terms of what Gable does, there's a few things. Gable is kind of a meaty platform. We tackle this problem in modular pieces. People can use any number of the modules that we produce, or they can tie them together into a more seamless experience, which is where it really gets magical and powerful. But the core of the platform is what I would call our API.

Chad Sanderson 00:31:37: We can build a consistent abstraction on top of many different types of data sources automatically. So if you think about what actually makes up any data source, there's really only two components. There's the structure of the data. So that would be the schema, the data types, column names, stuff like that, and the semantics. What does that structure actually mean? What does any particular column or row mean in the real world? And then there is the contents of the data itself. So those values that I would expect from that data, if I have five categorical values in some particular field, I would never expect there to be a 6th or 7th suddenly appear. Like, maybe my system is not built in such a way to deal with that. Or if I have an age field, I would probably not expect their age to ever be less than zero or greater than 300, unless you've got putting.

Demetrios 00:32:43: Conditions and almost like guardrails onto the data so that people understand, okay, no matter what and how tainted things can get, the data is not going to come out like this. And if it does, we need to figure something out. Are you alerting people if it does show up like that? Or is it just not allowed and you get like a null or.

Chad Sanderson 00:33:05: It's a bit of both. So there's sort of a flow that happens. So there's really like four main steps in gable. The first step is the identification of these data sources. And we have some magical AI driven mechanisms to scan a code base and to automatically figure out where in code is data being produced. And so that could be, we can figure out. Here's all the database tables that are being produced. All the structures that exist in your code, here's all the event code that's being produced.

Chad Sanderson 00:33:41: We could connect to some file structure like s three, scan through s three, and then figure out for each file, is this file related to data? Does it have a schema and stuff like that? Or is this more of like a non data related file? Is it just like a word document or something like that? Once we have all of those, and we could do this with third party stuff too. Salesforce, SAP, whatever, it doesn't really matter. But once you have all of those identified, then we can extract structure and we can extract contents of the data and we can represent those in a standard format. So it doesn't matter if it's a postgres transactional database, if it is a segment event, a Python dictionary, a.net, executable, Java, whatever, all of that data can be extracted and organized and formatted in the exact same way. Once that's in place, we can then automatically enrich that data with lots of useful information. So because we're injecting ourselves into the git version control platform of choice. So GitLab, GitHub, whatever, you can extract the ownership on the repo level, who is the person that actually owns the code that is producing this data, that is true ownership. You can extract how that data asset has changed over time, and then you can also extract how are these assets actually connected to each other.

Chad Sanderson 00:35:20: Like where do we see the same information in different assets? And that's sort of the lineage graph that I mentioned before. Once you have that and that all happens sort of automatically. Now, once you have that, now you can start to have humans come into the loop and a human can say, okay, well I have some expectations on this particular data, and those expectations might be the constraints you mentioned. So there might be some specific constraints that you have in mind that a machine would never really be able to figure out. Or maybe a machine could figure it out by doing some profiling, but we're not there yet. Or it might be something like an SLA, like hey, I always expect there to be 1000 events emitted over this time period, and if that number ever drops, it means something is probably wrong. What we can do then is based on the expectation we can alert when we detect a change is going to happen or has happened. So if you're in Ci CD, for example, we will know, is the structure of this code event going to change based on a PR hey, a software engineer has decided to get rid of a column that you, the data scientist are using.

Chad Sanderson 00:36:41: We know that before that happens, same thing for new things being added. The other option is to look at the contents of the data and to say, hey, the contents of the data have changed in a way that are going, that's going to affect your machine learning model. We're going to tell you about that before the pipelines actually run and cause.

Demetrios 00:36:59: All that damage, which I imagine just seems like magic for anybody that you talk to or that you're working with, because you're like, oh my God, that is going to be a lifesaver for me. I've been spending 75% of my time trying to figure this out, and now you're just telling me it's going to work. That's incredible.

Chad Sanderson 00:37:15: Well, there's a lot of things that you get for free with a system like this that I think are really cool. I mean, one thing is because we're registering the source data, then you get a nice sort of catalog of all of your data sources, which is the things that most data, data people actually want. Right? Like I said before, they don't care as much about the tables, and then from there they can, they have to go back to the sources anyway. They just want the sources. And that's something that we get. We get that for free because of this registration process. Same thing with the change management live change detection is a big part of our platform. We have a change log, so it shows over time.

Chad Sanderson 00:37:58: How has a data asset changed? Who actually made the change? What was the change? It's all the context that's very relevant for the teams that are consuming all this information. And there's a bunch of other things too around data contracts and how do we agree on them and how do we collaborate as the data evolves? The thing that was most meaningful to us when we built this is how do we help the data producer. So the engineer or the salesperson or whoever is generating this data understand that they are not in an isolated environment anymore. And as long as we can integrate with GitHub or GitLab, because we're automatically scanning and profiling this code and figuring out what it's connected to and who is dependent on it, we can provide that feedback without them actually having to do anything. Meaning, if I'm a software engineer, I'm just going about my day to day work, and I'm shipping features, and I decide to change some SQL code that exists somewhere in my system. I will be told by Gable, hey, you're about to change records that other people in the company are using. Here are the people in the company who are using these records. Here is what they are using those records for.

Chad Sanderson 00:39:18: You need to go and speak to them before you make this change. Or at the very least, you should write up a summary of what this change is going to be. And Gable will manage communicating that to the right people. Right. So it's not about like, applying AI magic to turn the data, resolve all data quality issues or something like that. It's applying the AI magic to bring the people together and have closer connectivity when it comes to managing and creating data in the first place.

Demetrios 00:39:50: Yeah. Just that awareness has got to be so brilliant for teams. I'm curious about you. A year, year and a half ago, left convoy with ideas on what was going to be the most valuable. It seems like you had a pretty clear vision, but it's been polished over the last year and a half. What is one or two features that people have been asking for nonstop? And you almost, like, are surprised by these features because for you, that was a non thing, but now you're talking to a lot of different companies.

Chad Sanderson 00:40:31: I think something I wasn't expecting is how much of an end to end system teams really want. My initial goal when I set out was, can we build something that by pressing a button, can scan all of this code and then figure out who owns that data? What does this actually mean? And put the constraints, like the CI CD constraints in there and really start to treat data as another step in the DevOps process. That was my main goal, is bringing DevOps to data and using data contracts as the vehicle to do that. What I found, though, is that people have a much broader vision of what data contracts are. There's the CI CD approach that I mentioned, but people also want contracts for cataloging purposes. I get all of the source data in one place and I can actually sort through that source data and sort of segmented by domains. People also want the live change detection on their data sources. So I want to see how, I want to know when someone makes an update in excel in real time so that I can either protect myself or protect my data consumers on a platform team.

Chad Sanderson 00:41:54: And I never want these issues to actually reach the analytical database in the first place. Right. So there was a lot more interest in that than we originally thought. And I would say I've also been surprised. Oh, sorry, go ahead.

Demetrios 00:42:09: I was just looking at how far upstream you are going. You're thinking like, very far upstream. And if you can catch this in the moment this data is created, I think that's where you want to live?

Chad Sanderson 00:42:26: Yeah, our goal is to go all the way, all the way upstream. I think that they're all the way.

Demetrios 00:42:33: Maybe all the way.

Chad Sanderson 00:42:35: I think that there are some places where you can't do that or there are problems that are introduced outside of that core data source. So, for example, you may have a human that is entering some data into a text field and they just enter in something that should never be entered. It doesn't make any sense and doesn't work with your pipelines or your dashboards. You can't really catch that during the CI CD workflow because there's no code being changed, but you would be able to catch that basically as soon as that data becomes accessible through some system, moving it from that front end interaction to either the database or the analytical database, you want to try to catch that as soon as humanly possible. But really, I think that the future is going to be all about code and also direct integration with any platforms or source systems that modify data, excel or Google sheets or Salesforce or SAP. A lot of data work happens in these products and there really is no oversight for data quality or data governance. If I'm a salesperson, I'm going to change around my Salesforce schema in a way that makes sense for me and my team in the way that I want it. But if that sales data is really critical to downstream teams, there's no real way of them knowing that or understanding that.

Chad Sanderson 00:44:08: And so all of that, like change management, change detection, CI CD, these are all places we want to insert ourselves.

Demetrios 00:44:17: I want to just look at the other side of this, which is if you didn't have that, it's like you have to then go and manually in each one of these systems, try and create something that says, hey, alert me if this ever gets changed. And just thinking about that almost gives me a headache. On all these different data sources, on all these different people that you're interacting with, maybe for a small team at a startup, you're running fast, it doesn't matter that much. And you can deal with that technical debt once you find product market fit or you start taking off. But if you're at a gigantic company, just the amount of people that are working at that company that you would have to interface with to try and figure out like how am I going to protect my assets, which are these things that I've been working on for the past, whatever, three to six months. And I thought I was having success with them, but little did I know somebody that I have no idea in some random department changed everything. And now I don't know why, but I got to go debug it. Or I got asked by data engineer to debug it.

Chad Sanderson 00:45:32: Yeah, exactly. At these big companies, there are thousands of people potentially who all touch different parts of the data that you're using, and you have no clue. A lot of these teams have thousands of microservices, tens of thousands of datasets, hundreds of thousands of CSVs. There's just enormous, enormous amounts of data, and it's very, very unclear where it all comes from and what it means and why it exists and when it changes. And so it's not just any change that happens. It's anytime you want to use that data, you have to go. There's actually this relatively narrow safe zone where you've already gone out and done all the work to understand what the data is and where it's coming from and what it means and all that. And that takes a lot of time.

Chad Sanderson 00:46:26: And then there's all the changes that could potentially happen to that data because it's plugged into so many sources. So you've kind of got this relatively infrequent window where everything's okay, everything is safe, and you've just kind of perpetually got your fingers crossed that today is not going to be the day, but it always is the day. At some point. At some point, the day comes, and when it does come, it's very scary and it's a big problem.

Demetrios 00:46:58: That's why I've been championing for the data engineering appreciation day.

Chad Sanderson 00:47:03: Yes, data engineers definitely need that.

Demetrios 00:47:06: So before we jump, what was the other feature that surprised you, that people were asking for a bunch?

Chad Sanderson 00:47:12: Yeah. Well, I think the other thing that was surprising to me was that how excited data producers and software engineers were about this. I think that data teams really, really underestimate. There's a bit of a Stockholm syndrome going on, I think, where they've probably tried in the past to go to their engineers and say, hey, I need you to own this data, or I want you to be more thoughtful about me. And they kind of got brushed to the side, or the engineering team said, hey, this doesn't really matter. But I think that may just be the perception. When I've talked to data producers, they all seem to acknowledge the problem. They're like, yeah, we understand that this is an issue because we get asked about it all the time.

Chad Sanderson 00:47:56: The issue is like, I, as a software engineer, don't have a mechanism to help them with this. Right. All of the solutions that I'm given is like, well, you need to be very thoughtful about what data lands in this table. But I don't know anything about Snowflake. I don't know anything about data bricks. I don't know anything about the data lake. I don't know about the ETL systems that you're using. I don't know anything about DBT.

Chad Sanderson 00:48:21: So it's like, in order for me to help you with this problem, I have to learn like four or five different tools. I also don't have any way of, like, if something does go wrong, I don't know how to resolve that. I don't, I don't know how to root cause it. I don't even know how to figure out if it was a big deal or not. I don't know what right or wrong even looks like. So there's like this whole list of, you know, requirements that need to be met in order for a producer can take ownership. And if those, if those questions are not answered, then this just seems like a very murky, unclear, difficult problem. And they don't know how big of an investment it's going to be and so they don't do it.

Chad Sanderson 00:49:02: And so when they see that the way, the way that we tackle this is, oh, this is all DevOps, this is just another step in the workflow that I already run where we're accounting for data the same way that we account for security or the same way that we account for code quality. And, oh, if I do make a change, that's not good. I'm going to be told and I'm going to be given all the context on everyone I need to talk to and why I need to talk to them and what good actually looks like and what I need to do in order to progress this pull request forward. That is very exciting. It turns this from kind of a nebulous, unclear, very confusing problem to something that is very clearly scoped, very clearly defined, and fits into the workflows that they already use. So that is what I think is probably the biggest and most interesting thing to me is just how much software engineers have gravitated towards these solutions.

Demetrios 00:49:58: Are you interfacing more with the DevOps teams or is it data engineers? Is it a little bit everybody? It feels like since the surface area is so big, you have to make sure that you are on top of it with each one of these people and that you understand their needs just as much as anybody else.

Chad Sanderson 00:50:19: Yeah, I mean, I think the data engineers and data platform teams are probably still the folks that we talk to the most because they're impacted the most. They have the most. There's the most for them if there's data quality. Right, exactly. For the data consumers. So the data scientists and the analysts, they're also impacted, but they're impacted in a different way. They're usually not the ones kind of doing all the root causing and the lineage, all the stuff that takes a bunch of time. They just want to make sure that it doesn't happen, and they don't really care how it happens.

Chad Sanderson 00:50:55: And then on the side of the data producers, they don't want to cause people pain. They don't want to be seen as the bad guy, and they certainly don't want to be in a situation where the whole company knows that they were the ones responsible for an outage to their machine learning model. That's what they want to avoid, but they just don't have the systems to do it. We start in the middle, and then we work our way out. What my guess is, is that over time, um, we will go the same route of. Of the devsecops kind of workflow. I think dev data ops will become a term that most DevOps teams and most engineering teams are using. And that means, like, selling, mainly selling into the software engineering organization.

Chad Sanderson 00:51:44: And, like, in the same way that software engineers, they, they've now started to realize that security is a critical part of their workflow. They can't just pass that off to someone else in the organization and kind of wipe their hand cleans of it. It's their responsibility, too. If they don't do security best practices from the beginning, then hacking and fraud will inevitably happen, and all you can do is deal with it. In the same way, if you don't do data management from the beginning, data quality issues will happen, and you just have to deal with it. And that's not acceptable anymore. In the age of artificial intelligence, maybe in the age of analytics, it was acceptable, but now it's becoming a lot less acceptable.

Demetrios 00:52:26: So, funny you mentioned that, because I was just at Kubecon two weeks ago, and every other booth was a security booth. And so it is very much understood that there are no excuses for any type of security breach. So I do see this vision where you're like, we cannot use these excuses anymore for poor data quality or just for poor data practices, because so much of the business is relying upon this. You have so many different use cases that are coming from the data, and if you can quantify that even better, and it feels like when you have things set up and you know where the data sources are and you know how the data is being used. And that rich lineage, you can tell, okay, this data is giving us. It's generating a whole ton of money through these machine learning models or through these other examples that you showed. I think that is another one that just, like, gives. It shows the teams the value of the data in a way that is irrefutable.

Chad Sanderson 00:53:34: Yeah, exactly. I think that we are going to start to see a big shift over the next five to ten years, not just in the tools, but in the companies that are using data. And it will split from companies that are using data for analytics primarily. And those companies will invest in low cost tooling. So very cheap storage and compute, very cheap analytic systems, probably open source. And then on the other side will be companies that are actually using data to make money. AIML is obviously one example, but there are other ways that you can monetize data or use data to save large costs. Invoice reconciliation is one example that I mentioned.

Chad Sanderson 00:54:20: And those companies are going to have to treat data like a product from the very moment of its inception. And that means not just putting quality into place downstream, but upstream as well.

Demetrios 00:54:34: Beautiful, man. Well, I always appreciate coming on here. I'm super excited to get to meet you in person for the conference. We are going to be talking about this so much. We're gonna spend the whole day talking about quality. I had to have you there because when I think about quality, I. You're like one of the first people that comes to my mind. And it's.

Demetrios 00:54:56: It's almost like we couldn't have a.

Demetrios 00:54:58: Conference about quality, AI quality, data quality without having you there. So, yeah, this is gonna be fun. It's gonna be a whole lot of fun. And we're gonna get to talk about this stuff all day long.

Chad Sanderson 00:55:11: Amazing. Well, I can't wait for it. And I will show up in my Sunday best and I'll cut my beard and so I don't live like a wild man out of the cave.

Demetrios 00:55:21: That's it. Actually, on a not so serious note, I was thinking, like, should we do some kind of a theme, like a hawaiian shirt or something? So if anybody has any ideas, throw them at us, because I am open to it right now. It's not too late to figure out what the theme or what the fun things that we're going to do are. We're already going to have a comedian there. We're going to have some jam sessions happening, so beware of that, too. I don't know the quality standards on those. We may need a tool like Gable to check on the upstream possibilities of the quality and give the stamp of approval for the musical session qualities. But you'll see when you come.

Demetrios 00:56:12: I guess that's. That's the fun of it.

Demetrios 00:56:14: Hold up.

Chad Sanderson 00:56:15: Wait a minute.

Demetrios 00:56:15: We gotta talk real fast because I am so excited about the MlOps community conference that is happening on June 25 in San Francisco. It is our first in person conference ever.

Chad Sanderson 00:56:27: On.

Demetrios 00:56:27: Honestly, I'm shaking in my boots because it's something that I've wanted to do for ages. We've been doing the online version of this. Hopefully I've gained enough of your trust for you to be able to say that I know when this guy has a conference, it's going to be quality. Funny enough, we are doing it. The whole theme is about AI quality. I teamed up with my buddy Moe at Kalena, who knows a thing or two about AI quality. And we are going to have some of the most impressive speakers that you could think of. I'm not going to list them all here because it would probably take the next two to five minutes, but just know we've got the CTO of Cruz coming to give a little keynote.

Demetrios 00:57:11: We've got the CEO of you.com coming. We've got chip, we've got Linus. We've got the whole crew that you would expect. And I am going to be doing all kinds of extracurricular activities that will be fun and maybe a little bit cringe. You may hear or see me playing the guitar. Just come. It's going to be an awesome time. Would love to have you there.

Demetrios 00:57:36: And that is again, June 25 in San Francisco.

Demetrios 00:57:41: See you all there.

+ Read More

Watch More

1:03:42
Posted Feb 23, 2023 | Views 450
# Serverless Databases
# Cloud
# Infrastructure Patterns
# Opinionated Databases
55:21
Posted Mar 24, 2022 | Views 591
# Building ML
# Analytics
# Data Stack
53:38
Posted May 07, 2022 | Views 944
# Run:AI Atlas
# ML Inference
# Resource Management