Putting the AI back in Medallion Lake Design // Simon Whiteley // DE4AI
Simon is a Databricks Beacon, Microsoft MVP and owner of Advancing Analytics. A deep techie with a focus on emerging cloud technologies and applying “big data” thinking to traditional analytics problems, Simon also has a passion for bringing it back to the high level and making sense of the bigger picture. When not tinkering with tech, Simon is a death-dodging London cyclist, a sampler of craft beers, an avid chef, and a generally nerdy person.
In recent years, companies have seen an explosion in adopting lakehouses and reaping the rewards, but time and time again, we hear from people that they regret the layering of their lake. The zones don't quite fit what they were trying to achieve, and no one in the company understands what "silver" vs. "gold" actually means. Worst of all, it has become the domain of engineers & analysts alone - The original boom of data lakes was down to the AI revolution, so how do various AI personas fit into the mix? In this session, we'll recap a mature, production-grade lake design, then overlay our various AI activities on the top. Whether you're an AI Engineer, citizen scientist or grizzled data science guru, you'll leave this session with a better understanding of how lakehouse design works for you.
Adam Becker [00:00:05]: The first person that we want to bring on board onto the stage is Simon. Let's see. Simon, are you around?
Simon Whiteley [00:00:11]: Yeah, yeah, here. Hello.
Adam Becker [00:00:14]: Wait, is that. Oh, my God. That's not the profile photo I have. Give me 1 second. Anne, is this, is this Simon? This is Simon. Okay, Simon, the floor is yours. I'll be back in a few minutes.
Simon Whiteley [00:00:34]: Okay. Well, yeah, apologies if it's not what you expected, but, you know, we need to make these things work, right? So, my name is Simon. Hello there. I am a data engineer from the UK. I run a company called Advancing analytics. And what I do is I build lots and lots and lots of lake houses. Essentially, I teach people how to be an engineer, but also how to bring their data teams together. It's just a hard thing that we have to try and work out.
Simon Whiteley [00:01:04]: Right. So I thought I would spend a little bit of time today explaining one of the most oversimplified, but also quite complex things called the medallion architecture. And hopefully this will become a little bit more sensible as we go. So, firstly, data engineering. There are essentially two types of data engineer. There's the traditional, the OG, the original data engineer, who are essentially software engineers. They are the people who software engineers with a passion for data. And they kind of took that route in, and a lot of that started because they were looking at data science and they were saying, I need a data pipeline that is robust, scalable, repeatable, all the good engineering stuff that is just to service a particular training dataset or a machine learning model.
Simon Whiteley [00:01:55]: That's not to say that was the only way, but it's fairly common now. The other one comes from the other direction. That's traditional SQL people. I was offending people in the other chat, saying they're not real data engineers, but that's not quite what I mean. Data engineering with a traditional analytics focus is saying, I'm trying to build a generic data model, a star schema, a super big giant analytical model that can answer maybe 80% of the different questions a business might have. Very different from the single serving data pipeline of the past. Now the two are doing the same thing. They're both saying, yes, let's use data engineering.
Simon Whiteley [00:02:36]: Let's use software engineering in the preparation and transformation of data. So good, we're in the right state. That's what we're trying to do as data engineers. Now, the original SQL people, the data warehousing people, 20 years ago, we were doing the same thing over and over and over and over again. Essentially, that is the, we stage some data into a database into a schema. Then we pick it up and we clean it. We maybe sort out the dates, we standardize currency, we get it into a state where it all does the same think. Then we pick it back up and we maybe make facts, dimensions, data model that we want to show the business.
Simon Whiteley [00:03:20]: Now, if I've cleaned one bit of data, and I use it in 20 different dimensions, well, I just go back to the same cleaned bit of data. It's all about reuse. Reusability is a key thing. Now, that's old, that's legacy. That's not what the cool kids are doing. So when they came out with things like the lake house, when we started saying everything should be a lake, not just specialist things, well, they had to have a new name for this stuff, which was essentially, what's this thing called? The medallion architecture. So we're doing the same thing. We've got bronze, we've got silver, we've got gold, and it kind of maps to exactly what we were doing before.
Simon Whiteley [00:04:01]: So I've got my data going into bronze, and it's a raw state. It might have duplicates, it might be unclean, it might not even be the right data that I'm wanting. I've not really done much checking on it. I then pick up that data and I do my filtering out bad roads, cleaning data, scrubbing it, applying some kind of cleansing routine, loads of things we can do in there. And then I pick it back up and I put it into my business level aggregates. I make it tidy, I give it to the business. Sure. So that all sounds simple.
Simon Whiteley [00:04:31]: Really, really, really simple. Now, the worrying thing is that bit at the end ML. So data science only happens at the end of this process on data that's been scrubbed and cleaned and turned into a data model. But obviously, that's just not how it works. The real world is, sadly, a little bit more ugly, complex, different. So if we're saying things need to get real, well, when we have to get real about things. So when we talk medallion architecture, we just have to get a little bit more complex than that. It's not as easy as it's made out to be, and that's what the medallion architecture is all about.
Simon Whiteley [00:05:10]: It oversimplifies things, and so we need to take it with a little pinch of salt. So take a really simple data science workflow kind of more your traditional machine learning kind of thing. We can say, yeah, there's pictures coming from databricks, coming from various lakehouse people, coming from Delta saying, this is how it fits. Into medallion, but it's saying we'll pick it from any layer of the lake, bring it in, do your data prep, make your features, put it into a feature store of some kind, do some inference and spit out results. Table but it never maps back into the medallion architecture, never mapped back to the lake. It's kind of like a separate process. And that's the whole reason why we went to the lake house, to make this the same thing and bring it all together. So it's just not quite right.
Simon Whiteley [00:05:53]: So let's take that even simpler. Let's work out what those different things we're actually doing are. If I've got my bronze, silver and gold, well, firstly, I'm going to pull data from any one of those as my source if I'm trying to put it into some kind of machine learning model. Now, the most obvious to use is silver. So I want data that's already been cleaned. Someone's gone and tidied up my various different things. They've removed PIi data or they've standardized addresses, they've standardized currency. Whatever it happens to be, someone has gone and cleaned that data.
Simon Whiteley [00:06:25]: But if I'm doing something like trying to do fraud detection and insurance, well, actually, if someone cleaned that data, I probably over sanitized it. I might have lost some of those predictive signals I actually really needed in the first place. So sometimes it doesn't make sense to go to some cleaning that someone else has done. You want to kind of do your own cleaning, but you have to make that decision. So one big important thing is I should be able to see between those different layers what cleaning has been done. I should have transparency and lineage and be able to actually go, right what has happened in terms of transformation between those layers. But I can make a decision and go, I want the raw data. I know I need to do my own work then to clean that data up and dedupe and all that good stuff.
Simon Whiteley [00:07:04]: I can go to silver layer and then I need to know what's been done to it and what's changed. But also I can reuse that it's nice and efficient. I may want to go to the gold layer. The gold layer might be, I might have all my different products and my products extension, my categories brought together, and I've already applied all the business logic. I don't need to do that myself. Every different time I use products in some kind of model, just take one existing set, not quite the same as a feature store, but something's been prepared for the business in some kind of model, I might as well reuse that. So, in reality, if I'm saying what are my different AI Personas and which bit of the medallion do they go to? Well, all of them, but for different reasons and different use cases, just kind of makes sense. So the output of that stage is going to be some kind of clean data.
Simon Whiteley [00:07:47]: And again, no one really talks about how that plumbs back in, and we'll come to that in a second. Now, again, keeping my really obvious basic flow, I'm going to do some kind of feature extraction, some kind of feature engineering, or maybe just joining to an existing feature store and plumb that into my workflow. But where does that live? That's not mentioned anywhere in my medallion architecture. It's not really plumbed into my existing process. It kind of lives on the side. And then when we actually make a model, and we've done our experimentation, and we've done some maybe batch inference and augmented an actual table of data, well, it doesn't really go back in either. It didn't really fit as gold, because, well, it's not really a transformed data model. It's an augmented table that's got an extra column, but it's maybe not ready for the business, so it doesn't quite fit how we speak.
Simon Whiteley [00:08:32]: Which is why lots of people in the more data science advanced analytics teams get really confused when all the engineers are talking about bronze, silver, gold, because it doesn't quite map, it doesn't extend to what they're trying to do. Again, the real world is just a little bit more ugly than that. So, yeah, we can take this really simplified view of the world and try and jam things in. Technically, our clean data absolutely would go back in the silver layer. The feature store kind of also lives in the silver layer, but it's a different part of the silver layer. It's like stretching the analogy or augmented data that we've applied some inference to. Mine technically, is gold because it's had curation then, but it's not a ready data model. The problem with the medallion architecture is it's oversimplified.
Simon Whiteley [00:09:16]: It's trying to put things into three boxes, and everyone always believes it and goes, okay, I have to design my whole architecture to have three distinct boxes. And just sometimes it's more complicated than that. Sometimes it's a. Sometimes you can get away and life is nice and easy. Sometimes it's a bit harder. So, to step through really quickly what we tend to see, I mean, you might have someone who's got a really simple use case and they just have two layers. Great. You might be working with a giant enterprise bank and they've got as many as six layers.
Simon Whiteley [00:09:43]: Not everyone needs it, but we see it. So certainly I've got my raw files coming in, just whatever I'm getting, maybe some JSON, some barque, the results of some kind of stream, whatever it happens to be, we land it, we validate the structure of it. So I'm not cleaning any of the data, not doing anything to change that data itself, but I'm getting it into a standard, structured table source, if that's the kind of thing I'm doing, maybe I'm getting a lot of text documents and I'm chunking it up so I can then put it into a vector database, whatever I happen to be doing. I've got that step between flat files and then structured landed data that hasn't been cleaned yet. And then I do some cleaning. So I'm actually changing the values in there, applying some rules, and removing special characters from strings, whatever that happens to be. Now, I then may take that and do another layer. Now that might be my feature store, but that might also be maybe I've got ten different ERP systems.
Simon Whiteley [00:10:31]: I don't need to combine it into a canonical model, a standardized view of the world. I need a place to do that. That's not quite the clean layer, it's not quite my gold analytical model. If I need another hub, not everyone needs that. That is quite a big enterprise, very structured way of doing things. It's just about making space. And then I'll put it in some kind of data model, some kind of star scheme or one big table, whatever your fancy happens to be. And then you might even have a semantic model on top.
Simon Whiteley [00:10:57]: I'm not saying you have to go to the nth degree of making it complex, I'm just saying it's a little bit more complex than bronze, silver, gold most of the time. Now, the thing to understand is various data teams, wherever you happen to sit, if you're an AI engineer, you should be plumbing these various different factors back in. You should be talking to your data engineers about, right, which layer should I plumb my clean data in? Because I don't want to have to go and clean it again every time I build something. Where should our feature store live and how should it be managed? And can you take a look after it? Can you build any kind of batch inference just into the standard ETL, so it's not a separate path when it happens to be things, essentially, we should be talking to each other, working in the same way, but also understanding that if it's just bronze, silver, gold, you're going to have to break the rules because that's a little bit too fixed, a little bit too rigid to actually make things work. So, yeah, that was my super, super, super quick blitz through, kind of understanding what goes inside the medallion architecture, trying to understand how these things fit together. And essentially, it's a bit harder than it makes itself out to me.
Adam Becker [00:11:58]: Simon, thank you very much for this. I feel like if only I had met you maybe like six years ago, seven years ago, five years ago, it would have saved me a lot of heartache not knowing how these things fit together. And also just like serving the data scientists completely unuseful data for a very long time, simply because, well, they, the moment that I serve them a bunch of data that I asked them to do some predictive magic with, they came back and they're like, well, you've already manipulated it and added a tremendous amount of bias in the way that you've cleaned it. And I really need to be higher.
Simon Whiteley [00:12:37]: Upstream and all of the options are valid. Right. So sometimes the cleaning is just standard cleaning, and that is absolutely fine. They can use it as long as it's very clear and it's documented. Hooray. Data governance. As long as they actually know what's been done to it and they can make a decision going that hasn't added bias or that has. Oh, I was going to do that anyway.
Simon Whiteley [00:12:56]: I've saved myself a job and now we could just repeat and use that. Great. But again, sadly, it's a communication piece. Right. It's about showing people what's there, communicating what's been done, and making sensible decisions as a team.
Adam Becker [00:13:11]: Speaking of communication, if folks want to find you, I hope you're going to be in the chat for a little bit longer in case they have questions. Thank you very much, Simon.
Simon Whiteley [00:13:20]: This has been.