Sign in or Join the community to continue

Airflow Sucks for MLOps

Posted Jan 17, 2023 | Views 1.3K

Share

speakers

Stephen Bailey

Data Engineer @ Whatnot

Stephen Bailey is an engineering manager for the data platforms team at Whatnot, a livestream shopping platform and the fastest-growing marketplace in the U.S. He enjoys all things related to data, and has acted as a data analyst, scientist, and engineer at various points in his career. Stephen earned his PhD in Neuroscience from Vanderbilt University and has a Bachelor's degree in philosophy. When he's not putting one of his four kids in time-out, he writes weird, tech-adjacent content on his blog.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

Joe Reis

CEO/Co-Founder @ Ternary Data

Joe Reis, a "recovering data scientist" with 20 years in the data industry, is the co-author of the best-selling O'Reilly book, "Fundamentals of Data Engineering." His extensive experience encompasses data engineering, data architecture, machine learning, and more. Joe regularly keynotes major data conferences globally, advises and invests in innovative data product companies, and hosts the popular data podcasts "The Monday Morning Data Chat" and "The Joe Reis Show." In his free time, Joe is dedicated to writing new books and brainstorming ideas to advance the data industry.

+ Read More

SUMMARY

Stephen discusses his experience working with data platforms, particularly the challenges of training and sharing knowledge among different stakeholders. This talk highlights the importance of having clear priorities and a sense of practicality and mentions the use of modular job design and data classification to make it easier for end users to understand which data to use.

Stephen also mentions the importance of being able to move quickly and not getting bogged down in the quest for perfection. We recommend Stephen's blog post "Airflow's Problem" for further reading.

+ Read More

TRANSCRIPT

Stephen Bailey [00:00:00]: I'm Stephen Bailey. I'm a data engineer at whatnot, and I drink a whole pot of black old jurors coffee in the morning. Just the cheap stuff, but in volume.

Demetrios [00:00:11]: What's going on, everybody? We are back with another mlops community podcast. I am Demetri-os, one of your hosts, and today I am joined by none other than Mister Ternary data himself, Mister Odsenhe keynote speaker, Mister Wyoming Rock climber, aka the fundamentals of data engineering, author Joe Reis. What's going on, dude?

Joe Reis [00:00:40]: How are you doing? Good to see you.

Demetrios [00:00:42]: I am excited because usually it's like I get someone like you that I get to interview, but today, I mean, Christmas came early, man. We got to interview Steven, and you are here right by my side, making sure that this interview just. It hit all of the right notes.

Joe Reis [00:01:00]: Yeah, it was great. Great jam that we did. So like that a lot.

Demetrios [00:01:06]: Dude, this was killer. I'm saying it. I said it before, but I'll say it again. This may have been the best one. This may have been my favorite one, just because the clarity. And I used three words to describe what I felt like this conversation was. It was philosophical, it was actionable, and it was a third word that I can't remember right now, but it was a big word that was fun and smart. And, of course, I can't remember it to make me seem absolutely like an airhead.

Demetrios [00:01:39]: But what were some of your takeaways there, Joe?

Joe Reis [00:01:42]: I mean, the thing I really like about Steven is he just has, I think, a very innate sense of practicality to him, and it shows both in the discussion and in the, you know, the workflows that he's implemented and whatnot. So I think that's something I always appreciate also being, you know, I would say, trying to be more pragmatic than not or whatnot. Pun intended. So, yeah, so, I mean, it was definitely a breath of fresh air.

Demetrios [00:02:10]: Oh, it's so good. And it was so cool how he walked us through what not is in case there's anybody listening that does not know. And then he went through how he got to build out the data platform, what he does, where he sits. And I just thought one of the key points that he brought up was.

Stephen Bailey [00:02:28]: How.

Demetrios [00:02:30]: It is so clear to them what their priorities are. Like he said, priorities come to him. He doesn't need to go sort them or search them out because they bubble up, and he needs to just grab them when they cross his path, because otherwise, he'll have problems down the line. That was a huge takeaway. And it's got to be nice to be able to have that clear of a vision at a company and as a engineering team and as the data platform team to know what those priorities are and know what you need to keep in mind when you're building. And then just how he sits in the middle and not him only, but the whole data team and the data platform team, they sit in the middle of so many different stakeholders. And we mention this a lot. When you're working with data, you have to have loop so many different people in.

Demetrios [00:03:25]: But this is next level. I mean, the different people that he talked about that he has, you know, he's got the machine learning engineers, they've got the data scientists, he's got the software engineers, then trust and compliance, then the CEO's or the board. All of that is there's so many different things that each one of these stakeholders wants. And I love the way he breaks it down.

Joe Reis [00:03:50]: Oh, for sure. The thing is, the setup is very simple too, right? There's moving parts, but there's not a lot of complexity, right. And so, and it seems like as far as we could tell, everyone seems aligned and knows the priorities and knows the goals. And the other thing I found cool was the just the sense of move uncomfortably fast. I think this is a good mantra, doing what's 80%. It's good enough. I think all too often in the data world especially, we tend to have a lot of navel gazing and the strive for perfection, which is great, except you do miss out on the opportunities in front of you. If everyone's clearly aligned with the priorities, y'all can move fast together.

Joe Reis [00:04:40]: I think that's a great setup to be in your setup for success. It's awesome.

Demetrios [00:04:43]: Yeah, so true. It is a great setup to have. And a lot of people probably know Steven from his airflog blog post or sub stack that he put out. If you have not read that, I highly recommend it. We'll leave a link to it in the description. And he's also got another great one that we didn't even get into. But Joe, I know you posted this in the notes on the document that we were looking at. It's like what exactly isn't DBT? Where he goes through and says what you should and shouldn't be using DBT for in his ideas and why DBT shouldn't be an orchestrator.

Demetrios [00:05:19]: And so it's so cool to see how deeply he's thought about the orchestration layer and how it can be done differently, how he would like to see it. And who knows? Maybe he'll end up starting a company around that someday.

Joe Reis [00:05:34]: I think it'd be really cool. The thing I like about him, too, is he's in Ohio, right? He's landlocked and he's not, I guess, for better or for worse, polluted by all the noise in the data world, right? So I think having that kind of remoteness also gives you a sense of clarity, that purity. It's very pure.

Stephen Bailey [00:05:53]: Yep.

Demetrios [00:05:53]: It's that purity.

Joe Reis [00:05:54]: Like the dark snow and a Columbus morning in the winter. Yeah.

Demetrios [00:05:58]: Are you busting out poetry on us, Joe?

Joe Reis [00:06:01]: I didn't realize that. I would not inflict that on people. That's terrible.

Demetrios [00:06:05]: Oh, well, for anybody that has not read those blog posts, go and read them. For anybody that has not bought Joe's book yet, go and buy that. The fundamentals of data engineering will change your life.

Joe Reis [00:06:16]: It will change your life.

Demetrios [00:06:18]: Will change your life.

Joe Reis [00:06:19]: It'll make you rich, it'll make you famous.

Demetrios [00:06:23]: Yeah. It will change so many different things. Just trust us on that one. And if you are not already joining us in the over 30 cities where we have mlops community meetups happening, I recommend you do that. If you're not subscribed to our three different newsletters that we put out, we got two weekly newsletters and one monthly newsletter. Go jump on that. We'll leave a link to that in the description. And if you're not in slack, join us because there's all kinds of great conversations happening there.

Demetrios [00:06:54]: Last but not least, I want to give a huge thank you and shout out to our sponsors of this episode. Wallaroo is a platform designed to be a control room for production ML to facilitate deployment, management, observability, monitoring, and optimization of models in a production environment. They cater to AI teams larger, small working in projects organized in a way that works for them. Teams of data scientists, ML engineers, DevOps, business analysts, access and work in integrated fashions in their environments via SDKs, uis, or APIs. Get your hands on and grow your skills with Walluru by downloading and installing their free community edition. We'll leave a link to that in the description. At least go check it out. That's the least you can do.

Demetrios [00:07:47]: They sponsored the episode. They sponsored the community. They are big supporters, and we gotta say thank you to them. You can check out their free community edition right now. So without further ado, let's jump into it with Stephen. All right, dude, what is whatnot?

Stephen Bailey [00:08:08]: Whatnot is a live stream auction site. It's kind of like Twitch meets eBay. And so we have sellers going live all hours of the day and they are selling whatever they're passionate about. A lot of times it's collectibles, like sports cards, Pokemon cards, other trading card games, but we're branching out a lot into other categories like fashion and clothing. And we even have some of the whatnot type categories where it's just like whatever you bring to the show, you can sell. And so we have people opening return pallets from stores and selling what's in those. I bought like a, I bought some returned mail the other day. Got like a Harvard medical school, you know, primer on knees healthcare.

Stephen Bailey [00:09:01]: That was great.

Joe Reis [00:09:02]: Interesting.

Stephen Bailey [00:09:03]: We've got to go.

Joe Reis [00:09:04]: QVC meets Wayne's world. Is that kind of.

Demetrios [00:09:08]: This sounds like my place. I need to hang out there more, man.

Stephen Bailey [00:09:11]: It's amazing. It is really sneakily addictive. You kind of get on there and you start browsing, find something you're interested in. Then you just hang out for a while and see what people are buying. And a lot of the sellers are very, very engaging and they kind of gamify the retail experience where you buy one thing and then they'll throw in another thing or they can kind of deal make on the spot.

Joe Reis [00:09:38]: I'm just watching a clip on it right now. I'm watching Tuesday valueee with cactus. And I think I'm just going to spend the rest of the podcast just watching this clip here so you guys can carry on more.

Demetrios [00:09:50]: Interesting.

Joe Reis [00:09:51]: Yeah, it's very interesting stuff.

Demetrios [00:09:53]: I'm out, dude.

Joe Reis [00:09:54]: I'm done with morning mixers of Jonah. He's crazy.

Demetrios [00:10:00]: That's incredible.

Stephen Bailey [00:10:02]: We just dropped. We just released a new feature called drops, where you enter for a chance to win something and sold someone a seat on the Voyager. What is it called? The. The Virgin Galactic flight?

Demetrios [00:10:19]: No.

Stephen Bailey [00:10:20]: Yeah. So we're sending someone to space. Interesting.

Demetrios [00:10:22]: What? Wait, what do they have to do.

Joe Reis [00:10:27]: For whatnot and return? Do they have to like, pitch a product while they're flying into space or.

Stephen Bailey [00:10:30]: No, no. Just. Just be a part of. A. Part of the experience and just live. Yeah, just live. Make it home.

Demetrios [00:10:40]: Yeah. That would be a pr disaster, actually. So wish them well. Anyway, dude, you were talking about whatnot because you work there and you're doing data stuff there. Can you give us a little bit of a breakdown on what you're doing at whatnot? And I wanted to off, man. I wanted to use it.

Joe Reis [00:11:00]: The only question I have, though, to be sure before you get to that, is do you have, I guess a policy profile on there. Are you selling stuff on there on the side? I just want to check that out while we're talking.

Stephen Bailey [00:11:09]: Yeah, yeah, yeah, yeah, yeah. My. My username is data cat. Give me a follow. Liking, liking. There we go.

Demetrios [00:11:15]: Oh, my God.

Joe Reis [00:11:15]: I'm going, yeah.

Stephen Bailey [00:11:18]: I'm actually like, I do sell occasionally. We have a big focus on dog fooding and whatnot. And so one of the ways you dog food is by actually getting on there and selling, it's a very humbling experience. Every time I go on there, I just leave money for an hour because I have no following. So my first sale, I bought, like, 25 best new bestseller books and was like, this is gonna be great. People are gonna come in and get really engaged. And it was just me selling, like, brand new books for $1 for an hour and a half. So the bonus for viewers is they get great deals.

Stephen Bailey [00:11:59]: You can stay in the game.

Demetrios [00:12:01]: Do you have a hall of fame of the wildest stuff that you've seen on there?

Stephen Bailey [00:12:06]: It's like every month there's something. There's something new that's crazy that gets sold. We've had several stunts where we've had giveaways where we offered someone to play post Malone who loves magic, the gathering card game live. So we gave away a slot to do that, and we did this whole live stream of him playing one of our whatnot users. We had a sports card seller. Open, crack, crack open. Of the most valuable card in the world, which is called the triple logo man. It's a LeBron James card that has a patch from three of his NBA championships jerseys, the logo man, on the shorts, and it's all on one card.

Stephen Bailey [00:12:59]: And the card sold for $5 million. So that was open on a whatnot livestream earlier this year. Whoa. It's just wild. I think one of the things working at whatnot, one of the things you learn quickly, is how passionate people are in these different niches of the collectibles world. People will drop thousands of dollars on a Pokemon stream just on a Wednesday night. And it really speaks to not just the, like, you know, not just the things, the cards themselves, but to the community around it, because that's what's really fun about whatnot, is you get in there, someone's opening a pack, you pull a cool card, and it's like, it's not just for. It's not just you getting that and being excited about it, but it's like everyone else is like, whoa.

Demetrios [00:13:49]: Can't believe you got that.

Stephen Bailey [00:13:50]: That's amazing. That card. So cool. And so it, like, really is, like.

Demetrios [00:13:53]: A social experience, dude, that we need to get some people stoked that. Stoked on the mlops community stuff, that we're doing.

Stephen Bailey [00:14:05]: Livestream model training. I think that's where it's at.

Demetrios [00:14:08]: Yeah. Not exactly a spectator sport, but, you know, it's like watching. My dad was way into triathlons, and so I went and watched him, like, once or twice, and that's about all I got in me to. It's nothing, the spectator sport that you would hope it is. You just watch people whip by once, and then you're like, all right, see you in 2 hours. Especially if they're doing, like, iron man's and stuff like that. So, anyway, we came to talk about the whatnot data and just your data thoughts in general, because you've been putting out so much awesome material, man. Like, when it comes to the data layer in machine learning and in analytics, data engineering, you've got some quote unquote thought leadership and hot takes, as I would classify them.

Demetrios [00:14:58]: And maybe it's because you are in the middle of the country that you don't get polluted. Your mind doesn't get polluted by all of these hardcore addicts on each of the coasts. But let's just, like, break down. You're working at whatnot. What are you doing there, and what has your journey been to get there?

Stephen Bailey [00:15:18]: Yeah, I live in Ohio for those, for those who haven't catched it, and I think you. I think there's something to that. I think being in the middle of the country, you kind of. You're at least a little insulated from. From the buzz around you. No one around me. Joe agrees.

Joe Reis [00:15:33]: Yeah, I live in Utah.

Stephen Bailey [00:15:34]: No one knows what data engineering is, so.

Joe Reis [00:15:37]: Yeah, well, Demetrius, you live in a village in the middle of Germany, so people.

Demetrios [00:15:43]: Yeah, people don't like machine learning. What? So I feel you. I think we're all in the same boat on this call.

Joe Reis [00:15:53]: We're the same village, we're the same bar, whatever. So.

Stephen Bailey [00:15:58]: One of the really great things about working at whatnot is we kind of built out the whole data landscape quickly and protoform. So we had, like, a data platform that was distinct from the analytics team, that was distinct from the machine learning team. And so we've been able to build with a bit more specialized mindset than just having a single team that was doing everything a little poorly. And we've kind of grown into this distinct responsibility set. So I'm on the data platform team, which is orchestration it's the key pipelines really getting data from, especially our main application systems, and making it available for operational ML analytics, use cases and things like that. So I get to jump into the individual domains in the company trust and safety and the product experience occasionally, but it's very much more about enabling the rest of the company to do those more advanced use cases. It's been great. I really loved working at a, like a product first company that had a lot of people who are very experienced.

Stephen Bailey [00:17:21]: I mean, I think I've been in places where I was the smartest data person in the room, and that is okay. Like, there's nothing wrong with that. But being in a place where there's just so many talented engineers who are just like, have a direction in mind. Like they've seen it done at scale and they can kind of build from the start towards those that future state where we have really mature event logging, mature analytics, mature experimentation, mature ML models, and being able to be on the ground floor and build up to that has just been an incredible experience. Really, really great.

Demetrios [00:18:03]: Dude, I have to ask you, because I've been putting together a bunch of different best of the year since we are kind of at the end of the year, and I wanted to create the Mlops community awards. And so I was going through a bunch of slack threads that were contenders for the best of the year slack thread. And one of them that came up, and it really makes me think about what you're talking about was almost like a. It was a bit of a discussion around how much should you be thinking about the future? How much technical debt should you be putting or taking on just to get something out there and do it dirty so that you can validate something? And how much should you really just try and like, bulletproof it from the beginning, even if it makes you a little bit slower? And there's a certain few people in the community that are very opinionated that you should never just do something dirty because it's always going to catch up to you. And then there's others who are saying, just get it out there and see you don't want to. Like, there's plenty of projects that have been absolutely incredible underneath the hood that never saw the light of day, because by the time they actually got out, there was nothing there and it, or the, like, goalpost had moved. So what are your thoughts on that?

Stephen Bailey [00:19:26]: Yeah, that's a great, that's a great question. I think that's like, one of the fundamental tensions in the data world is between business value and building like a platform and sort of like thinking holistically and systematically. And I think it's, I think you almost can't do both at the same time. If you have your, if you have your one hat on, you're like, get, solve a problem, add on. It's really hard to do, to do like the system level stuff perfectly. You want to do it as well as you can, but you really do want to move as fast as possible. I think at whatnot, one of the things that's really set us up for success is that we divided the analytics side and the data platform side. At the beginning, we hired two people.

Stephen Bailey [00:20:11]: They had two distinct job responsibilities. The data platform side was much more focused on like getting a machine learning product out in production, and the analytic side was much more focused on like the board meetings. And those two concerns are pretty distinct. I think the machine learning side, you have to have like really high quality systems and pipelines that are like rigorous, they're defined as code, whereas the analytic side, none of that really honestly doesn't really matter as long as you're getting like the right, you're able to answer the right questions and you have good definitions and you have something that's fairly maintainable that could just be like a single layer of DBT models. So I think my sense is you want to get to value as quick as possible, but if you want to build in processes where you can go in and think holistically, and the way we've done that at whatnot is we have this data platform team that's distinct from the lines of business. It's kind of separated a little bit. And so we can think about how should we improve orchestration, how should we create, improve reusability, how can we improve the developer experience and those things. But it is a secondary concern.

Joe Reis [00:21:27]: Walk me through this. When you started at whatnot, and it was very early days with the data team, what were some of the initial questions that you wanted to answer in order to show that you were on the right path?

Stephen Bailey [00:21:40]: Yeah, I think the, so why not as livestream retail? So it's got all your typical retail metrics like number of orders and things like that. But then we also have these product engagement metrics, like how many shows are people watching? How long are they staying in shows? What's their product analytics journey. So I'd say those are the two sets of metrics that are most important for the customer journey at whatnot. And so getting those exposed to the end users or to our internal stakeholders is concern number one. And making sure that those are exposed reliably and consistently. But then it kind of forks from there. I think as soon as we had those base metrics, we started wanting to know, how can we operationalize this for a recommendation algorithm in the product that became the ML use case recommendation? Like, can we get people into a feed that is being driven algorithmically? Because that is your flywheel for additional experimentation long term. So the first version of recommendation was a huge goalpost to get to.

Stephen Bailey [00:22:53]: And then on the other side, I'd say we picked a couple of the most highest value use cases operationally to start drilling into. And that was like trust and safety, for example, where you've got, because it's live streams, you have all these potential ways that people can kind of abuse or ruin the experience for others. And so being able to expose data quicker, but also create kind of a language for us to understand, like, all right, what does it mean? What is a problematic activity? How do we define that then? How do we get that in the hands of agents? I would say those were the three top priorities for data. They all have very different flows because the analytics flow is fairly simple, a couple of DBT models, and you got to have that consistent and integrate it. But then the ML world is its own world once you have the raw data. The operational world, too, is its own world with its own stakeholders. And the flows that stakeholders use for looking at trust and safety data is very different from the flow that a board member would use for looking at a dashboard.

Demetrios [00:24:01]: Dude, awesome. That's so cool. Can you break down real fast? I do want to get into some of the articles that you've written, and you said the magic words of orchestration earlier, which almost made me jump in and say, let's talk about airflow. But I didn't bit my tongue. And I want to go into the idea of, you mentioned the recommender system, and recommender systems are something I've been digging into pretty, pretty deeply recently, and I'm wondering, like, what's your recommender system look like? What is the stack? How do you guys, how do you set that up? And how do you make sure that, like, I imagine it's real time, you want to be recommending things to people continuously and making sure that it's the best recommendations. And so, I mean, I want to start with just like, what it looks like, but I also want to, like, hear about these kind of edge cases. And how do you know that the recommender system is actually recommending things that are interesting and it's not just recommending the most popular stuff like what is that? Like, how do you go through all of that? Fun question too.

Stephen Bailey [00:25:11]: Yeah, there's so much to unpack there, first of all, and I think the recommendation system is so interesting to me. It's the first time I've worked with a team that's built a recommendation system and first time I've really been exposed to it. I think one of the fascinating things about data is how much you can know and also how little you can know about so much. I feel a recommendation system is pretty, I don't want to say straightforward, but it's clearly a data application. But it has so much nuance in it because it's such a mature data application that it almost is its own thing. I think it's, I guess just a sidebar. I think it's really interesting how data products or data use cases tend to do that. Whether it's computer vision or as soon as they grow to a certain point, they can branch off as their own distinct subproblem.

Stephen Bailey [00:26:07]: But anyway, I think the big thing, so at whatnot, the thing that distinguishes our recommendation system and a lot of our product serving infrastructure, whether it's recommendation, we now have other use cases for serving data into the product, is it has to be real time and it has to be very low latency and it has to scale to very high workloads. The way we do that for most of our real time applications is we use a tool called Rockset. We'll have our offline data processing and then we'll dump data into a rocksec real time analytics layer which basically has extremely fast lookups. You can also write SQL queries, so it makes it pretty easy to maintain and manipulate. Over time, it exposes the results of those queries to other services through an API. The main application can just query it through an API, get the results set and filter it down to certain users. That works really well. The latency is very low and I think most importantly, the speed of iteration to get a new endpoint up and running, it's very low.

Stephen Bailey [00:27:21]: As long as we can reliably dump the data into a certain spot, we can kind of add it into our repertoire of things to serve. So a lot of our recommendations are actually served that way. So we'll pre compute the recommendations, dump them in there, and then they're ready for the system. What goes on in the training aspect, I'm actually not that familiar with, and I think that's one of the distinctions between a data platform engineer and an ML engineer or whatnot is data platform. We're responsible for a lot of the infrastructure and what's going on, like, from a data movement standpoint. But what happens within a step, a lot of times there's so much more sophistication than it's really necessary for me to dig into. So I actually don't know what we're doing in many of the recommendation system cases.

Joe Reis [00:28:14]: Interesting. How do you coordinate with the ML teams to make sure that everything is, I guess, flowing well? What's the overlap with that team?

Stephen Bailey [00:28:27]: Yeah, that's one of the projects that I've been working on recently. In the last six months, we had to migrate off of a tool called Spell, which got bought into Sagemaker, the Sagemaker world. Sagemaker is Amazon's ecosystem of ML related model functions. The Sagemaker workflow determines our interfaces. What we want to provide as a data platform team are the tools, basically copy paste tools where people can configure essentially a training job. In Sagemaker, they provide a docker container, they provide a shelf script configuration and the form piper parameters. Then they can just copy paste it into our orchestrator, dagster, and then kick that off on a schedule. Very simply, a lot of the work of the data platform team was in setting up.

Stephen Bailey [00:29:23]: What are those different steps that data scientists can use to launch these training jobs? Create models, deploy the models, and then just making them very easy to reproduce for different variations of the, of the task.

Joe Reis [00:29:42]: You mentioned, dagster, one of the things we wanted to talk about in this episode was your article on your substack called Airflow's problem. So you switch sides now you Daxter dude now, or what happened?

Demetrios [00:29:59]: Yeah, we are under sagemaker pipelines. What do you got in under the hood now?

Stephen Bailey [00:30:07]: You have everything. I feel like that's the, that's the state of the world right now. You have a little bit of everything. You got some DPT cloud, you got some sage makers.

Demetrios [00:30:13]: That's so true. And that's why it's such a headache. That's why I think, yeah, you're such a headache. Yeah, your, your sub stack rings so true with so many people. But anyway, I'll let you answer.

Joe Reis [00:30:24]: What not to describe, like, all your data tools that you have.

Demetrios [00:30:28]: That's the stream, dude. That's what you do.

Stephen Bailey [00:30:30]: Data platform and whatnot. Yeah.

Demetrios [00:30:35]: So, yeah, but I jumped in. I didn't want to derail that, but it hit us with what it looks like you're using Dagster, are you also using any of these other ones? What, for example?

Stephen Bailey [00:30:47]: So our philosophy is the orchestration layer, that scheduling layer should be centrally managed. So we don't want DBT cloud to have a bunch of like gooey scheduled runs in it and just have those running independently, and then having Sagemaker to have their own schedules in there. We don't want our ingestion pipelines to be running on their own schedule that's built in their own tool. We want a single pane of glass for scheduling so that we can hook things together, make sure that sagemaker jobs run after DBT jobs. And if the DBT job takes an extra 4 hours for some reason, like all of those training jobs shouldn't then be delayed and have to be rerun and things like that. And those are very, it sounds like a very simple problem, but it's actually, it's actually very.

Demetrios [00:31:40]: Yeah, it's too advanced, right? Like we've gone too advanced and that's what makes it challenging.

Stephen Bailey [00:31:47]: Yeah. And I think where I was when I wrote that article, airflows problem, I was setting up a lot of this infrastructure and we had just reached this point as a team where we were starting to have lots of data products and data applications in different places, where we had some jobs that were starting to run like trust and safety related policy violation sensors, where they'd check whether something had happened and send a slack alert. And then we had all these training jobs that were being created and they didn't look exactly the same. So you had like one might have a training job and one might have another post process job, and then it has to get put into another system. And then we had of course our DBT and pipeline stuff and there was just no, it just felt like chaos. And when I looked at the orchestrator landscape, I looked at airflow and prefect and Dagtor and I had previously used argo workflows, which is Kubernetes native orchestrator. So I was fairly familiar with the space. I just felt like the orchestration layer was the right place to solve this chaos.

Stephen Bailey [00:33:02]: It's the only tool that's so tightly integrated with all of the different pieces of the infrastructure and the only tool that knows about all the pieces that could possibly provide like an interface to managing it. But when I looked at airflow and really even when I looked at Dagster and prefect and like just orchestration in general, it just didn't feel like we were solving that problem of like being able to see all of the things that are out there in the world and then much less being able to like help manage and I. And make them work together. I do think that dagster and the elemental team are starting to push in this direction through their use of asset declarations. DBT does this a little bit because it has their references. So if you add new models in, they should fit into the graph. And the problem with DBT, of course, is that it's just a subset of, of things. It's not going to manage your model definitions and stuff.

Stephen Bailey [00:34:08]: Then actually airflow too. They've started to add open lineage and really start trying to connect the operations that are happening from a processing layer to this map of what you have and what you're managing at the data level. I think that's the right, that's the right approach, but one of the. Yeah, I think. I think that's. I think that's probably like the, where we're moving towards.

Demetrios [00:34:37]: And so if I understand this correctly, it is you saying because the orchestration layer basically has its tentacles in everything, it should be doing more, not less.

Stephen Bailey [00:34:52]: Yeah, that's what I think. If we're going to solve the problem, like if we want to be able to intervene and manage the system as a whole, then it's got to be. There's got to be something that has at least as much cynicals as the orchestrator does, because you look at something else, like another tool that's kind of like this is catalogs. So data catalogs, they are the one tool that actually, I think, treats the data ecosystem in a way that is as broad as it is. Data catalog will probably index Google sheets for you if you wanted to. So they'll go to Google sheets, they'll pull in your machine learning models, they'll look at your bi tool, they'll look at your database, and they'll pull it all into one place. But they can't do anything about managing any of that stuff. It's just visibility.

Stephen Bailey [00:35:46]: It's adding some metadata on top of it, giving you a search engine on top of your catalog. The orchestrator, on the other hand, if you're pulling that Google sheet into Snowflake, it knows about both of those things and it can do things to both of those things. It can create that map between them and you're doing it in a way that actually lets you intervene. If you want to, say, stop pulling that in or only pull it in once a month.

Joe Reis [00:36:14]: Interesting. Walk me through your decision criteria. You're using airflow. You wrote a nice article about it, then you're not using it. But you also mentioned astronomer, too. In the article, if I'm not mistaken too, is this maybe a viable option? Why did you decide to go the route that you did?

Stephen Bailey [00:36:35]: I did pocs, personally did pocs with Prefect and then airflow itself. We were on airflow for a while and then as well, Dagster, Washington ultimately, what I really liked about Dagster was their focus on the engineer. I felt like of all of these tools, that Dagster had the most clear focus on the data engineer as a Persona, and it was coming out. They had just released branch deployments, which is basically every time you create a new pr, it'll create a sort of distinct environment for your code. And you do like dev and testing there. They had CI CD actions out of the box, and it's not to say any other tool. There's CI CD for airflow, of course, like GitHub actions all over the place. But I think when we were.

Stephen Bailey [00:37:36]: Your vendors are a little bit like a part of your team to some extent, especially when you're smaller. And so I knew I was going to be leaning on someone to help us get stuff stood up, and I just definitely saw that energy from the Dagster team. But I also think one of the things that came out of that article that I wrote was it's a bit of a, if you're going to go with a vendor, you are thinking like you are making a partnership at more of the platform level astronomer, for example. Their core value is being able to make it really easy to deploy airflow and to deploy it multiple times, and then also to aggregate metadata and observability metrics on top of those airflow instances. It's not just the framework, the orchestration framework of what's the syntax for creating a dagan? It's really about how are you deploying this thing and monitoring the infrastructure and then getting metadata outside of it? How are you sending those slack alerts and making sure that they're going to the right people, especially as the complexity evolves. It's almost like it's not a decision in many cases. Are you going to use Dagster or are you going to use airflow? It's like, what are you going to use at the level above that to help coordinate these things if you're going to grow? We know our trajectory. We're planning for a lot of growth.

Stephen Bailey [00:39:10]: We're planning for other teams to be deploying dagster workflows and plugging into this orchestrator. And so that's where we knew that was the problem we wanted to solve.

Demetrios [00:39:22]: I love the human design aspect of that, how it's like a bit of an abstraction above that and thinking about this holistic view, not just like, oh, this tool gets the job done in the best way. Well, what is that best way? Define best for me, because is it the fastest? Well, not quite. And you're looking at it in a much different way. And it's funny you say that, because I was at a meetup in Berlin, one of the mlops community meetups, a few months ago, and there was a data engineer there, and he was talking about how he was so excited about Dagster because of some. I can't remember what the exact feature was, but it was like, it's data self aware or some shit. I can't remember the name of it exactly. And. And so I'm just gonna take this moment to say, dagster does not sponsor this episode, but if they want to.

Demetrios [00:40:17]: Pete, hit me up, man. We're here. We got the availability. This is what I like. They can clip this all they want and get some good footage for their social medias. But anyway. Joe, I know you had a question, man.

Joe Reis [00:40:33]: I don't want the Dexter teams cutting. I just want to say that too. Nick Schrock's a friend, so prefect crew is friends too. And my company were ternary data. We're actually partners with astronomer, full disclosure. So I would say dotted line to Dagster as well. So I ain't got a dog in the fight either way. Yeah.

Joe Reis [00:40:53]: An interesting thing you brought up, though, is sort of centralizing the orchestration layer, which makes a ton of sense. It's sort of the hub that I guess maybe the air traffic control for maybe an overused metaphor. But you mentioned, too, you want other teams to start deploying Dagster and have that level of control. One thing I want to know more about is how do you train to and share knowledge among different stakeholders on these platforms? Because it would definitely seem like you'd want to at least have a common body and knowledge or competencies in order to be productive. How do you do that?

Stephen Bailey [00:41:33]: Yeah, it's a challenge. So I think there's two levels to it. The first is just education on the platform. How do you write a dagster sensor and then plug it in? One of the things that I really like about Dagster is it's created for the jobs that you run to be modular. We have four repositories that push jobs into the Dagster central scheduler, and they're managed independently. It's intended to be used that way where you have different, different environments and different docker containers and different requirements. I knew as soon, like, because we had data science workflows, machine learning workflows, I knew we had to support all sorts of insane Python package requirements and stuff like that. So there was never going to be a single base image that you just use this and plug it in and it works.

Stephen Bailey [00:42:32]: So we have this kind of heterogeneous set of jobs that get pushed to Dagster, and that is just a training issue, really. You're just coming up with a playbook and a template for how you do it, how you create the Dockerfile, making the image creation process as simple as possible. The more challenging part is making processes aware of each other and knowing what can be like, allowing other teams to know what can be reliable. This is like a lot of what data mesh, I think, is intended to make clearer, to create processes whereby one group can publish data products to other groups and then other groups can do whatever they want with them. That can be very challenging because we have our DBT project and that's really where all of the majority of our data pipeline work is done. But it can be unclear to end users like, all right, which model should I select from and which ones are not subject to change. So a lot of the id say we've done two things here that have made our lives a little easier. One is we've decomposed our one main DBT project into, like, marts, submarines that have different classifications.

Stephen Bailey [00:44:00]: So we have raw data marts. We have a core set of core Kimball model marts that we're building out now. We have a couple of other product use case specific marts. And then we also give people scratch spaces to build their own ad hoc stuff. There is a bit of a choose your own adventure opportunity available for end users. They can choose from the raw data, they can get some our core dimensionally modeled data that we have team guarantees around. They can pick more highly vetted stuff that's been published by our machine learning team, for example. There's an implicit knowledge that I'm getting data from this source rather than just selecting from a random snowflake table.

Stephen Bailey [00:44:50]: So that sort of classification of, like, what lives, what is, what are each of these things and who owns them is really important. And then the second piece is Dagster. We've moved all of our stuff over to Dagster's assets framework, which allows for the jobs themselves to be linked to, to upstream dependencies. And so when someone is writing a machine learning job. They actually reference the DBT table by name. That job depends on, and that's brilliant. It creates an explicit dependency on that table. Now whether you like how you trigger that machine learning job can vary.

Stephen Bailey [00:45:38]: You can make it sort of like sensor driven. So like if the upstream table changes, that's when the machine learning job runs. Um, and that's very nice, but at the, at least by having that asset framework, it makes it, you can go and look at that machine learning pipeline and then see exactly what jobs it's referencing. And it, it allows me, as a data platform person who doesn't really have perfect insight into things, to more quickly jump in and troubleshoot things if they go wrong, and also to give them some sort of advice. So it kind of creates an interface for, for us to support other people simply by having them declare like what specifically they're relying on being kind of, dude, don't do that.

Demetrios [00:46:19]: So many times I've tried to ask like, how people solve that problem, because that is one of the cruxes that you'll get between maybe it's data scientists or data analysts and platform teams, and it's such a headache to know, all right, if I'm changing something or if I'm iterating on some kind of data, how is that going to affect everything downstream? And to know that, all right, this has been claimed by these projects or this model or whatever it may be, then you have an idea of like, all right, great, now if something is broken or if somebody comes to me and says, hey, why isn't this working? Then you can just troubleshoot it and you can go into it and say, oh, maybe this is why. So it helps you get to that end state much faster. And I really love hearing about how you implemented that. It's so cool. So, all right, Joe, you got any more questions before I jump to a few meta questions on the airflow blog post?

Joe Reis [00:47:25]: Go for it. Go for it.

Demetrios [00:47:27]: So I just want to know, why do you think it struck such a nerve with people when you wrote this airflow blog? Like, what is it that, well, I mean, it's a sub stack. I keep calling it a blog, but same, same.

Stephen Bailey [00:47:38]: Right.

Demetrios [00:47:39]: And what is it that like, that got people because it created a lot of buzz. First of all, a lot of people really liked it. And then it was on like hacker news for a few days. It was on the front page of that and there was congrats and congrats.

Stephen Bailey [00:47:59]: Yeah, it's, it's, it's so funny. I mean, I think what, what really struck me personally was, like, as a writer, to get on the top of hacker news. Like, I didn't even realize it happened at first. Like, for like 20, like, maybe 20 hours. Right. So I, like, I just published this in my pajamas on, at 09:00 a.m. on a Friday, and then, like, doesn't.

Demetrios [00:48:21]: Feel like you did the what not stream and bought, and then I went.

Stephen Bailey [00:48:24]: Sold to Pokemon cards. But, yeah, I think it struck a nerve because I think I caught the feeling that a lot of people have with airflow where you feel like you're a little frustrated, but it's fine. Airflow is good. Airflow is great, and it solves the problem of orchestration, and it solves it pretty well, and everybody knows about it, but there's so many people that are, like, a little bit frustrated or just enough frustrated that they want to, like, they really want to talk about it. So I think just, you know, just critiquing airflow, I think, is a, is, is going to strike a nerve. That's what I found out. Um, but also just the sense that, like, airflow is so inevitable in the data industry, I think this is, this is really what, like, what I came down to in that post was airflow is so inevitable that you, I think it stifles the conversation of what that tool there could be, because it has defined the space so, you know, so completely for the last six or seven years. Right.

Stephen Bailey [00:49:44]: I think the way we, the tools that we use inform the way we think about problems. And the way that airflow is created is it's created as like, pipelines, individual bags. And so you tend to think of your world as, like, a set of discrete processes that run independently of each other on a schedule, and they're idempotent, and they, like, run on backfill and things like that. And that's the way engineers start to think about the problem. But if you look at something like DBT, DBT basically creates this graph, and there's all the ease of developer stuff. But to me, what is special about DBT is it puts you in this mindset of managing this graph of data assets. Some of them are managed by DBT, and some of them are exposures. But ultimately, what you're thinking about is you're thinking about how data is moving from the point a to point b, and that just, it makes you build in a different way.

Stephen Bailey [00:50:41]: And so I think that's really that, like, that sort of completeness of airflows, like, the way airflow is modeled to the problems we're solving is, I think, what struck such a nerve.

Joe Reis [00:51:01]: It's kind of like the java of the data world in a way. Everyone wants to hate on it, but it's good enough for the job, honestly. And it's not like I don't even see some hard numbers on this, but I can't tell that really airflow is suffering from any decline in installations, for example, or popularity. It is what it is. It's captured the mind. Share every team I talk to, too, it's like they want to explore other options, but inevitably they go back to just using airflow and just making it work, I guess you say, is good enough.

Stephen Bailey [00:51:37]: It's good. People know it. So, I mean, the fact that other, like, you can hire a data engineer and they know how to use airflow is like, that's great. That's huge.

Demetrios [00:51:46]: Yeah, that's speed. That's very worthwhile, right? Because you can get someone up and running like that and they don't have to learn a new tool.

Stephen Bailey [00:51:54]: I think, like, you look at what astronomers doing with it, they bought open lineage and are integrating like lineage into the product and they're starting to build some of that monitoring, like the higher level monitoring into the product. And maybe they'll come in and really improve the way things, the speed at which you can deploy a data platform and the scope that it can address. But I also think airflow is airflow. Airflow is not going to, they're not going to change very much in the next seven years. That would almost get rid of the, the value, the big value. They'll smooth out the corners, they'll improve the UI, but I mean, all they.

Joe Reis [00:52:37]: Need to do is just pay attention to what, you know, prefect and dags are doing and just make incremental changes. Just that, you know, they got the, they're the incumbent base, so it's like all they do is just tack on stuff just to stay somewhere, you know, wherever they need to be. So it's an interesting race. I mean, there, but then again, I mean, there's so much that further ahead. They came out years before any of these other platforms did.

Demetrios [00:53:00]: Well, I think you brushed over something that is so important and it kind of opened my eyes in a way, which was saying it's been this way for so long that we haven't been creative in thinking about the solution. It's like we've gotten complacent and said, these are dags. These are how we run dags. And this is what it is. And then I just think about all of the subsequent tooling that is based on that theory, and that's how you do things. And really if you're thinking about it in a different way and you come with something totally out of the box and you say, like you mentioned, okay, DBT has a graph way of looking at it now that brings a whole other perspective into what could be the next iteration of this. If we were to just scrap everything and start over today and we never had heard of dags, then what would that actually look like? Right?

Stephen Bailey [00:54:01]: Yeah, I think one area where it drives me crazy right now is the ML world and the data world. So I would say airflow is kind of like traditional data processing world. And then the machine learning world is not necessarily like the speed is not any greater. You have all these offline training jobs. That's what most things happen. But what's different there is that they have this. The machine learning models have this sort of ontology of what is happening at each step and what's getting deployed. So it's not just this processing job runs daily, it's this model is getting retrained on a certain basis, and then we're going to take action based on what happened in that last retraining.

Stephen Bailey [00:54:52]: And then when I, as a user, want to look at it, I want to look at that model and that model's history. I don't want to look at the last couple dates, etcetera. What I'm thinking about is I'm thinking about the model and everything. All of my language is geared around that, and it's incredible. I think one of the reasons the machine learning world has gotten so chopped off, especially from an orchestrator standpoint, is because of that difference in looking at things. It's not just a dag that's continuously loading data. It's like you want to think about the model and you want to think about whether the model's been deployed, etcetera. What ends up happening is you have to build a new orchestrator just for machine learning models.

Stephen Bailey [00:55:37]: And I would love one orchestrator or less. I just love it to be easier.

Demetrios [00:55:43]: Honestly, there's been too much vc money that's poured into this space for that to happen anytime soon.

Joe Reis [00:55:52]: Fair next cycle?

Demetrios [00:55:54]: Yeah.

Joe Reis [00:55:57]: Let me ask you this. How do you interact with the software engineering team?

Stephen Bailey [00:56:04]: We are implementing an event bus like Kafka, Kafka streaming solution. And the data platform team, not me, but someone else on the team, has set that up, has been the point person setting that up. We really see the data platform team as an extension of the engineer. The core stakeholder there and the people we report to is the engineering organization. In some way. We are kind of like an interface, like queryable, high volume data interface into the engineering organization. That's our primary affiliation. Especially with the event streaming project.

Stephen Bailey [00:56:47]: We are taking the lead in setting up the schema and data contract and data quality infrastructure for not necessarily enabling what events are going to be going through, but tracking them across projects, across the different clients. When I think about data, I really think the purpose is oftentimes is to take data from one domain and give it to others. And so events are an interesting example where within the organization we already have these discrete domains of like client side, different client operating systems, we have the backend services, we have a couple of different back end services. And so our role there is just to make sure that we are building a set of events that anybody can use. It's like anybody any of the engineering teams can use in a reliable way. It's kind of like that public infrastructure of the team, dude.

Demetrios [00:57:51]: So many different pieces that you're servicing and thinking about it and just how you're in the middle of all of this, you're right where the action is.

Stephen Bailey [00:58:01]: Yeah, it's crazy, it's crazy. I mean, it's a lot, it's a lot to keep track of. And I think one of the tensions that we see the data platform team experiencing is just trying to keep up with everything that's going on. We are really pushing to allocate all of our effort towards the highest impact use cases and then letting the other things just work themselves out. And that was not true from my experience as a previous employers where the data team had high visibility, but it wasn't always clear what was the highest value and highest impact thing. So we pursued a lot of things that were not, that didn't end up being that useful at whatnot, because so much is happening, if we don't get to something quick and solve it, and solve it well enough, then that thing either just takes on a life of its own and it's like gone, it dies, or it becomes something that becomes a bit of a mess that we then go and have to clean up. So the priorities, kind of the priorities come to us a lot more themselves out. Yeah, yeah.

Demetrios [00:59:13]: It's funny. Joe told me last night. What were you telling me, Joe? It was like data monetization. You had something funny that I found hilarious.

Joe Reis [00:59:23]: I don't know, I probably say a lot of hilarious things about modernization. I mean, it's just one of these things where, I mean, it's pretty cool that you're close to the line of value, though, where I think there's a direct reflection between, like, what you do from a platform perspective and the output of it. I mean, that's because you're tightly coupled to the engineering aspect of it. Right. Yeah, I can't remember what I said, demetrios, but it was, yeah, but I mean, the common trope I've been hearing for the last bit in data is like, oh, we need to provide value in roi and monetize our data. But it's interesting, during the, we talked about kind of funding cycles and stuff for touch on that for a second, I noticed that conversation sort of disappeared from data teams for a bit, too. Yeah, it's like, oh, we'll just provide things and stuff and we'll experiment a lot. And I think that's very quickly coming to a halt.

Joe Reis [01:00:17]: And it sucks in a lot of ways. I can't blame the people on these teams, maybe their managers or maybe the people that hired them, but I feel like a lot of teams are hired and there's a lot of money sloshing around. You get something off of whatnot, you go buy a data team while you're at it. And, you know, it's just, it's pretty fun.

Demetrios [01:00:34]: So you can buy data teams on whatnot. Now, this is crazy.

Stephen Bailey [01:00:38]: That's a good idea. No rules against it.

Joe Reis [01:00:43]: Start trading data teams and whatnot.

Demetrios [01:00:45]: Yeah, I want to ask one. Oh, go ahead, go ahead. Yeah, yeah.

Stephen Bailey [01:00:49]: I was just going to add it there. Like, I think I've been that person who was hired into like the first data science role, and there's so much, it's such a poorly understood role and it can go so many different ways. I think a lot of times it's just people don't understand how to get value out of the data team. They're not being pushed. The people are not being pushed to create things that can be put in directly into the product or they're not part of the engineering team, they're not part of the sprint cycles. I really think I could see a world in the future where there is no data team. It's just you're hiring data professionals. It's just better understood what kind of value they, and they just do it.

Demetrios [01:01:33]: That's so true. That's so true. And that's an interesting world to think about. Now, to finish, I know you mentioned before that you have philosophies and whatnot. And you're building with certain philosophies in mind. Can you tell us what these philosophies are? What are some things that you try to keep in mind as you're building? I know when we interviewed the Doordash folks, they were like very big on making everything 1% better all the time. That was their thing, like just getting 1% better and really chipping away at the big ice sculpture. Or they also talked about the velocity of machine learning.

Demetrios [01:02:15]: So how quickly can you take an idea and put it into production? Do you all have like certain philosophies? I remember, I can't remember what philosophy you mentioned earlier. I think it was building with scale in mind, knowing that things are going to be big. So that's one right. What else did you have on there?

Stephen Bailey [01:02:36]: We, one of our, one of our more distinctive philosophies as a company is to move uncomfortably fast and to get things in front of customers for the most part. But when it comes to data platform team, to get things in front of our internal stakeholders and also to not overthink things too much. We've had a couple, it could be really tempting to try and build the best data model possible and really get feedback from everybody and things like that. But we do try to minimize, figure out the 80% best solution and then go with it as fast as possible and just iterate over time. That does. As a data platform team, sometimes we have to think about that and really make space to have more thoughtful decisions, especially when it comes to data modeling. Because in the data world, as soon as you publish something, it's part of the infrastructure. It's like laying down pipes and then a city gets built on top of it.

Stephen Bailey [01:03:51]: You can't go back, and it's a very expensive thing to go back and redo some of those decisions. We try very much to identify what is a type one decision, like where we can just walk it back if we need to, and what's a type two decision, and treat them accordingly. So move uncoverly fast. Is Warren. The other piece I think that is pervasive is own everything and nothing. In the data platform world we see a little bit of everything from in the client side and back end code of our application. We have events getting emitted and we want people on our team to be able to go back there. And even if they are not making commits and changing stuff, we want them to be able to know why is this event sending an error, not being validated correctly? We need to understand the protobuf schemas and how that gets validated during the development cycle.

Stephen Bailey [01:04:48]: We need to know what is our trust and safety team using from a data perspective. We need to have that visibility across the whole system. Even if we're not going to manage it tightly. We want to be operating with like an understanding that we can step in and assist any of our stakeholders across the whole platform and that is a constant learning, learning effort for everybody.

Demetrios [01:05:14]: That transparency, man, I love that. That's incredible. Well, like, honestly I don't want this one to end. I feel like I'm just getting so much wisdom downloaded here. This has been so good and but I know you got coral reefs to go and livestream and maybe some guitar playing and whatnot so I will let you get back to that and get back to your life. But thank you so much for coming on here and teaching us.

Stephen Bailey [01:05:42]: Yeah, this looks great. Thanks, guys. A lot of fun.

Joe Reis [01:05:45]: Of course. Thank you.

+ Read More

Watch More

Who's MLOps for Anyway?

Posted Sep 17, 2024 | Views 5.5K

# Generative AI

# ROI

# EPAM Systems

Why You Need More Than Airflow

Posted Jul 21, 2022 | Views 1K

# Airflow

# Orchestration

# ML Engineering

# Union

# UnionAI

MLOps for Ad Platforms

Posted Oct 30, 2022 | Views 877

# Ad Platforms

# Advertising Optimization

# Design ML

# Promoted.ai