Sign in or Join the community to continue

Data Meh-sh: Composability over ideology in the data stack // Stephen Bailey // DE4AI

Posted Sep 24, 2024 | Views 866

Share

Speaker

Stephen Bailey

Data Engineer @ Whatnot

Stephen Bailey is an engineering manager for the data platforms team at Whatnot, a livestream shopping platform and the fastest-growing marketplace in the U.S. He enjoys all things related to data, and has acted as a data analyst, scientist, and engineer at various points in his career. Stephen earned his PhD in Neuroscience from Vanderbilt University and has a Bachelor's degree in philosophy. When he's not putting one of his four kids in time-out, he writes weird, tech-adjacent content on his blog.

+ Read More

SUMMARY

Data can't be split into microservices! But teams should own their data! But there should be one definition for metrics! But teams can bring their own architectures! Data platform teams have a tough job: they need to find the right balance between creating reliable data services and decentralizing ownership -- and rarely do off-the-shelf architectures end up working as expected. In this talk, I'll discuss Whatnot's journey to providing a suite of data services -- including machine learning, business intelligence, and real-time analytics tools -- that power product features and business operations. Attendees will walk away with a practical framework for thinking about the maturation of these services, as well as patterns we've seen that make a big difference in increasing our adoption while reducing our maintenance load as we've grown.

+ Read More

TRANSCRIPT

Skylar [00:00:11]: All right. Welcome Stephen Bailey from whatnot.

Stephen Bailey [00:00:15]: Oh, hello.

Skylar [00:00:17]: I'm very excited for this talk. I don't exactly know what it is, but I feel like from the title, I'm gonna like this talk, so I'm ready for it.

Stephen Bailey [00:00:25]: Yeah, the title is a little bit of a clickbait, I think, at the, at the end of the day. But, you know, if you've heard of data mesh, then, you know, hopefully you find it interesting. And if you haven't heard of data mesh, you should go look it up. Hopefully it entices you. Well, thanks everybody for being here. My name is Steven Bailey. I am a engineering manager at whatnot. Let me figure out how to get over here.

Stephen Bailey [00:00:51]: Whatnot is a live stream marketplace. You can kind of think of it like ebay meets twitch. And so there's lots of live, it's live shopping, there's auctions in real time, lots and lots and lots of use cases for us to do interesting stuff with data and AI. I've been at the company for the past three years or so and seen it grow from kind of like a small company series b to a much larger one with a very big following today. And in this talk, I want to talk about our journey as a data platforms team in scaling out our data architecture. So not just AI and ML use cases, but also the applications and analytics use cases that really drive the business forward. Our journey as a company or as a data platform looks like it's no secret that the modern data stack is pretty crazy. There's lots and lots and lots of options for basically anything you need to do with data orchestration, analytics, bi notebooking, MLAI agents, monitoring, observability, storage, Google platform like the different cloud platforms, there's lots and lots of choices and lots and lots of complexity.

Stephen Bailey [00:02:13]: If you're anything like us, and I think a lot of companies are, you'll start out with a pretty simple, nice medallion architecture for your modern data stack. You get it up and running pretty quick. But as new use cases come up to drive value in the business, you're going to stack on new things like a recommendation system or maybe some in app analytics. You'll bring in Kafka and streaming systems that'll integrate with your data warehouse. You might do experimentation metrics stuff, integrate with operational tooling. You might even put in a real time analytics database. The slide's a little outdated. I don't have any chat agents or vector databases in here, but probably have some of those too.

Stephen Bailey [00:02:55]: You end up with this very complex stack that is generating lots of data products every single day. Success breeds more complexity, which is great or which is great on one hand, because that probably means that you're doing important things and impactful things for the business. But it also means as a data platforms team, it can be very easy to go underwater and lose sight of who owns what and what's important, what's deprecated, what's live, what is blowing up your systems. So you need solutions that are going to help you. You need a framework to approach this problem. And one of the architectural patterns has been very popular in the last couple years, and which I think is very well articulated is one called data mesh. And data mesh is a way, an approach to enabling lots of different teams in your organization to create data products, share data products, govern data products, and really drive the whole organization forward up the chain of data maturity so that you can move quickly and get more value from your data. From the opening line in this book, you know, the data mesh sort of manifesto, you know, is really centered on these things, right? Respond gracefully to change, sustain agility, increase the ratio of value from data to investment, and be able to do all of things at scale.

Stephen Bailey [00:04:32]: And so it's worth understanding what is proposed here. There's a few different principles that are at play. Data as a product, data platform that self serve, making sure that domains own their data, and then also governing this data sort of at the level of the system. It all comes together into sort of a utopic view of data. But one of the problems that I found with kind of a big sweeping approach to managing data like this is that in a small organization or a growing organization or a very rapidly evolving organization, it can feel very heavy handed to try to, like, create domains and try to decentralize ownership of data and operations, because, in fact, centralization can be a very efficient way to do things early on in an organization's lifecycle. And you really have to ask the question in the same way that you have to ask, do you have big data? You probably don't. Do you have a big organization? Well, many times we don't, but you're very likely, if you're dealing with AI, ML, data warehousing, data lakes, streaming data, you very likely have big complexity. And so how do you handle that? Well, over the last couple of years, what we've found again and again and again, as we have deployed new systems into our environment and then deprecated them as well, to reduce complexity, you have to simplify.

Stephen Bailey [00:06:10]: And in order to simplify, you have to have a consistent pattern for standing up and monitoring these different systems, no matter if they're kind of like tried and true data warehousing systems or brand spanking new, like AI agent labeling systems. And it looks very simple, actually. You need to name the service, you need to create it with the intention to self serve, to allow self serve in the future, and which really means putting in guardrails around what people can and can't do. You need to stretch the data engineering organization to own interfaces end to end. So not just focusing on like, hey, is this database up? But really understanding what users are going to do to create data and consume it. You need to create a central control plane and best practices around how you do infrastructure management, monitoring and orchestration. And then you need to measure performance. I'm going to walk everyone through, since we only have a couple minutes, I'm probably going to go over anyways.

Stephen Bailey [00:07:16]: I'm going to walk through one example here of what we call the event bus at whatnot. This is our service that allows developers, whether they're back end developers or front end developers, iOS developers or Android developers, to create or to generate analytical events that can then be consumed in our data warehouse or in our real time database or by other like Kafka consumers in the application environment. What it looks like at a high level is this. You have several event producers. You have, I'm showing only one here, but you have multiple event consumers, you have a couple of different technologies underneath. I mean, we even have a schema registry that's defined in Protobuf that syncs both with the event producers and the event consumers. This is a fair amount of complexity, but at the end of the day, what really matters to the business is that we can produce events, we can consume events, and we have a way to define them very easily that allows us to create contracts between consumers and producers. Everything else is just sort of implementation details.

Stephen Bailey [00:08:23]: And so when we think about the event bus, this is the thing that we actually think about. I would argue you can do the same thing with any system, whether it's, you know, a real time online analytics system or, you know, a machine learning inference endpoint deployment system. It looks something like this, kind of like a pipe for guardrails. We use data contracts, declarative schemas, and Protobuf that sync with the clients. As I mentioned, we worked really closely with the iOS and Android and backend developers to implement producer clients that were sort of event bus aware. They weren't just like the Kafka client, and they weren't just segment SDK. They actually wrappers around these things that can integrate with the rest of our event bus system. And then we designed really, really simple, ergonomic, easy to use consumer interfaces that anybody in the company could use when it comes to monitoring and orchestrating this.

Stephen Bailey [00:09:27]: We also use the same tools, dagster, Datadog, terraform, to stand up this service as we do with any other service. That makes it really easy to manage, or at least much easier to manage for data engineers on the team, because there's basically one place to go if you want to orchestrate or look at the status of it. One place to go if you want to look at slos, one place to go if you want to look at the terraform. Then finally we measure performance using slos. The event bus slos and measurements fall in the same place that all of our other services do, and we can measure them in the same way as we do every other thing. Again, this looks very, very simple, and it's almost embarrassingly simple, but it takes a lot of work, it takes a lot of consensus to create a pattern that is enforceable across almost any system that you want to stand up and then, which your team also has bought into, understands, and is comfortable with. So when we think about ways to manage the complexity of data architecture, especially as you're going from a small team to a medium sized team, or as you're, you know, innovating with the latest tooling, what we found is that although I think top down organizational pattern architectures are great, I think that I love the data mesh book, I love the philosophy, but what it feels like on the ground to me is that we're not missing this organizational pattern so much as we're still missing, like, a good, consistent patterns around deployment and management of sort of data services, what I feel like, and this is really just a vibe, but I feel like we're in the pre docker days when it comes to managing data pipes and data movement across systems. It's like we have all of the compute and tooling, but we don't quite have it articulated in a way that's modular and easily trackable and can bottle up the complexity in a way that's very easy to move quickly with.

Stephen Bailey [00:11:49]: So thank you for your time. I'm happy to answer questions and hope you have a great rest of your day.

Skylar [00:11:57]: Awesome, thank you so much. I didn't see any questions come up in the chat, but definitely being the pre Docker days definitely resonates a lot. I think a lot of the challenges I've personally faced, I should say professionally faced rather than personally. But yeah, a lot of the challenges are faced at work really just come back to we don't treat data in the same way we treat deployable services. And definitely it resonates a lot, and.

Stephen Bailey [00:12:29]: It'S not an easy problem. I think when I was thinking through that analogy, data pipelines have you have an exponential curve versus a linear curve. If you deploy one new service, you're deploying one new thing, but you deploy a new data pipeline generating service, you could be potentially deploying a lot of different point integrations.

Skylar [00:12:56]: Totally. Totally. Well, awesome. Thanks for your time. Looking forward to catch up again soon.

Stephen Bailey [00:13:02]: Perfect. Thanks.

Skylar [00:13:03]: Awesome. Take care.

+ Read More

Sign in or Join the community

Watch More

Reproducible data science over data lakes // Ciro Greco // DE4AI

Posted Sep 18, 2024 | Views 547

The Only Constant is (Data) Change // Panel // DE4AI

Posted Sep 18, 2024 | Views 1.8K

Foundation Models in the Modern Data Stack

Posted Jun 28, 2023 | Views 455

# LLM in Production

# Foundation Models

# Numbersstation.ai

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io