Sign in or Join the community to continue

Reproducible data science over data lakes // Ciro Greco // DE4AI

Posted Sep 18, 2024 | Views 537

Share

speaker

Ciro Greco

Founder and CEO @ Bauplan

Ciro Greco, Ex- VP of AI at Coveo. Ph.D. in Linguistics and Cognitive Neuroscience at Milano-Bicocca. Ciro worked as visiting scholar at MIT and as a post-doctoral fellow at Ghent University. Currently "Building something new" at Bauplan

In 2017, Ciro founded Tooso.ai, a San Francisco-based startup specializing in Information Retrieval and Natural Language Processing. Tooso was acquired by Coveo in 2019. Since then Ciro has been helping Coveo with DataOps and MLOps throughout the turbulent road to IPO.

+ Read More

SUMMARY

As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.

+ Read More

TRANSCRIPT

Skylar [00:00:08]: I'm gonna bring up our next speaker. Welcome. How you doing today?

Ciro Greco [00:00:13]: I'm pretty good.

Skylar [00:00:14]: Can you hear me well, yes, we can hear you loud and clear. I'm very excited for this talk. I have told Sierra before that I have created a worst version of what he's building internally at my company, and I'm excited to have a professional version of it. So I'm excited for him to share everything he has with you today.

Ciro Greco [00:00:37]: Can you see my screen or not? Because I don't see.

Skylar [00:00:40]: There we go.

Ciro Greco [00:00:41]: I see. Okay. Okay. So there was something that you needed to do?

Skylar [00:00:44]: Yeah.

Ciro Greco [00:00:45]: All right, awesome. Well, thank you very much. It's great to be here. So, my name is Chira, and I'm the founder of Bao plan. Today we're going to talk about reproducibility of pipelines over data lake, which is a problem that we had for a long, long time, and we really wanted to find solutions. I'm going to talk about mostly open source technologies that we used and design principles that we use while building our platform. Don't have to use our platform to know these things. This is going to be useful for everybody, hopefully, and lesson learned for us that there's no need for you to relearn from scratch.

Ciro Greco [00:01:28]: Essentially, the feeling that a lot of folks have when they develop on data is to have the same pipeline is never the same. So it's like a river where nobody can bait twice because everybody's moving and everybody's shifting, and everything is moving and everything is shifting. And that's mostly because data is effectively an open system, and it gets complicated to debug and reproduce issues when the time comes. We kind of want to reject this and understand that why. The reason, the main reason why reproducibility is effectively a major problem in data projects. And the reason is because reproducing stuff on a data system requires a bit more than what we're usually used to in software engineering. This is an example that I am sure a lot of people lived firsthand, and it is, well, yesterday a pipeline broke, and now it's today, and we need to fix it. We don't know exactly what happened.

Ciro Greco [00:02:28]: It can be the code, it can be the data, it can be the environment, it can be the cloud architecture. We still don't know. So the first thing that you want to do, following kind of like guidelines, best practices in software engineering, is reproduced the problem deterministically. And so you can investigate what happened by having essentially the same problem that occurred in a different timeframe. But the problem is that when you try to do that, it often happens that, well, the data has changed, so now a day went by and the data now is different, so we can really reproduce the exact data set. The code also has changed. Can somebody push new code into production? And the environment has changed because now new dependencies have been introduced along with the code. So it becomes complicated now to have that deterministic reproduction that I would need to have an investigation that is not going to take days, but that is easy to achieve.

Ciro Greco [00:03:27]: There's a reproducibility checklist if you want. So a number of things that you need to have inversion across time if you want to, if you want to do this on a data system, it is essentially same code, same data, same environment, any for pedantic, same hardware. What you see here in the picture is taken from a paper that we published in Sigmoid this year is just an example. So you have like a certain data frame that you want a version of a certain size, like 1 billion rows, and then you have certain code that correspond to a certain commitment, and then you have a runtime that is dockerized and containerized, so you have certain dependencies frozen in time. And because you execute stuff, hopefully in the cloud, you can control what hardware you have. This is what you would have. In principle, we're a great fan. We focus a lot on building on data lakes, we focus a lot on building on object storage.

Ciro Greco [00:04:26]: It's a very widespread architecture in enterprises. Snotelix essentially, there's a new version of that architecture coming up in these days, because of what is happening in the open source community with open formats, and what is happening with certain vendors that are re architecturing the stack in a way that is kind of like object storage friendly, like databricks, or Dremio, or Starburst and Powerland. And the truth of the matter is that building an object storage has a lot of advantages for enterprises, but it's also tend to create messy data stacks. And so while you go through that reproducibility checklist that I was describing before, where you try to reproduce the same data, the same code, the same environment, to make sure that you can time travel in the right way, in a system like this, it can be very hard, because things can be scattered all over the place, because all the different pieces might not have built organically to address this problem. And so sometimes there's a piece missing, or sometimes like the pieces are there, but you have to patch them together. It's not exactly easy to understand where things are. So the problem in this case needed, at least in our end, a way to think about it. It's a bit more systematic.

Ciro Greco [00:05:44]: So we're just like, all right, so don't panic. Let's think about a little bit of the things that we would need no matter what. And we came up with this kind of like hierarchy of abstractions where you know, like, well, whatever is your implementation, you're going to need to have a data layer where you know, the data is stored. And that's the place where you want to make a decision about data being versioned right throughout time. Then you're going to have a compute layer where you're going to have to keep track of what the runtime looks like. That a spark cluster? Is that just a python runtime that you run perhaps using like an airflow deployment or it is something else. It's like an Olap SQL engine is whatever it is. And the code layer, which is luckily enough the part that most of the time is already mature in many organizations, that is taking care of versioning your code and having a git system where your team can work.

Ciro Greco [00:06:38]: These three layers require them to be somehow engineered as one. And so you can organically travel throughout like this hierarchy and be able to reproduce your pipelines. That's what we did. That's not necessarily the only way to do it, but that's what we did. We chose object storage for the reasons that I discussed before. We're a great fan of data lakes and we think it's the future of data infrastructure, which is parquet because it's very well known open format. But we decided to represent the files as iceberg tables. For those of you who are not familiar with iceberg is an open tabular format that will represent essentially your parquet files as tables using a metadata layer and give you all the things that you usually have with tables, like transactions or incremental updates and so on.

Ciro Greco [00:07:30]: It's a very useful way of thinking of interacting with your data. And those tables are now stored in a data catalog. Chose NESC, which is a very cool open source project maintained by the folks at Dremio. And the cool thing about NeSC is that it gives you the possibility of zero copying your entire data lake. So multiple tables so you can have now version tables, so you can time travel, which is the first thing we want to have. But also because of this abstraction, you can effectively create branches, which is great because now you have sandboxes, and sandboxes are great when you want to debug. For the rest, we use our own runtime, which is the core of our platform. You can use whatever runtime you like, but the essential part is make sure that the runtime is containerized so you have the possibility of traveling back in time and reproduce the environment.

Ciro Greco [00:08:20]: And the code versioning is the usual. Use an id, use a git system. And what you're going to get out of this is that even your runs can be now immutable. So you have the same code and can be reproducible. So you're going to have the same environment running on the same hardware in the cloud. Once you have all these three things, now you can check all the boxes of the reproducibility checklist, and you can time travel in all the dimensions that you need to reproduce an issue in a dynamic system, like a data system. Now all this is great, but these projects might be very different one from one another. How do we interact with all these different components? The important part is for us was how do we think about APIs that now can allow people to work with a system like this, hiding the complexity of all the moving pieces, and allow the developer to have just a few entry points that can be memorized fairly fast.

Ciro Greco [00:09:23]: There's a video here that just gives you an idea of what the abstraction set we chose. Essentially, we're going to have a bunch of APIs that you can call from your CLI that really remind of the syntax that you use in git. So you can do comments like Babylon Branch, which shows you all the different zero copies of your data catalog. So these are versions of your data catalog. You can create a new version, a new branch, by doing just like Bubble branch, and then you do branch checkout very reminiscent of git. Once you are in these systems, once you are in these branches, now you have the tables that have been zero copied into your branch, and you can explore all the data that you have. Now, crucially, once you have your branch, you want to run some computation on it and change some data artifacts, create a new table. In this case we're going to create zero new table.

Ciro Greco [00:10:18]: But the crucial point about the runtime, as I said, in this case, it's our runtime. It doesn't matter much, is all the functions need to be containerized. Each of these functions that are running right now you see on the screen are running in the cloud are containers that I can checkpoint over time. If anything goes wrong, I can always reproduce every single node of my pipeline in terms of the dependencies in the libraries that were used to execute that code. And once you have that then you can still leverage again the same ergonomics that are git like and just go oh I'm going to check out domain and then oh I'm going to do branch merge and I'm going to take this new artifacts or the changes that I made in my pipeline into my main branch which is effectively your production data environment. So ideally you have just a handful of API calls to handle this complexity and do code versioning, time travel dockerization and executing things in the cloud so you can go back in time and reproduce your stuff. We talk a lot about this stuff so get in touch if you want to chat about it. We are pretty passionate about making things easy for developers on data lakes, which tend to be quite complicated and that's all I've got for you guys.

Skylar [00:11:43]: Awesome. Thank you so much. This was super exciting. Definitely going to reach out to learn more after this but thank you so much for sharing with us.

+ Read More

Sign in or Join the community

Watch More

Data Engineering for Streamlining the Data Science Developer Experience // Aishwarya Joshi // DE4AI

Posted Sep 18, 2024 | Views 1.3K

Data Meh-sh: Composability over ideology in the data stack // Stephen Bailey // DE4AI

Posted Sep 24, 2024 | Views 842

Applications of Data Science

Posted Mar 10, 2022 | Views 955

# Automate Data

# Infrastructure

# Data Science

# Pallet.com

# Pallet