Sign in or Join the community to continue

Unified Data + AI Governance with Unity Catalog // Michelle Leon & Victoria Bukta // DE4AI

Posted Sep 17, 2024 | Views 1.1K

Share

speakers

Michelle Leon

Staff Product Manager @ Databricks

Michelle is a PM at Databricks, building the unified platform for Data and AI. Prior to Databricks, she lead product teams at Webflow on core product and infrastructure, and was an engineer turned PM at Airbnb. She is based in San Francisco.

+ Read More

Victoria Bukta

Member of Technical Staff @ Databricks

Product manager

+ Read More

SUMMARY

In today’s multi-vendor data and AI landscapes, organizations often find themselves struggling with fragmented governance. The proliferation of diverse tools and platforms leads to increased overhead, making it challenging to maintain a unified governance strategy across data and AI assets. This session will explore what a typical multi-vendor organization looks like and highlight the common challenges they face. We’ll delve into the complexities of the current governance space, focusing on the inefficiencies and risks that arise from tool sprawl. The talk will then introduce Unity Catalog’s mission to simplify and unify governance across diverse data formats and AI assets. Attendees will gain insights into how Unity Catalog’s multi-format, multi-asset approach enables seamless governance, empowering organizations to effectively manage their data and AI resources under a cohesive framework. Join us to discover how Unity Catalog can transform your organization’s governance strategy, reducing overhead and enhancing control over your data and AI assets.

+ Read More

TRANSCRIPT

Demetrios [00:00:06]: And we've got not one, but two speakers coming up next. Let's see if we can bring Victoria to the stage. Where are you at? Hello. And hey, where are you at? Hey. All right.

Michelle Leon [00:00:20]: Hi.

Demetrios [00:00:21]: How y'all doing?

Michelle Leon [00:00:23]: Good, how are you?

Demetrios [00:00:24]: I'm excited for your talk. That's how I'm doing. That is. And I got a little bit of a secret to tell you all. After you talk, I'm gonna be playing some music. So if anybody has prompts for what they want me to sing about, drop those in the chat. But in the meantime, we're gonna be learning not just wasted away our lives with a guitar. Anyway, I'll jump off the stage.

Demetrios [00:00:51]: I'll share the screen right now so you can get your, uh, presentation going. Oh, that's not the right screen. There's too many screens back here. I think one of you has to share your fast.

Michelle Leon [00:01:04]: I can share my screen. Um, and then, yeah.

Demetrios [00:01:12]: And time I going to drop the links in the, the GitHub. The GitHub link is now in the chat for everybody that is wondering. And then also if you scroll up, but I'm going to do it again. The single store now event is here. I'm going to resend that. And you're now sharing your screen. Let me see. Let me see.

Demetrios [00:01:42]: Oh, yeah. Here we go. Rock and roll.

Michelle Leon [00:01:47]: Awesome. All right, well, hopefully, okay, hopefully this is all working. Well, hello, everyone. Michelle here, joined by Victoria. Both of us are from databricks, and I'll quickly introduce myself, and then I'll pass it over to Vicki. I'm a PM at databricks, based in San Francisco. A few things you can talk to me about. You can talk to me about unity catalog, talk to me about Delta Lake and storage formats, and then also where to find the best burritos in the mission neighborhood.

Michelle Leon [00:02:23]: If you are in San Francisco or traveling through. Totally hit me up. All right, Vicki. Hey.

Victoria Bukta [00:02:29]: So, hi, I'm Vicki. I came from tabular, and before my time at tabular, I was at Shopify. So I don't have an official title at databricks yet, but hopefully I get one soon. So I was a PM at tabular, and at Shopify, I was data engineering manager, and I ran their data ingestion team. I'm based in Toronto, Canada, and I'm interested in all things streaming and data ingestion. That's kind of been, like, my expertise and kind of area of interest, but super interested in unity catalog and the interoperability story that unity catalog can give us. And that's what we're going to talk about today. And in terms of hobbies, I love sailing and all things sailing, so you can talk to me about that.

Victoria Bukta [00:03:18]: So today we want to talk about data catalogs and of course the unity catalog. But first off, like, what is a data catalog? So what when we talk about catalogs, we want a piece of infrastructure that is going to be the source of truth for all of our critical data assets, maybe ML assets, and all the properties and metadata that goes along with that, and that metadata and information, that extra information gives us a lot of stuff that we need to use to actually work with those assets. So let's take a look at what this looks like if you don't have a data catalog. So maybe this is what your organization looks like. You have, everyone has, like, everyone's talking about open lake houses. So here's like a lake house architecture where maybe you have some object storage. This object storage could be like s three, azure blob storage, GCS. We're writing files to object storage, and these files could represent a variety of data assets.

Victoria Bukta [00:04:21]: They could be in formats like Parquet, JSON, CSV data, maybe delta or iceberg format. And you could have various applications that are coming and bringing that data, writing it to object storage. Or sometimes we're working with unstructured data or semi structured data. And this could include anything from log files, but also things like audio, video and images. So in the ML space, it's really common to have these big data sets that are just like collections of images. And again, various applications writing and producing these assets. And this often makes up your bronze layer. And then with that data, you might use a collection of different engines for different kinds of use cases.

Victoria Bukta [00:05:05]: So Apache Spark, ML flow, maybe Snowflake Trino. And with those engines, we're doing manipulation of the data to produce more refined data sets, going from bronze to silver, silver to gold data sets, then you'll then use for maybe like reporting. So the point here that I want people to take away is that we have multiple applications producing data. We have multiple applications accessing that data and then processing it to create new assets. So let's go to the next slide of what this looks like under the covers. So under the covers, if you actually want to access this data and discover it without a data catalog in object storage, you're just going to have a bunch of folders that you're going to have to traverse to find those data sets and then to access it. So discovery can be hard because you have to look in your object storage.

Michelle Leon [00:06:03]: Accessing it.

Victoria Bukta [00:06:05]: How do I get access to this data? Can be difficult, not straightforward. We have to use IAM permissions and manage those IAM groups observability. We want to ask questions like who is accessing this data and when and how are they manipulating it, maybe, and by what engines. And then of course there's lineage about what data sets were used to produce new ones. So we want to understand the whole flow of data within your organization. So let's go to the next slide. Let's insert a catalog here. So hive Metastore has been a historic legacy catalog that people have used for a really long time.

Victoria Bukta [00:06:45]: And let's talk a little bit about some of the problems that it solved. So first off, what hive did, it gave us this ability to register a table. And this table was basically an abstraction for data that lived in object storage. So we can give it a name, we can give it a location, we can register information like what kind of format this table is. So that gives the engine information about how to interpret it and extra information metadata like its schema and maybe even extra properties. So all this metadata is really important because it gives engines or whoever is accessing the data information about how to interact with it. So let's go to the next slide. That's like the historic world, but now I want to use a different asset.

Victoria Bukta [00:07:35]: Hive worked really well for like CSV data and maybe parquet data, but now we're moving into a world with more complex types of assets. So let's bring in Apache iceberg into the picture. So with Apache iceberg I could use Hive, but maybe I want to use a more native iceberg rest catalog to manage those assets. So this is great. I can use maybe this rest catalog gives me some credential vending so that I don't have to manage IAM permissions to get access to those files. But now what we're starting to see is a segmentation of some data sets being, being available in Hive and other data sets being available in my iceberg rest catalog. So observability starts to become hard because I have multiple tools in here. And again, because of multiple tools, lineage starts to become difficult as well.

Victoria Bukta [00:08:32]: So let's go one step further and let's add glue and delta to this picture. So maybe I want to use delta for some of my applications. I have certain use cases for that. And maybe I have teams in my organization that are on AWS and they want to use glue instead of hive. Hive is legacy at this point. So they have glue but we haven't turned it off yet. This is really common in a lot of customers that we see. So now I have three catalogs, maybe I also, and I have four formats.

Victoria Bukta [00:09:02]: I can have, like, my CSV data, my delta data, my iceberg data, as well as, like, maybe some unstructured data, like pictures. So I have four data formats. I have three catalogs. Discovery is becoming more and more segmented. It's really hard to find my assets, and I don't know how I'm going to figure out who is accessing what. How am I going to audit this thing? How am I going to, like, map the lineage between how data sets are getting produced and from what assets? How am I going to track the data through my organization? This is becoming really, really difficult. Cool. So let's move to the next slide.

Victoria Bukta [00:09:40]: So the problem here is really that the metadata for these different assets, whether it's like the hive data, so that CSV data, whatever is sitting in glue, whatever is sitting in my iceberg rest catalog, because of these different metastores and catalogs, my metadata siloed. It's leading to fragmented discovery, governance, auditing, and lineage. So what we're seeing here is we don't have a unified view of my data and AI assets. We're going to have duplicated access policies that we're going to have to create. So in some cases, you're accessing file paths, having to declare and create IAM roles. In the other case, I can rely on the catalog. This is going to create lots of bugs in your security system. Auditing is going to be hard, because if you're going to have different systems, I'm gonna, they're gonna output different logs.

Victoria Bukta [00:10:33]: I'm gonna have to reconcile all of that and the silos that, that causes inner lineage gaps. Maybe my lineage tool doesn't work with all the different catalogs. Things are getting really tricky here. So, Michelle, what is the solution here?

Michelle Leon [00:10:53]: Oh, my goodness. That sounds so painful and complex, Vicki. And also the reality of what many, many large enterprises and organizations, what their architecture looks like today. It's a bunch of these different systems. You have multiple different catalogs, multiple different formats. You have to sync access control policies across all of them. You're not able to have that unified view. It's really painful.

Michelle Leon [00:11:19]: And so what the solution is, I mean, let's first, I guess, maybe talk about what we want, which is that what we want. And when we talk to customers, I mean, so both of us, we're product managers here at databricks. When you talk to customers and maybe we have like, and we have like 10,000 customers, I think using unity catalog production, when we talk to them, here's what they want. People ultimately just want to manage their data and AI assets in one place. So not having to split the view across all these different catalogs, or maybe in some cases like an AI asset, maybe not even using a catalog, and you're still working directly with file paths and giving tools direct access to underlying storage, like your s three bucket or something like that. And then that leads to the second thing that we want, which is that we want to govern assets through a single source of truth. So instead of having multiple different catalogs or governance solutions that, and as an administrator in my organization, I would have to go sync policies between. Or as a user in my organization, maybe I have to go get permissions.

Michelle Leon [00:12:24]: I mean, how many times does this happen to you? You have to go get permissions for several different tools. All you wanted to do is get this data set so that you can do some data science or prepare it for training an ML model or something like that. And you actually had to go on this multi day long journey of trying to request so many different permissions with so many different tools, and half the time you don't even know which administrator to talk to in order to get the right permission. Then the third pain point, and then this is what people want, is you want to be able to leverage best of breed tools with your data. There's tons of innovation out there. There's lots of great different data and AI tools and query engines and all these platforms that are springing up that are purpose built in many cases for use cases. Some are really great for real time analytics, some are really great for processing massive, massive, massive datasets for training these gen AI models. We want to be able to leverage all these best read tools with your data without needing to, again, have this maybe data copied across multiple different places.

Michelle Leon [00:13:30]: These are the three things we want. What we like to call this is basically our goal is having unified governance for all of your data and AI assets where any engine or tool or platform can read and write to it. So you have the single copy, and then also you want to make sure that there's no vendor lock in, because I just talked about this like amazing dream. You have this single source of truth. But the reality is that we also want to make sure that customers, we talk to customers, customers are wanting to make sure they're not locked into a specific vendor and that they're always able to own their data, own your underlying data. And that's what this idea of an open lake house is built on. Where at databricks. This is something that we've been talking about for years and really excited to see the industry as a whole adopt this term is this concept of having this open lake house for data.

Michelle Leon [00:14:21]: So you own your data, you own your assets, and then what we provide is this pink box here in the middle, this open catalog that's helping to stitch together this unified view where you have unified governance for all your different data and AI assets. So tables, objects, like images, like unstructured semi structured data that maybe you might be storing in an abstraction that we call volumes. Or you have AI ML assets, such as maybe you have genai functions that your tools or rag applications are leveraging. Also your ML models, maybe your vector indices and all of that is governed together in this single open catalog solution where you have this unified view. So you're able to do discovery across all these assets. So imagine earlier, maybe I had to go look at like several different tools or several different catalogs in order to like stitch together this, this view or find the specific asset or data set or model or something like that that I'm looking for. But now discovery can be in a single place. You can get lineage end to end across all of these different things, right? That like this table or this version maybe of a data set was used for training, you know, this model or, you know, or is leveraged by this like vector DB or something like that, which is, you know, managing maybe embeddings across these images that you had here in the objects.

Michelle Leon [00:15:51]: And so now you're actually able to trace lineage across all these different assets with this unified view and then also get great observability and monitoring. So this is something that's very important. So now we're talking about this concept of an open lake house. Everything is here owned. You can all view the files here and underlying storage. But then as an administrator, this is something that, especially at a larger organization, you'll really care about is being able to have observability and monitoring and auditing across all of these different assets. And then finally I'm going to talk about this box. At the top here is engines and platforms.

Michelle Leon [00:16:29]: I think I talked a little bit about like this is the end goal is that you're actually able to leverage all of these different engines and platforms, whatever. Is that right tool for the job with any of these assets that are governed in this open catalog. And so it has open access and interoperability with a really broad ecosystem of tools. So that's the vision for what an open lake house for data and AI can enable. And that's really what we envision here at databricks and what we've built Unity catalog for. And so there's three things that Unity catalog that we kind of see as the pillars, which is the first is that it's multi format, so it supports any table format, any of the ones that Vicki was just mentioning, that in reality an organization won't, they'll have some stuff still in CSV and JSON, semi structured data and parquet files that are more legacy, and then also maybe more advanced table formats like Delta and Iceberg. And so we really believe that it's super important. Unity catalog, Unity's in the name, want to make sure that we're actually supporting any format of tabular data.

Michelle Leon [00:17:34]: And then the second is that it's multimodal. So this multimodal here, I think, can be interpreted in a couple of different ways. The first is that we want to provide this universal catalog that's not just for tabular assets, because that's what I just talked about, but also your non tabular data, your AI and ML assets like models. We want all of that to be governed with this universal catalog. And then the second sort of interpretation is that it's also really important that unity catalog adopts like multi open standards. You're actually able to access your data via many of the different open standards and interfaces that are out there. And then finally, third pillar, unified. We want to provide that vision and deliver on that end to end vision of having a single catalog which can govern access across your entire data state.

Michelle Leon [00:18:23]: At the end of the day, that's what we hear that people want. At the end of the day, that's what you want. You don't want to have to be glued together multiple different tools, or buying yet another vendor that has to go through security review, that will stitch together a picture on top of all those different tools. Because you had so many tools, now you have to buy another tool to manage all those tools. That's like, well, that's how you get into a messy state. And so here's a little bit more of like an architecture diagram, or more of like a marketing architecture diagram that kind of talks a little bit about that vision and just dives a level deeper, which is around how Unity catalog enables that you access this broad ecosystem via unity catalog implementing open standards such as the Unity Rest API, which is an open standard. Unity catalog I'll talk about that a little bit later, but it's actually open source. Unity catalog also implements the iceberg rest catalog API so that any query engines which implement the iceberg rest catalog interface are also able to talk to you.

Michelle Leon [00:19:19]: Unity catalog. And then finally Unity catalog, with it you can govern and manage any of these data and AI assets and provides multi formats support, as I mentioned. Okay, so now I'll take us through and we're probably maybe coming up on time soon. So now I'll take us through just like a little bit of a walkthrough of like here's some example, use cases to make it a little bit more concrete. So the first is I'll talk a little bit about something that Vicki had just mentioned earlier, which is around temporary credential vending. So say I'm a user at my organization, I'm a data scientist. I just want to like use something lightweight for analytics. I want to use DuckdB.

Michelle Leon [00:20:00]: And so DuckdB actually has an integration with Unity catalog. I want DuckdB to be able to read from tables. So in the previous world, maybe I would have to connect duct DB individually to underlying I'd have to connect it to Unity catalog, I'd have to connect it to underlying storage like gcs, s three or adls. And instead now I can just have it talk to Unity catalog. Unity catalog will then a temporary credential just specifically to the downscope, to the files that make up that table, the actual table files, and then the metadata files, and then Duckdb will be able to go read it via unity rest APIs. The last point on vending temporary credentials, because I said a lot of words. My last point on that, and there's technical talks online that we've done on this too, is that this is basically the core primitive of governance upon which higher level governance such as like roles based access control or attribute based access control is built upon. So it's this under core underlying primitive that unity catalog implements.

Michelle Leon [00:21:08]: And so the second interface that unity catalog implements is iceberg rest catalog API, which enables external access, as I mentioned, to unity catalog tables for iceberg readers such as Trino. And then we also do credential vending via the iceberg rest catalog API as well. Then now let's get a little bit more fancy. So now let's say I want to start leveraging multimodal data in this abstraction that we call volumes. But basically it's just like this set of files. Previously without Unity Catalog, you'd have to maybe just provide access to s three for some of these tools and start copying and pasting around file path names, and that becomes really hard to manage. Instead, now we provide this nice abstraction called volumes where you can provide a name. There's other metadata about that volume.

Michelle Leon [00:21:57]: And then the tools such as ML flow or unstructured IO for unstructured data ingestion are able to leverage, read and write to that multimodal data and unity catalog for your AIML use cases like training or for serving applications. And then another really interesting example, let's get even more advanced with some of the new innovations around Genai. So Genai, something that's really taking off, and I'm sure tons of talks at this conference have already talked about it, is these ideas like agentic flows or AI agents that can leverage tools. One thing that's really interesting is tools. At the end of the day, these are actually, you can just think of these as functions. They're Python or SQL functions. We can also do this has already existed today is you can also register Python and SQL functions in Unity catalog. Maybe with a little bit of metadata around it, you can give them a name, a three level namespace, a description about how to use it.

Michelle Leon [00:23:00]: You can register what the parameters are. Then tools such as Lang chain, lama index, etcetera. You're actually able to go build these applications, these agents that can actually work with those functions that are registered in Unity catalog. And then the great thing is that all of this has that unified governance end to end, lineage, auditability, observability, monitoring, all that with this single catalog. And so that's kind of how unity catalog enables broad interoperability. And then finally I'll close out with this last piece because I think the really cool thing is that unity catalog is open source. This is something that we did this past summer. And so everything that Vicki and I just talked about, you can find a lot of resources at Unitycatalog IO that will be your hub.

Michelle Leon [00:23:52]: If you want to learn more about Unity catalog, learn more about the roadmap upcoming integrations, then finally, and then you can also check out the GitHub repo I think I would love to invite everyone here if you're interested in learning more about unity, contributing to open source, contributing to the discussions about, okay, what may be additional data types or types of AI ML assets that we should be supporting in Unity catalog. Please join us in the slack and GitHub discussions, our community meetups, we have them biweekly every Thursday. We're hosted with the Linux Foundation, AI and data foundation. So please join us because I think that core reason why we decided to open source unity catalog is, of course, enabling access to this and freedom from vendor lock in, but also to spur innovation. So with that, thank you so much for your time.

+ Read More

Sign in or Join the community

Watch More

Data Governance and AI

Posted Feb 16, 2024 | Views 710

# Data governance

# AI

# Gjensidige

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K

AI-Powered Data Unification for Data Platforms // Shelby Heinecke // DE4AI

Posted Sep 18, 2024 | Views 758