MLOps with Databricks
speaker

Maria is an MLOps Tech lead with over 10 years of experience in Data and AI.
For the last 8 years, Maria has focused on MLOps and helped to establish MLOps best practices at large corporates
Together with her colleague, she co-founded Marvelous MLOps to share knowledge on MLOps via training, social media posts, and blogs.
SUMMARY
The world of MLOps is very complex as there is an endless amount of tools serving its purpose, and it is very hard to get your head around it. Instead of combining various tools and managing them, it may make sense to opt for a platform instead. Databricks is a leading platform for MLOps. In this discussion, I will explain why it is the case, and walk you through Databricks MLOps features.
TRANSCRIPT
Maria Vechtomova [00:00:00]: So I'm Maria. I'm MLOps Tech Lead at Ahold Delhaize. I'm also a co founder of Marvelous MLOps, a company where we teach other people about machine learning. MLOps and MLOps with data rig specifically as well. I'm very critical when it comes to my coffee. You probably know about that already. I like to drink latte with oat milk and you need to have a perfect amount of coffee perfectly grinded with a perfectly micro. How do you fold? So yeah, I can make a better coffee than most coffee shops.
Demetrios [00:00:42]: We are back for another MLOps Community Podcast. I am your host Demetrios and today I have the pleasure of speaking with Maria from Marvelous MLOps. If you you are in the MLOps universe at all, you probably have heard of her. This episode does a gigantic deep dive. I cannot stress enough how deep we go into data bricks. The pros, the cons, the good, the bad, the ugly, everything about it. And stick around. At the end she talks a bit about her course that she's going to be doing and also the book that she is writing.
Demetrios [00:01:17]: So if you've ever, ever had the urge or if you are currently working with databricks and you're doing some kind of mlai, you should probably get in touch with her because she is by far the expert in this world. Let's dive into it. Everything you need to know about databricks and ML ops on databricks. The last time that I spoke with you, you were telling me how, yeah, I think I'm gonna start posting more on LinkedIn and oh boy, did that work out.
Maria Vechtomova [00:01:57]: Yeah, indeed. Yeah, it was two years ago actually it was exactly two years ago we recorded the first podcast together and a lot has changed since then. A lot.
Demetrios [00:02:11]: Well, what have you been up to?
Maria Vechtomova [00:02:14]: Well, I think two years ago I had this urge to start posting about what I do on link and also writing articles because I have seen so many experts on LinkedIn and everywhere, you know, talking about MLOPS and what it is and I thought that a lot of things that are there are just not true. And I've been doing it for many years, so probably my voice also should be heard and well, I keep going since then. It's really, really cool. I met a lot of awesome people and yeah, I love it.
Demetrios [00:02:48]: And you've focused, I think the majority of your, not necessarily your content on LinkedIn but really the content that you do beyond LinkedIn, deep diving onto databricks. And I think I would love to talk about databricks like why you chose that? This is not sponsored by databricks by any means, although they are proud sponsor of the mlops community and we love working with them. But this specifically, I think I want to talk about the good, the bad and the ugly from somebody who's a user, from somebody who's done the majority of their certificates. And also now you're giving courses on how to better use databricks. I would love to just start with like why you even chose databricks particularly.
Maria Vechtomova [00:03:38]: Yeah, yeah, that's really a good question. So I've been doing mlops for a very long time, like more than eight years before it became a thing. We built our own tooling around model registry experimentation, tracking, using some, you know, teradata databases for any other SQL database as a backend and GFROG artifactory for model artifact storage. And we were doing it like when no one was doing it and it was really, really cool. And since then I tried doing ML Ops with all kind of different tools on the cloud, on prem and well, as I said, build tools myself. And I feel like I've seen it all to a certain extent. Like how do you connect different pieces of your ML setup to make sure it's robust and you can roll back things whenever needed. And there are a lot of platforms and there are a lot of tools.
Maria Vechtomova [00:04:37]: So when you look at the tools, you can find almost a perfect tool for each of the components of your ML setup. But it is complicated combining them all together and you need to invest a lot into that and you have a big team to be able to do that. Platforms on the other side then less customizable. So they have all these mlops components which you need, like model registry, compute serving, component monitoring, component things like that for data versioning. They have them all. They're not perfect, but you still may want to get it working for your use case instead of trying to combine all existing tools. If you're on prem, you probably don't really have a choice. You have to go open source and then you have this luxury of choosing.
Maria Vechtomova [00:05:33]: But when you are on the cloud, if you are in a large organization, bringing in any tool is really, really complicated. You know, all these sourcing procedures, getting trusted vendors, all of that, it's a lot of pain for everyone included in the process. So you want to go through this process only if you really, really want to and have to. The platforms really got good enough. I feel it hasn't been the case like three, but now platforms are good enough for pretty much everything you need to do for MLOps. However, if you look on Internet what people talk, you know, how people tell you should do things, in my opinion, they're just not correct, they're not promoting best practices. Like if you look at databrick specific training, it's all around notebooks. If someone follows me on LinkedIn, you know, I don't really like notebooks and for a reason.
Maria Vechtomova [00:06:33]: I started as a data scientist myself. I used notebooks a lot myself. That was the only way how Python to data scientists was taught. In 2016, any course was just plain notebooks. And when I started learning Python, that was the way. And then I started trying to put my first models to production, like build an endpoint. That was the use case we had. And I saw how much struggle it was to get from a notebook to something in a production ready.
Maria Vechtomova [00:07:06]: It was a lot of pain. And that was the first revelation for me, like why notebooks is not really a great place to start. And unlearning using notebooks is really, really hard because people were using it for years, they know how to do it easily. You know how it is if you used to certain tool, getting to another tool, another way of doing things is extremely hard. And the longer you are doing it, the harder it is. It's just like with anything in life. And in my opinion, notebooks harm the Mallops life cycle much more than anything else out there. So that's just how I feel about it.
Maria Vechtomova [00:07:48]: So we need to teach data scientists to do data science properly, not using notebooks or at least teach them how to translate a notebook into something that is production ready, more or less. And that's what I try to talk about a lot. And also that's what we teach in the course. So why databricks specifically? Maybe another question. Well, if you look at all the platforms that are out there and there was I think a research at Ethical Institute of AI did and they just, it was a questionnaire and they asked professionals in the field what tools are you using for model registry, for data lake, for serving, for training. And databricks came out of number one tool. And it's not a surprise for me to be honest. Databricks is growing like crazy.
Maria Vechtomova [00:08:43]: So databricks pretty much becomes a tool of choice for ML these days in many companies just because it's easy. Data engineering is done on databricks. All the data is in Unity catalog and databricks, it's just a logical step forward for ML as well to train your models and maybe even surf and do monitoring on that as well. So There are many options out there. And also on another hand, it's not just the leading platform. I feel it's very different than other vendors that I've seen. So they're very open to people criticizing them. And when you tell them you don't like something, it's not like they're surprised.
Maria Vechtomova [00:09:26]: They actually know that that's the case, but it's somewhere not in their first priority because they have other things that are more important for them to fix. And yeah, if I have thought about a cool feature on databricks that might be useful, they have thought about it too and they're probably working it already. So that's something that I haven't seen in any other vendor out there and it's pretty impressive.
Demetrios [00:09:56]: So there's so much to unpack here. Let me try to go about it, because the first thing that jumped to my mind was you were mentioning how platforms these days feel good enough. They weren't three years ago, but now whether you're using, I imagine when you say platforms you're talking about the vertex AIs and the sagemakers out there. Yes, the databricks platforms. And so you can get enough from that platform to where it is not necessary to go out and cobble together many different tools.
Maria Vechtomova [00:10:30]: Yes, indeed.
Demetrios [00:10:31]: The other piece that I think is crucial that you mentioned is we've all been through trying to onboard a new vendor and we've probably thrown our fists up in the air and said, I'm never doing this again. I do not want to answer any more Questions from the DevOps folks or the DevSecOps team about this vendor or the compliance and the ISO or the SoC2, whatever it may be. It takes so long. I literally just went through this where I chose one tool only because they were already accepted into the ecosystem and we didn't need to then do the onboarding process. Even though I knew it was a worse tool, I wanted to get a different tool that I enjoy much more. But it's like, oh, if we want to do that, then we're looking at like a two month lead time to onboard that new vendor versus let's just use what we have and amplify the capabilities of this tool that's already been accepted. And so it feels like that is where a lot of us stand. And that also is one of the reasons that you chose Databricks, because you saw that survey and you said, looks like a lot of people are on it.
Demetrios [00:11:54]: I can only see this growing because their capabilities are growing and the stuff that folks are doing on databricks is it's inevitable that it's going to grow if you're doing your data engineering there, the data has gravity and you're going to probably start branching out if you haven't already on different use cases.
Maria Vechtomova [00:12:15]: Yeah. So. Well, the reason why we did everything on databricks because we already had databricks and databricks was not per se perfect three years ago, a bit more, but you could do already back then even enough things on it for it to be pretty okay. And it was still a better choice than Azure components that we also had access to. So we basically the company worked for is on Azure and we also use databricks. And if I have to choose between Azure and databricks, I would always choose for databricks just because it's way easier in my opinion to do analogs on it.
Demetrios [00:12:56]: That's the reason why. But I also want to know the, the bad part. You mentioned that databricks is pretty open to criticism. There are things though that I know when I talked to you, you said yeah, this is how it's done in databricks, but it's probably not the best way to do it. Whether that's just trying to work in notebooks and pretending like those are ready for production or I know we had also talked about like the databricks feature store and how that isn't necessarily the best way of doing things. So maybe you have some best practices that you found as you've been.
Maria Vechtomova [00:13:41]: Yeah.
Demetrios [00:13:42]: Going through databricks and learning about it or getting deeper on it.
Maria Vechtomova [00:13:47]: Yeah, definitely. So I think one of the biggest pain points still until today is development process on databricks. So as I mentioned, everything is around notebooks and I think ML code must be packaged. So that's just how it is. If you want to have a professionally written ML code, it must be packaged and it is possible to do on databricks. It's just not very straightforward. How so? For example, databricks comes with the notion of runtimes. It's pretty much containerized version of your environment.
Maria Vechtomova [00:14:20]: It has already pre installed packages and other programs or other software and you basically want to have reproducible environment locally to develop. Because you don't want to develop a notebook, let's say then you can't really reproduce exactly the same environment locally. It's just impossible. You can get an approximation of that. But because it's an approximation you can only develop a certain way to certain state and to test you want to test it on data rigs again, but you don't want to push between notebooks and local development. So there are other ways, like using Asset Bundle, for example. That's something that I absolutely love using. And that is underestimated, I think.
Maria Vechtomova [00:15:09]: Way of developing on databricks and we actually have a lightning session about it next week. Probably it's not going to come out this podcast on time, but there will be a recording, so we can maybe insert the link somewhere.
Demetrios [00:15:24]: And Asset Bundle is just a feature of Databricks.
Maria Vechtomova [00:15:28]: Well, Asset Bundle indeed is a feature that is developed by databricks. It is. Well, has a lot of components, I would say by itself. But first of all, it's a way of defining a workflow. So if you have an orchestration pipeline and databricks and you can define it in JSON file, or you can define it in databricks YAML file, which is the definition of the Asset bundle. And whenever you deploy that bundle, your workflow definition gets deployed together with all the assets and packages and other files that are required for your deployment. It used to be really, really hard to deploy things in databricks because you had to take care of all of that. Your packages get uploaded to databricks that, you know, your Python files get uploaded to databricks and all other files that you need.
Maria Vechtomova [00:16:26]: And you had to build your own logics around it to make sure it was all there. We actually built the whole thing for it. We basically built something very similar to Asset Bundles internally and now we are deprecating it because Asset Bundles does it all. But it's not just for the workflows, it's also for development. And that's something that people don't talk about much. I feel it's people use it for deployment, but people don't really use it for development, but I believe for development. It's also a really, really nice feature.
Demetrios [00:16:57]: So. So that's a great one on this development process. What about. What exactly was it that we were talking about the other day with the feature Store and how that works?
Maria Vechtomova [00:17:07]: Well, there are many feature stores available out there, right? It's not just databricks, but you have Feast Hopsworks and other tools stacked on available. So other tools focus more on just feature stores. But databricks has all kind of components in it. So if you look at Databricks Feature Store, there are pretty much two functions, like two things. How can you interact with the Feature store? It's a feature function and a feature lookup. Feature lookup is basically defining how you look up a key in the feature table. So I know if you want to look up a customer ID and return some values in the table for that customer id, that's what you could use for that. It doesn't really have a fallback, so if it doesn't find it, it will just return none, which is by itself not very convenient.
Maria Vechtomova [00:18:07]: So to replace that you could use a feature function, for example, if that is none, then look up another thing or return that value instead. So feature function by itself I think is quite an ugly construct. You have to define your Python function in SQL and that's the only way. I don't know who came up with it in the first place, but I just not fan of it for many reasons. First of all, there is no versioning of that thing, right? There is of course version control that you have, but when you create this function, there is no way to point it to a version of code or like a version of that function that is used. Also, that function will behave differently depending on the runtime you use, of course, because the Python versions of the Python libraries you use in this import statement in your SQL query, it will behave differently if you are in Python 3.10 or Python 3.11 and that Python version is defined by the runtime. And when you are running on serverless, it's even worse because you can't choose the runtime. But there is a concept of environments in serverless, so it may get very confusing for people, like how that function by itself behaves.
Maria Vechtomova [00:19:28]: Also, that function has certain limitations when it comes to serving. So for example, when you There is also a thing like a feature serving on databricks and the idea is actually quite good, right? You want to gesture features sometimes and not just models like you want to look up certain key in the table and return things back. It's quite convenient. And you can also serve like a combination of this feature function and feature lookups, like a stack. It's a list that you define and the order of the list defines the order of execution of these elements. There is no conditional statement, so all of these things always get executed. And all this together is defined as a feature spec and that's how it's called and that's what you serve. So the feature function there, if you have to output some complex data types like, I don't know, something that is not integer or string, it will fail.
Maria Vechtomova [00:20:29]: I hope they will fix it soon because it seems like not really intended behavior. So things like that also feature engineering package itself, it only works in a notebook or in databricks environment, you can't run that thing, that Python code on your machine. So I can go on and on with this. To be honest, I'm not a fan of this feature.
Demetrios [00:20:56]: It sounds like a mess, to be honest.
Maria Vechtomova [00:20:58]: Yeah, in my opinion it's quite a big mess. And that's too bad because I think there is a great potential in it. And another thing that I find also one of the most frustrating pieces, to be honest. So most of machine learning is done in Pandas still, right? Pandas. I mean now Polars is becoming a thing and people actually migrate to Polars, but still not all the libraries support it. So Pandas is still pretty much mainstream. So feature function. So feature traceability.
Maria Vechtomova [00:21:34]: This lineage only works when you use spyspark, so whenever you convert it to Pandas or something else, it's not gonna work any longer. That's kind of weird.
Demetrios [00:21:47]: Wait, did you tell me though that there's a workaround for that?
Maria Vechtomova [00:21:51]: Yeah, there are some ugly workarounds for that that we actually teach, but yeah. Why, why, why would you design it like that?
Demetrios [00:22:00]: Yeah, then it's feels like it is part of the downfalls that you get with a managed system, but it's also that trade off that you're deciding, hey, I would like a managed system. So I want certain decisions to be made for me. Right. And it's inherent that if you're using a platform they're going to have opinions about how to do things. This is one of those times where your opinions and the builders of the platform's opinions diverge drastically.
Maria Vechtomova [00:22:37]: Yeah. On this specific feature I would, I would say so. But I think. Well, this is just one of the things that I actually really don't like about it. But a lot of other things are awesome. Like I said, I think bundles is really great. The way how you define workflows drastically improved compared to how it used to be. I think those are also really good the way how you do training on databricks.
Maria Vechtomova [00:23:03]: But there are a lot of really cool parts. I think most of the things are cool parts. So I guess that's one of the reasons why I also talk about databricks. If all of that was not great, I wouldn't be.
Demetrios [00:23:16]: No. Are you diving into any of the mosaic side of things or how they've incorporated that into databricks?
Maria Vechtomova [00:23:28]: So not in the course that we have now, but we are going to launch llmops course and also I'm touching on it in the book that I'm writing. So I cover MLOPS but also llmops use cases where also indeed Mosaic part is covered.
Demetrios [00:23:48]: That's what we're gonna, we're gonna run with that term, huh? We're gonna use LLM Ops.
Maria Vechtomova [00:23:54]: Well, I don't know how you, how you would call that. For me. Everything is in my ops, to be honest.
Demetrios [00:24:00]: Yeah, I, I think we are very biased because for me too. But the LLM Ops I never felt like was a term that had sticking power and yeah, it's an interesting one that.
Maria Vechtomova [00:24:16]: Or AIOps. AIOps doesn't sound really good.
Demetrios [00:24:20]: No. AIOps. I also interpret with AI for operations. So like using AI to get less alerts in Datadog.
Maria Vechtomova [00:24:33]: Oh, yeah, yeah, yeah, yeah, indeed. Yeah.
Demetrios [00:24:37]: But yeah, I don't, I don't know what we can call it besides Vibe Ops. That's my new term, of course.
Maria Vechtomova [00:24:43]: Let's just call mlops. I think a popularity like as a term is actually quite big, so I can push him for it.
Demetrios [00:24:52]: Yeah. And I do think it. It's gone through the hype cycle and now it's on the uptick again. Like of course when LLM Ops came out, it went down and it was in that trough of disillusionment and now it's coming back up because I think folks are realizing, okay, well I, we need to figure out our production environments no matter what, if we're using LLMs or traditional ML. Yeah, it's kind of similar.
Maria Vechtomova [00:25:23]: Yeah, for sure. There are way more similarities than people would like to think. And well, I give this example quite a lot. But if you look at, you know, data science hype cycle, data science terms started appearing around 2016, 15 or something like that, and it took another three years or so, maybe five before MLOPS became a real thing. And I guess we are going through the same cycle these days, but the cycle will be even bigger. I mean, AI popularity has grown way further than ever happened with data science in the past. And because of that, I also expect that the next mlops hype will happen faster, but also going to be much bigger than we ever seen before. And well, thanks to the Vibe coding, we will have a lot of work to do.
Demetrios [00:26:20]: Yeah, that's good job security. That is for sure.
Maria Vechtomova [00:26:25]: Yeah.
Demetrios [00:26:26]: Is there times that you have been working with folks and you've recommended that they don't use databricks?
Maria Vechtomova [00:26:34]: Oh yeah, for sure, definitely. I think everyone should use whatever makes sense for the situation you are in. So for example, we use databricks a lot also for model serving. We have some model serving on databricks. But I wouldn't recommend it everywhere. So one of the stations I would definitely not recommend it is when the whole website is hosted by you. It's running on your kubernetes. And so if you want the low latency you would want to host your model and serve it on exactly the same kubernetes server, not anywhere else.
Maria Vechtomova [00:27:13]: So yeah, definitely in this situation don't use databricks for that.
Demetrios [00:27:19]: Is that like the anti pattern would be? Oh, we're going to bring in databricks and have it be outside of our Kubernetes cluster.
Maria Vechtomova [00:27:27]: It will be anti pattern because it's going to be slower 100%. Also Databricks is not going to work for everyone. I'm not talking about the serving part which I think, well, can be quite useful because it simplifies things a lot. But it's not going to work for everyone. So there is a limitation of 20,000 requests per second for the whole workspace and that's under some assumptions. So realistically it's going to be less than that. And that's for the whole workspace. If you have multiple endpoints on the workspace, it all is covered under this umbrella limit.
Maria Vechtomova [00:28:05]: For some companies it's enough, for some it's never going to be enough. So then you, you just need to pick the tools that make sense for what you're doing. That's always going to be the case. And for most companies you're going to be fine because they don't have any, you know, hard requirements or anything like that. And most of whatever is done is still batch.
Demetrios [00:28:33]: Yeah. And this 20,000 sounds like you've run up against that. And we don't need to talk about why or how or what, but is there not a way to go and negotiate with sales to bring that up?
Maria Vechtomova [00:28:48]: Well, I guess that could be, but that's already an increased capacity. So there is a default capacity which is lower than that and you can create a request to, you know, make it higher up to that number. I guess if you are really a big customer and they then it might be possible. But I doubt that it would be that easy for any customer, to be honest.
Demetrios [00:29:16]: Okay, so that's the when not to use it. Now what are some things that you've seen? And I think we're both going to be at the Databricks Summit, Data and AI Summit in June. And you're giving a talk, right?
Maria Vechtomova [00:29:33]: I don't know yet actually. So I may be giving talk at Devrel Theater. But anyways I will be around There.
Demetrios [00:29:40]: So what are some exciting developments that.
Maria Vechtomova [00:29:44]: Are on your radar on databricks? So I actually like that's not per se a new thing. Well, some things I probably can't talk about so that I need to be really cautious with that.
Demetrios [00:30:00]: But yeah, you're privy to insider information. I didn't realize you were that cool. That's awesome.
Maria Vechtomova [00:30:07]: Okay. I can't say certain. Okay, I can tell you things. I can tell. So there is Appgini, which I think is super cool.
Demetrios [00:30:16]: What is that?
Maria Vechtomova [00:30:17]: So basically it's this AI for bi kind of kind of tool. And that's something that within our organization we also are not now going to use more extensively. That really simplifies the way how other teams that don't per se have enough knowledge to code things, they can interact with data. And that's something that we try to incorporate in our product teams. I think that is pretty cool development. It's been there for a while, but I think now it's getting to a state that it's actually nice to use.
Demetrios [00:31:01]: So we're talking about using databricks in this utopian world where we learn the platform and the ins and outs of it and that is all we need to know. And if we can optimize that, we optimize our whole setup and system. But for most folks, I feel like databricks is just one part of the stack. What have you seen in that regard?
Maria Vechtomova [00:31:26]: Oh yeah, I agree. So I think from what I've seen, and that's the most common pattern, that everything batch is happening on databricks. So basically the whole model training which you need to, you know, retrain probably once per week or whatever your retraining cycle is, that can happen on databricks. And I think it will simplify your life significantly. Especially if all your data is in Unity catalogs. It will make it so much easier than using anything else. And that's why I think it's very smart from databricks to have Unity catalog in place. So what I've seen a lot and that's what also we are coming from.
Maria Vechtomova [00:32:06]: And we still have this kind of diversified way of deploying things. So we train our models and it results either in model artifact that needs to be served. So you basically have model serving use case or you have batch serving. Which means that you just store data somewhere in some database and at the request you just need to query the database or you have a mixed scenario where you need to have an artifact plus you need to look up some data somewhere. So when you Just want to look up some data somewhere. And without any models, I think databricks wouldn't be my first choice. So there are multiple ways of doing that. It's either, you know, model serving with some lookup in some other database, that would be one of the ways.
Maria Vechtomova [00:32:59]: Or you could use online tables and data rigs, but then you're limited with feature serving. Model serving, there are certain data types that are not supported. It's just too complicated in my opinion. So I would rather go for serving some fast API on Kubernetes and looking up in some database like DynamoDB, Cosmos, DB, MongoDB, things like that. And that's like the most common approach I see also within organizations. It makes total sense, by the way. And when you do model serving, you may want to serve it on databricks, but also ML Flowserve works with Kubernetes, so you could deploy it the same way but on Kubernetes and that would be also a nice approach. So I think you just need to think what does make sense for you for your specific use case and go for that.
Maria Vechtomova [00:33:58]: Databricks makes monitoring of the endpoints much easier. That's one of the upsides why I would vote for model serving if it's possible and works for you, whether it fits the requirements. Because for all the API calls you can look up that information. It's all stored in inference table, which you can enable and that really makes it so nice to monitor it. So it's easier than using some other tools to do the same thing. That's just what I think about it because we try different approaches with that. So. Yeah, as always, it depends.
Demetrios [00:34:46]: I'm glad you brought up MLflow right there because I know that there's been a lot of work on MLflow and specifically around extending MLflow to new LLM capabilities, but when we chatted before, you were saying how cool the new Updates are with MLflow. I've seen in the MLOPS community that there's been some threads going on in Slack on folks that don't like where MLflow has been going, but maybe you can talk to us about what you like about it these days.
Maria Vechtomova [00:35:17]: Okay, yeah, sounds good. I actually became MLflow ambassador, so I'm probably the right person to talk about it.
Demetrios [00:35:25]: There we go.
Maria Vechtomova [00:35:26]: No, I'm also. I did feel a bit old for a while, like nothing was really happening major, in my opinion, for a while. And it was quite, you know, not very intuitive to use for new users. But I feel data commutation is just awesome. It improved significantly in the last years. And if you don't understand something, then you can probably find that in the documentation. So that's one of the things that definitely improved. But also LLM features, that's something that also came out recently.
Maria Vechtomova [00:36:03]: Like they have mlflow traces now. They have also Prompt Registry, which is super cool and it really makes a lot of sense. There is also AI gateway in mlflow. Well, I don't know. I don't think there is any other tool out there that is that feature rich, to be honest. Which makes it on one hand a really cool tool. But on the other hand, yeah, it might not be straightforward to find out how to use it properly again. So that's, that's probably the other side of the coin.
Demetrios [00:36:44]: When you talk about the MLflow serving on Kubernetes, is this the managed version or is this the open source version that you were talking about?
Maria Vechtomova [00:36:54]: So open source MLflow serve, it's functionality that comes from the open source ML flow. But basically exactly the same thing is used on databricks. It's just you don't have to run mlflow serve anywhere. You need to use commands to deploy endpoints on databricks instead. But exactly the same thing happens actually behind the scenes, which makes it easy to deploy pretty much anywhere with exactly the same format. One of the things that I find less intuitive with databricks serving is that whenever you deploy it, if you just deploy it, you wait. Things may fail and it's very hard to debug. But what people don't realize that it's exactly the same thing as a metaflow serve that you could just run locally and debug it.
Maria Vechtomova [00:37:48]: So there are ways to test it. We're actually writing a blog about it as well now because it's not very clear for people how to do that. But MLflow model format, it's very similar to what Bentoml is doing. In certain sense it's just packaging your model in a format that can be served in a certain way and you can deploy it anywhere. So you can make an image out of it, a Docker image, and then you can use that for serving or you use databricks. But databricks does exactly the same thing on the background. So I guess it doesn't matter where you deploy it, just on databricks, I guess it's a bit easier because all of these complex parts are hidden from you.
Demetrios [00:38:37]: So if you had the ability to create your favorite stack, we could say using databricks and then plugging in different pieces where Needed like that aren't native databricks options and you can extend the platform. Let's talk about a specific use case. So it's not like. Ah, well it depends. If you're using this use case then you would want this or that. Let's talk about a recommender system use case. And how would that look? What would you swap out? Assuming that we don't have to onboard any new vendors and do anything new to specific pieces. Let's just pretend like the work of actually bringing on the tool is non existent because we already know that is not true.
Demetrios [00:39:37]: But in this hypothetical world it is. How would you extend it?
Maria Vechtomova [00:39:43]: So recommended systems usually are largely pre computed, right. It's not something that we are computing at the moment when there is a request coming in, maybe some parts of it, but most of it is actually just looked up somewhere already pre computed because it's expensive to compute like no one and we have some latency requirements. So it means that you have to look it up somewhere in the first place. So that's an assumption that we have.
Demetrios [00:40:14]: Okay.
Maria Vechtomova [00:40:15]: Model training can be done on databricks and for recommended system I think like at least what we have we use Spark a lot and it makes total sense because all of the processes that we run can be distributed for data pre processing. Then also the large logics of what we do for recommender system is also custom made also run in Pyspark so and it results basically in some something that looks like a very very large dictionary in the end and well that's something that you could store in some database and then it makes the most sense just to have some fast API running on Kubernetes or maybe I don't know, it depends what your requirements are. But even Azure function can be good enough for you so there are multiple options just you need to see also what you already have within your organization, what patterns do you have and choose for that. If Kubernetes is a big thing of what you do, I would totally go for it and then deploy just fast API app and then look up in some Cosmos, DB, DynamoDB, whatever you have database and return the value back. And for monitoring stack then you can't use inference tables and databricks then you have to. Well what we use we have app insights and we also have Prometheus and Grafana set up and that's what we currently use.
Demetrios [00:41:52]: What about the data layer of things like the whole data pipelines and processing, moving that around, creating features, et cetera.
Maria Vechtomova [00:42:02]: Yeah, so well we are on Azure and we don't handle the raw data part. So when the data ingestion part. But there are some limitations that don't allow us to use databricks for that. So that's why it's still an Azure data factory. But when the pre processing happens, it writes all the data in Unity catalogs. So we basically are consumers of that data. Our team is consumers of the data and the data is produced by another team. So if something is wrong with the data, it's not our responsibility to fix it, but it's that team responsibility to fix it, which simplifies certain things and complicates other things obviously.
Maria Vechtomova [00:42:49]: So that's a data layer that we are dealing with. Then we have our own custom data pre processing that is required for our models because you know, data engineering team doesn't care about specific data transformations. So we have our own data engineering pipeline that is used for our personalization stack so that is shared across multiple sub products within our personalization domain.
Demetrios [00:43:15]: This pipeline is with databricks or airflow.
Maria Vechtomova [00:43:17]: It is on databricks. It's using databricks workflows and we write back to also to Unity catalog and then sub products which is for example recommendations on the basket on the product detail page or personally offer recommendations is what we do as well. These are also separate sub products that run after that other pipeline has finished and the results are either written to some database where we can use fastapi to look it up like we actually do serving on Azure functions these days. And the other part, we actually have a model as well. So it's just a model that we do databricks model serving for that. But as I said, if you have certain requirements and databricks model serving doesn't fit your requirements, you could use a metaflow serve and deploy it somewhere else.
Demetrios [00:44:21]: So I like how I asked you what your ideal stack is and you gave me what your action.
Maria Vechtomova [00:44:27]: But it is very idea. I really like what we have now. It took long time to figure that out and we went through a massive migration. We are almost done with it. It's a great feeling and it's actually really the way we imagined in the beginning. So given the situation that we are, I don't think there is a better stack for us at the moment.
Demetrios [00:44:50]: That's why. Yeah, I like. I liked your answer. It's like there's no difference between my ideal stack and my current.
Maria Vechtomova [00:44:57]: Yeah, indeed. Yeah. Yeah. We are very proud of what we achieved. It was a lot of effort.
Demetrios [00:45:03]: Yeah. What were some of the things that you did when you were creating this migration that didn't work out?
Maria Vechtomova [00:45:11]: Yeah, we actually wanted to use databricks feature serving and also look online tables for feature lookup. And because of the limitations we faced, we never could use it. So that was one of the downsides because it would simplify our deployment stack. Right. The whole deployment couldn't happen just in one big pipeline in the workflow on databricks. But instead, now we have multiple pipelines, which is still manageable, but it's less perfect, I would say.
Demetrios [00:45:47]: And you kind of glossed over something that I want to go back to, which is you're not the owners of the raw data that gets thrown into Unity catalog. You're just a consumer of it, which has its pros and cons. How do you break down the pros and cons?
Maria Vechtomova [00:46:03]: Yeah, so I think the data quality part is where it is all about intent. Of course, there is some monitoring on the data ingestion side and the way how they process data and what they put in the Unity catalog. They have some quality checks in place. However, these quality checks are very different than the checks that we do right for them. It's just like the schema makes sense. The values are within certain ranges that are acceptable and the count is normal, things like that. But what we check for are very different things. It's more around, you know, statistical properties of the data and data engineering team.
Maria Vechtomova [00:46:42]: Because we're not the only consumers of the data. There are more other consumers. They don't really care about these things. So it can happen that we see that things are broken and they haven't noticed it just because they don't check for things we care about. And that's a universal problem that everyone has.
Demetrios [00:47:05]: This feels a lot like where you would want to create this data meshy concept, data contracts and have the responsible ones and the consumers. So the producers and consumers shake hands and say, all right, we agree that these are the quality checks I need or these are the things that I'm looking for. And I want it with this type of freshness and I want it in this style or this schema, whatever it may be. Have you thought about doing that?
Maria Vechtomova [00:47:38]: Yeah, we tried, but I think it always about, you know, how big teams are, what are their priorities, who they are reporting to. All these things matter. And if there is this kind of movement from above, then things may change. But you know, if you're just one of the consumers and there is a larger team and their priorities are way different, it's just hard. You know.
Demetrios [00:48:07]: Yeah, yeah, that's such a great point. It's how do you influence the other team who has other priorities to take into account that this data that they're giving you, sometimes it goes all a wire and you're not able to extract the most amount of value from it. So it's almost like you have to go up your food chain in order for them to go horizontal and then down their food chain instead of you just going and talking to that other team and saying, hey, can we set something up like this and that and.
Maria Vechtomova [00:48:45]: Yeah.
Demetrios [00:48:45]: And so you tried the data contracts. The data contracts were actively put into place or it just, the, the producers were like, yeah, we'll get around to it. And they never did.
Maria Vechtomova [00:48:56]: Yeah, second option.
Demetrios [00:48:58]: Okay, that sounds eerily familiar to a few ones that I've heard. So. All right, so before we jump, I do want to highlight the course that you are creating all about databricks. I think it is obviously clear to me as I've been talking to you for the last hour on how knowledgeable you are with databricks from a practitioner's standpoint. You've been getting your hands dirty with everything databrick and you've also been staying up to date with everything that is coming out. What's the course about? When is it? How can I sign up?
Maria Vechtomova [00:49:31]: Yeah, so this is a cohort based course that we have on Maven and the idea is that we really want to teach everything that we know about Windmill on databricks. So it's a very highly practical course. Every week we go through a piece of theory and we actually show the code and to people and explain how it's done. And everyone needs to create their own code based on their own data set. And we review, pull requests and every week iteratively we cover another thing and by the end of the course everyone has a full on project, end to end ML project that can be reused in any company. Pretty much.
Demetrios [00:50:16]: So.
Maria Vechtomova [00:50:16]: And that was always our goal with this, with this course to actually build something highly practical. And we are super active on Discord so people keep asking us a lot of questions. Also after the course we keep, you know, everyone keeps access to Discord and you kind of build a community in certain sense. So when the course starts, the next cohort starts 5th of May and it goes on until 23rd of June and we also have another cohort that starts 1st of September and it will be the last cohort of this specific format of the course because as I mentioned earlier, we are going into llmops and maybe gonna have like extended cohorts. We haven't figured that out exactly how it's gonna be, but you know, starting from November, so it will be a different course.
Demetrios [00:51:09]: Okay, so the course is awesome. I will also mention that you've been generous enough to give everyone in the mlops community a discount code. So you can. We'll leave a link in the description with the discount code and everything in there. And that's awesome. Thank you. You are writing a book too. What's that about?
Maria Vechtomova [00:51:29]: Well, ML Ops for databricks, what else?
Demetrios [00:51:32]: I should have known.
Maria Vechtomova [00:51:34]: Yeah, I should have known. Now it's basically all my brain dump. Everything I know about MLOps and Databricks will be in that book. And that's basically a guide that I always wanted to have myself if I were getting started with databricks. Also very highly practical with all kind of considerations. So the book will be coming out beginning next year, but there is early release of the chapters. So the first chapters already are coming out I think next week. But also half of the book will be around July, I believe.
Maria Vechtomova [00:52:12]: And I will finish writing the book around October. That's a goal at least. So the chapters will be appearing in the Riley platform, so everyone can read it, whoever has access to O'Reilly platform form. So yeah, if you want to have earlier access, that's the way.