MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

How Data Platforms Affect ML & AI

Posted Jan 26, 2024 | Views 312
# Data Platforms
# AI
# Machine Learning
# The Oakland Group
Share
SPEAKERS
Jake Watson
Jake Watson
Jake Watson
Principal Data Engineer @ The Oakland Group

Jake has been working in data as an Analyst, Engineer, and/or Architect for over 10 years. Started as an analyst in the UK National Health Service converting spreadsheets to databases tracking surgical instruments. Then continued as an analyst at a consultancy (Capita) reporting on employee engagement in the NHS and dozens of UK Universities. There Jake moved reporting from Excel and Access to SQL Server, Python with frontend websites in d3.js. At Oakland Group, a data consultancy, Jake worked as a Cloud Engineer, Data Engineer, Tech Lead, and Architect depending on the project for dozens of clients both big and small (mostly big). Jake has also developed and productionised ML solutions as well in the NLP and classification space. Jake has experience in building Data Platforms in Azure, AWS, and GCP (though mostly in Azure and AWS) using Infrastructure as Code and DevOps/DataOps/MLOps. In the last year, Jake has been writing articles and newsletters for my blog, including a guide on how to build a data platform: https://thedataplatform.substack.com/p/how-to-build-a-data-platform

+ Read More

Jake has been working in data as an Analyst, Engineer, and/or Architect for over 10 years. Started as an analyst in the UK National Health Service converting spreadsheets to databases tracking surgical instruments. Then continued as an analyst at a consultancy (Capita) reporting on employee engagement in the NHS and dozens of UK Universities. There Jake moved reporting from Excel and Access to SQL Server, Python with frontend websites in d3.js. At Oakland Group, a data consultancy, Jake worked as a Cloud Engineer, Data Engineer, Tech Lead, and Architect depending on the project for dozens of clients both big and small (mostly big). Jake has also developed and productionised ML solutions as well in the NLP and classification space. Jake has experience in building Data Platforms in Azure, AWS, and GCP (though mostly in Azure and AWS) using Infrastructure as Code and DevOps/DataOps/MLOps. In the last year, Jake has been writing articles and newsletters for my blog, including a guide on how to build a data platform: https://thedataplatform.substack.com/p/how-to-build-a-data-platform

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

I’ve always told my clients and colleagues that traditional rule-based software is difficult, but software containing Artificial Intelligence (AI) and/or Machine Learning (ML)* is even more difficult, sometimes impossible.

Why is this the case? Well, software is difficult because it’s like flying a plane while building it at the same time, but because AI and ML make rules on the fly based on various factors like training data, it’s like trying to build a plane in flight, but some parts of the plane will be designed by a machine, and you have little idea what that is going to look like till the machine finishes.

This double goes for more cutting-edge AI models like GPT, where only the creators of the software have a vague idea of what it will output.

This makes software with AI / ML more of a scientific experiment than engineering, which is going to make your project manager lose their mind when you have little idea how long a task is going to take.

But what will make everyone’s lives easier is having solid data foundations to work from. Learn to walk before running.

+ Read More
TRANSCRIPT

Jake Watson 00:00:00: So, my name is Jake Watson. My job title is principal data engineer, and my coffee is bourgeois espresso.

Demetrios 00:00:12: Welcome back to the mlops community. I am your host, Demetri-os, and hoping that you are having a magical day full of surprises. And beautiful surprises at that.

Demetrios 00:00:25: Oh, stop the tape. Before we get into this next episode, I want to tell you about our virtual conference that's coming up on February 15 and February 22. We did it two Thursdays in a row this year because we wanted to make sure that the maximum amount of people could come for each day since the lineup is just looking absolutely incredible. As you know, we do.

Let me name a few of the guests that we've got coming because it is worth talking about. We've got Jason Liu. We've got Shreya Shankar. We've got Druv, who is product applied AI at Uber. We've got Cameron Wolfe, who's got an incredible podcast, and he's director of AI at Rebuy Engine. We've got Lauren Lochridge, who is working at Google, also doing some product stuff. Oh, why is there so many product people here? Funny you should ask that, because we've got a whole AI product owners track along with an engineering track.

And then, as we like to, we've got some hands on workshops, too. Let me just tell you some of these other names, just for a know, because we've got them coming, and it is really cool. I haven't named any of the keynotes yet either, by the way. Go and check them out on your own if you want. Just go to home, MLOps community, and you'll see. But we've got Tunji, who's the lead researcher on the Deepspeed project at Microsoft. We've got Holden, who is the open source engineer at Netflix. We've got Kai, who's leading the AI platform at Uber. You may have heard of it. It's called Michelangelo. Oh, my gosh. We've got Fazan, who's product manager at LinkedIn. Jerry Louis, who created good old llama index. He's coming. We've got Matt Sharp, friend of the pod, Shreya Rajpal, the creator and CEO of guardrails. Oh, my gosh, the list goes on. There's 70 plus people that will be with us at this conference, so I. Hope to see you there!

Demetrios 00:02:36: And now let's get into this podcast. Today we're talking with Jake, all about data platforms. I love his articles. I'm going to just come right out and say it. I'm a huge fan of what he's done when it comes to building data platforms and what you need to know as you are building out your data platforms, he's gone deep into everything from the architecture, how you can add and maximize value, and then all the pieces that come with the data platform that coincidentally or not so coincidentally, have a lot of overlap when it comes to your ML platform. So these are things like data ops, data modeling, data pipelines, oh, my God. Data transformation, data quality, data governance. Can I say data one more time? Let me try and sneak that in there. Data security, obviously, and organizational pieces, like just people, teams, hierarchy of teams, and how you make that happen. So this conversation centered around all of these posts that he's done and things that he's been seeing out in the wild. And what has been most effective? I almost said effectful. Pretty sure that's not a word. What has been the most effective as he's been out there doing his thing as the principal data engineer? Hope you enjoy. And if you can do one thing and one thing only, we would love it if you share this podcast with one friend of yours. Talk to you soon. Well, Jake, it's a pleasure to have you on here. I am very excited because we get to talk all about the data platform today, how that interfaces with ML platforms, really what I am calling 2024, the year of the data engineer, which feels like they were a little bit forgotten in 2023. I don't know if you felt that, too, but the AI hype and the LMS hype, it made people think that these jobs were almost obsolete, I think. And those that have been in the pits, digging trenches knew that they weren't going anywhere. If anything, they were going to get more valuable. And so I've been seeing a ton of demand in the community for topics like this that are going to help upskill and get people that even if they are further down the line on their LLM journey, they still need that good old data. They need it to be transformed. They need it to be modeled correctly. And so who better to talk to than you about all this good stuff?

Jake Watson 00:05:19: Thank you. I'm glad to be here. I'm looking forward to fun chat. Just getting into the weeds. Yeah, I definitely feel like we did feel that the hype was all consuming, but now I'm starting to see increasingly people asking me, how do I get this element thing working, app scale and with our data, et cetera, et cetera.

Demetrios 00:05:44: Exactly. So let's give everyone a bit of background on you. I know you from your blog and your substac if you want to, whatever, like newsletter, blog, whatever substack is these days and the work that you've done on there, you have super comprehensive blog posts on how to build data platforms, the reasons for different components. I want to get into all of that, but I think it's worth talking to people about what you do day in, day out.

Jake Watson 00:06:10: Yeah, that's absolutely fine. My title is principal data engineer. I work for the Oakland group, who is a data consultancy. Even my title can be a bit misleading in that respect, as you could say, simply data consultant as I tend to go work through a variety of issues, usually starting with the clients when they haven't even got anything bit built and it's a blank sheet of paper. So trying to architect, design that, gather requirements, making sure that they're going on the right path.

Demetrios 00:06:43: Okay, so you wrote a really good blog post that I want to dive into that is all about the data platform foundations. And I'm wondering what inspired you to write that post.

Jake Watson 00:06:57: I might have described it a little bit in my intro post when I did it, which was basically, data platforms can end up being dozens of different elements and lots of talking with lots of different people as well. So you usually need a really large support network in your company to make it stand, to get a platform built and keep supporting it as well, and maintainable, which can be a whole different kettle of fish. It can be quite easy to build a POC and another toll thing to actually keep it running and in production as well. So I wanted to sort of get that across about all the different parts and also try to explain there's lots of different varieties as well. There's not just your modern data stack, but might have 20 different vendors all attached to it, which is, I know, is a gross, overwhelming thing, but it might just be a database doing some transformations, taking some data, transforming it, and then doing some outputs on that and getting some insights from that. And it can be any web, it can be something as simple as that, or it can be something very complex where you got your ubers, Airbnbs, where you're building your own custom pieces of software to do things that no one else is doing in the industry.

Demetrios 00:08:20: Yeah. So one thing that I really liked is how you basically showed both sides of the spectrum and you said, look, here is what it can be if it is the simplest of the simple, and maybe there's a one person band that is trying to upkeep this whole platform, and then here's what it can look like in the most messy of situations and so in your day to day, what's more common? What do you feel like you're seeing a lot of as far as patterns.

Jake Watson 00:08:51: When it comes to the platform, it's somewhere in between. So as I mentioned, I work with quite a lot of enterprises that can either be started on the journey or they might be quite far along on the journey. So it can vary. They usually got a couple of people on their data team, so it's rarely just one or two. But I have worked with those people and also sometimes people that don't even have a data team yet, and they're still trying to figure out how they can use all this. How can they actually make some data driven insights all the way to sort of like large 5000 man teams that have lots of different microservices going along, where they're pumping data from here and there and everywhere? It's sort of like everything in between. So I'd say, I was trying to say the average one. I suppose the average data team will be your classic two pizza team.

Jake Watson 00:09:55: It's sort of like it can be difficult, it can be not too bad to scale to that. It gets quite difficult to actually scale beyond that because you then have to start splitting onto teams and then you have to have another level of hierarchy on that. And then you need to think about, maybe you need to split up parts of your platform. So different teams come to work autonomously, autonomously at speed as well. You start to almost get into the data fabric, data mesh sort of landscape. So it can be quite difficult to move on from there. And some people just sort of like, oh, I'll just build another data silo and deal with it that way.

Demetrios 00:10:34: And one thing that is clear to me is that you talk about how there is that data mesh that's coming up. You also speak about the organizational side of things, right? What you're talking about, right there is very much, it's not a problem, so much scaling on the side of the technology. It's more, how are we going to have our teams be responsible for the data platform now? What are we going to look at as far as a hierarchy of the people that are working on the data teams? Are there ways that you've seen this be successful? I mean, I know data Mesh is a buzword. It feels like something I've seen people talk about how data Mesh is dead in 2024 and it was very lofty, or is a very lofty goal. I don't know if it really caught on as much as people were hoping for, unless you have other thoughts on that? I would love to get your take on that. But as far as the organizational side of things and how people can architect that, have you seen success in different patterns there?

Jake Watson 00:11:46: I'd say there's been a couple of success stories, but as you say, it's probably been few and far between than people expected. I think it does come back to the people element that people usually wanted came to data mesh because they have people issues. But people issues to solve are quite hard, so it will take a few years for them to solve that. It's usually people that have actually that have already been on the data mesh traffic, but didn't quite realize it at the time. I was working in one large public sector organization that were almost doing sort of the data mesh, but didn't realize they were doing it where they were having for each of their lines of business, where they were building their own platforms and organizations with like a central governance plane underneath and managing all data governance and billing. So I think a couple of organizations were already on the path, but didn't realize it. But I think it's going to be a slow burn and it doesn't fit a lot of organizations as well. Yeah, I think you need to be a large organization to truly benefit, and you need to be highly distributed by nature.

Jake Watson 00:13:07: So if you're a company that's globalized and has lots of offices in lots of different countries, or companies like insurance firms that are by nature very split up, you got your home insurance, your car insurance, and they may have their own, each have their own data teams, then it can make more sense there, as opposed to someone that already has a centralized, somewhat working well data team already and doesn't need to go too much in that they might like. And there is some things you can do sort of that on the way to a data mesh. I mean, it comes back to that. There's the classic data mart modeling thing that's been around since the splitting up your warehouse by domain, and that can get you a long way towards a lot of your pain to solve. In terms of that, it won't be like a fully distributed system, like a data mesh, so it won't have like separate DevOps systems, which might still mean you can only scale so far with a data map structure. So you might want to explore that and see if you're still hitting the paint with that before trying other methods as well.

Demetrios 00:14:20: And there's so many places that I want to take. This one is really along the lines of different structures for the data platform are there design patterns that you have seen that have worked out well and are easyish to implement or that have been useful? And it's almost like your go to recommendation.

Jake Watson 00:14:48: So I've been working a lot in the been quite. I've been quite a big fan of what I think web Databridge called the medallion architecture. It's also been adopted by many people who are streaming. You'd see I've seen it a bit with people that are building flink. If you've come across the stream processing library, which can do also like Spark, can do batch and real time streaming, and that can reduce the sort of overhead and amount to a learn if you want to do both batch and real time. I'm starting to see more and more people slowly wanting to have streaming because their users are starting to demand real time updates and sort of like notifications. They've been using those Uber apps and Netflix and getting notifications in real time and says, I want our company to have a bit of that too, and start demanding it to their data team. So they've got to start figuring out how do we get real time.

Jake Watson 00:15:51: And one of the easy ways is to get good enough streaming is through something that can offer both batch and real time, like databricks and Snowflake, I know, have been offering it too. I'm not going to try and be too much of a shrill to one or the other.

Demetrios 00:16:11: Yeah, actually, one thing, it's funny you mentioned that, because one thing that I was talking to a friend about the other day was how so many companies will start with one, be it snowflake, and then they kind of hit the maximum that snowflake can handle, and then they'll have to go to data bricks for a data lake. And so they end up paying both of them millions of dollars, and they're like, there's got to be a better way than this. We're using them both. We can't figure this out.

Jake Watson 00:16:47: It differs from each company. So I can't offer like say a silver bullet on that. And that things I know there is a cost implication. We reason both they can cost a lot of money at scale, which means some people start moving back into open source, which is probably a lot of reason why the likes of Netflix and Airbnb. And I think we had a recent blog about Instacart who moved some of their stuff back from managed services to running themselves so they can save a lot of costs. I think they moved from AWS Kinesis to Kafka on Kubernetes for example, to save, because once you start running those servers and then paying an overhead on top of that for that managed services, those managed services, extra costs can start to add up to actually a number of engineers that you can just hire to run that service itself. But I don't know how often that sort of comes up in terms of they tend to be like beak high maturity companies that can sort of do that sort of thing.

Demetrios 00:17:58: So one piece that you have on the data platforms, that is almost like another thing that I feel like a lot of people talk about incessantly is the data modeling piece and how important it is that you get that right for the rest of the downstream effects. Right. Can you talk a bit about how you see mean?

Jake Watson 00:18:22: It's. It's a tricky one with data modeling because it comes back to that. You're trying to get stuff done as quickly as possible. So it's one of those things that can be left on the wayside until it becomes a massive problem and then it becomes absolute pain to sort later. I think the one problem is at the moment is there's no one golden data model to use. Every organization will have something slightly different. All might have multiple data models in that where, for example, with machine learning, it's classically you want a one big table, so you can put that into a matrix and pass that into psychic learn or pytorch or tensorflow. But say for business intelligence, power, Bi and Tableau, prefer Star schema.

Jake Watson 00:19:14: Kimball dimensional modeling with facts and tables. So you might end up running multiple things and then you might be doing log analytics, which is sort of like it's all in JSON and elastisearch and MongoDB. So you can end up with multiple different models and that can make it hard to sort of make a single source of truth on that to mesh that all together. So I wouldn't say have a multiple, but be flexible, don't try. So I find it works better if people are a bit more flexible and not try and have one golden type of data modeling technique of saying, oh, everything has to be kimball, everything has to be one big table. You're going to find you might have to be flexible there and sort of like fit it best to your use cases and tooling that you're using is probably the best bet I've got, is the best advice I can ask for on that one.

Demetrios 00:20:11: Really? Yeah. And then how do you deal with the sprawl that inevitably comes from that? Right.

Jake Watson 00:20:19: With that, I think there has been an increasing call to bring back the good old days of enterprise modeling, which is more escaping me, is the conceptual modeling, which is sort of like getting back to what is a customer, what is a product, and can we possibly have a golden data set for customer and product, which is a long and hard road to go down.

Demetrios 00:20:55: Hey everyone, my name is Aparna, founder.

Jake Watson 00:20:57: Of Verize, and the best way to stay up to date with mlops is by subscribing to this podcast. And another problem with that is conceptual modeling was born in the 80s. So back in the days of waterfall and the people, and you'll find that a lot of the people that preach the good word of conceptual modeling are still doing waterfall. So there needs to be a bit of give and take where you need to take conceptual modeling into the days of cloud and agile, where you're trying to split up the work so it doesn't become, I'm going to spend three years just to make one golden customer table and work out what the hell a customer is, but also not turn it into a mess of a sprawl where we're never going to work it out. We don't care. And now we're just going to make 50 versions of the customer table and no one knows actually how many customers we have in our company. So it's trying to find that balance. So you can bring fast, you can bring some value, but you might find that you have to make compromises in that.

Demetrios 00:22:02: Yeah, and one other piece that you talk about a bit is the data quality and how basically you're making trade offs when you are dealing with all these different kinds of data models. Inevitably I think you're going to have to say, how can we keep the data quality high and have that integrity in our data modeling?

Jake Watson 00:22:22: I think I've read a few, couple of people that say you can solve everything with just enough tests. I'm a little bit skeptical of that approach. There's always more potential tests you can write because it's very rare, you have 100% coverage, because data quality tests are queries against your data warehouse. And if you're using the likes of data bricks and Snowflake, you'll get charged per query. So the more data quality tests you make, the more it will cost you not to take in all the maintenance costs and all that. So data quality is sort of like lead to classic technical debt in software engineering, where the more the debt accrues, the harder it can be to make changes, because you know that if you make any changes, you can introduce data quality issues and you need data quality tests to give you some confidence in that. So I know it's been called by the likes of Monte Carlo, sort of like data debt, where the more data debt you accrue, the harder it is to sort of change at pace in an agile manner. So there is that balance of data quality can cost you a lot of money to do, right? But it can also cost you not to do it.

Jake Watson 00:23:50: It costs you not changing and drilling value for the business. And the more value you give, the better your company is going to be.

Demetrios 00:23:58: And so you break down a lot of different pieces when it comes to the data platform and especially in these different blog posts, you go in depth on all these different pieces, right? And is there a hierarchy that you feel like you can't get to be it data quality before you have data modeling done, or you can't worry about your data transformations? Or is it all weighted and valued equally? How do you look at that type of thing? As you're building out your data platform and you're trying to build for the.

Jake Watson 00:24:33: Future too, it really depends on how fast you want to move and deliver value. Like if you're doing a quick and dirty POC MVP, depending on which terms you prefer, then you might not concentrate too much on getting model. You're just trying to prove a thing can work and will deliver value rather than getting hung up on making robust tests and robust modeling. But however, if you're working on a critical reporting service, accounting, safety reports, things where lots of money or lives change can matter on that data, then you want robust quality on that. And with modeling, it's sort of like if you know you're going to be end up working on a rather big system that's going to have to scale out, then trying to buy into that modeling early on can reap you much return. So if you can do that, it recommend to do that. I think the answer you've got to sort of take the user requirements and sort of fit it to what sort of works and then also build a plan to say, right, we're going to come back to that at a later date. If we've proved that X works, then we'll go to y and sort of like keep our technical debt down.

Jake Watson 00:25:56: There is also a monitoring element as well to this. So you want to monitor how well your data platform is doing. So if it isn't becoming as robust as you want, you want to actually concentrate on getting your technical depth down. So if you're feeling like you're getting too many errors and people aren't trusting your data platform, then you need to get more robustness into that with data quality. And if you're feeling that the changes are not, you're not making fast enough changes and your data model is holding you back, either in performance or how quickly you can make those changes, then you may need to look at remodeling your data that works better in terms of performance or rate of change.

Demetrios 00:26:37: So as I'm looking at this and I'm thinking about a data platform and having a strong data foundation in place, one thing that I realize is that before you get a data foundation up and running, it's really hard to talk about doing any kind of AI ML, and I imagine you've seen that a ton. And so how have you felt like there's been success when it comes to bolstering AI ML on top of a data platform? What are some things and design patterns that you potentially have seen that have been useful in that regard?

Jake Watson 00:27:21: There's a couple of different ways. A common one is you have, like, I've seen a completely separate data science team which has its own pros and cons, which might build out their own ML platform, which there's quite common out there in terms of data platform. What I find often useful is you do want that fundamentals in there. So you want to have some trust in your descriptive analytics before you start your inferential analytics, so you can build on top of them with a solid foundation that you know, your pipelines are working, you know, the data you're outputting is of high enough quality and that sort of makes it easier. You're not so much building the plane while flying it too much. With that, it comes back to, I think I mentioned in one of my books, like the pyramid of hierarchies of the data science hierarchy tree, which is fairly common, which is you want to get the foundational analytics in place before you then start. You want to first look back at your historic data and know that's good before you can move on to start forecasting into the future and start making wisdom and knowledge from that. But I have seen some success with data science teams go away and creating their own solution and sort of like build their own thing, which is kind of interesting.

Jake Watson 00:29:02: It would probably go against most classical data practitioners, which we just talked about, that hierarchy insight. I suppose what that would give them is a bit more autonomy in that. So there's more risk in them of getting the wrong outputs because they don't know the data as well and they're not on that based on less solid foundation, but they have more autonomy to work and sort of like build fast, which can be quite useful when you're just trying to get something out quickly.

Demetrios 00:29:33: Yeah, that's interesting, because it does feel like there is a very strong tendency for companies, once they hit a certain maturity, to have the data platform team, the ML platform team, and then almost like the data scientists are the users of the ML platform, and it's like they're the customers for the ML platform team. And so the ML platform team goes and they are making sure that those data scientists are taken care of and getting what they're needing.

Jake Watson 00:30:04: Yeah, which can be. I don't think the community at large has actually properly figured that out as well. And it can be quite tricky because data access is one thing I know the common one is data scientists want access to the raw data, which can be a whole minefield to get through for political purposes, because that data might be extremely sensitive for whatever reason, and companies are very worried of getting it out there. There's also the case of, I think that sometimes it can be, sometimes people go too early to the raw data thinking that they can just use the raw data and figure it out themselves, when actually you need to step back a bit, have a look at the business logic of what an analytics team have created, and then sort of be able to apply that to your machine learning models, if that makes sense. You're sort of not trying to run before you can walk and sort of fully understand your data estate and the domain you're in before you start just hacking the raw data, because you can end up wasting so much time just trying to recreate what the business intelligence team has already created, because it's quite common to create features that are very similar to what a KPI is. Basically, models can end up just being a collection of KPIs. Oh, interesting.

Demetrios 00:31:48: Yeah. So don't reinvent the wheel. If you are looking for features, maybe the Bi team has already created them.

Jake Watson 00:31:54: With a KPI, it's one thing.

Demetrios 00:31:59: Try.

Jake Watson 00:31:59: To look at the existing work done so far, rather than new things. Just try to find those low effort, high value things when building, when feature engineering, rather than trying to create your own complex calculations just to create a feature, try and find the low hanging fruit in that sort of sense.

Demetrios 00:32:20: Wow. Okay. Yeah, I hadn't thought about that, but that feels like. I trust your opinion on it. I think that it feels right.

Jake Watson 00:32:31: I don't know how controversial that assumption is, to be fair, but I even came across one blog a few years back where someone was sort of like whether there might be some future where both business analysts and data scientists are using feature stores, which might be a very, I think it's not something really done. So it's very controversial because at the end of the day you're trying to build interest and insights either via features or KPIs, and maybe use feature stores. The one place where data scientists and business analysis can come together.

Demetrios 00:33:11: Are we just calling the same thing by different names? When you have an analyst calling it a KPI and you have a data scientist calling it a feature, but the root of it is really the same.

Jake Watson 00:33:24: Yeah, it can be. I'd say there are times when it can be different in terms of, because it can definitely be expressed differently in the outputs, because a model wants integers, preferably in their matrix or decimal numbers, they don't want text, whereas business intelligence end users want text. So there are sometimes differences like that, because an end user will want, say, a traffic light report, where they want green, red, amber, but a model will want one, two, three or zero, one, two, or something like that.

Demetrios 00:34:03: Yeah. And one thing that I think both ML and data engineers can agree on is they have a shit ton of pipelines happening all over the place. Like pipelines are just the go to. And we talked a little bit about airflow before we hit record, and we've had a lot of people on here to talk about for ML, how airflow quickly can be a bit of a mess, and so it might not be the best choice. I'm interested to hear your take a, like data pipelines, what do you see them as when it comes to the data platform? How do you look at them and then B, how do those then feed data into the ML platform? And what does that synergy look like?

Jake Watson 00:34:54: I think it can be, yeah, I can definitely see how airflow can run out of scale, especially for ML, because airflow is its basic element. He's creating a DaG and NL almost likes to create a lovely loop of, you get your output data. A lot of ML algorithms like to create that loop of where it's almost doing a force feedback, you could say. And it is actually quite painful. It can be quite painful to scale as well in terms of manage. I can imagine someday. So I just want to get away from managing an airflow pipeline rather than they want to be building new features, making their model better, getting more insights for their customers, rather than managing an airflow instance. I know there is more interesting alternatives that might suit data scientists.

Jake Watson 00:36:02: I know this kubeflow, which is sort of like more aimed at.

Demetrios 00:36:10: I mean, there's a ton, right? There's Kubeflow, there's flight, there's Zenml, and then there's the data engineering ones. You've got like Mage and Dagster and Prefect and all that fun stuff. So there's a plethora of them.

Jake Watson 00:36:25: Yeah, I think prefect, for example, because I've used that in the past as well. That works much better for ML because it's a bit more flexible, the API is easier. It's very much a system that sort of has tried to learn from airflow and some limits, and airflow itself has sort of like done that a bit. I think people have realized it has been upgraded a bit over the last few years. Again, it can really depend because you can get away with using airflow to a certain extent, depending on how many models you want to produce and what kind of models you want to produce. If you're not feeding the data back to the start, it's much easier to build a pipeline because as I say, it's all about building a directed acyclic graph, dag. And they prefer to just go one way. They don't want to be doing things like looping back on themselves.

Demetrios 00:37:27: Yeah, that's a great point. Well, Jake, this has been awesome, man. I appreciate you coming on here and teaching me a little bit more about data platforms. And of course, 2024, it's all about them. Data engineers, I'm going to call it right now. We're in January. We're going to be hearing a lot more about how vital it is, especially as these LLM projects run up against the data problems and they're going to be calling in the data engineers all over the place.

Jake Watson 00:37:56: Yeah, I've heard it sort of like, there's a lot of talk about intelligent data platforms, which is sort of like mixing generative AI and NL and data all into one place. So they're almost sort of like a symbiotic relationship where you're using AI to make your data platform better, but you're also using your data platform to make better AI in NL.

Demetrios 00:38:21: Yeah, you see a lot of text to SQL, large language models coming out because of that. So we'll see. Hopefully it looks promising. It seems exciting. Let's see where the future takes us. Yes.

Jake Watson 00:38:34: Just make sure you have the right data for it, have high quality, robust data for it.

Demetrios 00:38:42: This is Skyler. I lead machine learning at health rhythms. If you want to stay on top of everything happening in ML Ops, subscribe to this podcast now.

+ Read More

Watch More

55:17
Posted Sep 29, 2021 | Views 644
# Tecton
# Tecton.ai
# Machine Learning Engineering
# Operational Data Stack
# MLOps Practices
52:52
Posted Aug 25, 2022 | Views 850
# ML Platform
# Scaleout24
# Product Management
# hrs.com/enterprise/hrsgroup/
# HRS Group
34:30
Posted Nov 29, 2022 | Views 2K
# ML Data
# Data Contracts
# GoCardless