MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

The Power of Combining Analytics & ML on One Platform

Posted Apr 16, 2024 | Views 105
# ML
# Analytics
# Twirl
# twirldata.com
Share
SPEAKER
Rebecka Storm
Rebecka Storm
Rebecka Storm
Co-founder @ Twirl Data

Rebecka has 8 years of experience building data products and teams, and is passionate about enabling fast iterations on ML models. She is now co-founder of Twirl, a data platform for analytics and machine learning that runs within your cloud account. Before Twirl, Rebecka was Head of data at Tink and ML lead at iZettle, where she led teams building both internal and customer-facing data products. She has also co-founded Women in Data Science AI & ML Sweden, and works hard to get more women in to the field.

+ Read More

Rebecka has 8 years of experience building data products and teams, and is passionate about enabling fast iterations on ML models. She is now co-founder of Twirl, a data platform for analytics and machine learning that runs within your cloud account. Before Twirl, Rebecka was Head of data at Tink and ML lead at iZettle, where she led teams building both internal and customer-facing data products. She has also co-founded Women in Data Science AI & ML Sweden, and works hard to get more women in to the field.

+ Read More
SUMMARY

Analytics and ML often live in separate worlds: analytics happens in SQL and dashboards, and ML in Python and notebooks. However, combining them both in one platform brings a lot of benefits: Speed, consistency, data quality, and autonomy. Building a platform that can work well for both isn’t easy though. In this talk, Rebecka will speak about some approaches she's seen, some tricks on how to avoid analysts and ML engineers getting in each others’ way, and what Twirl is doing to bridge the gap between these two fields.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

Rebecka Storm 00:00:02: I'm Rebecka. I'm here to speak about the power of unifying machine learning analytics on one platform. And I know this is a machine learning crowd, but I'm hoping this will be an interesting topic. Yeah. Anyway, but before I dig into this particular topic, I figured I'll introduce myself a little bit. So I have a background in physics. That's what I originally studied. Thought I wanted to go work in a lab, so I did that for a year in the US using tape.

Rebecka Storm 00:00:27: That was really cool, but not my thing. And I realized kind of quickly while I was there that what I really enjoyed was working with data and the sort of experimental data that we were collecting there. That's what I've done since then. I spent many years at I settle in different roles, working with data, mostly focused on data science and machine learning. I started out as an individual contributor. I've worked, I like to say I did a lot of applying learning there. So working sort of close to the business on all kinds of problems. I did everything from credit scoring to fraud detection.

Rebecka Storm 00:00:59: I worked with the sales team, I worked with marketing automation and optimization, as well as inventory forecasting for small businesses. I ended up leading the machine learning team at isettle for about two years and then moved on to ting, where I joined in a very different role. So I think I was head of data, so I was working managing. I had three teams doing analytics and data engineering. So I've also kind of seen a bit of that. And after a while, I sort of started to see a pattern. I was like, increasingly frustrated by how all the teams I had worked with at different companies doing different kinds of work were spending so much time on laying foundations. Like, so much of the team's total time was always going into foundations and preparing all the, the nice things that we'll do in the future.

Rebecka Storm 00:01:46: And I think every. It's from like an engineering perspective that can be quite satisfying to be like, building this great foundation. But I was also really frustrated about how we, like the fact that we couldn't spend more of our time working closer to the actual problems we wanted to solve at the end of the day. And that eventually led to me starting twirl, which is the company I've co founded and running today. And I'll talk a little bit more about Twirl later. So when we started Twirl, my co founder Eric, who's over there, and I, we talked to a lot of companies about their data platforms and their data stack. We have talked to over 100 tech companies at this point about their data platform, and we learned a lot from that. Some of the learnings were like exactly what we expected, given what we had seen ourselves, and some were really surprising, but one that perfectly, like, resonated with my own experience was nice topic for today's talk.

Rebecka Storm 00:02:37: Basically the idea of combining analytics and machine learning in one platform. The clear pattern we saw when we interviewed the fairly large amount of companies that those who were serious about both analytics and machine learning, if they hadn't already started from one platform for those use cases, they were going in that direction. So it's common to start from analytics and then add machine learning functionality, or vice versa. And some companies had sort of started with entirely separate setups, but were like gradually moving towards a unified approach. And I thought this was interesting because it exactly matched my experience. So at Isettel, where I had worked, we had had one platform for both analytics and machine learning from very early on, whereas the thing they were kind of in the process of unifying at the time that I joined. So that's my topic for today, that's what I'm going to talk about. And then I guess the first and most obvious question is, why would you want that? Why would you want one platform? Why do I think that makes sense? And why do companies tend to move in that direction? And I think there's like a couple of reasons for this.

Rebecka Storm 00:03:39: I think the most obvious one is the first one here. Basically, if you can reuse components across different data disciplines, you'll usually be a lot faster. You can reuse definitions of metrics of what's a customer. You can reuse entire pipelines, you can basically reuse anything that any other data team member has built. One of my favorite examples of sort of how this team successful is from Iceland, where one project we worked on was trying to generate leads for the sales team. So we wanted to tell the sales team basically, which of recently onboarded customers should they be calling and basically helping get set up. So which ones were like, high likelihood of being like good users, users worth spending time on for the sales team. And this was a fun project because it was like we had this internal stakeholder that was a sales team, and we could work really iteratively.

Rebecka Storm 00:04:35: And it took us three days from coming up with the idea of doing this project until we had a first model running in production and sending them leads. We obviously took a lot of shortcuts along ways. I think the very first version, we were populating a Google sheet or something, something like that, that they were kind of moving to actually all these leads, and we iterated. And the ML model at this stage is super, super simple. But just getting from like, idea to a model in production in a few days, I think it's kind of rare. And I think that was very much a consequence of being able to reuse existing definitions of what's a promising lead and what do we mean by a customer, things like that. Another one, which I guess is a little less obvious, is consistency, but I think this one is probably familiar. How many of you have spent time troubleshooting why, like, one value of a metric isn't the same as, you know, this other value of metric over here? I think this is super, super common to have multiple definitions of the same thing.

Rebecka Storm 00:05:29: So one example would be you're like training a churn model, and you have one definition of churn that was implemented by themal team, and then you have another dashboard that shows churn by month and that shows a different number, and it's a different definition of churn. And then you might have a third definition when you're running an A B test and see if you can reduce churn. And I think this is a very common challenge and sort of well known. And I have a strong belief that combining analytics and ML on the same platform basically reduces it makes it much easier to kind of build your ML model on top of the definition that the analytics team has already built, for example, when they built the reports on churn back on dashboards. And I think in a perfect world, like, what I just described is kind of how things would always work. Most ML projects would start from, like, we have this like, clearly defined label that, you know, we've been tracking for a while. We monitor this metric already. We know, like, it's baseline value, and the goal of the project is to improve it.

Rebecka Storm 00:06:24: I don't know about the rest of you, but my experience is that that's not very common. The opposite has actually been more common for me and the problems that I've worked on that we came up with sort of a project we wanted to do, or like some ML model we wanted to build. We had the first go define the label because it didn't exist. I think a good example here is also from Iceland, where I worked on credit scoring, where our first lending product, so isettle, was issuing loans to small businesses, and we wanted to build the first machine learning model to do credit scoring and figure out who should be offered a loan. And as we were about to start building this model, we were like, okay, what label do we train on what's the outcome here? And that turned out to be, like, non trivial, to say the least. I think we spent months on just this problem. Like, which label do we train on? Because ultimately that's a question of who does isettle, in this case, want to offer loans to? And that's not obvious. Is it the ones where Isetl will make the most money? Is it the ones where, like, the loans will help the small businesses successfully grow? Is it the ones where, like, we, you know, see their business growing thanks to the loan? It's complicated.

Rebecka Storm 00:07:30: And by keeping analytics, and by keeping analytics and ML on the same platform, that makes it possible to reuse all that work that goes into creating a definition for the machine learning case to reuse that also for other things like analytics and insights, and tracking the performance of the machine learning model, among other things. The next point, which I think is less obvious, is data quality. So my experience, and this is also sort of the case in many of the companies that we interviewed, notice that data quality, which I think is considered a huge problem by most data quality, tends to be higher in data sets that are used for multiple things, and especially if they're used for something that's customer facing. So if you have, like, a product, if you have a customer feature that's going to go down or like, look terrible, if the data is wrong, then all teams in the company tend to care a lot more about that data being correct, and that helps a lot with data quality for all kinds of other purposes. So there's a win win in having, like, multiple teams work on top of the same data sets. You can again use this sort of lending example at isettle. So this was like a product where loans were issued based on data from the data platform. Those data sets were in perfect condition, or no, they weren't, but they were in great condition compared to most other data sets, because if anything was wrong in them, it cost the company money, like we might lend money to the wrong company.

Rebecka Storm 00:08:47: And that was like a pretty expensive. So all the teams involved in producing and transforming data tended to really care about keeping it high quality. Final point is autonomy. So when machine learning and analytics happens on the same platform, that typically reduces the barrier to doing machine learning for lots of people who might not be ML experts. My favorite example here is actually from TynC, where the product analytics team pretty much completely independently managed to build and deploy a anomaly detection model for a metric that they had been building and working on for a long time. And they were like, in the perfect position to understand that this is a metric that needs to be monitored and to understand the metric and kind of its expected behaviors because they had already worked with it. If you had brought in to ML engineers to do this, it would have taken them much, much longer to get started on the problem because they wouldn't have had this deep understanding that these product analysts had from the very start. So by making it easy for a theme like this product analytics team, in that example, um, you can basically have a lot more models up and running in production and you can create a lot more value.

Rebecka Storm 00:09:53: So these are some of the advantages of combining machine learning analytics on the same platform. And uh, for now I'm going to just assume that I've convinced you this is a good idea to keep them both on the same. And I think that sparks the next question, which is how, how do you build a platform that works for both? Like that's fairly not obvious, I think. Um, so I'm going to try to break it down into kind of what steps do companies usually take? Kind of where do they start, what do they think about next? And like where do they end up? Um, so I think one of the most important kind of prerequisites if you have one platform that you want to use for both SQL, for both the analytics and machine learning is it needs to be able to run transformations using both SQL and Python. Those are kind of typical languages of choice for analysts and ML engineers or data scientists respectively. Um, for SQL, a common approach is to use something like DPT. Uh, it gives you a nice way to work and chain work with and chain lots of SQL transformations for Python, a little bit more tricky. I would say the most common setup, although it's gradually being more and more challenged, is airflow.

Rebecka Storm 00:10:51: So introducing airflow and uh, you know, working towards like common best practices, like running all jobs in separate containers so that you don't have dependency madness, uh, figuring out some nice ways, CACD, etcetera. So those are the two, I would say most important languages to support. Although of course if you can support more, that's better. And a lot of companies will end up needing to do more than just plain SQL and Python. But this I would say is kind of the minimum an issue with the problem with the setup I just described. If you have something like BPT and then you have airflow on top of that with containers, is it makes it hard to mix and match. It's not easy to go back and forth between SQL and Python, but I'll come back to that. Later.

Rebecka Storm 00:11:28: The next thing you typically want to do is you want to create a good local developer experience. And by local I mean like a way for people to run code before it's deployed in production. You want to basically test that things work before you deploy so that you don't break things in production. How many of you have ever broken something in production? Yeah, I think it's quite common, and I think it's because partially because any data system is hard to test properly, you need to test it on good, like production data. But I think it's also because tools in this space aren't very good or haven't been very good. I don't know how many of you work with tools like CPT. I think CBC has done a fantastic job in creating kind of the local developer experience and making it easy to, like, test multiple tasks that depend on each other and kind of testing your full pipeline. Airflow, for example, has not done a particularly good job in making this easy, although it's getting better.

Rebecka Storm 00:12:22: But this tends to be something that companies who want one platform for analytics and machine learning, they tend to spend time on this and kind of making this a smoother process so that you can avoid these costly errors in production and kind of catch them early. Instead, the next thing that is common to run into and the next thing you need to figure out is basically how do I avoid my analysts breaking my machine learning models or vice versa? My ML engineers kind of wrecking havoc in, in bringing down the dashboards that the CEO looks at every day. And I think this is a very hyped phrase today, but I think some kind of contract is a solution to this. I think there's sort of increasingly clear convergence towards this with kind of the best practice to avoid people getting in each other's way and breaking things for each other, basically. And I'm not going to spend too much time talking about this because I think this is a big and interesting topic, its own, but I do think this is kind of almost a necessary part if you want to have one platform that is used by different kinds of people with different skill set and different use cases. And related to this is some kind of ownership. I could have phrased this as code ownership or data ownership, but especially if you have one system and it's sort of becoming a big monolithic system, it becomes really important to know who built what data set, who's responsible for it, if it breaks, who should I ping on slack, who should be monitoring it, and so on. And here I know that data mesh is like a very trendy approach to fixing this.

Rebecka Storm 00:13:53: But what I've seen work well in the places I've worked and also heard a lot of other companies mention, is like a very simple way to achieve this is just assigning code owners on GitHub that has worked pretty well as far as I've seen. So to clear, if not everyone is used to this, this is a feature and get out that basically lets you assign a code owner to like a directory so that the people in that team, or it could be an individual person as well, gets pinged whenever there's a pull request open to change that code. That's the way of kind of assigning ownership without breaking the small list of smaller pieces. Next thing, and this is where it starts to get really messy, is when you want to process data at multiple cadences. Uh, so I think a common approach here is you might have your airflow dag, it might be running once per day, that works well for a while, until suddenly you want to make some machine learning prediction every day, or you have a dashboard that needs to be updated more frequently. It doesn't really matter if it's an ML or MLS case, but if you want to run the sum tasks or have one big dag that you need to run every day, another one that you need to run every week, and a third that you need to run every hour, this gets messy quite quickly. If everything like it can be fine as long as things work. But when something breaks, then you need to do a backfill or a rerun with get started super messy.

Rebecka Storm 00:15:05: And the reason is you essentially have multiple dependency tags, but you're not explicitly like they're not explicit in your code. Need to keep track of how these depend on each other. Basically modern orchestrators like Daxterid prefix, they make this possible to fix so that you can kind of deal with multiple cadences within the same deck. But it's quite a lot of work. It's a pretty steep learning curve and it takes a lot of time to kind of get a system up and running that works well for this. The final step on this curve, I'm obviously oversimplifying a little bit here, is when you start to need services. So this is common, I think ML use cases rather than analytics, although it can happen to both. So you have some use case where you need true real time.

Rebecka Storm 00:15:47: So the previous point was about anything that's like batch, even though it's like mini batch and semi low latency. But once you start needing like sub second latency, you enter a whole different world. And I think many of you are painfully aware of this. Um, so once you need this, the default solution or the one that I've seen and sort of experienced is you build basically separate back end services once you get to the stage. Once you need like sub second latency, that tends to be the most common solution. And that means you're creating like a world that's completely separate from your main data platform. It typically also means you're creating a separate world for prediction time when you compare it to training, which has some obvious problems, like training with serving SKU, which I think is probably a familiar topic for most of you. Um, but what's also interesting about this is like when you create a separate back end service to deal with like incoming requests and make predictions live, you also kind of miss out on the possibility to use data points from that.

Rebecka Storm 00:16:44: They typically sit in your data warehouse. They're the outputs of pipelines that you're using to monitor, churn, or keep track of your users behaviors. And if you want to also use those kinds of slowly changing data points in your machine learning models, you'll end up having to build some kind of layer to copy data from your data warehouse into your microservices, into some more latency database, so that your microservice can catch those features as well, that are more like user specific and don't come in the request directly to the service. So again, this is a little bit hand wavy, but this is like roughly the kind of path that many companies go on when they want to build a complete platform to solve all these types of problems related to both ML and analytics. And as I think you can tell, like, it's very possible, but it's also a lot of work. And I think that's why companies often feel the way that I did when I was at both. I settle and think, like, why are we spending so much time on building the foundation? Like, when do we get around to solving the problem? And this is basically the whole idea behind twirl. So this is kind of where I like to say that twirl is.

Rebecka Storm 00:17:48: So this is what we're doing. We are taking on all of these problems and a couple of others, basically trying to build kind of once and for all a solution to these problems that are not so business specific, but more on the foundational level and enabling, working on the actual more interesting problems or the data specific ones, building the actual models. I'm not going to talk so much about twirl because that's not the topic of this talk. But I will say if you're interested in this, please come talk to me after. I'm very excited to talk about all these different aspects of working with data. That's great. Thank you.

+ Read More

Watch More

48:13
Posted Mar 10, 2022 | Views 715
# Data Platform
# Building ML
# Kubernetes
39:54
Posted Dec 27, 2021 | Views 578
# Machine Learning
# Vertex AI
# Google Cloud Platform
# Loblaw
# Loblaw Digital
# Loblawdigital.co