MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Why All Data Scientists Should Learn Software Engineering Principles

Posted Jul 05, 2024 | Views 244
# Data Scientist
# Software Engineering Principles
# Coding
Catherine Nelson
Freelance @ Data Scientist

Catherine Nelson is the author of "Software Engineering for Data Scientists", a guide for data scientists who want to level up their coding skills, published by O'Reilly in May 2024. She is currently consulting for GenAI startups and providing mentorship and career coaching to data scientists. Previously, she was a Principal Data Scientist at SAP Concur. She has extensive experience deploying NLP models to production and evaluating ML systems, and she is also co-author of the book "Building Machine Learning Pipelines", published by O'Reilly in 2020. In her previous career as a geophysicist, she studied ancient volcanoes and explored for oil in Greenland. Catherine has a PhD in geophysics from Durham University and a Masters of Earth Sciences from Oxford University.

+ Read More
Demetrios Brinkmann
Founder & CEO @ MLOps Community

Data scientists have a reputation for writing bad code. This quote from Reddit sums up how many people feel: “It's honestly unbelievable and frustrating how many Data Scientists suck at writing good code.” But as data science projects grow, and because the job now often includes deploying ML models, it's increasingly important for DSs to learn fundamental SWE principles such as keeping your code modular, making sure it is readable by others, and so on. The exploratory nature of DS projects means that you can't be sure where you will end up at the start of a project, but there's still a lot you can do to standardize the code you write.

+ Read More

Catherine Nelson [00:00:01]: I'm Catherine Nelson. I'm the author of Software Engineering for Data Scientists and a freelance data scientist. And I make my coffee in a fancy espresso machine. And I like a latte.

Demetrios [00:00:15]: What is up, ML Ops community? We are back with another podcast episode. I am your host, Demetri-os, and today we're talking with Catherine, all about software engineering for data scientists. This is an episode where we break down what is going on when it comes to data scientists leveling up and understanding how systems work, not just their little piece of the puzzle. So if you are a data scientist and you are looking to expand your skills, this is a great one for you. And in addition to this, I would highly recommend that you get Catherine's book that just came out with the title software engineering for data scientists. And if you're not a data scientist and you're listening to this, hopefully it will give you a little more empathy for the data scientists out there because lot's been put on their back. One thing that I was thinking about after we got off this call and after I talked with Kathryn is just how vague a data scientist's job description is and their roles. In one company, it could mean one thing in another company, the tasks that you are expected to undertake can be completely different.

Demetrios [00:01:23]: And because of that, because there's not that standardized understanding of what a data scientist actually does and should know and should be able to accomplish, well, get a bit of everything. And so you may have that data scientist that knows how to write in YAML or that is able to take ownership of the model and the model they create. They put it into production. They don't just throw it over the wall. Or you might have the complete opposite where there's a data scientist who doesn't understand git. So this is an episode to go into the cultural aspect of that. And if you are a data scientist, give you a bit of direction on what you should be known and also why you don't know these things. Let's get into it with Katherine.

Demetrios [00:02:18]: And as always, if you like, this episode would mean the world to me. If you can share it with just one friend and get Catherine's book links in the description. Talk to you later. I have to mention, to start your first book, because I think you wrote it with a friend, Hannes, right?

Catherine Nelson [00:02:45]: That's correct. Yeah. Yeah. That was my entry into the world of writing books. And, yeah, I was really glad to have that opportunity to write with Hannes. It was a ton of fun and, yeah, really enjoyed that.

Demetrios [00:03:00]: Yeah. And so the other thing that I wanted to mention before we really get into the meat and bones of software engineering for data scientists, and that whole discussion is the idea of meeting in person more often. And I know that you just got back from Pycon.

Catherine Nelson [00:03:17]: Yeah, that's right. Yeah.

Demetrios [00:03:18]: How was it?

Catherine Nelson [00:03:20]: It reminded me how good meeting in person is. It was because it's not the thing, it's not the meetings that you expect to happen that are the really what really made that special and worthwhile. It was the unexpected things, like the dinner with an old friend where we just discussed anything and everything and all our thoughts on the rise of AI and what we were doing in our careers and all these kinds of things, and you just don't get that kind of thing when you have to pre plan meetings. So, honestly, it felt intimidating going to a big conference, thousands of people. I work remote. I don't spend time in places where there's thousands of people, but it was good to remember just how worthwhile it is.

Demetrios [00:04:19]: So I had almost the same situation happen because I just got back from the data and AI summit, and it's so funny, you cross paths with people and you realize, hey, I didn't know you were going to be here. Oh, yeah, let's hang out. And then all of a sudden, some people that I've only been talking to online for the past four years, now I get to see them in person, and we have coffee or we have dinner or we have lunch, whatever it may be. And you, you can't get that if it's like, hey, are you free? Yeah, send over my calendly link, and then it's blocked off. It's like those serendipitous moments. It's very hard to get virtually, yes. And so I feel you 100%. And that's also.

Demetrios [00:05:07]: I mean, we're doing the AI quality conference. I am thinking about that so much right now, is how to make the attendee experience just top notch and provide for those serendipitous experiences. Because I imagine there's gonna be a lot of people that are gonna know others at the event, but they don't realize that the other person is going until they get there and they're waiting in line for coffee or to register with them, and then it's like, hey, let's hang out. And so how do you have all of the cool talks, but also, how do you have the spaces for people to hang out and have those moments?

Catherine Nelson [00:05:43]: Well, I have one suggestion for you, which I heard at a conference a while ago, and they actually announced in the sort of intro keynote talk that if you were standing in a circle with a bunch of people talking to each other, they encouraged you to leave a gap in that circle.

Demetrios [00:06:01]: Oh, nice.

Catherine Nelson [00:06:01]: So that someone you could join in, like, if you don't know a bunch of people to get started with and there's a bunch of people hanging around talking, it feels intimidating to like, step into that. But if you leave a gap to encourage new people to join, that just makes more space for those serendipitous.

Demetrios [00:06:21]: Yes, it's much more welcoming. I know we didn't come here to just talk about in person tech conferences all day. I really want to know about your new book, which is all around software engineering for data scientists, right?

Catherine Nelson [00:06:38]: Yep. That's, that's the title. That's what it's all about. Yeah.

Demetrios [00:06:43]: Yep. You can't get any more clear than that. I think it is a topic that has come up so many times in the last four years doing this podcast. And whether it is like, I think one of the huge champions for this is Matt Sharpen talking about clean code for data scientists, or we have Laszlo in the community who is gung ho about like production code and what you need to know if you're a data scientist. And so there's a lot of threads we can pull on. Let's just start. Why did you feel inspired to write this book?

Catherine Nelson [00:07:23]: Yeah. So it's something that gets so much bad press. You see all these Reddit threads about like, data scientists write terrible code. What is going on? I'm ashamed of this. But then my experience, when I was early in my data science career, I was the only data science on a team of designers and engineers and product people. Nothing in my background, in my courses I'd taken had prepared me for that. They'd given me the language to work on a software team. So I had questions like, what is an API? What is a test? How do I write a test? And I wanted to learn more about this.

Catherine Nelson [00:08:13]: And I started looking at, I asked people for recommendations, for books, for whatever. And the, the books and the examples, they were all aimed at web developers or the examples were in Java or similar things. It wasn't, there was a whole load of extra knowledge that I needed to be able to get all the good stuff about what is clean code? That wasn't really useful to me as a data scientist.

Demetrios [00:08:44]: It wasn't approachable because all of a sudden now you're like, wait, c, what is that, Java?

Catherine Nelson [00:08:51]: Yeah. Do I need to like, learn this language? Well, enough to understand the examples, to learn what a clean code should be.

Demetrios [00:08:59]: Oh, yeah. So that can be very painful.

Catherine Nelson [00:09:02]: And then later on, I had mentees who I saw having exactly the same problem. So around about middle of 2021, I was just thinking, this should be a book. I didn't think I was going to write it, but I was like, this really should exist. There should be a guide that's aimed at people who are writing code in Python, who are not particularly developing for the web, but they want to write good code, because as data scientists, we don't want to write bad code. We're not writing sprawling Jupyter notebooks just to be perverse or anything. There are good reasons for this.

Demetrios [00:09:45]: Yeah, I thought they were into S and M, and that was the reason why.

Catherine Nelson [00:09:51]: And just like the exploratory nature of a lot of data science projects does lead you to write long scripts and do that quickly, just to kind of try out your ideas and go through that process. It's useful to explore data in a Jupyter notebook where you can see examples of it easily. But then as data science has become more mature, then you start to need to put that code into production, and that just needs a completely different way of thinking about things.

Demetrios [00:10:26]: Yeah.

Catherine Nelson [00:10:27]: So actually, the bar becomes a lot higher for data science because you have to know how to explore your ideas quickly, but then also turn it into solid, robust production code.

Demetrios [00:10:40]: Yeah. So it's funny, you mentioned the idea around wanting to explore as fast as possible, and basically there, the code is the least of your worries. So you're just trying to see, like, hey, is this assumption that I have actually valid? Is there something there? And you're figuring out where you're going to go with things and what you're going to invest time into. And so the last thing that you want to be investing time into is making sure that your code looks nice, because that's the least of your worries.

Catherine Nelson [00:11:13]: Yeah, you just got to hack it together as quickly as possible. But what I wanted to equip people with in my book was to know the difference between the code that you just hacked together as quickly as possible and what good code is. And take a whole chapter. I just explore the various ways that the chapter is called what is good code? And I just explore that idea and why you might want to do this, because it helps you scale up, it helps you avoid technical debt when you get into production systems and so on.

Demetrios [00:11:50]: Yeah, it does feel like the wise data scientist understands that difference and they know, all right, I'm doing this now just to move fast. But once I have something and once I'm ready, I'm not going to just say, this is okay to send over to someone else on the team. I need to take the time and really format it or just put a, put in those hours to make sure that people understand where I'm coming from and it's not just spaghetti code.

Catherine Nelson [00:12:28]: And as you get more experience, do you start to develop a sense for when pieces of that code might be useful for another project or similar projects and develop a library of things that you can use and that because you're going to run it repeatedly, you want that, you want to be sure that works well, you want it to be efficient, you want it to be tested and so on.

Demetrios [00:12:52]: Yeah. So there is a few things that you mentioned that you feel like there's a disconnect, right? And I've heard it a lot on here where people will say, wow, I worked with a data scientist and they didn't even understand git. And so that was like the, that's like the meme, like you were saying on Reddit, you hear a lot of, I've been guilty of it.

Catherine Nelson [00:13:15]: Yeah.

Demetrios [00:13:15]: And it's so easy to throw data scientists under the bus. Sorry for all them listening and out there, but I feel like it's getting better. And books like yours and just like movements, people talking about it, it's understandable now that you shouldn't be doing that or it's not the best. Where else do you feel like there is a disconnect because you mentioned the APIs you hadn't understood from all of your training to become a data scientist. Things like APIs or like code review I imagine was foreign concepts. What else is there that you're like, okay, let's build a bridge into more software engineering practices that data scientists can follow.

Catherine Nelson [00:14:05]: Security was definitely another one that's something that isn't talked about in data science courses because you're like, oh, let's go download this dataset from the web. We don't need to take any care with that. We don't need to treat that securely.

Demetrios [00:14:23]: Or the pypy packages, it's like, yeah, we'll just download any of these pypy packages, right?

Catherine Nelson [00:14:29]: Yeah, we'll just save our models in pickle and upload them everywhere.

Demetrios [00:14:34]: All classic.

Catherine Nelson [00:14:35]: But if you're just dealing with open data, open source software, like most of the data science training, then why would you know any of those aspects of things? It's just not relevant. But then suddenly you're working in a software team, you're working with real people's real data. It becomes really important.

Demetrios [00:15:00]: And are there specific ways that you found basically just to level up? Let's take the security one, for example, like me not knowing anything, to me at least having a grip on it, should I just familiarize myself with the terms, or should I familiarize myself with all the different ways that there are vulnerabilities? Or should I just know? Don't download open datasets off of hugging face.

Catherine Nelson [00:15:25]: In my book, the chapter on security is just like, hey, security is a thing you should think about. This is what a risk means, this is what the threat means, this is what a vulnerability is. Heres where you can go and find ones that might be relevant to the work that youre doing. Some basic stuff on just physical security and some details about the consequences of when people dont. So I put in some details on high level, high profile security breaches and just how big a deal that is. Security for machine learning is this whole new thing as well. And adversarial attacks on machine learning models, that is absolutely something that data scientists and machine learning engineers should be aware of and thinking about just the possible methods of attack and what they can be doing to mitigate those.

Demetrios [00:16:17]: It was funny, I saw a LinkedIn post. You know, how LinkedIn now will like prompt you to answer it? It's like you're an expert and you should answer this question. Somebody just, I'm pretty sure they're trying to like train a model off of all those answers and saw something that I thought was hilarious because I saw somebody's answer pop up on my feed. It had nothing to do with the question. And it was obvious, like just data poisoning. And so I can only imagine that person writing a whole lot of answers, or potentially like just prompting chat GPT to be, hey, I'm an angry, grumpy coder who doesn't like this. Answer these ten questions for me, and then they go and answer them. And now you have to think about data poisoning.

Demetrios [00:17:05]: So hopefully people also are keeping that into account. But that, I would imagine, is more in the realm of a data scientist. Like, they understand that much more than they understand kubernetes.

Catherine Nelson [00:17:18]: Yes, definitely. And a lot of what you can do about that comes down to careful monitoring and analysis of your model and how it's performing. So that is absolutely in the realm of the data scientist or the machine learning engineer.

Demetrios [00:17:32]: Yeah, yeah. We would always ask, I think, in the beginning of the community, because when basically 2020, when things were very unclear on who owns what. What is a data scientist? What is an ML engineer? Not that I'm saying like, it's super clear now, especially with the rise of an AI engineer and what's the difference between that and an ML engineer? But I remember we would constantly kind of be asking one another, who gets the ping at 03:00 a.m. when the model's gone rogue? Who's owning it? What does the ownership look like? Nine times out of ten, it's not the data scientist because they're not able to be like, oh, roll back and they can.

Catherine Nelson [00:18:19]: They're the ones taking the big picture of you rather than the immediate production monitoring, I should think.

Demetrios [00:18:26]: Exactly. And so that kind that dovetails nicely into the next question that I wanted to ask was, yeah, from data scientist to machine learning engineer. If we loosely just describe a data scientist being someone who is very much about the modeling and playing in their almost like statistics and analysts and creating models, but less on the infrastructure side and ML engineer being someone who maybe is more on the platform side, supporting the data scientists and less on the modeling and definitely less in the Jupyter notebooks realm of things. How have you seen people successfully transition from a data scientist to a machine learning engineer?

Catherine Nelson [00:19:20]: I think the biggest thing is a change of mindset, because you're moving from someone who's focused on exploration and uncertainty and, oh, I've got this big problem to solve, but I don't know how I'm going to do it. And I'm completely happy with that and exploring that space to someone who's saying, I'm going to make this run repeatedly, I'm going to make this scale up, I'm going to make sure it's running well. So you're moving from that very open and exploratory type of a job to something where you're very focused on standardization and making sure that everything is efficient and robust and tested. So you might still be running some experiments, but it's in a narrower field. So yeah, I think that's the main thing. That's what I've seen people do. And a lot of the ways that theyve gone about that is more just getting really interested in the code that youre writing and learning how to write that well, so that youre less focused on the project direction and the science and more interested in like, oh, how am I actually doing this task? How do I do that efficiently? And thinking about how to scale things up, thinking how to make each piece more efficient and fit together better yeah.

Demetrios [00:20:55]: It does feel like it's going from the ad hoc nature to the reliability piece and saying we've done this once and now we're going to do this n plus one.

Catherine Nelson [00:21:11]: Yes.

Demetrios [00:21:12]: So you have to figure out how to do it in scalability. All these questions that are, are kind of for lack of creativity. It's what the software engineer's bread and butter is. Right. And so it is more. I do understand the idea of changing the mentality from being that exploratory nature to being more software engineering focused, where.

Catherine Nelson [00:21:39]: You'Ve kind of established that a thing is possible and that you know you're going to do it, have a good idea of what the outcome of the project is going to be at the start and you know that you need to build that.

Demetrios [00:21:53]: And so have you seen those pocs be hacked together and just really try and move as fast as possible to get something out and then we can iterate on making it better? Is that how you generally would go about the projects or do you still want to focus on making sure things are tip top before they get pushed out? Even if it is going to take.

Catherine Nelson [00:22:24]: Longer, it's going to be a balance. Right. And this is something that I feel is not particularly, there's not a good established process for moving from that PoC to something in production. There's been talk for a long time of do you put notebooks in production? Do you try and find some way of just taking that exploratory code and just running with it? Or do you sit down and do a big refactor into something that is more usable by the rest of the production system? And im firmly in the second camp. I think taking that messy code and then refactoring it is absolutely the way to go, but that requires a lot of skills. That might not be the bread and butter of the data scientist and its something that I dont often see as baked into the project lifecycle data science project. You kind of need to set that time aside to do that stuff.

Demetrios [00:23:32]: How many story points?

Catherine Nelson [00:23:34]: Yeah, right. Quite a few actually.

Demetrios [00:23:39]: Yeah, exactly. That is so funny that you mention it. And it is almost like with your book, data scientists don't have the excuse to just throw it over to somebody else because now hopefully any software engineers or data engineers or anybody out there working with a data scientist who says, hey, well, I don't know how to make this better, then just let us know and we can send them a copy of your book and they can read it and have a few story points for that. One.

Catherine Nelson [00:24:16]: There you go. I mean, there might be situations where it is a good idea to hand that code over if you have some particular requirements, if the data scientist needs to move on to another project. But then that means there's a gap between the intention of what the data scientist is aiming for for that model, you lose that insight into why that particular model was chosen, if that, if they're no longer responsible for that code. So it makes it a little harder to then update the model in the future. If updating the model needs the data scientists to go back to their exploratory code and then train a new model and then have that over and have someone else write the code to actually put that new model into production, that doesn't seem like a smooth workflow to me.

Demetrios [00:25:17]: No. Yeah, that's not scalable. What is a better way?

Catherine Nelson [00:25:21]: I think if the data scientist is able to own that model, to a certain extent, it's helpful. I don't want the data scientist to end up doing everything, but at least being able to work together with the software engineer on that production code so that they have an understanding of what's gone into it, being able to communicate better rather than just throwing it over the wall and thinking their jobs done.

Demetrios [00:25:53]: Well, because the pushback that I've heard from quite a few people is that you want to have everyone working in their zone of greatness. And so if you do have a data scientist whose zone of greatness is more that exploratory, being able to spot trends or anomalies or whatever it may be, understand the data and have a very intimate relationship with that data, that's where their zone of greatness is. And then any moment that they're not doing that, you're really missing out on someone, that you're paying a lot of money to do that.

Catherine Nelson [00:26:30]: That makes a lot of sense. Yeah.

Demetrios [00:26:32]: Have you thought about that other side of it and what the, I guess what the pushback would be if that was it? Because I do intuitively, I know what you're saying is a better route because you don't have these hangups, but I also see the side where you have that problem when people are saying, well, I'm good at this, I'm not good at refractoring code, or I'm not good at all this other stuff that comes along with it.

Catherine Nelson [00:27:07]: If I was a data scientist whose specialization was in that, whose zone of greatness was the exploration, I would still want to be able to read and understand the production code that is based on that work. So I think that's what I'm aiming. That's one of the things I'm aiming for with my book, is to equip people with the language and the terminology so that they can understand more complex code, even if they're not writing it themselves.

Demetrios [00:27:42]: So what are some of these keys to being able to work with software engineers or data engineers or just data scientists? If you're the lone data scientist on a team, how have you seen that be successful?

Catherine Nelson [00:27:56]: I think one of the big things is just understanding the incentives of the software engineers. This is common to anyone in a different discipline that you're going to work with. What is their background? What makes them sick, what are they trying to achieve? And hopefully they do the same with you, so that we're all working towards this one thing in a different ways. But, yeah, understanding how important things like standardization are to software engineer and why that might be important, because that's what makes code easier to maintain. That's what makes it easier to get started on someone else's code. All these things. And more specifically, if I'm thinking about trying to understand more complex code, then things, having a, having some knowledge of things like object oriented programming, what that is, what that looks like a suite of tests looks like, what is a unit test? What is an integration test so that you just have some better expectations around what a bigger code base might look like.

Demetrios [00:29:12]: Understanding the CI CD pipeline, understanding how to push things to git, makes a lot of sense.

Catherine Nelson [00:29:19]: What even is CI CD?

Demetrios [00:29:21]: Yeah.

Catherine Nelson [00:29:21]: Correct.

Demetrios [00:29:23]: Yeah. And you would hope that most people understand it, or at least these days are aware of the different terms, but sadly, I guess it's not always the case.

Catherine Nelson [00:29:39]: No. And I think even if you're coming from a pure, even if you're coming from like a computer science degree, you might note you might not get that exposure. You were into the theoretical side of things. I'm thinking about the junior data scientist, junior ML engineer, who is fresh out of a bootcamp, fresh out of a master's degree. They're probably just not going to have had exposure to a lot of these terms. And it all makes sense. Once you, once you have a basic idea of what's going on, then you can start to read the documentation. But there's this period where you don't even know enough to learn more from the more complex documentation.

Demetrios [00:30:29]: Yeah, it's funny because I was going to mention even just seeing some. Well, no, let me take that back. The junior data scientist who has only played on Kaggle and only use the Titanic datasets, and now they're thrown into a job where, yeah, you've got Pii to worry about. You've also, you've got all the, like you said, like maybe you're working with a UX designer. Maybe you're working with data engineers or software engineers, DevOps folks. And they all have to be conscientious about different parts of the system.

Catherine Nelson [00:31:10]: Yeah.

Demetrios [00:31:10]: And so they're going to be coming to you, asking you about certain parts of the system, and if you have that common language and that common understanding of what things are, you're going to be able to go much further.

Catherine Nelson [00:31:21]: And ideally, you'd have a great mentor in your team who would be able to explain these things to you and answer all your questions. Yeah, sure.

Demetrios [00:31:32]: The amazing prompt there is, yeah, you're a software engineering mentor, and I have lots of questions for you. Yeah. But even understanding how to help you with the code and how to help you think about things differently and look at the system design documents. And so I understand the desire for different data scientists to want to be able to understand the system as a whole as opposed to just their little piece.

Catherine Nelson [00:32:09]: You don't need to be an expert on the whole, but just knowing what's there is, I think, is key.

Demetrios [00:32:16]: Yeah. And knowing the different, I find it super helpful, too, just to know the different tools that are out there and what they do. Like me learning, even though I'm not, it's not like it's my bread and butter, but I do understand what Kafka does, or I do understand. Okay, what's the difference between, why do I always see Kafka and flink together? And now I'm seeing, I'm starting to see lots of design diagrams with Apache arrow. What does that do? And the more that you can get into those types of things. And that's probably a lot of reading of different engineering blogs that will help you see, like, okay, what design principles are they taking and what are they actually doing? What's the design decisions they made and the systems that they created? That can help a ton, too.

Catherine Nelson [00:33:09]: Yeah, definitely. Yeah. I think just reading around what's going on, what the new trends are, has been something that's been really helpful to me my whole career.

Demetrios [00:33:21]: When you think about data scientists like interacting with larger code bases, how can this be done well and not pushing problems? Hopefully there's redundancies in place so that nothing can actually escape. But you can imagine that someone who is fairly new at this can mess up a few times.

Catherine Nelson [00:33:49]: Yeah, I think for starters, like any software system, you want to be really clear about what the requirements are for that code and how is it going to interface with the rest of the system. So designing that well from the start, making sure that that's clear to the data science team who's working on their bit of code, is going to help a lot. So if the system needs to return an answer in some specific amount of time, then make that clear, make that something that needs to make that something that's baked into the requirements. And then, yeah, if you've got incentives like this, then that puts it, that makes people want to build that system so that it fits those requirements. Then you start needing to make each piece of code run well so that the whole system runs well, and also setting up tooling that works with data scientists rather than against them.

Demetrios [00:35:04]: So when you talk about that, it reminds me of just when I was at the data and AI summit, there is a company whose whole thing is built around making sure you have the right permissions in place for all the data. And I was like, wow, this makes a lot of sense because I've heard horror stories of people who join a company and then three to six months later, they still don't have access to the right data to get their models running. Yes, or they're getting dummy data and it looks nothing like what is actually in production. And so any model that they try and push is failing horribly. And it was like, oh, this is super cool to think about as that. But I just wanted to interject on what you were saying with the provisions, provisioning, access.

Catherine Nelson [00:35:57]: Yeah, definitely.

Demetrios [00:35:59]: And the other piece to that, when it comes to the data scientists working with larger code bases, I think the biggest question that we had over the span of two to three years, basically before chat GPT was released and everybody's attention shifted, the constant discussions that would come up probably once every other month in the Mlops community were putting Jupyter notebooks into production. And then the other side was, should a data scientist learn kubernetes? And so I'm wondering, from your side of things, how far is like too far? Well, first off, maybe, and I've seen it a lot, there's a lot of people that have come through the community, and there's one that is absolutely hilarious. Somebody introduced themselves in the Mlops community, slack, and they said, I used to be a happy data scientist and then I had to push my code to production, and now I'm like having to learn Kubernetes and I'm having to learn all this stuff because they didn't have the support from a team and the software team was like not able to give them what they needed. So you do hear those stories of when a data scientist will level up and learn kubernetes. I think for 90% of the data scientists it's way too cumbersome to ask that of them. And so I guess the question I have is when is it taking it too far?

Catherine Nelson [00:37:41]: Yeah, yeah that's a great point because I can imagine that if you're responsible for everything to do with the model, and that includes the Kubernetes side of things, even just the Kubernetes piece could be a full time job. It makes me wonder why there isnt a team supporting that model if youre taking it that far. But then if there isnt the need to be exploring new models so much, then that might be the way that you end up going. Its a real problem with just data science being this very a job description that could mean pretty much anything.

Demetrios [00:38:30]: Yes, yes, the data. I remember I was talking to someone in Amsterdam when we had a meetup and they were saying how they were hiring a data scientist and I was like, oh, so what does a data scientist mean to you? And they said, you know, like they can do anything and everything from data pipeline to CI CD to kubernetes to monitoring it and working with Datadog. And I'm like, oh, so you're looking for a unicorn, you're not looking for a data scientist. They're like, yeah, but unicorns exist. I know I've done it and I'm like, man, that is just really hard to expect of someone. And I guess it's more, it's more common these days. And you've probably undoubtedly heard about the t shape where you can be understanding the whole life cycle, but you go deep on one aspect of it.

Catherine Nelson [00:39:26]: Yeah, absolutely. It is unfair to expect someone to learn all these things, but I can also see how it happens because data science does attract people who like to learn new things. So I can see that once people get into a data science job, they're like, oh, I need to learn this thing to make that happen. I need to learn this. And they are able to learn all these new things and they're interested in them, and then suddenly that sets a precedent for being able to do everything totally.

Demetrios [00:40:02]: It's like, okay, you're an analytics engineer, you're a data engineer, you're DevOps, you're a data scientist, machine learning engineer. Yes, all wrapped up into one. So hopefully you're getting paid like two people, not one, because with all that knowledge, that is incredible. But.

Catherine Nelson [00:40:23]: And if you're having a good time learning all those new things and you're excited about it, that's fine.

Demetrios [00:40:29]: Yeah, yeah. One of the original co hosts of this podcast, David Aponte, he started as a data scientist, and then he moved into machine learning, engineering, and he started learning kubernetes. And then he really enjoyed that for a while, he was doing more like DevOps SRE stuff, and then he started going a little bit further down the stack, and now he's just optimizing GPU's, and he's very much like he's doing the research side of machine learning. But to make sure that the, these LLMs, or just foundational models can work fast on whatever GPU or CPU is coming out of the box with the computer. And so it's, it's like what you just said. Since he enjoyed learning stuff, he just kind of kept going and kept learning. And next thing you know, he's got a PhD and he is, uh, yeah, way beyond what I would have ever expected. Right, but that's how it works.

Demetrios [00:41:33]: All right, real quick, let's talk for a minute about our sponsors of this episode, making it all happen. LatticeFlow AI. Are you grappling with stagnant model performance? Gartner reveals a staggering statistic that 85% of models never make it into production. Why? Well, reasons can include poor data quality, labeling issues, overfitting, underfitting, and more. But the real challenge lies in uncovering blind spots that lurk around until models hit production. Even with an impressive aggregate performance of 90%, models can plateau. Sadly, many companies optimize for prioritizing model performance for perfect scenarios while leaving safety as an afterthought. Introducing LatticeFlow AI.

Demetrios [00:42:20]: The pioneer in delivering robust and reliable AI models at scale. They are here to help you mitigate these risks head-on during the AI development stage, preventing any unwanted surprises in the real world. Their platform empowers your data scientists and ML engineers to systematically pinpoint and rectify data and model errors, enhancing predictive performance at scale. With LatticeFlow AI, you can accelerate time to production with reliable and trustworthy models at scale. Don't let your model stall. Visit LatticeFlow AI and book a call with the folks over there right now. Let them know you heard about it from the Mo ops community podcast. Let's get back into the show.

Demetrios [00:43:04]: So, on that note, I think there's probably an interesting topic to cover, which is the future of data science and how you feel things are moving, especially because now a lot of people are just able to grab a model off the shelf, get it running, and it's better than spending six months trying to get their model going right.

Catherine Nelson [00:43:38]: Yeah. And I was asked this question at a conference. I did a talk a few weeks ago, and my answer has completely changed from then to now.

Demetrios [00:43:50]: Oh, wow.

Catherine Nelson [00:43:51]: And I'll explain this. So three weeks ago I said that I was pretty certain that data science was kind of diverging into more analytics side and more machine learning engineering side. So you either got interested in the statistics side and you went deep in the analytics side, or you were training models and experimenting with those and then maybe moving to deploying the models into production. But then I recently started consulting for a gen AI startup and started playing around with LLMs. And for me, this brings an extra dimension because evaluating LLMs is hard. It's really uncertain how you do this at the moment, but I think that working out whether LLMs are doing the task that you want them to be doing is absolutely in the data science set of skills, but it needs both statistics knowledge and knowledge of the internal workings of the machine learning model. So I think it brings both those divergent skill sets back together again.

Demetrios [00:45:07]: Oh, fascinating. So the people that are primed to take on these kind of hairy problems of evaluating the output of the models are data scientists.

Catherine Nelson [00:45:20]: Yes.

Demetrios [00:45:22]: That makes 100% sense to me. And especially when you start looking at the output as a corpus of data, as opposed to like one to one, like, is this this one answer what we thought was going to come out? It's like, no. Are these 1500 answers? How do we look at those as a dataset?

Catherine Nelson [00:45:46]: Yeah, yeah. I was listening to a podcast with which interviewed Christopher Manning at Stanford, and he made the point really well, that machine learning has changed from, you collect a dataset that's aimed at your problem and then you train a model and you have this model that's suited to that problem and that's relatively easy to evaluate because one model, one problem, but now with LLMs, you're expecting them to do all these different tasks. So that means that if you change something about the model, there's trade offs all over the place. It might get better in one place, it might do worse on another just because you've changed the prompt a little bit. So evaluating whether the change you've made is a good thing or not suddenly becomes super hard.

Demetrios [00:46:44]: Yeah, you know what, another piece that I, this is a little bit not related, but it sparked this idea.

Catherine Nelson [00:46:52]: Yeah.

Demetrios [00:46:52]: That I've been thinking about quite a bit is how a lot of these tools, especially when it comes to prompting and prompt tracking, they're not built for the people that are actually doing a ton of the work. Like the person that is there iterating on lots and lots of prompts can be anybody. It can be a data scientist, it can be a marketer, it can be a salesperson. And a lot of the tools that are out there right now are like the weights and biases that are specifically geared towards data scientists. And so a marketer or a salesperson isn't going to be like, all right, no, let me just go open up weights and biases to track my prompts.

Catherine Nelson [00:47:35]: Right, right.

Demetrios [00:47:36]: So it's a little bit tangential, but I was thinking about that a ton and thinking how I think I am generally against advising people to start something in the space because it's just so crowded and there's so many different tools out there. But that feels like something that might need to happen, especially for the random marketers who are getting a ton of value by running a bunch of their marketing copy through chatgpt.

Catherine Nelson [00:48:11]: Yeah, I feel like the community hasn't even settled on a process for how you do these evaluations, let alone turn that into any kind of tool. There's no consensus yet, so it's going to be fascinating to see that evolve, to see that.

Demetrios [00:48:30]: Yeah. And evaluations, the tools that are out there on evaluation, I have a document that eventually I imagine I'll make public, but it's just like a notion database, and it's called same, same but different. And it's all the tools that are out there that I've seen or come across that are in the ML ops or like the LLM AI infrastructure space. And one thing is for sure, there's a lot of evaluation tools out there.

Catherine Nelson [00:49:03]: Yeah, that's right. Yeah.

Demetrios [00:49:04]: And I guess it makes sense because, like you're saying it's a hard problem that has not been cracked yet, so everybody's trying to take a swing and saying, hey, well, we did this at XYZ, our last company. We kind of got it figured out. So here's how you can do it. But, yeah, we probably will be standardizing around a how in a year or two, hopefully. But judging by how fast things move.

Catherine Nelson [00:49:35]: Yep. Because it feels like it's, it's very easy to make something that works well in a few cases and kind of eyeball it, write some tests, but how you scale that up, that's hard.

Demetrios [00:49:51]: Yeah, exactly. And yeah, predictably, being able to just trust the output. I mean, honestly, that's kind of the whole reason that we are doing the AI quality conference, because it's like, well, we're all a little bit confused at how to go about this. Let's talk to each other and see if anybody, if we can surface some of these practices where people have figured out different chips and tips on what to do and how to make it more reliable. Because like you said, getting 80% of the way there is great. And then it could take you double, triple the amount of time to get 90% of the way there.

Catherine Nelson [00:50:30]: Absolutely. Yes. Yeah, it sounds like it's going to be a great event.

Demetrios [00:50:35]: Yeah, yeah, I'm looking forward to it. You're able to make it right?

Catherine Nelson [00:50:39]: Yes. Yeah, that's right.

Demetrios [00:50:40]: Yes. Oh, I'm so excited.

Catherine Nelson [00:50:42]: Yeah, me too.

Demetrios [00:50:43]: Well, this has been really, really thought provoking talking to you. I am 100% in the camp of data scientists need to learn how to at least understand what the system does, and it will be immensely valuable for any data scientists out there to learn the jargon at the very least and understand what APIs do, understand how to work with git, understand what security limitations there are or what's good practice, what's bad practice when it comes to security and all that fun stuff. So it's really cool to see that your book is out there. For anybody that wants to get a copy of your book, we're going to leave a link in the description, but it is called dun dun dun dun. As we mentioned before, software engineering for data scientists. That's great. Yeah. Thanks for writing it.

Demetrios [00:51:42]: I know that there's a ton of people out there that are going to be better engineers because of it.

Catherine Nelson [00:51:49]: That's fantastic. And I had a lot of fun writing it as well, so I'm excited to see it get out into the world and start hearing what people think of it. So, yeah.

Demetrios [00:51:59]: What was one thing that you learned while writing it that stays with you still?

Catherine Nelson [00:52:05]: How much time I can save by automating things because there's an initial setup cost to setting up tools and to automate processes that you do repeatedly. But once you've done that and you can just press a button and run your test and lint your code and run the formatting, it's great. It saves you so much time.

Demetrios [00:52:30]: Yeah, it's heavy on the front end, but then you will thank, your future self will be thanking you.

Catherine Nelson [00:52:39]: Yes, absolutely.

+ Read More

Watch More

Clean Code for Data Scientists
Posted Jun 07, 2023 | Views 794
# Clean Code
# Data Scientists
# Shopify
Data Engineering for ML
Posted Aug 18, 2022 | Views 1.3K
# Data Modeling
# Data Warehouses
# Semantic Data Model
# Convoy