MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Clean Code for Data Scientists

Posted Jun 07, 2023 | Views 731
# Clean Code
# Data Scientists
# Shopify
# Company.shopltk.com
Share
SPEAKERS
Matt Sharp
Matt Sharp
Matt Sharp
MLOps Engineer @ LTK (formerly Reward Style & LIKEtoKNOW.it)

Author of LLMs in Production, through Manning Publications. Has worked in ML/AI for over ten years working on building machine learning platforms for start up and large tech companies alike. My career has focused mainly on deploying models to production.

+ Read More

Author of LLMs in Production, through Manning Publications. Has worked in ML/AI for over ten years working on building machine learning platforms for start up and large tech companies alike. My career has focused mainly on deploying models to production.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
Abi Aryan
Abi Aryan
Abi Aryan
Machine Learning Engineer @ Independent Consultant

Abi is a machine learning engineer and an independent consultant with over 7 years of experience in the industry using ML research and adapting it to solve real-world engineering challenges for businesses for a wide range of companies ranging from e-commerce, insurance, education and media & entertainment where she is responsible for machine learning infrastructure design and model development, integration and deployment at scale for data analysis, computer vision, audio-speech synthesis as well as natural language processing. She is also currently writing and working in autonomous agents and evaluation frameworks for large language models as a researcher at Bolkay.

Prior to consulting, Abi was a visiting research scholar at UCLA working at the Cognitive Sciences Lab with Dr. Judea Pearl on developing intelligent agents and has authored research papers in AutoML and Reinforcement Learning (later accepted for poster presentation at AAAI 2020) and invited reviewer, area-chair and co-chair on multiple conferences including AABI 2023, PyData NYC ‘22, ACL ‘21, NeurIPS ‘18, PyData LA ‘18.

+ Read More

Abi is a machine learning engineer and an independent consultant with over 7 years of experience in the industry using ML research and adapting it to solve real-world engineering challenges for businesses for a wide range of companies ranging from e-commerce, insurance, education and media & entertainment where she is responsible for machine learning infrastructure design and model development, integration and deployment at scale for data analysis, computer vision, audio-speech synthesis as well as natural language processing. She is also currently writing and working in autonomous agents and evaluation frameworks for large language models as a researcher at Bolkay.

Prior to consulting, Abi was a visiting research scholar at UCLA working at the Cognitive Sciences Lab with Dr. Judea Pearl on developing intelligent agents and has authored research papers in AutoML and Reinforcement Learning (later accepted for poster presentation at AAAI 2020) and invited reviewer, area-chair and co-chair on multiple conferences including AABI 2023, PyData NYC ‘22, ACL ‘21, NeurIPS ‘18, PyData LA ‘18.

+ Read More
SUMMARY

Let's delve into Shopify's real-time serving platform, Merlin, which enables features like recommender systems, inbox classification, and fraud detection. Matt shares his insights on clean coding and the new book he is writing about LLMs in production.

+ Read More
TRANSCRIPT

Uh, I am Matt Sharp. I work at Shopify. Uh, I'm a data developer There, not very generic title, but essentially machine learning engineer and I, I don't drink coffee or tea, so I'm a water guy for sure.

Welcome back to the ML Ops community podcast. I am your host Dimitri Os, and I am here with one of the best in the business.

Abby, how's it going, Abby? Great. There's a lot happening in like the tech world. Too many changes. Too many improve. Brings way too many announcements. Uh, we had a wonderful conversation with Matt. Matt is a chemical engineer, turn data scientist, turn data engineer, and right now he's leading data at Shopify.

So I think I'll let Dimitri's take on. What were your favorite hearts, uh, Dimitri in the conversation? Well, for those who don't know, Matt also is a bit of a LinkedIn influencer because of all of the conversations and posts that he has up around. Clean code for data scientists.

He refers to himself as a recovering data scientist, and we got into that in this conversation. We also talked a little bit about the. Platform that Shopify created, Merlin and the blog posts that they wrote about it and how they're able to enable. There's a little bit of a tongue twister for you.

The realtime recommender systems and the inbox classification and also fraud detection, all that good stuff that you hear about for real time. Use cases, they break down exactly how they're doing it. And also I think the biggest takeaway for me on Merlin and the whole real-time serving platform that they created was the ability to quickly iterate and get up and running with a sandbox.

And then if you prove out some value, you can go and run with it that was my takeaway. What about yours? So for me, I think I loved his advice on what does clean code mean for dear.

I thought he had a heart take about. Those posts that blow up on LinkedIn and Twitter, like I've been a data scientist or a software engineer for X amount of years, and I still Google how to do the most basic things in Python. And his response to that was hilarious.

It was like, If you've been a data scientist for 12 years and you're still figuring out the basic building blocks of Python and you have to Google that, you're not good at your job. So let's just be very candid about that, and I thought that was a funny hot take that is a little bit contrarian to the things that we see and hear on social media.

so overall I, I think the conversation was a good combination of talking about clean coding, coding practices for data scientists, and then we move to real time streaming, talking about Berlin, and then finally two elements in production and some of his work on the topic and some of his opinions.

So overall, there's something for everybody. He is coming out with a book too on LLMs in Production. I think that is actually the title of the book, and it should be out soon. We, if it is out already, we will leave a link in the description.

If it's not out, we will hopefully leave a link to the early bird version of it so that you can check it out. And without further ado, I think it's time to get into this conversation if you like and enjoy this podcast. It means the world to us. If you just share it with one friend. One friend. That's all we ask because that is the way that we're gonna spread The good word of ML ops, or as some people are calling it these days, l l M ops, or as other people are calling it these days.

L LMS in production. And we've been having all kinds of LLMs in production conferences happening, and I think by the time this bad boy comes out, We are going to either be gearing up for the LM in Production conference, which you can find a link to in the description below, or we're gonna just have had it, which you also can find a link to all the replays in the description below.

So depending on where you are at in time when you listen to this, you can click on the link below and. Get everything about LLMs in production to your heart's desire. That's it for now. Let's jump into the conversation with Matt.

First thing I want to know to start it off, do you have a really big wall in front of you that you watch? You have a projector behind you and I'm guessing that it's your like home movie theater that you project onto a wall. It is. I knew it. Yeah. So we, uh, Yeah, just on the second floor, a across the living room, there's just this empty canvas and so yeah, we'll have home videos and stuff up there, so it's, it's pretty nice.

Oh, that looks so cool. Yeah. Well done. Well done. All right, Matt, great to have you here. I have been a fan for a while. You've been preaching the good word on LinkedIn about clean code for data scientists. I think it struck a nerve because. Occasionally I'll see one of your posts pop up with thousands of likes and views and all of that good stuff.

So you are a bit of an influencer these days, and I think there's a reason for that. Lots of people have either been a data scientist or work with data scientists that may or may not know the fundamentals, and so it's very important to get it right and. You are trying to raise awareness about this. What brought you into this space?

Like why did you decide that you were gonna be the one to start talking about this? Um, well, I guess, I mean, to start off, like I used to be a data scientist myself, so like, I get it. Like when you're a data scientist, there's just like so many things to learn, you know, like statistics heavy and just like, Analyzing data and just being able to answer stakeholders questions and answer to business values and all of these things that oftentimes actual co code portion is often seen as like, well, I got it working.

Like, you know, that's fine, but like, At least when I was a data science, there were so many errors and wasted time because like I had to go and fix some bug or, or do other things. And like there, when I was a there, there was no one around me who knew coding, you know? So like, it, it was mostly just me and, you know, some other data scientists, you know, trying to hack things together.

And so, um, and then like, you know, I go to a tech shop and like, it's completely different, right? So like, um, At, uh, Shopify, for example. Um, like when, when do you think they start expecting like a data scientist to start writing clean code? You know, and, and like the answer is like immediately, right? Yeah.

Like if you go in and, and you look at like our documents of, of saying, okay, hey, like you're a senior engineer. Like these are the requirements for your job. Like, Like clean code is actually in inside, you know, like being able to write clean code is inside of our requirements. And it starts as an intern, right?

Like, cause you know, like when, when you come in as an intern, they don't necessarily expect you to be able to solve problems, but they do expect you to write clean code. So like it's very much a fundamental problem that needs to be there right at the beginning. And I don't think a lot of data scientists like realize that.

One very quick question because I want to clarify this for our listeners. There's a lot of conversation about writing clean code, but what does that mean for you? Does that mean the use of functions or writing modular code? Does that mean writing unit test and integration test while writing the code?

Which or does it mean structuring all of those things? What are your top tips for when you say, okay, this passes my mark for clean code, so like clean code to me? Like the definition is how easy it is to understand. And so, so the best way for you to gauge it is if you write code and you come back six months from now and you completely forgot about what you were working on, can you get caught up within a couple of seconds or minutes?

Can someone else come in and read your code and understand it? Right? Clean code is all about the communication aspect. It's very much. A technical skill, but it's a lot more of a soft skill. Um, cuz the whole whole point of clean code is making sure that you can collaborate with other, with other people.

Um, you know, so like, cuz you don't work in a vacuum, right? Like you need to be able to communicate with analysts and other data scientists about the statistics and about how you're working with the code. And then you also need to be able to. Go in and hopefully you're gonna be working with, you know, data engineers and, uh, engineers and, uh, other people and they need to be able to talk.

And then, like the code is really the communication piece, uh, of how this all works down. And so like when, when code is clean, it's, it's communi capable. And how do you define clearly communicable code? Does that mean it's well commented, or does that mean you know, you are using the right variables, which is the, the function naming is, is done in a way that's makes sense instead of using very random variables to be able to store these.

How, how do you define or what are your three top things that you would say are important for that? So naming variables is definitely, probably. The largest aspect of writing clean code. Um, just because if the variables are well written and, and well defined, like you just, you know, like if you read a model called cars, right?

It's just like, well, what what could be in there? And it, it, it gets I think, uh, a data set that I often point to is, um, so like at my old company we had a table called Locations and like locations sounds like a great name. But you know, like, what, what does it mean, right? Yeah. Like, it, it can mean because like we had locations for like merchant addresses and then we also had online locations, which were just websites and just like, it, it got really confusing.

Mm-hmm. And the problem was, is that this location's table, like people put it up and it was actually zip codes and we never used it. And so like, but it was up there and it got integrated into a bunch of different things. And so like, No one wanted to touch it and remove it and, and clean things up, but everyone would always try to, you know, merge onto that table and like get their location data and they thought the result grent, then they would be like, wait, why, why isn't this working?

You know, why is my data analysis like not working? And, and it's just like, well it's cuz you're using the locations table, which, you know, isn't our locations. Locations aren't locations. Right. Like, like you want, you know, Russian addresses. Right. And so, but like, So a good name is both like, um, very, like particular, but it's also broad enough to include all the use cases of, of what you're trying to do and, and kind of understanding that and getting a hang of good names is, is obviously really important.

I think there's a common joke where it's just like they're are, you know, the, the two hardest things in programming are naming. Cash and validation and off by one errors, not like, so, so yeah, so, so naming is definitely hard, but I think, um, oftentimes when you talk with data scientists though, is that like they don't, most of them don't necessarily come from acute computer science background.

And so, uh, a, a lot of the tips I give often involve just like, how do you structure a project so that way you can find the files you need? You know, like how do you. Um, you know, and cuz that comes down to the communication, right? Because like, if I'm given a project a, a Ruby on res project, like Ruby on Res has this awesome structure where like, you know, you, it has scaffolding.

And so every project is exactly the same. It's really easy to find, okay, here's where the JavaScript, here's where the controllers are, here's where the views are. It's, it's really easy to go in and dig into them because they're all the same, but like, uh, uh, like data science projects. Often don't have that type of scaffolding or other things like that.

And so like, it's really hard to find even like basic things and you're like, where are the tests? Oh, like, do they even have tests? Oh, okay. No. Like this is a data science project. I forgot. You know, it's just like, you know, and then like they have, you know, and, and just like a lot of the problems often stem from like, you know, data scientists are always using Jupyter Notebooks, right?

And, but like they haven't learned the computer science aspect, so they don't know how to like manage those. And they often get lots of different naming and, and so it's just like figuring out the best. Um, and so a lot of my tips often go into that. Like, okay, like this is how you get the most out of your Jupyter Notebook, you know, this is how you can communicate more and, and other things like that.

Dude, I love it. I mean, coming out hard on the data science projects and you went there, so I'm gonna follow your lead and ask about Jupyter Notebooks and things that you've found, because I actually enjoy finding out about other people's processes when it comes to taking Jupyter Notebooks and. All of the power that they present in their phases of the machine learning life cycle.

But then what do you do once you've reached the end of their life cycle and how do you get out of them as quickly as possible? And what do you recommend to the data scientists then, if ever? Because some people I know don't like to get out of them. Um, You know, I've spent enough time in Jupyter Notebook, so I'm really not a fan.

Uh, you know, everything from managing their kernels to like just, you know, like the tricks you can play. Um, and this was actually, I don't know if it was April Fools, but it was just like, Hey, like this is a prank I played where, you know, you go in. You run some code and then you delete that cell and like no one's ever gonna be able to understand why like their, their Jupyter Notebook works then and you know, it doesn't work anymore just cuz you like redefined, you know, some in imports or things like that.

And so generally, how I like to use a Jupyter Notebook though is because like they are powerful. Like you can. Iterate really quickly. You can go through and you can get things, um, you know, look at your data easier. Look, you know, they, they are really powerful. But, so I, I like to use them and I think this is really useful is, is essentially like I, I think people make their cells too small.

Um, you know, they're, they're often, you know, you often wanna make them do one thing, but like a, a cell isn't a function, right? Like, there's no restriction for it to doing one thing. But, so what I like to do is I like to think about cells as, as, like, if I was to put this into like a scripting project and inside of just pure Python files, like a cell would be a single file.

You know, with, with its separate imports and doing its own specific thing, you usually like defining a class or, or defining a group of functions that I can then like import into other, other cells. And so what, what makes this strategy powerful is generally like by the time you're, you know, each cell is encapsulated and you can be, ensure that each cell will be able to run on its own.

The, the next powerful thing about doing this is that when it comes time to like communicating or maybe moving it into a more foundational, you know, Python script or library, like, it's just a matter of taking each cell and copying it into like a script file. And then, you know, just making sure you adjust, you know, the imports to include the other things you need.

And so, Um, I find that strategy really powerful as far as, um, transitioning between a Jupyter Notebook and an actual project, as well as just like making sure things are clear and things always run, things work well. I think this is a really interesting nerve that you're hitting, which comes down to this difference between.

Doing just an exploratory project where you're just exploring the data at, at the first level, trying to understand makes sense, or if versus when you're writing code for production leveling, why you can't do it in anything. You can write it in a Jupyter better notebook, or you can choose to write the Python code from the get-go.

Um, but I don't think that distinction is usually mentioned clearly, which is, you know, there's no problem with using any tool. The problem is how you write. At different stages of, of using that tool. So while writing module code is a little bit more important, when you're writing production level code, do you think it's equally important when you are doing exploration as well?

Or do you think now the cell code is okay, we can keep using the df d shaping, just run that while we are doing exploration and maybe once we are writing the production level code, that that doesn't really fly. Yeah, no, I, I talk about this a lot. I thinklink code should be right at the very beginning no matter what you're doing because exploratory data analysis, like the workflow I see data analysts do all the time is, okay, I have a new exploratory data analysis.

I create a new Jupyter Notebook and then. The first thing they do is they go find their last analysis and they start copying the code over, right? Like, oh, I don't wanna redo this again. Right? And but like that first code they wrote being like, oh, this is just a quick script and it's really bad. And then they go and they copy it over and then they're like trying to, you know, run it and they're like, oh no, I made a mistake.

And now they go and, you know, they make fixes and then they make edits and they, you know, it's just like, No, like, just, just go. And then this is actually something I, I, I recommend is like, have a Jupyter Notebook that is like four notes, like actual notes, right? Like, you're gonna go and, and you're going to copy and paste, okay, this function was really useful.

Copy and paste it over, give notes about what it does and like clean it up, you know. And then like when you go and you are gonna copy code, you're first gonna look at that notebook that has all your clean code. It's gonna have all that, all your reference stuff in it, and, and that's gonna just, and, and then you're copying over and things are just gonna go a lot smoother when it comes to a quick analysis.

Right. Because like no quick analysis is ever like quick. Yeah. You know, and so like, and so like, it's almost more important to make sure your code is cleaned when it's, when it's quick and when it's supposed to be an exploratory data analysis cuz um, Like, one of the things software engineers have learned is like you are, you know, when you're writing code, like you yourself are reading it about 10 times more than you're writing.

You know, you're, you're scanning up, you're looking up, oh, what did I set this at? Okay. What is this variable? And so like, You know, every code is read way more than it's ever written. And so like that is the important part of why clean code is good. And like clean code is is way more valuable for you, right?

Like it's future you, whether or not that future. You, earlier I said six months, but like, you'll find value five minutes from now, right? Like when you're coding and you're just looking back up, you're like, oh, what, what, what did I do up here? You know? Like, what did I name that variable? You know? And so good clean code is useful.

Um, and, and, and this is often like one of the values of actually working in an IDE versus, you know, just a general Jupyter Notebook is IDE has come with all of the helpful, you know, code completion tips and, and other things like that. And so like, um, I, I often encourage data scientists, Hey, go open up an id.

Like, it's fine if you use Jupyter and old books and you love that, but like open it up and realize what tools and like what things are there and ready for you out of the box. Cause it'll definitely change your life. Yeah. And it's so interesting that you point that out, like the foundation and how just us as humans, we're so in a hurry all the time to get to the next thing or try and figure something out, that we do these quick hacks and then we just start throwing something on top of these quick hacks.

And that foundational level is so flimsy that it can end up coming back and biting you in the ass real quick. And so, I want to ask one more thing before we move on to the, um, actually, so one of my most popular posts on LinkedIn was about streaming data in real time, and that seems to blow up. I wanted to ask you before we jump to the next one, what has been your most popular post on LinkedIn this far?

I imagine it's around something clean code. Um, so this is interesting. So I think my most popular post ever was taken down because it was too controversial and, uh, taken down by who? You or LinkedIn? LinkedIn. By LinkedIn? No, yeah, about clean Coke, about data scientists. Some data scientists at LinkedIn got picked.

So it was, yes, yes. So it was, it was, it was a very simple. Um, very, very simple. All it was, um, and I'm, I'm known to be a little controversial sometimes, but all, all, all the post said was, clean code is greater than clean data. And it, it kind of blew up. And like, a lot of people were really angry with this.

Um, but like, my, my, my basic, uh, you know, thought process around this is just like, Clean data comes from clean code. Like all data, like ultimately comes from code. Like, you know, when you know whether or not, you know, you're taking a video, you know, a digital camera, like all of that is, you know, code working under the hood to get the data and then you're putting into a database and you're using data pipelines and you're doing all these things.

And so like there was just this point where, you know, like they say like, 80% of the data scientists job is like cleaning data. And like everyone was always saying, Hey, data is so, so important. You have to, you know, you have to clean it, garbage in, garbage out. And you know, it's something that data scientists very understand very well, cuz when their data is bad, like it just ruins their day and how they, and they makes the analysis way harder.

It makes training models really harder, you know, but like, they don't want clean. Code cuz that's work that is on them to do. It's just like, but if they had clean code, their data would be cleaner. Right. And like, this is something, you know, with things like what Chad Sanderson's pushing out with, you know, models and, and APIs and just, you know, having data contracts and things like that, it's just like, yeah.

Like, If you just had clean code and data contracts and these other simple principles, like you're gonna have clean data, like it's just, it's going to be a result of it. And so anyway, so, so yeah, it was just a very simple post, you know, clean code is more important than clean data and it just blew up. And I think some people got mad and they took it down cuz of like, but yeah, so.

That's so wild. That's, I wonder if it, they took it down because of the comments potentially like being toxic. And LinkedIn doesn't want that to be the experience that people have on LinkedIn or if it was just some engineer at LinkedIn was like, fuck this. And then they went and went behind the scenes and were like, delete post.

Yeah. I don't, I don't know. It was up for a good while and like, People even took screenshots of it. So like I, I know it existed, but like I can't find it anymore. And I even had like a direct link to it. Wow. And like that just returns a 4 0 4. So, well, I, I've been thinking about like re-exploring, re-posting it just because it's been taken down.

I, I the cult, like this was way back when, when I was first posting about clean code. Like it would really upset a lot of data scientists and like no one thought it was important. But nowadays, like, All, all influencers have found, they're like, oh, if I post some, like, hey, like, here's some good sequel cleaning tips, or like, here's some good Cote, like Uhhuh, the content does really well now and I see a lot.

And then, so like I'm, I'm really glad that the community has changed and realized, like for the, you know, hey, clean code is actually important for the better. But like, yeah, when I first started talking about clean code, like the, the only people who I could satisfy was like, you know, old hardened. You know, software engineers like yes, you're finally saying what I always wanted.

Right. But you know, like, yeah, I would, I would often have like some head of data science or some VP of data science or something like that, so it doesn't matter. You just need to get it working. But I don't see that as often, so. Wow. Wow. It's incredible. 📍

I think now that people have moved from that stage.

Where people were like, can we actually derive value rf? You know, building a machine learning or like a data science team, I think everybody has sort of moved where they have enough maturity to know, yes, you can derive value, but you're not gonna be able to derive value or if like bad practices or bad software engineering practices, specifically speaking, because you have to be able to write code that passes the level of at least some best practices on production levels.

Software engineer. Yeah, I think this, um, I guess just one more thing before we move on is just like, like code is where the rubber meets the road for a data scientist. And it's important to realize that like a lot of data scientists think their job is reading white papers all day, you know, and like, or, or you know, like maybe at the end result is writing a white, white paper like that is.

The, you know, glory like that is their job is to write white papers, but ultimately no is like, it's the right code, right? Like they're, they are programmers, right? Like ultimately, you know, like if they wanna manipulate data, if they wanna write machine learning models, if they wanna do any research on data, like it involves writing code and that is where the rubber hits the road.

And like that, you know, it's just so important to harness that craft and, and realize like, Hey, yeah. You are a programmer, you're a very specialized programmer that does a lot of statistics and does a lot of research, but like you're mm-hmm. You know, like, like that is your craft is writing code. It's not reading and writing white papers mm-hmm.

Or whatever else you may think it is, you know, talking in meetings about statistics. No, it's like ultimately, like you're, like the value you bring people is when you write code. Yes, you and you will be held responsible for that. So it's, it's cool that you've talked about it. It's cool that you continue to preach the good word on it and you're doing hard work and I'm glad somebody's doing it.

I wanna go into this realtime streaming. As I mentioned, one of my best performing posts on LinkedIn was talking about real time streaming. And you all at Shopify just put out a blog post that goes really in depth into how you are doing. Real time streaming. And one of the things that I think we can kick off the, this part of this conversation about is Merlin and what exactly Merlin does and then how it fits into the rest of your stack.

Yeah. So Merlin is just the machine learning platform at Shopify. Um, and, uh, like really, um, Like when you, when you look at it, it's, it's, it's really not much, but it, it's just a way for data scientists to quickly manage a machine learning project. Um, of course it, and like I say, it's not much, but it's, it's really cool.

And so like Merlin, the, the big thing with Merlin is it's built on top of Ray. The main thing is, is it makes parallel processing really, really easily.

Um, and so like you, so the basic principle is it has head node, which has a, a bunch of worker nodes. And so like when you start a training job, It'll start spinning up worker nodes to process that data. And so it, it just makes working with really, really large data sets so much easier. And so Merlin, um, essentially just, um, provides a bunch of command line tools so that way you can quickly spin up a project, manage your dependencies, you know, create your images and, you know, has some boiler plate code for you to.

Kind of do all this stuff and, and makes it easy. Um, specifically my work stream that I, I work on inside of Merlin is about making sure that we can do online inference. Um, so we, we provide a bunch of, you know, helper functions and things like that, so that way you can quickly spin up a new rest api, a new service with your machine learning model, and to be able to.

Have it inside of the Shopify ecosystem and to be able to, you know, get inference out of it. And so, um, it's, it's just really simple. You just, and your terminal, right? Merlin services create and that creates a repo. It creates, you know, simple config files, um, easy enough. Uh, So that way, like a data scientist doesn't have to go in and, and mess with Terraform or other things like that, but like they can make sure that they have e everything needed.

You know, define their, you know, what GPUs they need, how much, you know, CPU and memory, it takes care of all the auto scaling. Um, so one of the recent projects I worked on was adding G P U autoscaling to it. Um, and so yeah, so that's kind of what my team works on. We created a Merlin pipelines library that just you can import and then it just has everything you need from getting data from our different data stores as well as, uh, you know, transforming it and, and getting it into however you need, as well as, and, and making sure that it's all distributed.

And using Ray out of the GetGo. They're now on version two and there's been lots of improvements, but, um, kind of working with, with the team over there at ending scale and so, so there's been kind of lots of learnings, um, like one of the things we found is a lot of times, like, you know, data scientist send up, often just working on the head node and not taking advantage of the worker nodes and making sure things spin out correctly. And so just, making sure that we can make that easier

So there's, um, one big thing that I want to know, because as you know, I am a happy Shopify user with our funny shirts that we've been selling, which is, uh, yeah, I hallucinate more chat.

G p t, they've been discontinued because we can't just have them available all the time. We gotta put a little bit of scarcity around it. But me as a Shopify user, What are some of the things that, like you are working on that I, as that end user, am seeing in my interactions with Shopify? Um, so there's, uh, kind of a lot of cool projects happening under the hood.

Um, so, um, so like one of the projects that, uh, the team has been. Uh, working on a lot is supporting like, not safe for work, um, filtering because, you know, Shopify supports lots of different businesses. Um, and so, you know, when, when you go onto an app though, you, you don't, you don't necessarily, you know, depending on who you are, right?

Like necessarily getting, you know, kind of an adult shop or lingerie shop or, you know, like a violent, you know, like a gun shop or something like that. Maybe, um, You know, if, if you're just looking for Lego toys, you know, and something like that, like, or a water gun, you don't necessarily want real guns popping up, right?

And so having some sort of, uh, a filter there, um, is something the team's been working on and improving. Um, and I've been, uh, and so tho so those are some of the things. Uh, the other thing, like we have a projects, uh, like, so one of the cool projects I think is like, You know, recommendations for like new shop owners.

Um, so like when you create a new shop, like there's a lot that needs to happen, right? And like understanding, you know, the best path forward to create your shop have like, can, can be a confusing process. So like we have some models in place that help kind of. Understand who you are, what type of shop you're trying to build, and, and then like make recommendations as you go.

And, and we try to follow what you're doing and continue to make the next best action recommendation so that way you can build and grow your shop nice. More organically. And so, uh, that's kind of cool. We, we also have other machine learning models that just, um, you know, are for internal use of. Uh, understanding, uh, what we call the grid is just like, you know, understanding our own data, um, and understanding trends.

And, you know, what, what should we do? Like maybe we, should we do different promotions or should we do different outreach? How should marketing work? You know? And so like we do have models internally that, you know, help help the business do the business things. And so like, uh, yeah, there's, there's lots of different, um, There's lots of different projects.

Yeah. Shopify. Um, I think, uh, it was said as like, you know, with online, uh, co or you know, e-commerce, like there's so much that can be done and there's so much we can help our shop owners become better at. Um, and so, you know, like mm-hmm. Uh, I think the saying is saying like, we're more likely to die from like constipation than we are from starving.

Right? Like we have. You know, trying to shov too much in and, and not focusing on the highest priorities than we are to be like, oh, we don't have anything to do. Right? Like, no, we have so many ideas floating around. There's so many, uh, cool projects, uh, that people come up with. And so, yeah, always trying to figure out what the highest priority ones are and what can be the most valuable, um, are always really useful.

So yeah, our. Uh, Shopify has invested a lot recently, uh, into, um, like open ai and we have a partnership with with G P T. And so like we've integrated it, um, into a lot of different things that you might be able to see. So when you're creating your store and you wanna start adding like products to it, uh, like there's a whole workflow.

Process. And inside of that, you can actually add, um, G P T recommendations, right? So like, it's like, Hey, I want to add sunglasses. And if you add this to your workflow, it'll actually come up with a cool description of like, what those sunglasses are and like kind of be a little bit sellsy and like, hey, like you're, you're the coolest with these shades and things like that.

it helps, uh, kind of automate a lot of those different things as well as, you know, like, hey, you can use it to like, maybe auto-generate tags. Um, so saying like, what, you know, who this is for and things like that. And so, um, so yeah, lot, lots of different, lots of different projects happening inside of Shopify to make your job easier.

Yeah. Well, I know we're gonna get into a lot of that. Hopefully I can convince somebody at Shopify to come and talk at the next LLMs in production conference because that would be awesome. Maybe you, maybe somebody else, I don't know. But I want to take these last couple minutes to talk about the book that you're writing and what book are you writing, first of all, and what made you wanna write it half.

Um, so we're writing a, a book, um, LLMs in Production, which happened to be the same title as your last conference. Um, so we fell, you know, great months to go. Yeah, exactly. Yeah. So this is something that's kind of been in the works for a while now. Um, me and my co-author, Chris Brusso, um, he's kind of. A legend in the N l P community per se.

Um, he, he works at MasterCard. Um, and so, it really is a nice pairing cuz he comes with a lot of that natural language processing, um, background. You know, he's worked as a translator and he, he knows several languages. Uh, went in and became a data scientist and kind of learned a lot of that.

And I kind of come from the other end of, hey, like, One of the things I've always been focusing on in my career is like, how do I get a model into production? Like how do I make it easy for people to use and actually usable? Like I like, one of the things I hate the most is like when someone creates a cool model just sits and like their project workspace on their laptop or something like that.

It's like, no, I want like, let's get this out to the world. And, um, And so, yeah, so we, we kind of come in from, from both ends. And so yeah, this, uh, this book Elements in Production, it really focuses on, um, how to put and work with and maintain these large language models inside of production. I think, um, we, cuz what, what kind of inspired this though is like, we, we saw a lot of people talking about writing LM books, but a lot of it is more focused on.

The data scientists talking about the theory, it's really going into like mm-hmm. How do you train one of these massive models and, you know, how, how do you deal with all those things? Uh, so we, we really wanted to take it the other way. And, and so like our, our book is really more geared towards the MLEs, the software engineers, you know, the, the people who also wanna work with these, um, models, but, you know, like kind of more get them into a more useful state.

And so, um, So, yeah. We'll, we'll, we're talking everything from, you know, so like there are things we're gonna be talking about in the book, uh, that, you know, like, okay, you know, like what data sets do you need to train an l l m? Um, but like a data scientist is gonna care about that question differently from like a machine learning engineer, right?

Like a machine learning engineer. Mm-hmm. Or a data engineer cares about that in the aspect of like, How do I go and get that data right? Like how do I build the data pipe chain Yeah. Around it, like how do I clean it or web scrape it or, you know, do those things Versus a data scientist is gonna worry more about like, okay, like how do I clean it?

How do I train it? You know, like, and, and so, um, we often have a lot of the same questions, but our reasoning and purpose for, for them are, are different. And so, yeah. So, uh, this book is. Uh, I think what's gonna be really cool is so the end of the book, we're gonna have three different chapters on, uh, that allow you to go and put an l l m into production.

So we're probably, um, one, one chapter will be just like, Hey, working, you're gonna set up a, a cluster in the cloud. You're gonna go in and you're gonna deploy a model into the cloud. You're gonna then, you know, create. Maybe like a phone app or something, and you're gonna be able to go interact with it and, and, and use it.

Maybe set up some sort of chat bot. You know, uh, the other project we're working on is, um, you know, like code completion. Like getting your own GitHub co-pilot, something like that where you can go in, you can just get something maybe set up on your laptop or again, You had it from the cloud before, so like how can you deploy an l l m that you fine tuned for that and, and kind of just use it on your own data, you know, so that way you can keep all your code, you know, prevent from pushing your code up to open AI,

so that way you have more control and making sure it's clean. And then the last one, uh, I think is, is really gonna blow people's mind. Um, so like we've gotten. And there's definitely a smaller part of creator who sets, if you won't go into Reddit and things like that. But like we've gotten LLMs working on a raspberry pie and we wanna show you how you can do that as well.

How you can get LLMs onto the edge. And I think that's really cool cuz when we first started talking about this idea of, we're just like, yeah, LLMs on the edge. Ha ha. Like, there's like, it's not, you know, it's not even possible, you know, several months ago. And it's just like, no, you like, you just have to have like, Tons of massive confuses and things like that.

But like, you know, like with,, compression as well as llama and other things like that, um, you can get a small l l m, you know, 6 million parameters, 7 million parameters down onto a Raspberry Pi and get it running. And, um, it's, it may be really slow, but like we, we hope we'll be able to speed that up with like, uh, TP stick, like with Coral or something like that.

And, So yeah, we're, we're pretty excited. And about this book, so currently I can't wait, I have a little bit information, which is, you've written the initials for chapter two in chapter three. Al Vain does the early exist? Come on. Uh, so I think the early axis comes out after you've written a third of your book and.

Since we're going with Manning, they have Meep is what it's called. M e a p, which, uh, gives people early access and so, so yeah. So we're supposed to be done with our site, uh, May 22nd. So hopefully the, the first meet will be soon after that by the time this podcast comes out, approximately around that time.

Oh, okay. Yeah. So yeah, if you're, uh, watching this podcast, go check out, like look and meet. Hopefully it's, it's out really soon. Yes. We'll leave a link. So, dude, this is awesome. I look forward to reading it. Thank you so much for your time. This has been great. Awesome. Thank you.

+ Read More

Watch More

49:58
Posted Jul 21, 2022 | Views 578
# Mage
# Fresh Data
# Clean Data
# mage.ai
23:18
Posted Jul 12, 2022 | Views 792
# Data Science
# Clean Architecture
# Design Patterns