MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Don't Listen Unless You Are Going to Do ML in Production

Posted Mar 10, 2022 | Views 435
# ML Orchestration
# Model Serving
# Leverage
# Banana.dev
Share
SPEAKERS
Kyle Morris
Kyle Morris
Kyle Morris
CTO @ banana.dev

Hey all! Kyle did self-driving AI @ Cruise, robotics @ CMU, currently in business @ Harvard. Now he's building banana.dev to accelerate ML! Kyle cares about safely building superhuman AI. Our generation has the chance to build tools that advance society 100x more in our lifetime than in all of history, but it needs to benefit all living things! This requires a lot of technical + social work. Let's go!

+ Read More

Hey all! Kyle did self-driving AI @ Cruise, robotics @ CMU, currently in business @ Harvard. Now he's building banana.dev to accelerate ML! Kyle cares about safely building superhuman AI. Our generation has the chance to build tools that advance society 100x more in our lifetime than in all of history, but it needs to benefit all living things! This requires a lot of technical + social work. Let's go!

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
Adam Sroka
Adam Sroka
Adam Sroka
Director @ Hypercube Consulting

Dr. Adam Sroka, Director of Hypercube Consulting, is an experienced data and AI leader helping organizations unlock value from data by delivering enterprise-scale solutions and building high-performing data and analytics teams from the ground up. Adam shares his thoughts and ideas through public speaking, tech community events, on his blog, and in his podcast.

Many organizations aren't getting the most out of their data and many data professionals struggle to communicate their results or the complexity and value of their work in a way that business stakeholders can relate to. Being able to understand both the technology and how it translates to real benefits is key.

Simply hiring the most capable people often isn’t enough. The solution is a mix of clear and explicit communication, strong fundamentals and engineering discipline, and an appetite to experiment and iterate to success quickly.

If this is something you’re struggling with - either as an organization finding its feet with data and AI or as a data professional - the approaches and systems Adam has developed over his career will be able to help so please reach out.

Cutting-edge data technologies are redefining every industry and adopting these new ways of working can be difficult and frustrating. One day, there will be best practices and playbooks for how to maximize the value of your data and teams, but until then Adam is eager to share his experiences in both business and data and shed some light on what works.

+ Read More

Dr. Adam Sroka, Director of Hypercube Consulting, is an experienced data and AI leader helping organizations unlock value from data by delivering enterprise-scale solutions and building high-performing data and analytics teams from the ground up. Adam shares his thoughts and ideas through public speaking, tech community events, on his blog, and in his podcast.

Many organizations aren't getting the most out of their data and many data professionals struggle to communicate their results or the complexity and value of their work in a way that business stakeholders can relate to. Being able to understand both the technology and how it translates to real benefits is key.

Simply hiring the most capable people often isn’t enough. The solution is a mix of clear and explicit communication, strong fundamentals and engineering discipline, and an appetite to experiment and iterate to success quickly.

If this is something you’re struggling with - either as an organization finding its feet with data and AI or as a data professional - the approaches and systems Adam has developed over his career will be able to help so please reach out.

Cutting-edge data technologies are redefining every industry and adopting these new ways of working can be difficult and frustrating. One day, there will be best practices and playbooks for how to maximize the value of your data and teams, but until then Adam is eager to share his experiences in both business and data and shed some light on what works.

+ Read More
SUMMARY

Companies wanting to leverage ML specializes in model quality (architecture, training method, dataset), but face the same set of undifferentiated work they need to productionize the model. They must find machines to deploy their model on, set it up behind an API, make the inferences fast, cheap, reliable by optimizing hardware, load-balancing, autoscaling, clustering launches per region, queueing long-running tasks... standardizing docs, billing, logging, CI/CD that integrates testing, and more.

Banana.dev's aim is to simplify this process for all. This talk outlines our learnings and the trials and tribulations of ML hosting.

+ Read More
TRANSCRIPT

0:00

Demetrios

What's up, Adam? How are you doing, man?

0:01

Adam

Very good. Very good. How are you?

0:04

Demetrios

I am excellent, dude. We just got done talking with Kyle from Banana.dev – I think is the actual address. But he is an ex-Cruise machine learning guy. And, dude, I got some killer takeaways from this. But I want to hear what you thought first, before I tell you mine.

0:27

Adam

Yeah, very interesting discussion, lots of... It's interesting to see someone with such a sharp-minded focus on just their problem like, “We're just going to do this thing.” And I love that in the conversation, he said that “Actually, with certain things, we just say ‘We suck at that – that's not us. Come back later’” kind of thing that's missing, I think, from a lot of what people try to do. They're like, “Oh, well, yeah – we can do that as well. Actually…” It's nice and refreshing. And this – what Banana and them are doing – is in a very competitive space, it's not well understood. So he's taking an opinionated, but quite a pragmatic approach to just getting stuff done. So I really liked that. Yeah, some brilliant material within the conversation.

1:12

Demetrios

That laser focus man. It was killer to see that. I totally agree with you. I also think he had a quote that I might put on a shirt or sweatshirt and try to put it in the merch store, which you all can find as you know, if you've listened to this podcast before. You can check out the link for that in the description. But the quote was “Paying for idle GPUs is really dumb.” I need that on a shirt. It's really killer. Also, the three different axes, which he looks at serving things – definitely worth keeping an eye out for that. We are labeling this “Do not listen to this unless you're going into production.” Because this is very much production kind of content, quality stuff. I loved it. Kyle is, again, the CEO of , ex-Cruise, and he did robotics at CMU. The guy is doing stuff. And then we did a little personal exploration at the end.

2:17

Adam

Also worth saying, he’s super generous. Hit him up for advice and tips and stuff. He said he was really forthcoming with his time, and he wants to contribute to the community. So really, really cool.

2:28

Demetrios

So true. Yeah. So feel free to reach out to him. He just wants to help. He was like, “It's not going to be a sales call. It's just I want to help people get into production.” So there we go. Alright, man, let's get into it.

intro music

2:43

Demetrios

Let’s talk for a minute about Banana.dev, man. I got to know what's up with that. And I always think whenever I say Banana (whatever) incorporated, Banana enterprises, it reminds me of like, we're going back to our roots and it's so easily relatable. I love it because you can draw a banana and it's really easy, and it makes for a great logo. But then it makes me think like we're very kindred spirits with the banana because of our monkey roots. I want to know what's going on with you and Banana.dev.

3:12

Kyle

Yeah, so just like the origin of the naming? Basically, there's a common meme in the dev community of using bananas for scale and that's what we want people to do – we want people to use Banana for scale. The other, more personal reason is I like to say “Banana is the next Apple.” Because it's like, if you're pitching something, it's pretty easy. People like fruits. But I think from the dev side, “Banana for scale,” that's what we're going for. Yeah.

3:42

Demetrios

Amazing.

3:43

Kyle

It's pretty simple. chuckles

3:43

Adam

Really good joke, once you've explained it.

3:46

Kyle

We had different names before. We were Booste. But we heard several pronunciations like “boost-eh”, “boost-ee” and we were just like, “No. This isn't working.” chuckles

3:58

Demetrios

So what are you guys doing? Can you tell us a little bit about Banana.dev before we jump into this full conversation?

4:04

Kyle

Yeah. So the point of Banana is we make productionizing ML really easy. You come to us with a model, you've got the quality you need, you want to go to prod, we give you back an API that you can call in two lines of code, and then you can use your models. It's like GPT-3, or like some third party – except it's your baby. Your model. And what this means is that it'll scale naturally, we keep the cost low, we make the latency fast. We do all that stuff you need. You don't have to worry about it. You focus on the quality, which differentiates the business. We do the undifferentiated production in front.

4:43

Demetrios

Hmm. So what's the great vision there? Is it just to help more people get into production? For more people to realize value with machine learning?

4:53

Kyle

Yeah, so the motivation was – I was at Cruise before – we were productionizing ML and we realized that it takes months to just get a model to the quality you need to go to prod. And then once you go to prod, it takes another few months to handle all the long tail of stuff that breaks, like from auto scaling, spinning up machines, if you want to do something like serverless GPUs, being able to just handle high volume, optimize latency – now you need to know all these tools. Suddenly you have to know Docker, K8s, you have to know CUDA if you want to optimize inference, and it's like, “Wow, this isn't what I signed up for. I'm a data scientist (or whatever). I'm working on model quality. And now I have to do all this.” So kind of my broad motivation is to accelerate the creation of really smart machines, like AI, whatever ML, whatever you wanna call it. And I think productionizing is just a way we can do that – it hasn't been solved yet. And it's just starting. The pain point is being hit by more and more companies every day. So the vision would be that, when you go to production and you want to scale machine learning – use Banana for scale. That's the simplest term. And right now we have early offerings with serverless GPUs, which I think is the most valuable offering we have right now. Because we can just say, “Hey, go to prod. We’ll spin up GPUs 10 times faster, so you only pay for usage.” And this can really help if you have a lot of idle time.

6:22

Adam

So that's really interesting. So in like, the utopia, ideal world, I don't have to hire any machine learning engineers, now. I can just have my data scientists interact with your team and your tools. So that's the big sell. And then the thing I would get onto following that is, there's a lot of competition in this space, right? People are trying to figure this out, it's very young. Is that the goal, really, to kind of take that headache away from companies?

6:48

Kyle

So, what we've seen is two cases. There's either the case where the team doesn't have infrastructure engineers, and yeah – their data scientists will talk directly to us. This might be an early startup, they've built the model, and now they're like, “I don't want to spend months hiring and then wait for them to build something, if you can just do it.” So that's kind of the speed-to-market value prop. The other is where a team is actually like “We want to hire. We need infrastructure engineers. But we want those engineers to be empowered using the best tooling.” And then we try to work directly with them and provide that tooling. So they need flexibility – they don't want to be vendor-locked. Nobody likes that. So that's the other angle.

7:28

Adam

Yeah. The other thing I was gonna ask then was – we had a conversation with James Lamb a while back, and he talked about how they built… I can’t remember the name, it was his previous employer. He'd done the hard yards and done the machining engineering system there, and they found that the outcome of that was that it was very… it was optimized for one very particular type of model and workload. And if you're talking about early offerings being GPUs, I'm guessing they're in computer vision. Is that the kind of early stage stuff that you're doing at the moment?

7:57

Kyle

Yeah. So, large ML models are typically on GPUs – you need that to do quick inference. We do language models, computer vision, text-to-speech was actually a really popular use case because it's a large batch process. Yeah. The second pain point for customers is – typically, spinning up a model can take minutes. With stuff like GBT-J, it takes 15+ minutes, but we've got it down to seconds. So that allows serverless offerings. All that optimization is on the GPU side. You can use CPU, it's just very slow, typically. A slow inference affects your user experience. But in some cases it won’t, so we'll take the CPU approach then.

8:45

Demetrios

So, I wanted to get into a little bit of what you were talking about right before we hit record, and that is some of the common pitfalls that you've seen on the engineering side when you want to take a model into production, and also then start to scale it. Can you hit us with some of the things that you've seen over your tenure?

9:12

Kyle

Yeah. I'll start with the most common. The first step that teams take is – they build out a model, they're used to notebooks, they have their Google collab (whatever) and then they go to prod and they say, “Hey, we're gonna use the same setup.” We've seen teams spin up a notebook and then offer it as a Flask app and literally, if you close the laptop – prod goes down. We've seen things as janky as that. Because that's how you did dev, right? We've seen people spin up, basically, team-up sessions on a remote machine, again, like port forwarding a server. They load their models directly into a Sanic or Flask app and then they try to just return the inference results. This is usually the first step. And it's great, I love a mindset of like “ship fast”. The problem is, firstly, you can't scale it. If you have any sort of traffic spikes, and you're doing inferences that take on the order of seconds, now your user experience will tank quickly. The second is, you don't have a way to roll out deploy updates, so prod continuously goes down. We see this problem where folks typically already have customers and they're wanting to offer ML to them. So a better solution, right? They don't usually start with ML. So they built this thing, they put it in prod, and now they're like, “Oh, man, I have hundreds of people using this and I don't know how to fix this. Because if I change anything, prod goes down.” So that's like the second main blocker. That causes a lot of headaches. Fixing that takes weeks. So that's the first area to save time. It also causes your users headaches. I think the takeaway here would be, “When you're going to production, you really have to think about your end user. And I know there's this whole ‘do things that don't scale,’ but if you're deploying something to an already scaled audience, you really have to think twice.” I worked on cars before at Cruise and it's like, you can't just push something to production when it's a car on the road. You have to think about – there's already an ecosystem out there. There's people using the thing. There’s people interacting with your product in extraneous ways. So that's a big thing and once you do, it's hard to revert going to prod – when you don't have the infrastructure to push the new update. You've just rolled over users to this broken thing, essentially.

11:34

Adam

I would just say plus one for the word “janky”. Someone cheered me for saying that the other day. All of a sudden everyone I thought I'd made it up and I'm sure I've heard it elsewhere. So I'm glad that you've used that word, because it's definitely not me. The thing I want me to ask, because you've touched on this, and it's something that gets discussed now and then, and I think I did a post about it earlier – about roles. I touched on the fact that “Okay, all of a sudden, you've got data scientists, it's especially difficult to acquire skill set. You need to learn a lot of other stuff now – Kubernetes, Docker, all this stuff – that wasn't in the Coursera course I paid for.” So one thing that I’m constantly wondering about – you've done the hard yards – is there such a thing as machining learning engineering? Or is it actually just software engineering? Is there really a difference? Or is it just good software engineering applied to ML?

12:25

Kyle

I mean, you could generalize that and just say, “Is there such a thing as engineering? Or is it just solving problems?” Adamn laughs That's personally the way I think of it, as a founder – I'm just like, “Hey, I treat organizational problems as systems. Organizations are systems and people, like literally workers, processes, whatever. And then machine learning is just another way of basically making a computer do a thing.” That level of reduction may annoy people because I know there's purists that like thinking that there's Software 2.0. with Machine Learning – you could argue that neural networks have an architecture. But then you have to dissect like even machine learning – there's traditional support vector machines, k-nearest neighbors methods, principal component analysis, that aren't using neural nets. So there's this architecture – like there's the control flow, and then the data. I haven't dug in too much into it. I just see it as problem solving. And I think neural nets, in particular, in machine learning are a really powerful tool. So I would say, if you learn how to use those, you're an engineer that uses that tool. But you should use any tool that you need to solve your problem, basically, as an engineer, in my opinion.

13:40

Adam

Very cool. Okay. I suppose it just comes from – if it's a useful model, then then use it. Right? I guess. Okay, to flip that same question – I guess then, who's in the team? Is it software engineers you're hiring? Is it data scientists, infrastructure, guys and gals?

13:59

Kyle

Yeah. So typically, early on in a company you want to hire generalists – people to just hack and make stuff work. We started extending into folks that did a lot of consulting on ML. So, you have engineers that have worked directly with folks like Andrew Ng, have done large consulting projects – they have a lot of data points, they know how to solve a problem. We need people that can do two things: understand the requirements from a customer, and then figure out how to build it. There's so much going on in productionizing for ML, that if you have those skills – that's great. But adaptability, I think, is more important. So a lot of us on the team didn't have the ‘perfect skill set’ that we'd be hiring for now and we think it's changing so fast, that We're just like, “Here's the basis.” Maybe, for example, “Can you code? That's good. Do you know how servers work? Do you know Python?” But then it's like, “Alright, you're gonna have to learn Kubernetes, Docker, Pulumi – if you want to do infrastructure as code, all this other stuff. But we don't require that upfront, because it's like, “You'll figure it out. And we might be building the next Pulumi.” Right? Whatever's needed.

15:12

Demetrios

Something that you touch on, which for me is probably one of the most important parts, and what I see as a disconnect from interviewing so many people over the last few years is, how not enough people think about the way to interpret the problem or actually using machine learning to solve a problem, as opposed to using machine learning to do whatever it is that you need to do and forgetting about the problem completely. So when I think about Banana, and how it's like, “I can come with what I have, and I can bring it. And then you guys help me scale and you help me do everything to productionize that.” Have you thought about like… is there not a disconnect with – well, what about if what you're producing (what you're operationalizing) isn't actually solving the problem that they want? Because I imagine you're not going upstream and saying, “Hey, are we solving your problem?” We're just going to package it up and get it out there and have it so it's bulletproof.

16:25

Kyle

Yeah. We had a previous iteration of our company where this was the failure mode. What we did is we focused on quality and productionizing – that was a mistake. Companies should pick one thing and do it well when they’re starting. And we didn't align on requirements, so we spent a lot of time. We basically had this amorphous set of just like a customer saying, “I want high quality.” And then we don't know what that means and then we keep working on it and they’re like “It's not good enough.” For production sizing, we focus there now – same thing happens, right? You need to really make this tangible. If you're trying to understand the requirements of somebody trying to productionize a model, here's what you need to have them write down and align on: you need to know inference latency, “How fast does the inference have to be?” You need to understand traffic spiking, “How much does it need to scale up or down?” You need to understand price, “How much are they willing to pay for this hosting?” And you need to understand how quickly their model loads into memory, because when you're talking about production, it's not model quality anymore, it's the user experience and their business – how much they're paying you – that matters. So quality is like training everything and when you’re in prod, there's other things like reliability too, but basically it comes down to “Does it scale?” “Can I afford it?” and “Is it bad for my customer?” Because that's what they need. If you don't have those numbers and you start building, you have a very high chance of being disappointed. Because you will do weeks of work, they'll come back and they'll say, “Oh, this is like a 30-second inference. I need seven seconds. I need like four seconds. I need 500 milliseconds.” And we're like “Whoa! But now it's going to cost you this much.” And they're like, “I can't afford that.” So you can think of it almost like this cube – we call it the production cube. There are three axes: there's like latency, there's cost, and then there's how quickly it can scale. And we look at “Are we able to meet those three constraints for our customer?” And if not, we just tell them like, “Hey, we're not here yet. We'll do a service fee and then we will do some custom work.” Sometimes if it's a really big deal, we'll just buckle down and be like “They obviously need this.” So you really have to understand, in numbers, what the customer is looking for – what they expect – and then set those expectations before you build anything.

18:51

Demetrios

And then how does it work with like… because they're coming to you at the end of the road, almost. A lot of shit can break upstream. So how do you make sure that your SLOs are met if stuff is breaking upstream? And how can you tell them that, “Hey, that's actually your part that is breaking, not our part.”?

19:16

Kyle

The way we partition is – they own quality, we own productionizing. So if they tell us like, “Hey, the model is not of good enough quality. Our customers don't like it.” We say “That's not up to us.” Unless it's something to do with latency, inference speed, and then price – like we can't scale – then we'll come in. So we have a very clear division of ownership. It'd be cool to grow into those other areas. But right now we're just purely production inference.

19:48

Demetrios

And like, if you have data pipelines that are breaking upstream, is that part of the quality also? And you're saying “Look, I'm not getting what I need from you guys, so I can't do my job.”?

20:02

Kyle

Yeah. At the moment, we have customers send us the model artifacts – so the weights and the architecture – and then we say, “Hey, we're gonna load the weights into the model architecture. We're gonna host.” How you create those weights, that's up to you. Your data pipelines, your notebooks, everything that's pre-production. I think that can change in the future, but right now, it's just a matter of “How do you 10X in one area?” Right now we are 10X in production. If you come to us and you're paying like 10K a month for GPU hosting, we can probably bring it down to like 1 or 2k. So that's the thing. And then we just proudly say “We suck at the other stuff. We don't do that yet.” I think (this is kind of a meta thing) but it's kind of relieving, as a founder, to be able to confidently say you suck at certain things and say you're really good at something, rather than saying, you're just a dev shop and you can do everything. Because then then the question is like, “Well, why don't I just use all these other services?” So you really have to 10X something to win in this market.

21:12

Adam

I think that’s a really powerful mindset to have across the board anywhere. I really like that. I want to go back a little bit because I’m trying to extract some of that experience and knowledge for listeners. You went through a bit of a bullet list on your fingers there of stuff that you need. Is that the same if I'm… so the scenario is “I'm a data scientist in a company that's finding its feet with ML. I've got a model that I like. I want to put it into production.” Is that list the same? What's the kind of biggest thing to focus on? Actually, are there any resources that you've written up, like a blog post that we can point listeners at? Because I think that'd be really helpful.

21:49

Kyle

Great point. That sounds like a piece of content we should write. I don't think we have yet. Let me try and distill it here – the TLDR for a listener. So you have a model. You're trying to go to production with it. First thing you always need to ask is “What's best for the customer?” The thing to note – the thing customers care most about – they don't care how much your model costs behind the scenes. That's your cost. They care about how much they're paying, and they care about how fast they get a response. So if you're doing a consumer application, latency is going to be your number one thing. Customers need fast response time, because fast response time means higher retention engagement. Imagine you build a chat bot and your ML takes 20 seconds per message – that's going to suck way more than if it took half a second. Versus if you're doing batch processing, something like “Hey, send us your job. We'll finish it and get back to you when it's done.” You don't have to care about latency as much – maybe now price is the angle. But customers care about how much they're paying, how reliable the service is, how fast they get the value they want. One interesting insight here is just about customer psychology – people just assume reliability. We have Amazon, GTP, Azure – people assume five nines uptime. If you don't have that, you're not even playing in the same field. You can't come in with like 80% uptime and expect customers to survive with your product. Unless you're like MVPing with like a handful of trial users. Once you're actually scaling up – you're trying to grow – your retention curve will just be destroyed by uptime, because people will say, “But all these alternatives are more reliable.” People just don't think of the problem anymore, right? It's kind of like ordering food that takes three hours to get, and you're like “This doesn't take three hours, it takes 20 minutes with Uber.” So you need to know that about your customer.

23:47

Adam

I think people forget that it’s an artifact of the modern world, I suppose, like “I'm gonna compare every website to things like Google homepage and stuff like that, without really considering the differences in resources, and expertise and knowledge and stuff.” Do you have any advice for people that are dealing with unrealistic expectations on that production end? Again for like, “I'm a machine learning engineer. I'm trying to get something set up.” How would you go about expectation management?

24:19

Kyle

Sure. So you're talking about expectation management with customers, right?

24:24

Adam

Yeah.

24:25

Kyle

So again, let's anchor these axes: price, latency, reliability. Let’s say those are the big ones. Customers don't care about scale because they're one person. Your business cares about scale, because you're meeting many customers. Customers just want reliability, speed, price. The expectations along those varies. The one customers typically care about the most is price. My advice – this is from what I've done across dozens of different projects I built and what I’ve seen from successful companies is – you should start with people willing to pay a lot. So try to bias towards people that have money. Tight? And then treat price optimization as unlocking the market as you go – as like a consequence of scaling.

25:16

Demetrios

That’s a very good point.

25:17

Kyle

Basically what you can ask – we've got a flowchart for it – you should say, “I have this glorious product vision. It's gonna change the world. Great. How do I give it to one person this week (or next week, whatever)? And how do I find the one person that's willing to pay for the price it's currently at? If you can't find anyone – that should be a flag. You should really be sussing that out. Because everything, at least personally that I’ve built – we've had customers come in and just offer hundreds of thousands, and millions of dollars, to solve a problem. So if you can't find somebody that's willing to pay (upfront, very high) you might not be solving a big problem. Then once you have that – that one person – be like, “How do I find 10 customers?” And now you're gonna exhaust the people that are willing to pay a lot. And now you're gonna have to… just consequentially, the goal is to give this product to as many people as you can provide value to. You're gonna have to lower the price just as a consequence. Again, reliability is expected. We just covered price – start high, optimize. That's how everyone does it. Uber started with special vehicles, right? Like these fancy – what was it? – black limo service.

26:32

Demetrios

Tesla – same.

26:33

Kyle

Then they rolled out the economy. If they started with the economy, they might not have been able to have the unity cloud to grow. I don't know if that's for certain, but you just look at the first principles – look at a business – you don't have money yet, you’re not big, you can't be penny pinching. Reliability is assumed. People just want high uptime. I know, it's depressing, but that's just the world we live in. People want good products. You're competing with giants. In terms of latency – this one’s interesting. With latency, it doesn't have the same properties as price and reliability. With latency sometimes, for example, we have customers where if the product takes this long to respond, people don't use it. And then as the latency comes down, it's kind of like unlocking that adoption curve. From what I've seen just from dozens of customers, it's not as easy to just say, “Oh, let's find people that are willing to have bad latency,” as opposed to like, “Let's find people that are willing to pay a lot.” Latency is kind of a spikier distribution. Some people are okay, some people aren't. You kind of just have to hone in and be like, “Are people not using my product because the latency sucks?” And you have to identify if that’s the problem. So for example, there was a bunch of stuff with GPT-J – people were putting it in prod, trying to do chat bots, and people were just like, “This isn't fast enough. We need better latency.” But then you optimize GPT-J for latency, and now the quality is not there. You'd do quantization, and you'd make it smaller or whatever.

28:11

Adam

Yeah. I can see that, because a lot of the stuff we do at Origami actually is like… it's kind of a hard threshold. It's like the latency can be whatever. But there's a threshold at which you're unusable, kind of thing. Because it is very patchy and we've got systems – billing systems kind of stuff.

28:27

Demetrios

And also, I want to just point out – hard truth being thrown out there. If people aren't willing to pay top dollar for what you've built, then that should be a red flag. That is such a good quote, like, “If you can't find someone to pay a pretty penny for it, then you're doing something wrong.”

28:48

Kyle

Maybe we could re-distill – we kind of rambled through stuff. But when it’s distilled, it's really simple – if people aren't bugging you for your product and willing to pay a lot, you might not be solving a real problem. That's just hard. Reliability is assumed. Start high with price for latency – that's where your main work is going to be. With latency, you need to be like, “Am I unlocking the market by lowering it.” With price, just start high all the time. With latency, to try to get it low. Don't start by trying to get price low. That's basically how I distill it into a chart. It's high price until… start with lowering your latency, that's what drives customer experience. Find the people that are willing to pay a lot for that experience and then lower price as you scale.

29:35

Demetrios

You know what I love, Adam?

29:37

Adam

What do you love?

29:38

Demetrios

Gettin’ a new job!

29:40

Adam

Really? Okay. Well, it’s funny you should say actually, because Origami is hiring. We're hiring at some pace as well – I kind of keep losing track of what we've got going on. But definitely on the tech side – we're looking at data folks, data engineers. I expect more to come, but any kind of software engineers or engineering managers that listen to this, please reach out. But I think there's something you're looking at doing to maybe help with this.

30:04

Demetrios

Dude, we created a jobs board. Adam ooh’s And you can see on that jobs board right now, Instill AI is looking for a founding engineer. You've got Zoro, US, looking for a senior machine learning engineer. There's all kinds of great jobs being posted on our jobs board. So whether you are trying to hire somebody, or you're trying to get hired, get over to our jobs board at MLOps.Pallet.com. You can also find that link in the description below. That's it. Or just go work for Adam – go work with Adam. That would be awesome. At origami. All right, let's get back into it.

30:42

Demetrios

There's some things that we mentioned before we started recording and one of them was like – hopefully what this podcast will be today is – something for someone who is already in prod. Right? Like, “Don't listen to this podcast unless you're in production or you're going into production.” What are some certain things that you want to talk to people who are in production about that you haven't already said?

31:10

Kyle

The biggest thing is paying for idle GPUs is really dumb. GPUs are expensive. Right now, AWS, GCP aren't offering serverless GPUs. That's kind of what motivated our team to go like, “This is so dumb.” Like, you wouldn't leave your car on and pay for a gas bill. You wouldn't leave your TV on and pay for the electricity bill. So why do people leave compute on when they're not using it, when it cost more than gas and a television’s electricity in some cases. If I was looking at a team that’s already in prod, I'd be asking, “How much idle compute time are you using? Could you put that into hiring an engineer?” It's really easy to burn 3-5K a month on compute. When you look at 3-5K, you can hire an engineer for that. So I would ask people, “How much are you spending on dead compute? And how many people could you hire with that?” And just trying to figure out how you can get that down is valuable. I think that's the big thing. Then the second would be, look at your latency tolerance. If you hit a traffic spike, like I already mentioned, latency affects how much your customers use your product. So really dissect “What is the worst latency your customers are getting?” That'll give you a lot of insight into what's working with your product. Sometimes when we just fix that, we realize that like, “Hey, 10% of users are getting just awful latency – multi-minute wait times – and some people are getting seconds.” When you fix that, you can add auto-scaling and better distributed workloads that can directly impact retention. These are the main things I'd say to prod: watch your bill, and make sure you're scaling, basically.

32:58

Adam

I think as well, we’ve not really spoken about it that much, but there's a real significant human cost to learning this stuff – actually going through the 6-9 months, where you figure all this out for yourself. It can be quite stressful, difficult, painful, whatever – can lead to big catastrophes. So one thing that I wanted to touch on, we've talked about how to optimize the decision making process, match requirements and things. On the technical side – more deeply technical side – what's the one hiccup or hurdle in productionizing ML systems that maybe doesn't get talked about enough, or some people might be unaware of, or have you got an interesting anecdote about something that really stumped you that was kind of unforeseen?

33:46

Kyle

Yeah, I could speak of this personally – I didn't know how to productionize ML before. I learned this on the job. So maybe… the angle here is maybe this will help others that are just about to productionize foresee a few hurdles that… entrepreneurs essentially tend to think stuff will take a couple of days, and then it takes a week. So I can de-risk a couple of those and go through the process of what I got stumped on. First is machine orchestration. Again, you have one replica running of a machine – you have your server – that's great. But the moment you need to scale, you need to be able to have health checking to auto-start a machine, spin it down. Now to do that, to be able to start a machine from a script, you need to have a consistent environment, which means you need to learn Docker. And Docker is a headache, if you haven't done Docker. Then once you want to start doing somelinge like auto-scaling, then you want to start doing bigger orchestration with tools like Kubernetes. That was the big the second big headache for me to figure out. It is a fun tool, but it's it's not as fun when you’re trying to serve demand and you're blocked on it. It's more fun when it's like a weekend project you're learning. So, Kubernetes, Docker, then the third big hump you're gonna get to is – sometimes you're gonna want to use GCP, sometimes you're gonna want to use AWS, because they have different offerings that work for you. Now you have to learn a whole suite of how to operate in a console with like, TPU nodes, or TPU VMs, or GPUs on AWS with EC2 versus GCP. It almost seems like they intentionally make it really difficult to do both, because they want to have that vendor lock. So you're gonna spend, like… Dude, I have like an encyclopedia in my head of all the terms with AWS like setting up warm pools on EC2 with instance template and… You just have to get to know all that to be able to orchestrate resources? That's gonna cost you weeks as an individual, if you haven’t done it.

35:54

Adam

The documentation is really good as well. chuckles

35:57

Kyle

Yeah. chuckles Exactly. So the third thing that is going to come in, that would probably be really… I'm trying to explain it from what's helpful to you, not just like what the headaches are. The first is – learn Docker and Kubernetes. Learn Pulumi – that one's been really cool. Or some other inference code, basically, because that lets you write effectively some Python code that spins up the AWS resources. And you can do that without having to learn the console. So that'll save you like a week of just becoming an AWS-certified guru, basically. What else? Then I think the other headache was… So again, really simple customer requests – if somebody's like, “Can you make my model 20% faster. Just the inference. Make it a little faster. But I can't pay a lot more.” So we're like, “Okay, let's figure it out.” Now you need CUDA, or you need to know GPU programming. chuckles You have to optimize kernels. And assuming you have a deep CS background – like you understand how operating systems work, how drivers work – you can do that. If you don't, that's gonna be a learning curve. Then you have to learn the whole new language construct and this whole paradigm of programming for GPUs. Then you want to do GPUs, that's a whole different type of programming – I need jacks. So just for latency, there's all these tools – they're all their own headaches. If we zoom out, I’d be like Docker, Kubernetes, Pulumi are probably the biggest bang for your buck. You have to learn three tools sooner than later – that's what I wish I had learned from day one. I kept prolonging it, because I was like, “I don't want to have to learn a new tool.” But if you know those three tools well, you'll be able to set up auto-scaling, restart logic, reproducible environments, and use Google and AWS interchangeably. You'll just have to do the work, but you'll be able to.

37:56

Demetrios

One thing that I find interesting and fascinating when talking to a lot of SRE, is the idea of like… what is it called – chaos engineering, and how to do chaos engineering. Because it sounds like what you're talking about is very much focused on this – scaling up, scaling down, and making sure that you have the SLOs in place. How do you look at chaos engineering? Are you purposefully going through there and testing the resilience and testing theories that you have when you're scaling up and scaling down? And what are some things that you focus on when you do that?

38:37

Kyle

Could you define chaos engineering for me? Is it like extreme programming? What is it about?

38:40

Demetrios

Basically, it is intentionally bringing chaos to your system to test it out and see how it does under stress. We had someone on a few weeks ago, or actually last week, and they were talking about how it was like, “You want to bring this chaos in because you want to have a theory about what will happen when this chaos is introduced. And you want to see if your theory is correct. Just in case that, out in the wild, that actually does happen.”

39:14

Adam

Yeah, you set up services that shut off regions, shut off infrastructure, by intent. Once in a 1000 events, they just do it regularly. It was a Netflix thing, wasn't it? That led to them being so reliable.

39:28

Kyle

So your question is like, “How do we do chaos engineering? Or how do we approach that type of stress testing?”

39:32

Demetrios

Yeah, exactly. What do you think about that kind of… just basically keeping the resilience and keeping the lights on?

39:44

Kyle

Yeah. When you're a startup, it's a fine balance of being resilient, but also not overoptimizing. The way I kind of look at it is, “Okay. We're trying to serve some customer. We haven't served them before. What are the ways this could break?” Just make a list. I think the problem is that engineers get stuck in their head a lot, thinking of all these cases that don't exist. And then the customer just uses this tool in a really simple way and everything breaks. So I think I would just say – be your customer. Especially if it's consumer app, just be the customer – use it as the customer. Look at like, “What are the hot paths going to be?” And then distribute it as “What are the common use cases? What are the common ways this could fail?” For example, you don't need to stress test Korean servers if you don't have servers in Korea. You cal also do all your region deployment. We've had this – we've launched servers for customers in South Korea. So we have to start thinking about that and we'll ask ourselves, “Hmm. If I'm a customer in South Korea and I'm using Banana, what's the first thing I'm going to care about? Probably latency.” So we see, “What's the latency if it's in the West Coast? What's the latency in Korea?” We just recheck it. Then we ask, again, like, we zoom out – does that matter to me? And then we might be like, “Hey, it's a batch job which doesn't affect retention. So actually, it doesn't matter.” And then we're like, “Cool, let's just not worry about it.” chuckles It almost sounds too simple. But it's like – just be your customer. Try to get in their shoes the best you can and that's 1) Gonna help you make better business decisions. But 2) help you figure out what to stress test. And that'll avoid the problem of an engineer coming to you saying, “Hey, I've tested all these specific sub-regions of things.” That is great. Again, if you're working on autonomous cars, you need to do that. Because the consequence of not doing it is you could cause irreversible damage, and you can't mess up. But when you're deploying SaaS or something, you need to just ask “What are the actual real failure modes that can happen?” Because you're resource-constrained as a startup. The equivalent with autonomous cars would be like, “Hey, we're deploying cars in Palo Alto on nice open roads, and we're only launching during the sun.” Well, then you should just say, “We're not going to drive at nighttime.” We're going to test the day, we're going to pass these cases, and we're going to be very clear that we do not drive at nighttime, because we don't support that yet. So with ML, you can do the same thing, like “Latency doesn't matter for this customer’s use case, we don't need to worry about those types of spikes.”

42:25

Demetrios

chuckles As you were talking about this, I have a favorite question that I like to ask all of our guests. And I want to ask you about it before I jump into some of your website – your personal website – and some of the great blog posts that you have on there. Do you have any war stories for us? What can we learn from your wrongdoings? chuckles

42:48

Kyle

chuckle Yeah, the hardest part is picking one. laughs

42:53

Demetrios

That's a true engineer right there. chuckles Yeah. That's basically everyone.

43:01

Kyle

I think… Okay. Let me just give a really concrete one. chuckles If you're using Redis queues, make sure they're not public or else people will start throwing shit on your Redis queues. Back when we were really early on, we had a production Redis queue that was open and we started getting – prod traffic was coming down – and we're like, “What's going on?” Basically, we would take a task off a Redis queue and then look at what the script running on the top was and then trying to run that script, and somebody was trying to do remote code execution. So they were putting a URL to their own bash script that basically wiped the machine and then started mining Bitcoin. And they were putting it in our Redis queue. It was spreading out across all our GPU – huge headaches. Basically, don't keep up public, like you will get hacked. You might think, “Oh, nobody has time to do that.” The thing is, it's not time – people have bots that scan. They use something called Google Dorks. You just write a Google query to search for failure modes – like in the URL, an SQL injection, whatever – and then they'll automatically find your site, just like a crawler would find your site and post spam. So that was a big one. Trying to think of like… yeah, that was a huge headache, and it was such an easily preventable thing. chuckles

44:30

Adam

I saw a demo of that, actually, from a senior Microsoft salesperson who went through a whole tech demo of how Azure is like the most hacked or attacked site online or something. He turned off some of the security stuff that they normally have and this new instance (of whatever it was) within seconds, it was getting 20,000 attacks a second from all this automated stuff that was just trying to break in and stuff. It was really quite impressive. He said “Look, we protect against this.”

45:02

Kyle

I think the other example I can give – I can't go into the exact example, but I can give enough context to really not do it. Basically, if you're deploying a tool that people can use, you really need to understand what they're using your tool for. I can't explain that enough. If you don't understand how downstream people are using your tool, really bad things can happen. Like if they're supporting something that you don't support, if they're causing harm, if they're breaking the law – and it will happen if you build tools. It's surprising. Like you'll build a hammer and say, “This is for nails,” and somebody's gonna throw it at someone’s window, and you're gonna be like, “What the heck?” This is just really kind of a similar thing to “Don't do stuff public,” but at the same time – just look at how people use your tools. Understand it.

45:54

Demetrios

Alright, so let's get into a little bit of this personal stuff that you have. I think it's awesome that you put out this personal blog – and we found it – so I have to ask some questions on it. You have a really cool post about just what you want in life. And there was like a footnote at the bottom that you edited out, or you changed your original post, from going for “I want to long life,” and now it's like, “I want to live hard and die fast,” almost. Can you talk to us about that? What's the catalyst behind that?

46:29

Kyle

Oh, that's so amusing – the fact that somebody saw that, because of how I blog. Firstly, I just write in prod on there, like I try to go there and I say, “This is a new post.” And then I think about it. But I don't think too much about who's reading it and stuff. I just say, “Hey, this is valuable to me. I hope somebody else thinks so too.” Basically, I had this kind of realization – I've been training for Ultra and I got rhabdo, which is basically when the muscle breaks down too fast, gets into your blood and your kidneys stop working. It really messed me up. So I highly recommend if you see that happening, you should get checked out. But basically what it kind of made me think was that I didn't regret the process of training to the point where that happened. I kind of see it like it's just how I want to look back on my life. If I was like summiting a mountain or something like that, and I saw myself, and I passed away on that mountain – the two questions I'd ask myself is, “Was the mountain this guy was climbing big? Did it look inspiring? Or was I outside of a Walmart on a hill and just passed out and died?” I'd be like, “That's not inspiring.” So I want to be like, “Am I doing something big with my life?” The second is – this happens all the time – like you excavate something or someone from a mountain from 1000 years ago, and people say, “How hard is this person pushing?” If I found myself, I'd want to look at myself and be like, “Yeah, I was pushing to my absolute maximum. Maybe I wasn't the smartest about it. Maybe I didn't have everything figured out. But I pushed into my absolute limit.” So the big distinction is that it's a tougher change. Maybe this will be different in five years, in my life. But I realized that for me, trying to live a really long time was sacrificing my ability to push myself to my limit. And I feel like it's just such a privilege to be able to hit a true limit that I don't want to lose that. It doesn't mean “Hurt yourself in the same way over and over again,” it doesn't mean “Inflict harm for the sake of harm.” It means, “I want to train for an Ultra. If something like my kidneys stop working, I want to be like, ‘Wow! What a privilege to be able to hit that point. It's so rare to hit that level of you're just can't keep going.’” And now that I know that I won't do it again, but I wouldn't want to live in a way where I don't get to experience those extremes.

49:08

Demetrios

You’re looking for the red line.

49:09

Kyle

Yeah, so it's kind of like my value. I want to live healthy. I want to live long. Like, I get that. But I don't want to do it by tiptoeing around life and avoiding things that could push me to my limit. If I look back on my life, the other thing I was really proud about is – I did a Kilimanjaro hike a few years ago with food poisoning. I blacked out multiple times, oxygen was low, I had to get carried into camps. It was brutal. But I look back to that and I'm like, “Man, that's one of the things I’m really proud of or happy about.” That was such a limit that I didn't think I could hit. It would just feel really tragic if I hit the end of my life – even if I lived to 100 – and I look back and be like, “I didn't ever push past 90%.” I would feel sad, even if I lived an extra 10-20 years. Whereas if I look back and I say, “Yo. Man, I messed up. This is it.” Or like, I get diagnosed with something, and maybe have a year left – I want to make sure that I don't regret not pushing hard enough.

50:11

Demetrios

Amazing. Amazing to hear this. And I love this philosophy side. It might become more commented on, on the podcast. If anyone wants to hear more philosophy from our guests, let us know. I know this is a technical podcast, but that was a very inspiring answer and it makes me think about how I'm living a little bit differently. Now, let's… I think let's finish it here. This has been awesome. Kyle, I thank you so much. If anyone wants to check out what Kyle is doing, go to Banana.dev and you will see all of his stuff. He's pushing hard on his company, too. You just heard the philosophy. You heard everything. I mean, you were… Yeah, there were some real gems in there, man. I appreciate you coming on here and teaching us some stuff.

51:02

Kyle

Perfect. If I can add one thing – if you have any questions about going to prod – free of charge – just message us, hit us up. We want to help in any way. If you're building your own thing, just at least come to us and don't make the same mistakes. But we're happy to give advice because it helps us learn too, about the different pain points. Yeah, good luck, basically, if you're taking on that challenge.

outro music

+ Read More

Watch More

53:18
Posted Apr 28, 2021 | Views 588
# Googler
# Panel
# Interview
# Monitoring
49:26
Posted Mar 28, 2023 | Views 509
# Ubisoft
# Platform Development
# Generative AI
# Ubisoft.com
38:03
Posted Feb 25, 2024 | Views 964
# LLM
# Fine-tuning LLMs
# dpo
# Evaluation