Building ML/Data Platform on Top of Kubernetes
Julien is a software engineer turned Site Reliability Engineer. He is a Google developer expert, certified Data Engineer on Google Cloud and Kubernetes Administrator, mentor for Woman Developer Academy and Google For Startups program. He is working on building and maintaining data/ML platform.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.
When building a platform, a good start would be to define the goals and features of that platform, knowing it will evolve. Kubernetes is established as the de facto standard for scalable platforms but it is not a fully-fledged data platform.
Do ML engineers have to learn and use Kubernetes directly?
They probably shouldn't. So it is up to the data engineering team to provide the tools and abstraction necessary to allow ML engineers to do their work.
The time, effort, and knowledge it takes to build a data platform is already quite an achievement. When it is built, one has to maintain it, monitor it, train people to on-call rotation, implement escalation policies and disaster recovery, optimize for usage and costs, secure it and build a whole ecosystem of tools around it (front-end, CLI, dashboards).
That cost might be too high and time-consuming for some companies to consider building their own ML platform as opposed to cloud offering alternatives. Note that cloud offerings still require some of those points but most of the work is already done.
0:00
Julien
Bonjour! [speaks French].
0:13
Demetrios
How's your French coming along Vishnu?
0:15
Vishnu
Just like my Spanish, it is stunted – terrible. I don't understand anything beyond maybe “bonjour”. That's all I got. What about you? Are you fluent in French?
0:26
Demetrios
Oh, no. Not even close. But I understand a few words, like I tell most people. I mean, Julien, our guest today is not from France, he's from Belgium. Just as a little bit of a sanity check. I don't think he would like it if we confused the two. But I lived on both sides of France enough to pick up a few words in French. Currently, I’m living in Germany and before this, I was living in Spain. So I can say the basics to get me by and get myself a croissant. Anyway, it was cool, man. I really enjoyed this conversation. What did you think?
1:02
Vishnu
It was great. Julian is an infrastructure engineer extraordinaire. He's been at a number of different companies, most recently at Spotify, and perhaps as another adventure left in him. His focus is on enabling technologies like Kubernetes in an ML context to help scale intensive data and ML-related jobs. So, a really fascinating technical background, but he shared a lot of different nuggets. What were your favorite Demetrios?
1:31
Demetrios
Dude, I loved when he talked about – from the beginning, it was almost like from the get-go – he was on fire. He touched on what we always say here, but it cannot be repeated enough – the technology problems are easy, it’s the people problems are the difficult ones. The processes are so much more intricate and detailed than the actual tech. And then there was the whole… His views on chaos engineering, in the context of machine learning, were awesome. So keep your ears ready for that one when you get the chance, and perk up if you hear it. But what about you? What were some key takeaways that you had?
2:16
Vishnu
Yeah, I think for me, there were two takeaways that are really crucial to being a machine learning or data engineer at this moment in time. Number one, I would say, is this point around how to manage your career. It's easy for engineers to focus on what technologies they know, but he offers some great suggestions on how to think about it in a bigger way around what problems you're trying to solve. So I really suggest listening to that. The second piece was – what chaos engineering is and how that mindset, which he goes through, is actually really useful in the context of highly complex modern production ML systems. So a lot to learn from what he was talking about. Any closing words, Demetrios?
3:00
Demetrios
Yeah, I just want to give a huge shout out to our sponsor, Superwise. Yesterday, or two days ago, they released their SaaS version of the product, so if anyone wants to just go play around with a monitoring tool for machine learning, you can check that out in the links below. That being said, let's talk to the man himself, Julien Bisconti. I hope I said that right. Bisconti! You know, I'm a professional at butchering names. So here we go, let's talk to Julien.
[intro music]
3:35
Vishnu
We were just having a great conversation about the two kinds of people that we tend to have on our podcast here. And I think our listeners will be very familiar with this, where they're sort of more product and organizational folks, who tend to talk about the organizational challenges of MLOps. And then there are also the more implementation- and technically-oriented professionals who talk about ‘how’ challenges in MLOps. You told us that you were both and I'm really excited to get into that. But tell me about what you're so passionate about on the non-technical side that I think that everyone else should hear. You're telling us about the process and why that's so important. How do you arrive at that?
4:18
Julien
Well, basically, when I was a software engineer, I transitioned towards site reliability engineering. So I focused on the reliability, the cost – it mainly has to do with cloud and resources and providing a platform to the developers and helping them with the code to feature observability, like metrics, logs, tracing, and those kinds of things. I would say, I envy people who have technical problems, because I would say that most problems are truly organizational first. And I would say that it's much harder to change a process than it is to change code for the simple reason that making people understand the context of why we do things is actually very hard.
Humans don't have high bandwidth communication: speech, text, and even video are actually really low bandwidth, especially now that everybody's working remotely, we can feel that. Face-to-face is a little bit better, but it's very hard to explain to people sometimes why they should or should not go a certain way, what the pros and cons in the trade-off are, and how far it is. And I think that data is actually what helps a lot with this. Companies that make data-driven decisions have a much better, a much faster, decision process. They make decisions faster and more decisions in a given time. And that helps them to actually increase speed. For a company, I don't think I've found another metric that defines success other than speed. [cross-talk]
6:05
Demetrios
There’s something you mentioned right before we hit record, and I wanted to just go into this a little bit more with how you envy people with technological problems. Can you just say that again and why that is?
6:23
Julien
Well, because – basically everything has been solved, at least from my point of view. Maybe we need a more performant algorithm, maybe we need a bit more confused – a little bit more time or money – but those problems are solved already. And it comes from the fact that if you don't know something, maybe it's good to learn from someone who already solved that problem. It's hard to find those people, but they exist.
Sometimes it can be from the same discipline or across disciplines. And I think that's from the site reliability engineering part, which is basically “How do you deal with an incident?” Well, airlines have had to deal with that on a very serious level – it's people’s lives. We can actually learn from other domains – from domains other than tech, basically. Tech is everywhere these days, but the domain they are in has also some really good nuggets of wisdom that we can use.
7:25
Demetrios
The main thing there, if I'm understanding it correctly – for you, the technological part has been solved. And that's where you're like, “The hard part is the people and the processes.” You mentioned that it's much easier to change a piece of tech out than it is to change out a process when you have people asking ‘why’ and you have this low bandwidth conversation.
7:53
Julien
Definitely. I'm gonna give you the best example. We talk about the machine learning platform, right? Do you know how long it takes to build a platform from scratch? And just that, think about it – it’s two years basically. You input for a journey, given that you already know what you're doing. That's not a given, you know? People who build machine learning platforms are not like…
8:17
Vishnu
Rare.
8:18
Julien
Actually, they’re probably listening to this podcast. So hello, everybody. [chuckles] Nice to meet you. The thing is that, I was telling that around the time the cloud came around – so it was 2015. People were comparing VM to VM, the cost of the cloud, to the cost of a server that you can buy. And then they forget that they actually have to build that data center. And then they have to connect it to the internet. But to connect it to the internet, you have to go to a telecom provider, which is usually a monopoly – they just overcharge you in the premium for the bandwidth that you get. So you add all those little costs, and you get the total cost of ownership. This is something that people don't compare often, because it's more like an art than a science.
How do you really measure the time it takes to build a data center rather than to run it? You understand? Those are different consequences for the operation costs and accounting and all those things. But people don't realize that if you need something now, the cloud is actually a really good option. Even if you scale, it's still – I don't know if you heard the story – but Uber had too high costs of observability. The storage and acquiring of the metrics were just prohibitive. The percentage that they were making was just too high compared to the total cost. So they created their own database. I'm not sure if you understand what kind of position you have to be in to say, “You know what? Today I'm gonna create a new database and it's going to be awesome.” I mean, I don't think there is a harder problem than creating a database that is efficient to store and query. Compilers are hard, but databases are just as hard. So you see, this is kind of what I want to talk about.
When you start building an ML platform, you start from nothing. Let's say that two years later, you have your platform. But during those two years, the cloud didn't freeze – it still evolved. Suddenly, you find out that you’re stuck with your platform, and it’s going to be really hard because you had to learn and find the right level of abstraction for your API's, and teach people, and train them, and put on-call and all that – the total cost of ownership actually has such a toll on the organization.
11:05
Vishnu
Yeah, I think that's a really interesting way that you phrase this. We talk about this a lot in terms of the “build versus buy” context. Now, the build versus buy is often a trite oversimplification because that's the way that you have to present it to your managers. But the reality is, when you're an engineer, or when you're a working professional charged with developing ML products, there's a lot of different atomic decisions that you make in terms of adoption or decision t o build it yourself. And what you're saying is, “If I am starting to do that ‘building myself’ earlier in the journey, I'm essentially missing out on the advances of other cloud providers.” Is that what you've seen over the past few years from the companies that you've worked with? That those that have chosen to do it themselves have basically missed out on crucial innovations in the cloud sphere?
12:00
Julien
There are exceptions. I know a few exceptions and those people have hired such a brilliant team. I mean, those people are not in conference – they don't talk about it, but they are wizards when it comes down to the computer. This is basically the level that you need to build a proficient platform. You know, they often say “You're not Google,” but as you understand, Google builds platforms. They use their own platform. But the thing is, if you're in a big organization, whether it's someone in another business unit, or in the cloud, it’s almost the same level of communication. You probably don't talk to them directly. So there is an exception to everything I say. I'm not saying this is the way and that's the gospel, and please, let's join hands and sing Kumbaya around the fire. It's really not that.
13:01
Demetrios
I would love that, though.
13:03
Julien
[laughs] You would play guitar around the fire and we would have an MLOps campfire. I would join. But, you know – it's just hard. It's just hard to build a platform.
13:20
Vishnu
It's a great point you just made, which is when I think about building a business, which is what we all do – whether you're at a big company, or small company – you're building a business, right? There are a couple of different resource pools that you're trying to draw from and try to build a business with. And the reality is, to do the kinds of things in terms of ML platform requires drawing from a very deep and very specific talent pool that not every company can realistically put the effort and money into hiring. And your point is that some of these companies that have done it successfully, that we look to as examples of ML platforms, have hired from pools of talent that are simply not accessible to everyone else. And I think that that's a journey that I have seen myself, having been at multiple startups – making frank realizations in terms of like, “Who are the people that we're trying to put together to build a business that we can? And where can we realistically draw from?”
Now, with that said, I have an honest question. Every engineer is sitting in their office, home office, wherever they are, thinking, “Hey, I could use this tool that exists for this problem already, or I could figure out the problem from first principle and then have that on my resume and learn a lot more. And in the long run for my career, that may be better for me to say, “Oh, I have this understanding of these things I've built rather than these accelerants that other people sort of ridicule and say, “Oh, well, he or she or they don't actually know from first principles the problem that they're trying to solve,””” hence the temptation to build your own database – so that you can write a blog post about it. You've had a successful technology career. What's your advice to me and that developer that's thinking, “Hey, should I build it or buy it for my own career potential?”?
15:10
Julien
Well, there are many ways to increase your value on the hiring market. I think having a high tech resume is just one of them. I would say if you have to invest in something, then it would be more like ‘take a class on negotiation skills,’ because that's basically your salary. [Vishnu laughs] No, but it's true.
15:32
Vishnu
It's true. It’s a great point.
15:33
Julien
It’s just a few hours and you're going to be briefed and trained. And the next time you are in front of a recruiter, this is where all the tricks pay. It's the highest return on investment you can have. So it has nothing to do with technology. I would say teaching has had the greatest impact on my life professionally and personally. When I arrived in Sweden four years ago, I basically didn't know anyone. It's through meetups and the local community (by giving talks) that I made a lot of friends there. It created a great network – I still keep in touch with those people today. So I would say, you can be the best of the best and then suddenly, you're going to be in one company and never write a blog post about that.
I'm not sure that building a database is even considered for it – like, how good is it going to be? You have tons of stories of people trying to start an open source project, and then going down in flames and crying because it's just so hard to maintain and to build and to review issues. Honestly, if you want to write code, please do – enjoy it. Does it have to run in production with 1000 other people hammering on it? That's another problem. If you want to get good, there are many ways to do it. But what are you optimizing for? What do you want out of your career? What would make you proud? What would make you look in the mirror and say, “Yes, I'm really happy with myself. I am proud of what I did.”
For me, it was really to help other people, so I started teaching. I get a lot more out of helping others than actually having some really fancy skills on my resume. There's also this concept that if you have a certification or something like that, it's just one indicator. But usually, if you're at the cutting edge, there is not even a blog post about it. If there is a book about the topic, you’re already too late. If there is somebody who can teach you about that, you are lucky. So this is where I feel like even those strategic problems are a kind of combination of all problems put together. It was a very, very experienced consultant who once told me “Every problem can be broken down into a simple Unix command. [Vishnu laughs] Since he told me that, I could not unsee it. It is so true.
18:20
Vishnu
It is so true.
18:22
Demetrios
Yo, Vishnu – real quick. Do you like getting new jobs?
18:25
Vishnu
I love getting new jobs. And I love looking at benefits packages – because it's the best part of starting a new job. Do you like jobs?
18:33
Demetrios
Let's talk about some jobs that we got in the community right now. Walmart Labs is looking for a Director of Data Science. We've got ZenML looking for a Developer Advocate role. What else do we have?
18:47
Vishnu
We’ve got StockX looking for a Senior Applied Scientist role focused on machine learning. I really endorse taking a look at this job. Sam and Ray, who are both community members, are hiring for this job at StockX. And by the way, if you want to check out more jobs, check out our MLOps Pallet, you can find it in the Jobs channel or go to MLOps.pallet.com/jobs
19:09
Demetrios
For me, there's something interesting that you're talking about there too. This comes from – I was speaking a few days ago with my friend Henry, who works at LaunchDarkly. He does not have anything to do with machine learning, but he's a back end engineer, and he was talking to me about these little hacks that he's found along the way when it comes to growing in your career and along your path. One was, he was talking about how important it is to document things and just how he's seen over time that the people that really excel in their careers are the ones that are voracious documenters. The whole reason behind that – and I've said it before, but it bears repeating – if you can, as one engineer, affect the way that 10 engineers work, it's going to be a lot more powerful than you doing 10 engineers’ work, no matter. That is just going to give you infinitely more leverage.
So that was one. And I asked him, “Okay, what's more? What are some other good hacks that you've heard about?” And this is what I was thinking about, to your question, Vishnu – he said, “You know, a lot of times people don't realize how powerful it can be for a company to just enforce someone (some individual contributor) to take the bull by the horns and onboard a SaaS product.” And most of the time, it is so easy to just grab a product off the shelf, incorporate it into your company, and you don't need much. For a lot of the products that are out right now, it's not like you need to go through procurement – you don't need to do much. You just take that Brex card that you have for your startup, you pay for it, and then you go.
That was a huge one where I felt like, “Oh, I've never realized that.” But because – if you do that, you are affecting the way that the company works and even if you're able to just shave like 1% off of each engineer’s time that they spend repetitively doing things, you have made a significant impact in the business. We can bring this back to the machine learning aspect and talk about that. But what I really want to talk about with you, Julien, is the idea of chaos engineering. Because that's something that, for me, is fascinating and I would love to hear how you look at chaos engineering when it comes to machine learning.
21:46
Julien
Yes, sure. Actually, chaos engineering has had a big impact on me, especially because of those cross-domain problems that were solved independently and they can all come together. I would say this is a mentality shift. In site reliability engineering we have the concept, what we call “service level objective”. Basically, you set up statistics on what you find acceptable on your service. Let's say, “I want all the requests to answer below 100 milliseconds over a period of time.” This creates what we call a “budget”. So when you are actually above that number, you can say “I have some budget. I have wiggle room to actually break things.” And this translates into what we call an “error budget” – this budget is called an error budget. That budget you can spend on doing chaos experiments. But chaos is a very misleading name.
Actually, we don't add chaos to the system – we reveal the chaos that is already inside the system. Let me give you an example. When you scale, if you have three VM – you go from three VM to five that you create. Then you find out that the load is going down, so you can scale down. How do you know which machine is going to get killed first? This is basically the kind of question you have – how do you find the answer? So the experiment that you can run is, state your hypothesis “the newly created ones are killed first,” and then you test it. It's a very scientific approach. It's not at all like, “Hey, let's break production.” [chuckles] They say some people have done chaos engineering inadvertently, but it's not at all like that.
23:50
Julien
The chaos part is actually quite small. There is all the analysis part that comes after, and getting the data, and the prerequisite to chaos engineering is actually that you have stellar monitoring. Because if you don't know what's happening, well, you just know you broke things and that's the result. Sometimes, to confirm a hypothesis, you find questions you didn't have before because suddenly you understand how the system works. And you understand the edge case of the system. It could be so many things. I’ve had things that nobody could have figured out until we hit some problem and scaling is one of them. Usually, if you have a database and suddenly the number of backends just scale up, at some point the number of connections to the database might become too much and you start dropping them. The database is still accessible but just some backend cannot access it.
So you have that insight, “How can we improve on that?” Then say “Well, okay. There are some numbers. You can fine-tune those things.” And this chaos engineering gives you such confidence in the system that you know you can break it. You have confidence to the point like you know exactly how long it would take if a whole region goes bust – how long it will take to recreate it into a different region. Sometimes, we talk about infrastructure as code and things like that, but I find that sometimes the simplest solution scales best. Implementing infrastructure code has a lot of tricks and the problem is, if you automate something that you use once every six months, I'm pretty sure you're gonna have to relearn everything you need the moment you need to use it, because you don't remember how it is automated.
Sometimes having those runbooks helps a lot. If going to the UI of the cloud, and creating a cluster manually, and then triggering the CI/CD pipeline to redeploy everything there, is your fix – that's an acceptable fix, in my opinion. Because you spend very little time preparing for it and the problem is solved. So it's all about risk, right? It's like, “How much reliability do you need? How much usability are you going to have when there is a problem?” And it all comes down to cost, basically. Money translates to everything, I can tell you this.
26:33
Vishnu
I want to ask a quick question here, before you jump into that. This is the first time I'm really hearing, in a clear fashion, what chaos engineering is. I liked your distillation of it a lot. As I hear it, I'm reminded of how entropic, or chaotic, machine learning systems are. You have the data, which is constantly changing, you have the model, and especially with the advent of deep learning models, we have to (put it politely) some unpredictable models with a lot of different factors involved. Then you have the code itself, which involves a lot of different components because of how challenging it can be to code and configure machine learning systems. In the context of all of that, what is the role of chaos engineering in building production machine learning systems?
27:22
Julien
Oh, it's very interesting because I had this question recently. The solution became very complicated, very quickly, in the sense that – let's say you have a model that is behaving badly in production. What do you do? Do you fall back to the previous version? Do you still have it? Is it easy to deploy? Or do you fall back to a heuristic, like some back end? The funny thing is, many people think that MLOps is kind of like, you take a DevOps engineer and you put him doing MLOps, and it's gonna go fine. But the problem that MLOps has, it's kind of the same but a different scale. Because we don't deal with code, we deal with data. All the tooling that we have is built for code, meaning like a few megabytes, as opposed to MLOps machines, which are like terabytes, sometimes petabytes. So the order of magnitude – you cannot create something that has the scale more than 100 times of the scale that it was meant to fix basically.
So these load balancers, for instance, they are not meant to implement in that use case. They just either keep retrying, but they fall back onto something. This concept is actually quite hard to implement in any load balancer because they're not meant to be automated in that way. They don't say “If this one fails, just trigger the redeploy of the previous model.” So at the end, you have two models in production and imagine what that does on the monitoring? How do you differentiate which model gets used?
So you see, it's those little details, that in the grand scheme of things make things so complicated, and that's why changing the process of how people do things is really the one thing that is going to define success. Because they need to very quickly change how they do things in order to map whatever the business needs. I heard I think it was in the podcast, I don't remember what it is, “The ML engineer should be able to ship things to production.” And I have to say it really depends what you want them to focus on – the ML part or actually the MLOps part? Because as soon as you get people who know both machine learning and operation, they're pretty rare. It's not a big pool of people.
30:00
Julien
If you think about it, it takes five years to get a Master's degree in computer science. You don't have to wait five years for hiring new personnel. I take a degree as an example, but it takes time for people to learn those things. I had a data scientist come to me and say, “What’s Docker?” And I was like, “Are you kidding?” I had such a… I had to really find the empathy in me to understand like, “Okay, let's do a workshop. I'm going to teach you everything you need to do.” And it got me thinking, like, “Should they really be focusing on that? Is that really what they want to do with their life?” If I'm a data scientist and suddenly, I'm a glorified sysadmins doing debugging with GPU and drivers and all those things. It's like, “Why? Why?” They should hate themselves. I would hate myself if it was my job. You understand, right?
So this is where the platform is interesting, because you need to provide the right level of abstraction to the user to find that. It's a hit and miss. You have to try, you have to talk, you have to iterate, you have to have a good organization, a safe organization, an inclusive organization. Because people come from different backgrounds and they have different knowledge. This is why the Spotify model is quite… I mean, in Sweden, it's not really proper to Spotify. But in Sweden, companies have a very safe – psychological safety, meaning that it's okay to discuss something without throwing chairs at each other. They are very open to dialogue. That facilitates a lot of things, I would say. It's one of the best ways to deal with that.
32:06
Vishnu
There's so much to unpack in what you just said. I want to start first by saying, you're probably one of the funniest guests we've had in a while. I'm over here cracking up. So this is awesome. Number two, I think your comment about having empathy and understanding how people fit together is so crucial. Because I do think that companies and teams that are working on engineering tend to be lazy about people and not technology. And that's weird to me. Because people are much harder to hire, and find, and pay, and compensate, and keep, right? But when we hire someone, we don't think – no manager sitting there and saying, “What is my 90-day plan to upskill this person to the point where they're productive?” They're saying, “Okay, new hire. You're getting hired – go. Now learn.” To me that's so ass-backwards, right? It's like, “Come on!”
But I digress from my main question, which to you is – you have very succinctly put what the core challenge of MLOps is, which is “We need to configure the data, not just the code.” How have you solved this problem in previous technical environments, given the tools that you had? Can you tell us a little bit, either about your experience and Embark or Spotify – great companies – how have you practically solved this problem? Maybe through an example of a model or a system that you were trying to help productionize.
33:37
Julien
This is where I tell you – use the cloud. [laughs] If you have something to do, it is gonna save you hours and hours – days, if not. And it's easier because once you figure out your use case, your level of abstraction, it's easier to build it when you already know what it looks like. This is the hard part. Let's say you want to learn about finance, but you’re learning in Spanish. if you don't speak Spanish, it’s going to be twice as hard. Here, you would already understand what layer of abstraction you're working on, and what your people can manage, and then you can recreate it yourself. Most of the time, I would say there is not a lot that is readily available or a package and so you have to find those tricks. Those tweaks can be like the file name of the file in the bucket. You know, you can find some dirty tricks, like having some kind of super efficient database like Spanner – that will solve most of your scaling scalability problems.
Yeah, you pay for it, but I would say when you count the cost of how long it's going to take, how much resources, and all the people are going to build something that they could get from a credit card and they cannot focus on actually building the feature. For a startup, it sounds… I mean, I would be very surprised if a startup came to me and said, “You know what? We really need to build our own ML platform.” Today, I mean – not two years ago, but today. Because now there are so many options that they could find. I would say that they should first ask the question, “Is ML the real solution you need? Can you do something that does half the job – handle 50% of the use case with something that you already know?” Consistency is key here.
I'm going to tell you a little bit of a story. After the second world war, the US Army has made studies on airplanes, why they fail – what kind of errors the pilots were making. And they found out that 50% of the error was called “substitution error,” meaning that when the pilots went into the plane, the dashboards of the planes were different because the planes were different –the way the altitude was displayed, the buttons, the controls were different. And that caused half of the mistakes. They standardized the dashboard, and overnight, half of the mistakes were gone. They didn't standardize the plane, though. It's just the way the interface from the human to the machine – they standardized that. Think about how we build our tools. In your stuff, if you have one language, one platform, one API – those things that you already know where to look – and then when you start feeling the pain, like an organism, like a cell that is divided in two, you just branch out. So this is a little bit how the concept of evolving into something works. Start simple because simple scales.
37:10
Demetrios
So you're talking about standardizing. This idea of standardization has come up quite a bit. I know that I've heard it, talked about it – where it's standardizing on an industry level. But you're saying to just standardize within the company, so that everyone in the company understands how things work at this company, and it will be much easier and there will be less mistakes made?
47:35
Julien
Yes, definitely. Otherwise, it's more like, “Oh, there is this open source project. Oh, there is that open source project,” and they kind of overlap. In the end, everything is installed and you don't really know what to use. You're going to spend more time asking around, “What should I use? What is good to use? What is maintained?” It’s like, “Oh, we forgot to deploy that.” – those kinds of things. That's the problem with dealing with computers – we don't see what's going on unless you know the keystroke that will actually get you there. So working early on IAM management is actually really important. Because half of the errors I see are more like “Hey, I don't have access to this and I don't understand why.” Of course, they're not going to tell you “This is the place you should go if you don't have access.” [chuckles] That's kind of a security breach, right? If it's an attacker, he would be so happy to see that.
This is the hard thing about hard things – those are the things that we don't see. It's everything like, backups, reliability, monitoring – it sounds so easy, but I see people logging so much at some point I renamed a service “blowhole” because that's all I could understand it was doing. It’s just like pouring data over things. And the cost of data, storing those logs, was getting to be more than running the service. And I'm like, “You know what? You have to choose. Either you do distributed system or you do logging. Which one do you want?” Because in every system, for one request goes through any of the services, every time it's one log entry that gets added. In the end, it’s just an exponential explosion of data that is actually completely useless because it's stack trace, and they use it for debugging.
This is where I advise “If you are building a distributed system, use tracing, because tracing is built to actually efficiently store those data.” This is where – it sounds crazy to say “don't log” – but I worked at companies where I was not allowed to log unless it was an unrecoverable error. Like “If all else fails, you can log, because it costs too much to log.” So there are all those tricks that you learn once you start moving from one scale to another – it's actually the same problem, but different scale, which makes it another problem. It's like it's the same problem, but a little bit different. So that's where you'll have problems that you never had before. You thought, “Okay? I just use a database,” and then you realize you have two terabytes in that database, queries are getting slower, and those things. It's still a database, but you need a different database that can handle that kind of thing. And those problems are very, very complicated.
40:37
Demetrios
There is something that I wanted to get into, because you mentioned “Okay, in the beginning, or if you're a startup and you're looking to go into these things, just use the cloud, especially because it will give you that standardized, easy way to go from zero to one.” I've heard that said before – I can buy into it. Now where I'm at, though, is – how do you reconcile the trade-offs that you get when you use a Vertex or the SageMaker, and it gets you kind of what you're looking for. But then there are all these edge cases that you have to spend months trying to fight with your SageMaker to actually get what you want out of it.
41:22
Vishnu
Great question.
41:23
Julien
This is exactly the use case – Okay. Now you understand the problem. This is where building your platform – you have an actual reason to build your platform. It becomes sustainable. You say, “Hey, if we really need this from the business perspective,” I mean, we're dealing with data. When you're dealing with financial data, you know how much it’s going to cost to run those things and you know how much it's gonna cost not to do it. So it's better to have those data-informed decisions, data-driven decisions than actually just winging it and say, “You know what? Today, I feel like everybody on Slack today makes a decision about building a platform.” This is terrible. It's undocumented. You cannot justify it. And not even for others – just for you. In six months, when somebody will ask you, “Why did you do that?” Document your architectural decision. This is very, very important. It doesn't take a lot. It's just like one page, “We found this problem. This is what we had. This is how we think of solving it. Maybe we were wrong, please forgive us.” [chuckles] That's how I would write it.
Assume you're gonna be wrong and plan. It’s like a chaos engineering mindset. Assume something will go wrong and plan for it. If you're okay with the risk, then it's all good – you're not going to stress about it. And this is why – it's the relationship we have with the unknown. We can very much minimize that. That question is great, because that means “Yes, you have a use case for building a platform,” or maybe a part of a platform, or the part that's missing from the cloud. And if you have that problem, probably other companies have it as well. So that's a great open source of opportunity to collaborate. You see? You build that community out of that. It might even become a startup product at some point, because the cloud cannot manage to evolve as quickly as a startup. They evolve very quickly, but they have a different skill. Once they have to change something, it's like millions of users. It's not the same as “Hey, we're gonna start small with 10 users and see how it feels.” And probably there is a market for this kind of thing.
So having this need-driven development is actually much, much better and easier to justify, I would say. It's all about the concept where people were chanting about multi-cloud, and I think they misunderstood. Because the real lacking of the cloud is security. As you know, a CPU will run, right? The CPU or RAM networking, it's more or less the same. But for security, every cloud has its own security model and different level of granularity that don't translate very well. And that's really hard too. Most projects got killed because of that. However, multi-cloud makes sense when you have your people negotiate the contract for renewal with the cloud provider and say, “Hey, we are on two clouds. If you don't give us a good discount, we’re going there.” That's how it works.
It's not at all about technology, it's about leverage. So multi-cloud is very much about having that discount with cloud providers. It's not about reliability, that's what I mean. Because it's way, way harder to manage those two clouds. And when people migrate – actually, some projects never migrate. They just let it die and then rebuild something into the new cloud, because it's just impossible to reverse-engineer everything that has been done or translated. It’s like changing the engine of an airplane while you're flying. It’s just really, really hard, those things. So there are many things like that.
45:11
Julien
It's very much using your wisdom and seeing the landscape that you're in, and strategizing according to that. There is a great, great talk by Simon Wardley – it's called Crossing the River by Feeling the Stones. He talks all about strategy and how you can plan years in advance, like, “Okay, many companies have had those problems so the solution will come by that time,” and you can actually plan for the future regarding what you should build and buy. This is why – start with the cloud. Because you don't know even if your product is gonna make it or not. Or you might have to pivot at some point, and it's easier to outsource all that infrastructure part to a cloud provider that you pay, rather than trying to compete with it with people that you're going to hire and train, that have to maintain all that infrastructure. It's about strategy. That's why I say that tech is actually not really a problem for me – it's a fun problem to have, I would say. I'm really happy when I have a tech problem. That makes my day. I love that.
Also, if you ask a developer – don't ask the barber if you need a haircut. If you ask a developer, “What's the problem?” He's going to come up with a code solution. More code, more tests, more deployment, more manifest, more everything. It's important to ask, “What's the right level of abstraction that we should apply the solution to?” Sometimes changing a process solves months of development. So are we going to need that really? If we change that process, we can actually save… we don't need to build this. Or, if we actually keep something very simple, like putting everything into one folder – that’s just an example. But you see, keeping things simple is actually a full time job and it takes a lot of meetings.
47:29
Vishnu
I totally see your point across your answer there, which is this experience and trying to make the complex simple, and how you can find ways to do that iteratively is the essence of modern software engineering in practice. Right? And I think that that is a really great takeaway for us to wrap up on. Unfortunately, we're at time. Julian, it has been a pleasure to have you on. It's been a while since I've laughed as much as I have on a podcast and also learned a lot at the same time. So thank you so much.
48:11
Julien
Thank you for having me. I had a blast.
[outro music]