The Challenge with Reproducible ML Builds
Omoju Miller is the Founder and CEO of Fimio, Inc, where she is building 21st century developer tools. In the past, she was Technical Advisor to the CEO at GitHub, spent time co-leading non-profit investment in Computer Science Education for Google, and served as a volunteer advisor to the Obama administration’s White House Presidential Innovation Fellows.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In this talk, we will learn about repeatable, reliable, reproducible builds for ML written in Python. We will go through what it means for a process to be reproducible. Furthermore, we will talk about the need for accessibility and ease in collaboratively building and working on large-scale ML.
The Challenge with Reproducible ML Builds
AI in Production
Adam Becker [00:00:05]: Next one up is Omoju. Where you at?
Omoju Miller [00:00:11]: Yes.
Adam Becker [00:00:11]: All right.
Omoju Miller [00:00:12]: Hello. Hello.
Adam Becker [00:00:15]: How you doing?
Omoju Miller [00:00:16]: I'm very fine, thank you.
Adam Becker [00:00:18]: Great. So you've got ten minutes on the clock. Do you want to share your screen or anything before we dragon?
Omoju Miller [00:00:26]: Yes.
Adam Becker [00:00:28]: Yeah. There you go. That's what they see. All right, cool. Well, ten minutes on the clock. Feel free to get rocking.
Omoju Miller [00:00:36]: All right, thanks, everyone. Thank you so much for actually inviting me to this conference. I'm very excited to talk to you all about the challenge with reproducible ML builds. So we are living in the golden age of LLMs right now. All of us, every single one of us, is extremely lucky to be alive at this particular moment in time. And for those of us that are in the Bay Area, we are basically in Florence at the start of a new golden era. And this golden era that we have now has been ushered in by LLMS, which is a very exciting time. Back in 2018, I was a senior ML engineer data scientist at GitHub, and I was speaking at the Kingdom conference, and we were talking about something that we were calling semantic search, which was basically using natural language to search for functions.
Omoju Miller [00:01:32]: And it was something very simple as you would type in, I want to flatten a list, and it will give you a Python function. And we're very excited about this. This was in 2018, the zenith of that work went on at GitHub with a collaboration with OpenAI, where they eventually create what we now know as GitHub Copilot. And that basically started us off in the LLMS race. Copilot had such a massive impact that even the term copilot has become something that people just use to talk about a pair programmer or an AI assistant. And as amazing as all of that is, and it is truly, truly, truly remarkable, in order for us to actually build LLMs, it's nontrivial. It's very, very difficult. And even if you overcome that difficulty, if you want to deploy them in production based environments or move them into different environments, it is something that is very, very hard.
Omoju Miller [00:02:38]: It's not easy to do. And that is because the way we fundamentally build machine learning software is very, very different from how we build regular software. And the reason is because there's a tight integration between the software and the hardware layer. What we are talking about is actually computational linear algebra. And so because of that, we can accelerate how these things work by going to use a GPU instead of a CPU. It makes everything go that much more faster. So because of this integration between software and hardware, it makes it nontrivial to move code, working code from one environment to the next. And what that means is that in addition, for you, being an ML engineer, you now also have to be like a DevOps expert.
Omoju Miller [00:03:29]: You also have to be an infrastructure expert. And that's pretty much asking a lot for any one person to be something that in my mind should be very easy is you wake up, you see that maybe Google or OpenAI has dropped a new model and its model is open sourced on hugging face. And you're like, cool, I want to go around and play with it. You go to a playground, maybe hugging face spaces, you see it, it's working. You're like, great. Now you're like, I actually want to take that model and create my own deployable build. I actually want to build this in a production like environment, quickly deploy it, and play around with it for my own use case. That that should be so simple is actually quite hard and it takes significant amount of time.
Omoju Miller [00:04:24]: So what I did earlier in the week was just wanted to actually figure out how long it would actually take me to do this. So the first thing I did was picked a random model. I was like, all right, I'll take the Nvidia Canary 1 billion parameter model that is on Hogan face. So I go there, I take it, and in my mind I just figured I'll just go there to the models page and I will download the model. And they're very nice. They give you like the little prompts of what you need to do. Just copy this instructions, put it in your terminal, and we should be good. I should have known that was not the case because I forgot to read the fine print.
Omoju Miller [00:05:07]: It's not going to work automatically out of the box because even though it says git LFs install, it's actually not an installation. It is more an initialization of git LFs. And you already must have had git LFs pre installed in your system. All right, no problem. I go and figure out how to actually install git Lfs. I decide to use homebrew. I get it going. I install it in my system.
Omoju Miller [00:05:32]: I go back, run those commands. Now, this time they actually work. I'm able to clone that repo and pull that model down, and it's awesome. But what I get is a model which is in some kind of format, the nemo format. I really don't know what to do with it. This is not giving me a lot to work with. So I go back and try to like, okay, what is this? Oh, now I need to go install something else. I have to go install this Nvidia Nemo thing.
Omoju Miller [00:05:58]: All right, cool. Maybe you go ahead and install that. But before I did that, I was like, maybe I can cheat this and go to spaces. Because if there's spaces that is up and running, that means that somebody has actually figured out all of this stuff and I could just take their code and copy their code. And luckily there are a couple of spaces that were up and running and they were using this. So I go in there and clone the repository again. So I go in, I clone the repository. Of course it's not going to work because they are both named the same exact thing and there's a clash and you have to do this and you have to do that eventually.
Omoju Miller [00:06:30]: I just was like, okay, I can keep on doing this and I'll eventually get to a part where I could build something. And more than likely, even if I decide to build it locally on my machine and it worked, if I want to now move it to my production environment, I knew automatically that it was probably going to fail because there were so many different parts of that process that are not captured, all the configurations, all the requirements, all the prerequirements that must be already in your system for the thing to work. And so this is actually a huge problem. And clearly there are many companies who are deploying LLMs in production. So what are they doing? And what it turns out they're doing, talking to a lot of people, is they're just hacking together their own bespoke solutions and keeping their fingers crossed and hoping that the builds don't break. This is not sustainable. It definitely will not scale. What we need is a repeatable, reliable collaborative build system.
Omoju Miller [00:07:31]: You basically need like 21st century developer tooling. We need to have all these things that we need to create these reproducible builds captured in an environment that it's very easy and accessible for everybody to use and shouldn't require you to actually be an expert. So what I propose that we do is to actually innovate on top of Git. We already have git. It helps us with collaborative coding, and that has transformed the world, making open source what it is. And now we need to take it to the next level because of the requirements that we have today are different. What we want to do is not necessarily just collaborative coding. What we want to do is collaborative building and shipping in an open environment.
Omoju Miller [00:08:19]: So a commit used to be something that was just about code. And at Femio, we are creating an entirely new class of commit object that is a fully reproducible build. So every time you actually commit into our system, it is something that can work. And here is a rough demo. We're still working on our application. It's not fully done yet, but here's a rough demo. And hopefully this actually plays fast enough. You log in, we're logging in with GitHub, you go into the repo that you've already decided, and you pick your entry point file, and we commit it and we build it for you.
Omoju Miller [00:09:03]: And when this finishes, and if the build is done, what you're going to end up is some kind of like a program that actually gives you a quick example of what the output is, and then you could share this with somebody else. For example, if you're working with a PM and you've just done some work, and we just want to quickly see what the results are, you could do that. But more importantly, the part that is the most exciting for us is that at the end of all of this, that entire workflow that you just saw is actually a commit, a new kind of commit that has four specific types of things, each of them with their own shards. The code itself, obviously, all the specifications that you need for the environment and configs, all the inputs that you use to test and the artifacts that were generated. And the reason why this kind of work matters and it's needed in a time like this, is because we are about to see the next big fight for open platforms. Llms are amazing. They are truly transformative technology. They are like an oracle.
Omoju Miller [00:10:12]: And because they have been trained on the entire Internet, they are a source of our wisdom and a source of our history. Right now, because of the complexity in the build processes and actually having it up in production, most of our LLMs are behind closed doors. They are in silos. Because big companies have enough manpower, enough budget to actually have dedicated teams that one team does the infrastructure, one team does the build, one team, they have SREs. And so therefore, that knowledge stays behind the wall garden. And we don't want that to be the case. We want an open Internet where all the knowledge we need something that came to almost like a Wikipedia for LLMs. But for that to even exist, you need to have open, collaborative builds.
Omoju Miller [00:11:06]: And that means that you need a different kind of tooling that is going to get us there. If any of the stuff I've said is exciting to you, and you want to get into this kind of work and you want to use what we've built, because we are building it for you, and I'm looking for people right now to actually test it to make sure that it's doing exactly what we believe it's going to do. If any of this is exciting to you, please reach out to me at Omoju at Femio XYZ. Or you can reach me also on X at Omoju Miller thank you.