Why is Open Data Important for Open Source Models?
Andriy is the co-founder and CTO of Nomic AI - a venture-backed start-up that is on a mission to democratize access to artificial intelligence. Prior to Nomic, Andriy was an early engineer at RadAI where he trained multi-billion parameter LLMs to assist radiologists, and a Ph.D. student at NYU's Courant Institute for Mathematical Sciences. He cares about making AI systems and the data they are trained on more accessible to everyone.
I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.
I am now building Deep Matter, a startup still in stealth mode...
I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.
For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.
I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.
I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.
I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!
Open in the context of machine learning systems is a poorly defined and understood term. Can you access and without restrictions use the model weights? Can you access the model training code? Can you access the model training data? This talk explores the various definitions of open in the context of machine learning models and highlights why reproducible training code and data are crucial to determining which machine learning model to select for your production use cases.
Why is Open Data Important for Open Source Models?
AI in Production
Slides: https://drive.google.com/file/d/13RncizF1aX85zFOSdZMScog_H0tQLHGQ/view?usp=drive_link
Adam Becker [00:00:05]: Next we have coming up. Andre, let's see. Andre, are you here?
Andriy Mulyar [00:00:09]: I'm here, yeah.
Adam Becker [00:00:10]: Okay. Good to have you, man. And today we're talking about, as I understand it, openness just in general. Right. We just talk a lot. Even any PM is going to talk about, like, open source models. What does that actually even mean? Mean I think you're going to help to sort of organize our thinking a little bit around this.
Andriy Mulyar [00:00:29]: That's the hope.
Adam Becker [00:00:31]: Let's hope so. I'll be back in about ten minutes. Take it away.
Andriy Mulyar [00:00:36]: Cool. All right, so, hi, everyone. My name is Andre. Basically, the goal of this talk is to discuss openness and machine learning, and specifically one kind of aspect of openness that I think oftentimes gets overlooked. But the argument that I'm going to make in the next ten minutes is that it's probably the most important part. So again, the goal here is to understand open source in the context of machine learning. What does that actually mean? I think a lot of people have a very bad impression or a bad definition of what that is, and then understand the importance of open data, which I'm going to argue is the kind of crucial component to openness and machine learning. So what makes an AI system open? Open source machine learning today, and then the importance of open data is the outline.
Andriy Mulyar [00:01:19]: A little bit about me. So I'm the co founder and CTO of a company called Nomic. The sort of like general one liner we give people is we build explainable and accessible AI systems. So we focus on allowing people to better access state of the art systems and their capabilities. And also we focus on building tooling that allows people to make the systems more accessible. You might know us through some popular projects that we've been participating in. So we drove the GPT for all project that allowed people to access large language models in early 2023, and we still maintain it. We've also recently released a sort of a state of the art, fully open text embedding model that is getting some adoption.
Andriy Mulyar [00:01:55]: Well, that was about two weeks ago. We're also based in Flydire, New York, so if you're ever in New York, come to one of our events. So if I go into Google, and this is actually not the first thing I did when I was putting together this talk, but if I go into Google and I say, what is an open source machine learning system? I get this IBM SEO article that talks about an open source AI system is something where the source code is freely available. And when I first saw this, I was like, what the hell does this even mean? Anyone who's built an AI system knows that source code actually means nothing when it comes to the actual construction of an AI system, let alone how you are able to reproducibly release it. So what is an AI source code? Right. If you're coming into this from the traditional software world, this might sound like a reasonable statement, but what really is the source code for an AI system that you've built? Well, it kind of comes down to three components. And the reason I say the word system here is because it's really a system of several unrelated things that you put together to actually build the whole end to end AI system that you can call sort of open. The first thing is the data set.
Andriy Mulyar [00:02:56]: Nowadays, data sets are really large. So this is, for example, a whole entire dump of the Internet, something like the pile, a whole entire dump of all the images on the Internet, like Leon. A whole entire dump of a cleaned version of the Internet, like c four. There's an algorithm associated with it. So something that you put the data set through, and it outputs something called a model at the end. These are kind of like all of these unrelated components that all kind of come together to form the end, sort of end to end system. So what is an open source machine learning system? Well, you need to be able to have transparency and openness in all of these components independently and in all of them put together. So all these lines between them too, not just, for instance, the source that goes into the training logic, not just the source for the data set processing, not just the source, for instance, for doing model inference.
Andriy Mulyar [00:03:40]: These are all different components of a machine learning system that aren't the kind of source code. So I think one of the biggest problems that a lot of people have and kind of some of the things that we've been exploring through our community and sort of just conversations with people, is that they don't really actually have a fundamental understanding of what it means to even release a machine learning system openly. And that's sort of one of the motivations for this talk. I want to sort of shed a little bit of light on a little bit of light on this. So there's this data set component that goes in. So what samples do I feed into my model? Can anyone access those same samples? So did you release those samples when you were going in and building your machine learning system? Are those samples only valid at a point in time? For example, if you release data set of images, can I go in and query those same set of images? If you just gave me URL. Sometimes the image links might die. The algorithm and the training code are your actual algorithmic improvements behind your system that you're building? Are they openly disclosable? Are they even disclosable internally in your company? Right? If you take a vanilla transformer architecture and you make a few modifications to it, are you actually letting the individuals be powering your system? Know that, letting individuals that interact with system know that these are perturbations you made to the actual training algorithm, for example, or even on the model artifact side, do you openly release the model? So in summary, what this really means is that is your end to end pipeline that goes in and that takes in a training data set, takes in an algorithm and outputs sort of model weights with inference code.
Andriy Mulyar [00:05:08]: Maybe at the end to execute the model is that end to end system reproducible so that somebody can go in and grab it and iterate on that system themselves. This is what we mean when we say open source machine learning, the whole end to end thing. And what are those current practices right now? Right. So I sort of gave you sort of a rough definition that this is sort of a little bit non standard from what you might typically see in typical software open licensing, this sort of thing. Currently, a lot of organizations, they see the value in, for instance, openly releasing models. A lot of model hubs, for instance, like hugging face, they host these models. One of the things that people do really well right now is the opening of the model weights. So it's a norm right now that you will go in and you will release the model weights if you're inclining to openly release your model.
Andriy Mulyar [00:05:52]: One of the things with these model weights is a lot of people go about saying like, oh, this model is open source. This model is not open source. Open source is not really the right word. The right word is maybe open weights. Most of these models actually on release don't have a typical open source license. So for instance, stability releases. Their license now releases under this non commercial license, which means you can't actually go in and use it for anything other than research purposes. So you can't truly build on this if you're a company.
Andriy Mulyar [00:06:17]: Quen, which is a top open source LLM out of China, also has a custom license. Usually anytime you see a custom license, there's some sort of asterisk there that you need to really pay attention to. But when it comes to openness, this is an amazing start in machine learning. On the very same, usually open training code exists. So the models come with the ability for you to go in and take the model weights and maybe go in, rather take the model weights and apply it through like a fine tuning algorithm, and they'll usually reveal what this are. Because these procedures are pretty standard. It's a very sort of standard thing to be able to lower the learning rate and add some additional fine tuning data and be on your merry way with sort of a diff version of a foundation model that you're working with. These are kind of the current practices.
Andriy Mulyar [00:07:02]: What doesn't happen right now is the fact that people don't actually go in and often associate the data sets with the models that they go in and release. There's probably several reasons for that. One of the large reasons is the data is probably the secret sauce, right? What makes GPT four better than Gemini? Everyone knows what a vanilla transformer is. Everyone has access to the best engineers to be able to optimize cuda kernel to train the models larger and faster. The big difference is all the hand curated data that they've gathered from providers like scale, for example, to be able to scale up the actual set of fine tuning data points that they're working with. Same thing with stable diffusion versus mid journey. Why is mid journey so much better? They've done a lot more data work. And the other kind of portion here is the data sets are really expensive to collect.
Andriy Mulyar [00:07:42]: So once you collect the data set, do you really want to open it up to the rest of the world? And it also maybe will put your company under legal consequence. So, for example, if data sets are copyrighted, your company might get sued. For example, the Wall street, the New York Times is suing OpenAI right now, or maybe one larger major news publisher is suing the New York Times right now for this very, very same point. So this is the reason you wouldn't be able to, you might be motivated not to release your open data. The problem with not releasing your data is that it's kind of the whole gist of what the machine learning model at the end is capable or not capable of doing. So there's kind of two arguments why I want to make here, why people should really be moving forward and releasing the data that their machine learning systems work on. The first one is really the generalization performance argument. We oftentimes evaluate our machine learning.
Andriy Mulyar [00:08:32]: We do put a little effort into train a machine learning system, and we evaluate it on a handful of benchmarks, maybe accountable number on your finger, number of benchmarks. And then we say, oh, this model is better than the other model. There's no way to actually truly know how your model is going to perform on outer distribution benchmarks without having knowledge of what the model has been trained on. For example, if I wanted to, for instance, use my llama model or mistral model, whose underlying pretraining data sets, we actually don't know what they are, even though the model themselves are open. And if I wanted to apply that model to, let's say, like a clinical task. So using clinical notes, having knowledge, if a model has seen clinical notes during pretraining, would guide me to pick one model over the other and obviously save me a lot of time and save me a lot of resources in any sort of downstream thing that I do. So this would be a very useful place to be able to understand where the underlying training data came from. And the other thing is, people build over top of these models.
Andriy Mulyar [00:09:24]: This is basically exactly what I just said. As people build over top of these models, knowing the data that goes into the pretraining is really important. The other thing is the capabilities and advancement argument, and this is the one that I most believe in. This is the reason why we at nomic focused on really doing open source properly by releasing training data and training code, baking your systems fully reproducible, I think, actually builds a really strong following and moat behind any sort of downstream use cases of your system they're using. So, for example, by increasing the stability, you actually increase the ability for a larger audience to make modifications to your machine learning model. And this in turn, improves model capabilities for the tasks that you care about that you've been working on. Just to give you like, a list of successes here, I think the two biggest successes, in my opinion, from the past year, so ever since Salama was, so the weights were openly released by meta in early 2023, you've seen a giant boom in sort of people being able to build their own sort of like, large language models and run them locally, run custom versions of them, run custom versions of those models. Now, I said the pretraining data of those models was not of a model like llama was not released.
Andriy Mulyar [00:10:35]: But I think the thing that was critical to actual, this boom that you saw in 2023 was the fact that individuals were releasing the curated fine tuning data sets publicly. For instance, Noose research publicly released their open Hermes data set, databricks publicly released their human preference data set called Dolly. And actually, we ourselves, when we released GPT for all, openly released that data set, and immediately you had dozens upon dozens of people iterating on new models. And that is what caused that sort of explosion in sort of open source model capabilities throughout 2023. This is, I think, sort of one of the most crucial points. The last thing here is also, I would say, on the embedding model side, a couple of weeks ago we released one of the sort of first fully open embedding models that was an open AI quality. So what this means is you can go in and grab all 240,000,000 trading data pairs and with a reproducible pipeline generated at OpenAI, add a quality or actually OpenAI small quality text embedding model all by yourself. And one of the things I wanted to note here is for anyone who might be thinking, hey, what does this actually get from my company? Why would I ever want to do this? We've actually been seeing just in the past two or three weeks, people moving off of other closed providers solely because of the model auditability reasons.
Andriy Mulyar [00:11:45]: So it's not just you're doing some sort of like saintly act for the world, there's real business reasons. You may want to actually go in and fully make sure that you have this data component in your open source system. And our actual inference API has increased in usage ever since we've done this, which is maybe quite surprising for many people. So the thing I want to leave everyone here with is you do about this. How do you navigate this world where you have all these open source models and you have all these closed source models and you're trying to figure out what do you iterate and pick on? Well, the thing that I would first say is, the thing you want to always come back to is you want to build machine learning systems at your company, at your organization, in industry, exactly the same way you would build them in academia. You want to make sure they're fully reproducible. And if you choose to release your models, I would say strongly consider also releasing the training data, because the impact that your model is going to have on open release is going to be lowers of magnitude greater if you actually have the full replicability that comes with the training data release. You could be an advocate at your company for this one way to advocate.
Andriy Mulyar [00:12:44]: For instance, maybe point to the things I just said in the previous slide. Open source is an advantage. It's not a concession your company has to make because of some reason. It'll build a moat around the product and around the model that you're putting out and drive usage and benefits for your company. And convincing your team of this is maybe not a trivial actual thing. It is not a trivial thing to realize without actually having evidence of it. And sort of the final thing I would say here is that building with the mindset of opening up your system allows you to improve your internal practices by making sure that your internal stack is fully reproducible, so that if you wanted to, you can click a button and, for instance, allow anyone to reproduce your system. It allows you to actually do better machine learning.
Andriy Mulyar [00:13:24]: It's a lot harder to fool yourself into thinking your model is more capable than it actually is. If, for instance, you want to be able to expose the model to the world and actually have that sort of level of auditability and people actually using it and testing it in situations that are maybe out of domain for the task that you worked on. So kind of in final here is that this is a sort of short ten minute argument for why open data is sort of one of the most important things you can do if you're really thinking about openly releasing your machine learning models.
Adam Becker [00:13:55]: Andre, this was absolutely incredible. We're going to share the slides for sure, in case people can use them to champion opening up the models that they're building in house. Thank you very much for that. There's a couple of things that you just put the fear of God in me. One of them is it could have just been the case. And I know that some people have been speaking about this, that we've just been overfitting where it is so difficult for us to actually have an accurate, reliable assessment of the out of distribution performance that unless you release the data, we will never know. Right. So I know that there's a lot of work that people are trying to do to just take data and figure out whether a particular prediction has actually come from something that looks like it's been in the trend.
Adam Becker [00:14:44]: But all of that just sounds like a much more difficult task to do than to just release some of that data. Though I understand all of the arguments against.
Andriy Mulyar [00:14:52]: Yep.
Adam Becker [00:14:54]: Awesome, Andre, thank you very much. And you guys are based in New York, which means that we should partner her up and do an event for Mlops NYC.
Andriy Mulyar [00:15:04]: Adam, we're already way ahead of you here. Planning one already with you guys.
Adam Becker [00:15:10]: Okay, sounds good. Oh, that's right. Yes. That's where I know the name from.