MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Accelerating Multimodal AI

Posted Jun 21, 2024 | Views 274
# Multimodal AI
# ML Models
# Runway
Ethan Rosenthal
Member of Technical Staff @ Runway

Ethan works at Runway building systems for media generation. Ethan's work generally straddles the boundary between research and engineering without falling too hard on either side. Before Runway, Ethan spent 4 years at Square. There, he led a small team of AI Engineers training large language models for Conversational AI. Before Square, Ethan freelance consulted and worked at a couple of e-commerce startups. Ethan found his way into tech by way of a Physics PhD.

+ Read More

We’re still trying to figure out systems and processes for training and serving “regular” machine learning models, and now we have multimodal AI to contend with! These new systems present unique challenges across the spectrum, from data management to efficient inference. I’ll talk about the similarities, differences, and challenges that I’ve seen by moving from tabular machine learning, to large language models, to generative video systems. I’ll also talk about the setups and tools that I have seen work best for supporting and accelerating both the research and productionization process.

+ Read More

Join us at our first in-person conference on June 25 all about AI Quality:

Ethan Rosenthal [00:00:00]: My name is Ethan Rosenthal, and I'm a member of technical staff at Runwayml, and I take my coffee black, no sugar drip.

Demetrios [00:00:11]: Welcome back to the MLOps Community podcast. We are back. I am your host, Demetrios. For the next hour, we're going on a bit of a journey with Mister Ethan Rosenthal. He is a returning guest, and boy, oh boy, he does not disappoint. I'm going to give him an award today. It is the first of its kind. It is the most buzz word of 2024.

Demetrios [00:00:32]: And that is because he introduced what he is calling the multimodal feature store. He also talks about the differences, or should I say, the tension between engineers and researchers and giving it researcher the flexibility to go off and create and make magic, but also making sure that that magic can be productionized. He's come up with a few tips and tricks for us. I thoroughly enjoyed this conversation. I feel like he's got a great viewpoint of what ML engineers are going through and especially for his use case. Working at RunwayMl is a fun one because they're doing so much cool stuff with generative AI and those foundational models that it puts him in a very unique position, and he's learned a ton and he articulates his learning very well. So, Sir Ethan Rosenthal gets the award today for the most buzzword of 2024, and we're going to get into this conversation. Oh, yeah.

Demetrios [00:01:44]: By the way, if you enjoy the podcast, you can engage with the app you're listening on in many different ways to let that recommender system know. Let's get into it. So you were at square block, and you were talking about working with LLMs back in those days, and this was very early in the LLM boom. I think maybe before chat GPT, this.

Ethan Rosenthal [00:02:21]: Was like, it was right after the stable diffusion stuff got big and so. But then before chatgbt, and so, yeah, in hindsight, it seems like I had amazing foresight by working on language models, but I just happened to kind of, like, fall into that work at the time.

Demetrios [00:02:41]: And how did you fall into that?

Ethan Rosenthal [00:02:44]: I was working on a different team at square and wasn't as happy on that team. Kind of wanted to do something different, wanted to do something a little more AI or something like that, do a little more deep learning, and there just happened to be a team opening, talked with the team, ended up there. I was like, oh, I don't really know anything about NLP, but I used to do recommender systems, so I know what embeddings are. So this should be fine.

Demetrios [00:03:08]: Same thing.

Ethan Rosenthal [00:03:08]: Yeah. And then, like, a year and a half later, chat, GPT comes out. I was like, oh, wow, okay, now, now everybody's interested in this.

Demetrios [00:03:16]: Yep, yep. You're the OG. You're the thought leader in this space. You've been doing it since one year of work.

Ethan Rosenthal [00:03:23]: Prior to that, therefore, at the OG.

Demetrios [00:03:26]: Yep. You're like, I was doing this during the AI winter, and that's how far back I go. But then you moved on. It's funny that you mentioned stable diffusion. You moved on to the creators of stable diffusion to Runway, and now you're doing all kinds of fun stuff. Can you break down what you're doing these days?

Ethan Rosenthal [00:03:47]: Yeah. So in January, I left square and moved to Runway. Yeah. One of the creators of stable diffusion and Runway. I think most people know of stable diffusion for generating images, but I think Runway were one of the first people all the way back at the start of 2023, 20 years ago, in AI years, they had one of the first generative video models. And so that's, I think, what Runway is most known for, and kind of one of our hottest products that we have is around generating video. So you can write some text and generate a video from that. You can take an image and then animate that image, turning it into video.

Ethan Rosenthal [00:04:32]: And then we have a bunch of other tools to do interesting things, interesting, creative things with video.

Demetrios [00:04:39]: Yeah. And for those who haven't seen the demos, those would always be so much fun for me because you could see how much thought the product people at Runway were putting into the user experience. And it's one of those products that, for me, it feels like, wow, okay, this is actually a good use of AI. This is something that is quite helpful for a video editor, because they can now type into a prompt box or the context box and say, okay, this video, but without such hard colors or without the street lamps on the street when people are walking down, and it will automatically remove those. Sometimes, I think, better than other times, but still, the idea was there, and so that captured the imagination of a lot of people.

Ethan Rosenthal [00:05:33]: Yeah, I think that's, you know, it's kind of hard to thread the needle between, like, AI hype and actual value and stuff like that. So I think one of the things I liked when I was looking for new places to work was that it seems like, you know, this, you know, runways tools are, like, actually valuable to all sorts of content creators and people who just generally work in, like, the creative industry. And I think it helps that, like, the company's founded by creatives like the. And I do, you know, as you mentioned, like, the UX is a part that I really like. You know, it's. It's fine to, like, write prompts all day, but I really hope we're all not prompting ten years from now. I think it's. It's more interesting when, like, we can use the full capabilities of our computers to do things.

Ethan Rosenthal [00:06:20]: One of the older tools that Runway built was this green screen tool. If I want to crop a person out of every frame of a video back in the day, you'd have to go in by hand and draw a lasso around them and crop them out of every frame. But instead, we have this nice tool where you can draw a mask or paint a brush over the areas that you would like to be cropped out. And then with the, quote, magic of AI, you can now just do this on the whole video. It disappears. There's no prompting involved. It's just kind of using your creative tools like you would expect to.

Demetrios [00:06:55]: Yeah, so, wonderful. So, basically, you were like, all right, I spent a lot of time at a company that jumped on the crypto hype, changed its name to block, and now I need to find something that is not chasing that hype. And I think Runway has done a great job of that. I will say they are doing some cool stuff. And now your day to day, though, you went from working with LLMs to, are you still working with LLMs? Is it diffusion models? What does your day to day look like these days?

Ethan Rosenthal [00:07:26]: It's kind of a little bit of everything. And to be clear, I do some modeling work, but I also like, if I spent all day staring at loss curves, I think I'd be a little sad. On the flip side, if I'm writing code and just waiting for unit tests to run all day, then I also get a little sad. And so I try to sit a little bit in between. And so, yeah, we definitely have diffusion models, language models, all sorts of stuff like that here. But I've been spending a lot of time on kind of creating nice training datasets and organizing our training data and things like that in order to help the researchers help to scale what they're doing, train on larger datasets, make sure that we have interesting diversity in our data sets, and incorporate new and interesting inputs into these models.

Demetrios [00:08:25]: Well, I know how cumbersome that can be. Hopefully, those researchers appreciate you.

Ethan Rosenthal [00:08:30]: Yes, they're very appreciative. Very nice.

Demetrios [00:08:34]: Excellent. And the last piece that I will say about Runway, which is also something it's an incredible feat. It all runs in the browser. So it's not like a app that you download to get this AI magic. It's all happening in the browser. We had Brennan on here, who is your colleague coworker. He talked to us a lot about how you all are using Kubernetes and in the infrastructure of Runway. And so if anyone wants to listen to that, I highly encourage it.

Demetrios [00:09:06]: It was fascinating talk. But now I want to talk with you about a little bit of a buzzword bingo that you were mentioning before we hit record. And it's all about what you're calling like a multimodal feature store. Can you break down that idea for me?

Ethan Rosenthal [00:09:22]: Yeah, like before everybody rolls their eyes, because I know that that's a lot of cheesy buzzwords, I guess what that means. So let's say we'll start with multimodal. Multimodal, honestly, actually, like with kind of, kind of a top of the moment right now. You know, you look at GPT four o, that came out, what, last week, two weeks ago, and we see that all sorts of different modes are now being incorporated. And so what I mean by mode is like, we have images, we have videos, we have audio, we have text. And I think historically these have all been treated somewhat separately. So when I was back at square, I worked on language models, and everything we worked on was just with text. We might incorporate kind of like tabular information in the language model, but at the end of the day, in order to feed that model, you convert all of your information into text and then train a model off of that.

Ethan Rosenthal [00:10:20]: But now that we have, nowadays, we have all sorts of inputs into our models. So you can send an image to chat GPT and ask it about it. We can send video to Google systems, you can talk to GPT 40. Now, I think on the input side, we have lots of different modalities that we want to feed into the model, but then also on the output side, we have different modalities that we want to generate nowadays. And so text to speech, we are generating speech. From that we can generate text, but then all the way to stable diffusion and things like that, we can generate images and generate videos. And I feel like if you had asked me this six months ago about is multimodal AI important? And things like that, I definitely would have rolled my eyes. I have now joined a company that does this.

Ethan Rosenthal [00:11:13]: So I've probably spent the last couple of months drinking the Kool Aid, but that's okay. Kool Aid's tasty and it does seem like this is where everything is going. And I think that people do see that there are benefits to learning from and producing different modalities. So you start to get some kind of cross modality learning and things like that. But anyway, so that's multimodal, right? But then the other side is a feature store. So I think actually, last time I was on this podcast, we talked about feature stores, probably. Yeah.

Demetrios [00:11:51]: You had a blog post about feature stores, right?

Ethan Rosenthal [00:11:54]: I did. Way back in the day, I had a couple. One was about really liking them. We had a great one at square, a feature store for tabular data. And I was like, this is fantastic. But then I also, I think, had a blog post saying that I thought feature store companies were a little silly and that we should probably just replace them with good streaming databases. So it's a, then I brought on.

Demetrios [00:12:19]: Somebody from a feature store company.

Ethan Rosenthal [00:12:21]: You did.

Demetrios [00:12:21]: You guys should talk.

Ethan Rosenthal [00:12:22]: It was a fun co host for you to have for that conversation.

Demetrios [00:12:28]: I'm surprised that you agreed to coming back on here, and you probably thought I was going to ambush you with stability AI founder Ahmad. And you're like, dude, come on.

Ethan Rosenthal [00:12:38]: Surprise.

Demetrios [00:12:39]: Are you kidding me?

Ethan Rosenthal [00:12:40]: No, you'll be nice to me this time, and then I'll come back on a third time and then you'll get me. So, yeah, I think most people even feature store is maybe still a bit of a buzzword, but I think people are somewhat familiar with them. Nowadays for tabular learning and things like that, we have classification regression models where we want to compute a whole bunch of features about our data and kind of use these for both training and inference in a performant manner. And so I think this is hard enough for tabular data. If you have features like back when I was at square, you might have a feature that's like, what percentage of payments have been blocked from this merchant within the last 60 seconds, and we want to use that information to power a fraud model. So if this merchant tends to have a lot of payments that have been blocked recently, then perhaps they're riskier than others, kind of aggregating these events and things like that and turning them into features and serving them for inference. That's already difficult. Now we get to the world of multimodal AI, and we have lots of different modes that we have to deal with to incorporate into our feature store.

Ethan Rosenthal [00:13:52]: And so for this, I think that inference is perhaps a bit less important, dealing with very fast streams of events that we have to quickly aggregate. We might be for a fraud application or for DDos attack applications and things like that. But instead, let's say we're training a model to generate video. We might have a bunch of videos, perhaps unsurprisingly, in our data set. Now, our feature store needs to, needs to contain videos. And videos are really big, and they're slow to decode, they're slow to download and things like that. And so we might have videos, but then we might have also other information about these videos that are relevant for training. So as an example, we might want to know what is the width and the height of the video or what is the resolution of the video, because that might be informative for our model.

Ethan Rosenthal [00:14:51]: And so you have this high variance between the different columns of your data set that you're training on. So one column might contain videos and another column might just contain how many pixels are in the video. And storing all of this data and being able to query this data and then being able to filter it, and then in particular, being able to use this data during very large scale training jobs is totally non trivial. And it's kind of like an unsolved problem right now, I would say.

Demetrios [00:15:32]: So when you're talking about these multimodal feature stores, it's primarily for the training. And when we think about feature stores in their past iteration or what I imagine a lot of people understand them for these days is for serving, and specifically like low latency online serving. Right. When you need that fraud detection or you have a recommender system or whatever it may be, you want to get something out real quick. That's when you're using feature stores. But what you're saying is, okay, well, for a multimodal feature store, this is for training, because we have so much unstructured data and structured data that can go together, and we need to be able to understand both of those.

Ethan Rosenthal [00:16:21]: Yeah, yeah. Like, I think that for the former case, I think it's, I think the fast inference is the hard part of feature stores for tabular data and things like that. But you also need to have the training component as well for like, let's say unimodal or like regular feature stores. The reason is that you don't want this skew to occur. You want to make sure that the data that your model was trained off of matches the data that is being fed into the model during inference as best as possible. Ideally, in most features store companies do this. They try to maintain this parity between the training features and the inference features. The training features, they're not so hard to store them.

Ethan Rosenthal [00:17:05]: We have formats like Parquet and things like that nowadays where we can store these large datasets. But then the hard part is making sure that inference matches that, although I guess they also have to backfill data when you add new features and other sorts of things that are fairly complicated. And the datasets tend to be quite large for the training datasets. But you're right that for what I'm talking about now, this is primarily focused on the training side of things. And that's also because most of the inputs to these generative AI models, most of those are provided during inference, they're not kind of like, they don't, there's not that many inputs, like if you think about what are you inputting into chat, GPT, maybe some text, maybe like a system, components of a system prompt, and maybe an image or something like that. But it's not like we're dealing with thousands of inputs to the model during, during inference. And so yeah, the inference side is much easier, but then the training side becomes much harder for these AI feature stores.

Demetrios [00:18:09]: I really like that difference that I hadn't thought about before, where when you are dealing with traditional, quote unquote traditional ML, you have all those different inputs and you have a lot of them, and then you also have that low latency that you need to be hitting when it comes to LLMs or just these foundational models that we're dealing with now. You've got the prompt, and that's kind of it. There's a lot of iterations on how you could do the prompt and what you do. Maybe it's with voice, and maybe you're giving different chain of thought, chain of reason, whatever you want, but it's still kind of just a prompt, right? And it's not like you're giving all those data points into the model and saying, okay, this is it. Is it fraud or not? And so that's why I'm a big proponent of saying, yeah, okay, you have your LLM foundational model use cases, and those are totally different than your traditional use cases that you would get from classical ML. And so it does, it does feel like that. I still think the, I think this is fascinating when it comes to the multimodal feature store. Then can you just break down the benefits that you would see from that training perspective? If you have that, what are the benefits?

Ethan Rosenthal [00:19:32]: Yeah. So I think one piece is searchability, right? So I think with tabular data, lots of people have snowflake or data warehouses like that. And if all of your data lives in a warehouse, then you can write some SQL queries and you can calculate some aggregations of your data. You can answer analytics questions, you can segment your data and things like that. I think that when you're dealing with this unstructured multimodal data, it becomes a little bit harder to do that. For example, if I want to search over a data set of images for some semantic quantity, find all images with cats or something like that. To do that, we can do that. Nowadays we have to first calculate embeddings, vector embeddings for all of the images, and then maybe we run some sort of nearest neighbor search over this.

Ethan Rosenthal [00:20:25]: But that's much more complicated, I would say, than just running a SQL query in your warehouse that ends up becoming a component where maybe you want to have vector search that sits on top of your feature store where your, where your training data lives. And you need that to kind of be performant. Because maybe I want to fine tune my data or maybe I want to find certain subsets. Maybe I want to ensure there is sufficient diver like semantic diversity in the data that I am training off of. And so as a result, if I break that down, how do I run vector search? I might need to first calculate a bunch of embeddings related to the data that I have. And you can think of that as doing feature engineering. We're taking our data and then we're calculating a feature from it. But in this case, the feature that I'm generating isn't embedding.

Ethan Rosenthal [00:21:21]: So it necessarily takes a little while to do this, especially on a big dataset. So the computation piece is a bit more gnarly than running some kind of streaming aggregation keyword with a regular feature store. This also means that my feature store needs to support storing things like vectors, and then I also need to be able to run vector search over it. That's an example. From the querying and curation side, there's also basically the analogies of just straightforward feature engineering. As an example, I think at the start of 2023, Runway had a paper about our model gen one, which is a video to video model. So you take a video and then you can write with some text and maybe take an image and kind of convert the video into a style that matches the image that you've provided, or convert the video based off of the text that you've provided. And so in that paper, I think text can be converted into an embedding with a clip model, I believe is what we used.

Ethan Rosenthal [00:22:35]: And then the video to video model that under the hood uses this model called Midas, which does depth estimation. So, based on an image or based on a video, you can estimate what is the kind of depth, how far away is every pixel from the camera? And so this starts to kind of get into the fact that a lot of these image and video generation models, they're not just one model. It's not like. Like I think we often think of, you know, GPT-3 and GPT four and things like that as gigantic model, these single models. But instead, you know, in this example, this video to video model, it consists of, you know, there's a diffusion model in there. There is a, like, there's the clip model for converting both images and text into embeddings. There is a depth estimation model. And so now if I want to train this model, there's two ways that I can do it.

Ethan Rosenthal [00:23:28]: I can take all those different models. I can take my diffusion model that I want to train. I can take my clip model, I can take my depth estimation model, and I can take all of them, and I can kind of calculate these features on the fly. So, as I'm training my diffusion model, I get a video, and on the fly, I calculate the depth of every pixel in that video. But alternatively, I could pre calculate all of this, and I could store this information in my feature store. So I could take all of the text that I might have, and I could pre calculate clip embeddings and store that in my feature store. And it's the same idea as doing feature engineering on the fly. Maybe as a simple example, we often normalize features before they get input into our model.

Ethan Rosenthal [00:24:22]: And so that is taking a feature and then transforming it in some way. And so I could do that where I take text and on the fly, I convert it into an embedding. Or I can pre calculate this and store this in my future store. But then you can imagine, maybe I want to use a different model other than clip. Maybe I want to use some other model that converts text into embeddings. And now I can start to store all of this information in my feature store and give all of these different features to the researchers, where they can start to choose what they want to use when training models, and they can start to run experiments based off of all of these different precomputed features.

Demetrios [00:25:08]: Oh, so you're giving them new datasets.

Ethan Rosenthal [00:25:12]: In a way, or new features built on top of the datasets.

Demetrios [00:25:17]: Yeah. Okay. And again, this has nothing to do, really, with speed. The precomputing doesn't. It's going to be faster, but that's not really why you're doing it.

Ethan Rosenthal [00:25:32]: No. Instead it's more that we, we can now have a canonical place where all of this information lives, and the same way that a data warehouse becomes the canonical place that everybody basically solves their analytics needs. But we can now have a canonical place where we have large training datasets with all sorts of features and information against them that can then be used by downstream researchers for running all sorts of experiments.

Demetrios [00:26:06]: And do you think that this is going to be solved by traditional feature store companies? Or is this something that feels like it's going to be like a new database that comes out of nowhere and says, yeah, we do vectors and we also do a little bit of, because I've seen people doing like semi structured data or this, I feel like I've seen some people. My friend Cody Coleman, I don't know if you know him, but he is doing coactive. And one of their value props is, hey, let's be able to search video just as easily as we can search a data warehouse. And so I wonder if, because of the opinions that the feature stores have, the traditional feature stores have taken, my.

Ethan Rosenthal [00:26:55]: Guess is that there's not a large amount of overlap between the implementations of traditional feature stores and these more unstructured multimodal feature stores. There is some, but as we talked about, I think a big part of the traditional feature stores is around the inference side of things, and that's the expensive side of things, too. And a lot of that is not here, and it's not relevant for these kind of multimodal feature stores.

Demetrios [00:27:32]: And the last point that I want to make on this, before we move on to something else, is talking about large scale distributed training and how this, in your mind, can help.

Ethan Rosenthal [00:27:44]: Yeah, so I think when you run jobs to train AI models nowadays, especially if you're dealing with kind of multimodal AI and things like that, you are going to be running your job on many machines. Each machine often has many GPU's, and among the different GPU's, you might be running your job with multiple data loader workers. So you might be parallelizing. You basically are massively parallelizing requests to your training dataset. When the dataset that you have is small, you can just, let's say I'm running a job on two different machines, I'm training a model on two different machines. If your dataset is small, you can just download the whole dataset on both machines. Maybe you train on half the dataset on one machine and half the dataset on the other machine, and that's it. Things are easy.

Ethan Rosenthal [00:28:45]: But if your dataset is to be fair, this is what I did when I was at square. We were dealing with text data and the text data, we had a lot of data, but in the grand scheme of things, it was relatively small. We could just download the entire dataset to disk and then we could train a model off of that dataset. But when you have videos and images and embeddings, these get quite large quite fast, and it is totally intractable to download that entire dataset on every single machine that you are training your model off of. Instead, you need to be able to download portions of your dataset. You want to really only download the portions that are relevant to training your model. I think we talked about this before where maybe I want to have different embedding models from my dataset. I don't necessarily want to download every single embedding that I've calculated.

Ethan Rosenthal [00:29:46]: I probably only want to download the ones that I care about and use those as input into my model. One way to solve this problem is you have your feature store that we've already talked about, and then you have a totally separate system that creates training datasets for you. Maybe you query the feature store, you generate your training dataset, and then you have your models train off of that training dataset. This is kind of like in classical machine learning world. Maybe I have to query a database, but at the end of the day, I create my design matrix, my x. In scikit learn parlance, I create my x and my y, my x matrix and my y targets that I want to train my model off of. And so maybe it's one service's job to create those training datasets, and then your models train off of this. But again, if the data that you're dealing with is really, really big, it could be wasteful to have to first create training datasets and then train your model off of these training data sets.

Ethan Rosenthal [00:30:50]: And so in an ideal world, your feature store can solve for kind of searching and querying and storing all of your data, but it also solves for performantly pulling down batches of that data while you are doing a large scale training job. What you need there is you need to be able to query your feature store and only return specific rows and columns that you care about for training your model. You need to be able to do this in a distributed fashion where certain portions of my data are sent to one machine, certain portions are sent to the other machine, I think in an ideal world, you can solve for both large scale training and all of the other requirements of the feature store all in one.

Demetrios [00:31:40]: Okay, so the whole value prop there is that. Since these datasets are gigantic and I only want a sliver of it, and I don't even, that sliver is going to be too big for me to train on my one or two machines. I just want the data that is relevant to me when the training is at a certain point. Once it gets past that point, I want to pull down new data that is relevant to me and et cetera, et cetera. That keeps going until the training's done.

Ethan Rosenthal [00:32:11]: Exactly. Yeah. When doing deep learning, like this, stochastic gradient descent, everything operates on batches of data. And so you want to be able to kind of stream batches from the feature store and only kind of process these batches that are relevant for you and ignore everything else.

Demetrios [00:32:33]: That makes a lot of sense. Okay, dude. Wow.

Demetrios [00:32:38]: All right, real fast, I want to tell you about the sponsor of today's episode, AWS, tranium and inferencia. Are you stuck in the performance cost trade off when training and deploying your generative AI applications? Well, good news is, you're not alone. Look no further than AWS's tranium and inferencia, the optimal infrastructure for generative AI. AWS, tranium, and inferencia provide high performance compute infrastructure for large scale training and inference with LLMs and diffusion models. And here's the kicker. You can save up to 50% on training costs and up to 40% on inference costs. That is 50 with a five and a zero.

Ethan Rosenthal [00:33:24]: Whoa.

Demetrios [00:33:25]: That's kind of a big number. Get started today by using your existing code and frameworks, such as Pytorch and Tensorflow. That is AWS, tranium, and inferencia. Check it out. And now let's get back into the show.

Demetrios [00:33:40]: So, feature store. This is cool. I think we could get sued calling it a feature store just because so many people out there have one idea of what it is. We got to think of a different name. But I do like this idea, and I do think that it can make life a lot easier for the researchers and perfect segue into, like, dealing with researchers and making sure that when you're working with them, you're able to both let them run wild, but also get something for production. And have you found any tricks on how to do that? How to make sure that researchers are actually producing for production, or at least once it gets past a certain phase, they understand what the productionizing means.

Ethan Rosenthal [00:34:40]: Yeah, I mean, I don't have any silver bullets, but I have lots of thoughts. I think something that I. A lot of my experience is drawn from my team at square, but I've seen similar things here at Runway. I feel like there's. There are a lot of people, or there's a lot of thoughts out there around how to solve this and how to drive more value out of researchers, data scientists, machine learning engineers, and things like that. And I think oftentimes, and probably these people have been on this podcast, there are lots of tools that are sold to try to do this as well. I have used some of these tools, and I've tried to use many of the tools, and I think that one thing I often see is a wall that people throw their models over. So there are tools that are really designed to help researchers iterate quickly on what they are doing and give them maximum flexibility.

Ethan Rosenthal [00:35:40]: But then in order for that researcher to deploy their model and things like that, they might need to, I don't know, package their model up into a script or something like that. A lot of these tools, I would say, oftentimes the deliverable of the researcher to the real engineers, the team that's involved with production. Amazing is that the researcher needs to serialize their model to some artifact, and then maybe, if they're lucky, they can hand off a script like a python file and hand that off to another team.

Demetrios [00:36:16]: Right?

Ethan Rosenthal [00:36:17]: Exactly. Yeah. They email it. Sometimes they put it on a zip disk and send it over. And the problem that I have with this is that this often ends up meaning that the researcher kind of has to reinvent from scratch their model every single time. So all of their models are just like one big, long script that they might hand off to somebody else. And all they know about is the script and the details of which libraries are run during inference, and information about the Docker container and everything else like that, that is kind of abstracted away from the researcher. But I think that by having to hand off a script, this prevents people from being able to reuse code.

Ethan Rosenthal [00:37:08]: And so when I was at square, we had a single code base that did both training and inference. And so we had one codebase with a lot of shared code, and a lot of libraries, like internal libraries, that had been built up over time, and we were able to reuse code that we had written during both training and inference. And I think that if you talk to a lot of conventional software engineers, this idea of having a single code base that does training and inference feels maybe a little bit bad. Like, we should have code that is for training, and then we should have code that is for inference. And everything in between needs to be serialized or something like that.

Demetrios [00:37:56]: Yeah, it's like staging in dev.

Ethan Rosenthal [00:37:58]: Exactly.

Demetrios [00:37:59]: Staging and prod. Yeah.

Ethan Rosenthal [00:38:01]: But I think the problem with this is that if you don't get to reuse code, you don't get to enjoy all the benefits of, like, conventional software engineering. And so by having a single code base to do all of this in, you get to reuse your codes, reuse your code. You get to keep things nice and dry. Do not repeat yourself. You get to write tests. If I'm just handing a script off to somebody and they're going to run that in production, then I'm not going to bother to write tests. But if I have a shared code base, I've already built out CI CD on that codebase. I can write tests and things like that.

Ethan Rosenthal [00:38:38]: And I think that, like, and then as well. Well, the trade off of this is that by having this singular codebase that other people are working in, you now need to write code that doesn't break other people's code. And so if you want to kind of modify something in the training pipeline, you don't have the full flexibility to do that. Instead, you need to make sure that your modifications don't break other jobs that are using that code for trading. And this can slow people down. But I think that the benefits of working in a shared code base end up outweighing the cons or the potential slowdowns of limiting some flexibility. I know that this was a long, ranty response, so maybe I'll pause there.

Demetrios [00:39:27]: Well, I guess the other side of it is that people probably think, okay, but my researchers, for better or worse, they probably aren't thinking about how to optimize the code or how to create, quote unquote, production code. They're just thinking about how to get whatever it is in their head out and try and make it so that model works. And maybe there's a certain metric that they're trying to get as high as possible, and then it's somebody else's job to figure all the other stuff out. And when you're asking a researcher to now figure out, like, all right, make sure that you're conscientious of not breaking things when you're working on stuff, maybe you gotta go and figure out these unit tests or integration tests, and you also gotta start using git and start doing things that you haven't been doing for your whole career or that you haven't liked to do. And now I really think you put it perfectly. It potentially can slow people down, but the benefits outweigh the cons there.

Ethan Rosenthal [00:40:49]: Yeah, I think they do. And there's ways to help people with this, too. Right. I think I have a lot of empathy for the researchers. I went to grad school. I wrote really, really bad code. When I was in grad school. I wrote all Matlab code.

Ethan Rosenthal [00:41:06]: It's hideous, just like any old code that you've written should be. If your old code doesn't look hideous, then it means that you haven't learned anything since you wrote it. But I eventually learned over time, various, I don't know, for lack of a better term, best practices that software engineers use. And not all of these best practices are applicable to iterative research environments, but I think some of them are. And I think one of the hard things for me, and probably one of the hard things for researchers who are in a certain environment is if you don't have any examples to build off of. So if you tell somebody, hey, here's an empty GitHub repo, please write unit tests, please put all your modeling code in there, write unit tests, set up CI CD, build a docker container, and all of these other things, then that's going to be, they're going to have a hard time.

Demetrios [00:42:02]: But if there's already staring at a blank page, go and write a book.

Ethan Rosenthal [00:42:08]: Exactly. Then they're going to reinvent the same book or something.

Demetrios [00:42:13]: Yeah, but like, yeah, it's just a lot more overhead to think about it all the way through than if you're editing something else that's already on. There's already something on the page. 100% sorry I cut you off.

Ethan Rosenthal [00:42:23]: Yeah, no, I think it's a great analogy. Like, it's much easier to edit, it's much easier to refactor than it is to write something from scratch. And so if you have a code base that already exists, that is already running tests and already has examples of this is how you write a test for this type of thing, then it's a lot easier to just add your own test on top than it is to build out an entire testing suite from scratch. And so there's still the factor that some people just don't want to do this. I've worked with people where they're like, this shouldn't be my job, and I generally don't like working with those types of people. I don't know. I think it's, I think it's good to kind of like work at the boundaries of what you think your job should be, because like, it smooths the interface between people on different teams. And so, you know, if you, if you're a researcher and you have a bit of like interest in the production side of things, that little bit of interest can go a really long way.

Ethan Rosenthal [00:43:24]: Because if you kind of learn how to talk like an engineer, then you can communicate with them better and you can get stuff shipped faster. And the converse is true as well, that engineers who understand a little bit of the modeling side of things, they're going to be able to work a lot better and understand requirements a lot better from the researchers. I think that code ends up being the best way for everybody to communicate with each other about this stuff.

Demetrios [00:43:52]: Aside from this Monorepo idea, are there other infrastructure pieces that you've seen that have helped a ton when it comes to researchers and engineers playing nice together?

Ethan Rosenthal [00:44:06]: Yeah, I think like on the training side of things, there's lots of tools to like that will claim to kind of like train your model very nicely in the cloud or serve your model very nicely on GPU's in the cloud and things like that. I think that again, some of these tools end up running into the issue where you need to package up your code. Either you need to use some weird domain specific language to do this, you need to write a gnarly YAML file in order to do this, or you need to rewrite. If I have an existing codebase, I need to rewrite everything and turn it into a script and take my requirements TXT and my Docker information. I need to convert all of that into this tool, which isn't great. What I really like is if you can still work within Docker world. A lot of training under the hood happens in Docker. A lot of inference happens in Docker.

Ethan Rosenthal [00:45:12]: I think from the researchers. The hard thing can be that if I want to train a model in the cloud, I might need to build an entire docker container, upload that from my laptop and then wait for it to be downloaded on different machines. All of this is very painful, but I think the closer that experience can get to basically just being docker with a whole bunch of caching involved, the closer my experience is to writing code on my laptop in a shared code base, then hitting run when I want to run a model. But if that happens to run in the cloud on lots of GPU's, then that is the ideal experience is that I can sit in my code base and write code how I like to write it, using whatever id I want. But then with low latency abstract away, running things in the cloud. I think there's some new companies that are doing this in interesting ways. Modal and bao plan and people like that. This is what we did at square.

Ethan Rosenthal [00:46:20]: You have a shared codebase, you write code on your laptop. And then we had a tiny little CLI where we package up everything into a docker container, ship it up to the cloud and then run a job on GPU's. It was a bit slow because we weren't great at caching things, but I think the better you can cache and the faster that experience becomes, the better things are.

Demetrios [00:46:45]: Yeah. Because you know what is not going to work? Trying to get that researcher who barely wants to learn git to learn kubernetes.

Ethan Rosenthal [00:46:53]: God, no. No. It would be nice if nobody had to learn it. For those people who do know it, it's nice when only they are the ones doing it.

Demetrios [00:47:03]: Somebody told me like, oh yeah, there's this whole generative DevOps movement. And I made the joke, like, all right, well, I'll believe it when these LLMs have kubernetes fluency, you know, or like YAML fluency, I think, is what the way that I put it. And they're like, it's coming, it's coming. I still am not so bullish on that and I'm not sure if it's going to ever happen, but hopefully I'll be proven wrong.

Ethan Rosenthal [00:47:28]: Yeah, it would be nice if we have something other than kubernetes before the AI model. Like the AI models have to learn kubernetes.

Demetrios [00:47:37]: That's a different way of thinking of it. Just get rid of kubernetes and then you won't have to worry about teaching AI kubernetes or YAML. I like that.

Ethan Rosenthal [00:47:47]: My other very brief rant is that I did this at square and I replaced all the YAML with python code and I hate YAML. And I think that the more you can, like, we replaced all the YAML with pedantic models, like kind of data class like configs, where you get nice type hints and validation and everything else. And I think that, like the. This ended up paying dividends as well because I feel like what often happens is you have like maybe engineers who start building out all of this crazy custom parsing of YAML files because the YAML becomes the interface between the researcher and the engineers. Because the engineer is like, all right, as long as you can get it into YAML, then I can run your training job or serve your model. And then the researcher is like, well, actually I need this weird custom logic. And instead of just being able to write Python code to do that, the engineer now has to write a custom parsing of the YAML file in order to implement the logic that the researcher wants. And now you have these weird YAML transpilers and it's a mess.

Ethan Rosenthal [00:48:53]: And I think people should just write Python code itself.

Demetrios [00:48:56]: Yeah. First the researcher spends probably five to 10 hours researching YAML and how kubernetes even works and what kubernetes is, and then they're like, actually, you know what? Fuck this. They try and get somebody else to do it or they try and figure something else out. So, yeah, that is a classic one. Man, that is so funny. The last piece on this from my end is how are you structuring teams so that you have the researchers and you have the engineers playing nicely on that organizational front?

Ethan Rosenthal [00:49:34]: Yeah, so we have the kind of cringy title of a member of technical staff for all of the technical people at Runway, which I feel like is like the new AI company thing to call everybody this new. Cool.

Demetrios [00:49:48]: Yeah, it's the flat. Yeah, in like 2010, it was flat management. So you didn't have a hierarchy, didn't.

Ethan Rosenthal [00:49:55]: Like zappos do that or something? Yeah, yeah, yeah. So maybe that's what it is. But, you know, the argument is that boundaries between teams should be fuzzy and people should feel free to work on what they want to work. That said, we still do have teams, and so we have like, we have a large team of researchers. We have a kind of solidly backend software engineering team and front end software engineering team. And then I sit on the machine learning acceleration team that sits in between backend and researchers. And so this is kind of like machine learning acceleration, not engineering, but acceleration.

Demetrios [00:50:35]: Huh. And wait, what is the goal? Or like, what kind of KPI's do you have? Or okrs, I guess. What, what are the metrics that you care about? No, none.

Ethan Rosenthal [00:50:46]: I mean, we're a startup, so I.

Demetrios [00:50:49]: Think with artistic leadership too. That's the other piece.

Ethan Rosenthal [00:50:55]: Exactly. I think things change too fast to set quarterly KPI's and things like that. But generally I feel like the team that I'm on is really oriented around kind of speeding up the research, both training as well as inference for the researchers. And so the faster they are able to run experiments, the faster they're able to train their models, the better that their models scale when training all of that side of things we work on and then on the inference side, the GPU's are very expensive. I don't know if you knew this, but they cost a lot of money and so you want to be able to, you want them to go brrrr. You want to use them as much as possible. And so on the inference side, the faster we can run inference, kind of the more models we can pack on a GPU and things like that, the better we can kind of utilize our capacity and stuff like that.

Demetrios [00:51:53]: Is it all homegrown stuff that you have?

Ethan Rosenthal [00:51:57]: No, we have a giant mixture of everything.

Demetrios [00:52:01]: Yeah. Nice, nice. Well, this has been fascinating, man. I really appreciate this idea of the multimodal feature store, and it's something that I'm going to chew on for a bit, I think especially when it comes to thinking about a. How it enables the large scale training and the distributed training and getting the data exactly when you need it. It reminds me of like the Toyota just in time model, and you don't want to create the car before somebody has demand for that car. You don't want the data to be downloaded until there is somebody that has demand for that data or a model needs that data for their training run. And the searchability on the features of a multimodal feature store also seemed fascinating.

Demetrios [00:52:52]: And then being able to segment your data sets and then give researchers different options for the flavors of datasets, I think that's a cool way of looking at it too, because then they can hopefully get more out of the same data, almost, I guess, is what it would be. It's like you can squeeze the lemon a little bit more and get more juice out of it.

Ethan Rosenthal [00:53:19]: Exactly. There's like that Google paper, you know, everybody wants to work on models, but nobody wants to work on the data part of it. I don't remember what the title is, but it's the same idea that the model is this tiny component, but then the quality of the data that you're training off of is hugely important. It's just typically a lot less fun than futzing around with neural nets and attention and things like that. I think that the more fun we can make it more realistically, the less painful that we can make inspecting and understanding your dataset and deciding what your model trains off of the better results that we'll get.

Demetrios [00:54:02]: Yeah, and I also like what you were talking about on how to accelerate the researchers. You're the researcher enablement department or acceleration department, and that is a really cool role to have. You get to think about for these researchers, what is the best structure and what is the best way that we can make sure they are utilizing all the resources we have to their maximum capability, and we are getting as much as possible from them. And hopefully nobody's overspending on GPU's in the meanwhile.

Ethan Rosenthal [00:54:36]: Never, never. Optimal spending the whole time.

Demetrios [00:54:39]: Yes. Excellent. Well, thanks, Ethan. This was great, man.

Ethan Rosenthal [00:54:44]: Thanks so much.

+ Read More

Watch More

Building Multimodal RAG
Posted Jun 17, 2024 | Views 175
# Multimodal Models
Accelerating ML Deployment with Orchestration Systems
Posted Aug 20, 2023 | Views 459
# ML Deployment
# Orchestration Systems
# Etsy
Beyond Text: Multimodal RAG for Video
Posted Jun 24, 2024 | Views 206
# LLMs
# Multimodal RAG
# VideoDB