Sign in or Join the community to continue

PyTorch's Combined Effort in Large Model Optimization

Posted Nov 26, 2024 | Views 1.3K

# PyTorch

# Torch Chat

# Meta Platforms

Share

speakers

Michael Gschwind

Distinguished Engineer @ NVIDIA

Dr. Michael Gschwind is a Distinguished Engineer at NVIDIA in DGX Cloud and AI Optimization. He previously created and led GPU Inference, the PyTorch generative AI stack for GPU-accelerated AI servers and mobile/edge on-device AI, and AI training at Meta AI. Prior to joining Meta, Dr. Gschwind helped invent general-purpose programmable GPUs as chief architect of Cell, was chief architect for 3 Top-1 supercomputers (RoadRunner, BlueGene, Summit), and 3 game console processors (PlayStation 3, Xbox 360, Wii) at IBM. Dr. Gschwind is a Fellow of the IEEE.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Explore the role in boosting model performance, on-device AI processing, and collaborations with tech giants like ARM and Apple. Michael shares his journey from gaming console accelerators to AI, emphasizing the power of community and innovation in driving advancements.

+ Read More

TRANSCRIPT

Michael Gschwind [00:00:00]: My name is Michael Gschwind. I'm an engineer on the Pytorch team. I help create Torch Chat, which is the Pytorch library for LLM Large language model inference. And my coffee is usually a latte, I take it with a double espresso.

Demetrios [00:00:21]: I feel honored that I got to talk to Michael today. This is another episode of the MLOps Community podcast. I'm your host, Demetrios, and this is one of those times where I almost have to pinch myself. This man has contributed so much, and maybe he has been behind the scenes, or if you have been in the ecosystem for a few years, you probably know who he is, and that's for a good reason. The dude has done a lot, and recently with his product or project Torch Chat, he talked about how he was able to combine all of these different projects that he's worked on while at Meta in the Pytorch ecosystem. And really, when you talk to him, you can tell that he is all about that optimization. Another thing that became blaringly, blazingly obvious as I spoke with him is he gives a lot of people a lot of credit for everything. He recognizes that proverb, if you want to go fast, go alone.

Demetrios [00:01:33]: If you want to go far, go together. He basically embodies that. And I can tell you that because one of the first things that he told me was, okay, I want to give a shout out to that team, but I don't want to talk about anything that team is doing because I want them to come on here and talk about it themselves. I don't want to act like I can take credit for that. So, you know, you've got a gem of a person when that's what they're telling you. Let's get into it with Michael. And as always, if you enjoy this episode, share it with one friend. I'll see you on the other side.

Michael Gschwind [00:02:10]: Yeah, I've been around the block for a couple of times. I built what arguably the first accelerator that wasn't the sell PlayStation 3. So if you played PlayStation 3, you use my chip. If you use the Xbox 360, I worked on that one, too.

Demetrios [00:02:32]: So everyone like myself, who enjoyed Tony Hawk growing up, it was thanks to you.

Michael Gschwind [00:02:38]: Jesus, that sounds exciting. Yeah. And I think there is a path that led from there. It was terribly hard to program initially. The goal was to put all the performance that you can get into one chip. And to do that, we sort of had to make compromises. It's not, you know, the comfort of a traditional CPU where, you know, you have branch prediction and all those caches, whatever. It was much more like, okay, you want the pure performance, we're going to give you as much performance as possible, but it means there's nothing else, no comfort on this chip.

Michael Gschwind [00:03:24]: It's all on you to manage that. And so users did find that somewhat hard to program, to say the least. And I ended up spending a good part of my time building software environments, ecosystems, compilers. And that straight up led to AI, which turned out to be the most voracious, both consumer performance and in the end, accelerators, GPUs, et cetera. So that was really cool. From enabling accelerators for a broad range of applications within hrb, accelerators powered the first petaflop computer, the first computer delivering a petaflop per second. And there is a direct connection from there to the use of accelerators in AI today. They are the largest consumer today of compute performance and they have a particular affinity to the sorts of things that work well on accelerators.

Michael Gschwind [00:04:46]: Large matrices, very repetitive operations, but many of them similar to what you have for graphics processing, where you want to process a lot of pixels at the same time. Here you're processing a lot of neurons, a lot of layers at the same time, but the same general operations matrix multipliers in particular. And that was a match made in heaven. And so, yeah, the history of accelerators in many ways has become, you know, enabler to AI. I think Jeff Hinton and his team were the first who saw that potential when they trained models for the ImageNet competition and beat the traditional Image Understanding entrance in the ImageNet competition back in 2012. How time flies.

Demetrios [00:05:54]: And were you, because you were working on the PlayStation 3 and then the Xbox and mainly in the gaming world, I guess, or the chips side of the gaming world. And how did the transition go into AI? Was it fairly straightforward where you recognized, okay, this is same chip, it's just different application?

Michael Gschwind [00:06:18]: So the story is a little bit longer. After CEL and the game computers, I spent a considerable amount of time on supercomputers. So I built three of the world's fastest supercomputers. That was the roadrunner. The first petaflop supercomputer, then Blue Chain, and at the end, the Summit system, which was using GPUs to get over div power architecture to power National Labs applications. And it's from there that I was looking around, hey, what's the next cool thing you could do with this technology? We have the Summit style supercomputers where we connect GPUs with CPUs in a very tight collaborative fashion. This was pre Grace Hopper CPU GPU that Nvidia later released. We have been working with Nvidia on bringing the Nvidia GPUs to the IDM systems.

Michael Gschwind [00:07:50]: And so as I was looking around back then over 10 years ago, there were these ImageNet results that Jeff Hinton had. And I was like, we need to try that. And so we built a system to train what were CNNs in the bay, Alexnet in particular. And we trained the first alexnet model in under an hour. And so that was like, hey, this, this is working. This is a really good direction. Of course, Facebook beat us shortly after that, which, you know, as I mentioned, competition is good in our industry. You want to see that because somebody comes and says like what you're doing is cool, but I have these ideas how we can do that even better.

Michael Gschwind [00:08:48]: So I think this collaborative competitive framework is a great way of bringing people in, harnessing their ideas, allowing them to build on the ideas about them. And I have to say, research is a team sport. So there is a lot of wisdom in teams in collaboration where you might not have all the answers, but you have a part of the answer and some of your colleagues, or competitors for that matter, may have the other part of that. And overall, you know, by having that cooperation, co petition, whatever you want to call it, we get an overall better outcome. And I think that was goodness and super exciting to see. So I was really thrilled when I joined Facebook to have the opportunity to work with some of those people that had built those exciting training systems. And Pytorch was really just coming into its own. And I think Pytorch's commitment to open source, to a community based model has served it super well, is the foundation of a lot of innovations, improvements that people brought to that ecosystem.

Michael Gschwind [00:10:40]: Having that openness means you can pursue your own ideas, you can contribute, but you don't have that time cost of where you first have to build a framework before you can prove that, you know, I have a better whatever it is, matrix, multiply kernel or a better way of doing quantization or whatever. You can build on the work of others, you know, and focus on your own interests and your own ideas and improvements while relying on the community of other users of other contributors to take care of the things that are important, but maybe not your core competence.

Demetrios [00:11:33]: So talk to me a little bit about Torch Chat now, what you've been working on recently and the three goals as to why you created it so okay.

Michael Gschwind [00:11:46]: Torch Chat is in many ways the culmination of several development strands or if you will, in the Pytorch ecosystem. In the LLM ecosystem there was the optimization of Pytorch for LLMs. So we've done a lot of investment over the years to optimize PyTorch for LLMs. There was better transformer, the initial acceleration for LLM inference, then the accelerated transformers that brought in flash attention, that brought the new scaled up product attention operator, many other infrastructure improvements. There was the Torch compile work that allowed you to take those models that you have in Pytorch and export them such that you can run them standalone outside of Python hosted environment like a normal application. And then there was the executor on device work that brought model inference, LLM inference with LLM optimizations to on device AI. And so one of the thoughts here or one of the goals in building Torch Check was to integrate all of these different development strains, these different silos of optimization, both to show users how these different technologies can be used, but also to integrate them, how you can integrate them and to build that end to end solution that provides a seamless environment to deploy LLMs for inference from servers on the one side all the way to on device applications at the other end of the spectrum. So that was the first underlying impetus that was both enabled by these Pytorch developments.

Michael Gschwind [00:14:27]: Then Torch Chat is also a driver for ongoing optimization for LLMs in PyTorch in Execord. Just by having that integrated environment you can now benchmark. We know in engineering that if you benchmark something, if you measure it regularly, it gets better. Not by magic, but by some engineer looking at this, this is not good. We can do this way better. You know, if, if you don't have those numbers out there that give somebody the hey, I think I can do this twice as big. Yes, please, Tom, send us your pr. That's exactly what we want, right? So getting that platform where people can measure, where people can try out their own ideas and measure them and sort of almost like a dialogue with the community.

Michael Gschwind [00:15:35]: Now what can I do here to contribute to bring my ideas to the table? And it's been an important driver for Pytorch optimizations. We found kernels, we found opportunities for, hey, this should do better by comparing with other inference solutions, by comparing between say the Mac and a Linux platform. ARM code generation hadn't been until very recently a big focus. And so with Torch Chat running on the Mac, people started to ask questions, hey, why is this performance the way it is, we thought, you know, based on the rating of this machine, you'd get better performance. And so we looked at again, what gets measured gets improved. We looked at the code, we partnered with Metazone, LLVM team, Metazone compiler teams like, what can we do to generate better code for arm? We partnered with arm, we partnered with Apple to bring better kernels. A lot of the performance comes from the BLAST kernels, GEM generalized matrix multiply kernels. And so getting the BAS kernels, partnering with the community that is most intimately aware of what improvements can you do for that particular platform.

Michael Gschwind [00:17:34]: And again, bringing that into the common ecosystem gave us gives the community a better outcome by bringing another death of eyes to the table, another set of ideas into the ecosystem. And it's not just Torch Chat, it becomes, and I would love for Torch Chat to be this catalyst in some ways that people can look at it like, what are they doing to get performance, whatever. And for people to blatantly plagiarize it for their own application, that's what open source is good for. Hey, I saw this idea. I may not be doing LLM inference, but this is a cool idea which I can use for my own application over here. So there is some of that. We're just aggregating all of these libraries, ideas, ways to structure kernel to do things better. Just provides a reference for others.

Demetrios [00:18:53]: I'm a really big fan of this. It's to summarize, you were able to leverage many different veins of the Pytorch ecosystem that was already there to create a more complete product. We could say in a vague sense, I don't know if product is the right word, but to create something new that is leaning on or standing on the shoulders of giants, and then you recognized, you know, what would be really good is if we can benchmark these, but not just benchmark them for ourselves, benchmark the performance and let everyone see, because we want to know and we want the numbers to continuously get better. And I think one thing that I really appreciate about what you just said is the competitiveness. It's almost like, you know that whenever you see a number and you think, how could this be faster? You. You get that itch. And so there's going to be other people that are going to be out there and they're going to get that itch. And if you throw it out and make it public, the numbers are only going to get faster.

Michael Gschwind [00:20:05]: Exactly. I think that's what good engineering is about, looking at a solution, asking, well, this is great what's the next thing we can do with that? Either how can we use it and give them more value by using it in a different space or how can we take it in this space and make it more performant, lower power, more reliable. So many attributes that people care about where you can imagine improvements. And hey, I would like to run it without Python because Python is cool but it's also a very large ecosystem so it's not going to run on my small on device. We've been running Torch Chat on a Raspberry PI, which I thought was one of the coolest things ever. That's the other thing in engineering. Every time you achieve something you have the next goal. So I think the next goal for Twitch Chat should be to run a model on an Apple Watch or whatever other smartwatch like natively, not just talk to the Internet and pull down the answer from a server, uh, but run it on the Watch on device.

Michael Gschwind [00:21:27]: Wow.

Demetrios [00:21:28]: Yeah, it, it's fascinating you say that too because we've had conversations in the MLOps community, Slack where folks are saying I need to optimize battery. I need to know how I can put this model on device and make sure that it just doesn't kill the battery right away. So there is like you say, so many different vectors that, that you want to optimize for and it's really cool when you can give that to the community and each different use case is going to have their vector that they're trying to optimize for. And so com bind you're able to get to places that you probably wouldn't be able to get to if it was just one team trying to work on their one use case? And I'm wondering as I talk to you, what are some things that surprised you that gave you just an enormous amount of lift as you are trying to optimize that either you created or it was because of stuff that you and your team has done or just the community has done and it surprised you because you wouldn't have thought to try and go in that direction.

Michael Gschwind [00:22:41]: A big shout out to to the developers of X and N Pak and to our own team members that are a part of X and impact development. One of the engineers on our team developed a really cool 4bit matrix multiply for ARM that uses a instruction that only exists in ARM that does like a vector of 4 bit by 4 bit multiplications and then accumulates it all integrated into a quantization kernel with scaling of the group wise quantization but and the scaling gave a Amazing performance that made CPU inference of LLMs in particular for on device scream in ways that were just amazing. It's another example where if you integrate different technologies or different dimensions, there's the instruction set architecture of arm, there is the XN impact architecture, there is the specific kernel and that's integrated with exeggutorge and comes to torchgen. Just if you look at the steps of how many communities and experts were necessary, it's hard to imagine that you can build that in a clean room without these amazing community that all bring their own ideas to the table.

Demetrios [00:24:47]: 100% in agreement with everything that you just said. And speaking of which, you've built products that go on different chips and interface with different chips at the a very low level. Is there tips or tricks that you could give us? Because I if I was talking to a friend who is working on these very problems and he was saying, man, it's just so hard because GPUs aren't written in Cuda. It's like if I need something to happen with the GPU or if I really want to figure out how to squeeze the most out of it, I got to read that manual. And sometimes you just got to know somebody at Nvidia to be able to get that little, I don't want to say trade secret, but you want to figure out or sometimes even it's like petitioning or lobbying so that something can get changed in a gpu. And I wonder if with all your experience you found there are easy ways to do things, there are hard ways to do things, there are tips and tricks that can help us as we're. We're trying to squeeze the most that we can out of the juice.

Michael Gschwind [00:26:03]: I think the number one magic is collaboration. There are a number of open source solutions and teams all around the space. And so deciding to go something completely alone will make it difficult to engage the right people that have the right idea. But you're missing this one thing and you don't know where to get it. While somebody else may have that, and I'll give you a couple of examples. There is the Cutlass team at Nvidia and they're producing a amazing library for doing numerically intensive processing on GPUs. And that's been an amazing relationship. A lot of capabilities we get from Catlus.

Michael Gschwind [00:27:15]: Conversely, we have partnered and contributed optimizations and these guys work at Nvidia. So if you have your kernel and you contributed to that library a, you make it available to many more users that might benefit from it. But Also might improve it because they have insights or ideas that build on top of your own solution. And with Cutlet being linked to Nvidia teams that think about how to optimize the architecture, the GPU architecture and the library architecture, you're much more likely to find that person that have either the ability to say, hey, there's this little known approach that, you know, if I integrate it here, improves performance significantly or hey, this is really good, but if we optimize this or that in our next chair, then we could build something that's even more amazing. So now through these collaborating communities, you get this continuum where the chip manufacturers, the vendors, whether that's Nvidia or ARM or Qualcomm, Dat or Apple, on the one hand, the build the infrastructure on which you're running all the way to the software end users and there is the opportunity really to do end to end optimizations that way without needing to own or understand the entire stack. Because frankly, it's very unlikely that you're going to find an LLM expert who also happens to do chip design.

Demetrios [00:29:36]: The other thing that I wanted to talk to you about is this idea of the cloud versus on device and I know you have some strong thoughts about that and how it's almost like this false dichotomy or false paradigm. In the past maybe it was more valid, but now because of advancements, it is more realistic to be able to do what you want to do on device.

Michael Gschwind [00:30:05]: On device, there are many reasons why on device is exciting and on the brink of a breakout. If it hasn't already broken out. On the one hand, there is, well, I want to run in a disconnected environment. Whether it's I want to use the LLM as a translator and I'm in a place where I don't have network access, but I still need translations or running in a car and now don't have the right connectivity or I just care about privacy, get some explanations about something or other that involves data that are private and don't want to upload them into cloud. There's a range of applications. So in the past on device AI have been seen as this special thing on the side where you have to develop models that are specifically for on device. And that meant that the on device models were, you know, a small set of models like for keyboard prediction or gesture recognition and sort of very niche, well, you need that on device. But it wasn't benefiting from the broader research, the broader investment into AI model because they were always the thing on the side with the Release of Executorch and in particular also including LLM optimizations for Executorch and Executorch is by purchase backend.

Michael Gschwind [00:32:10]: For on device embedded edge type applications, it became practical to take models that are channel Purkase models and run them on device. Then Twitch Chat took that and created this continuum where it's actually the same software infrastructure that you can use to run and export and exercise models, optimized model from server all the way to on device. So there's a consistent model definition that will run on either servers or on device. So you can use the same models and try it out on server. Then ask the question, so how does this run on device? There will be evidently some sort of different goal, like you will possibly want to deploy a larger model on the server. On device you're more memory limited, you're compute constrained. But so llama 7 billion or llama 1 billion versus llama 70 or 405 billion. But it's the same definition.

Michael Gschwind [00:33:38]: And if you look at some of the challenges that you're facing from server to on Dubai, they're in fact very similar. For example, on device one of the big challenges is memory footprint, but also compute power. And one of the answers to that is quantization. Taking the 32 bit or 16 bit floating point values and they asking the question well, how many bits can I get away with? And I mentioned earlier, we have this streamer 4bit kernel for ARM processors that we use for embedded and that comes that. So that that's called quantization. That quantization allows to get really amazing performance on device. But the same approach also applies to servers, to GPUs. GPUs are also memory limited.

Michael Gschwind [00:34:44]: They have a fixed device memory size on the card. And so you're facing some of the same constraints, only with slightly bigger number. Plus on GPUs you also find like on all other platforms that your bandwidth limited. So quantization helps you not just with your footprint but also just taking advantage of the bus which has a limited amount of pins, you know, a limited number of bits that can be simultaneously transmitted. And so if you need to transmit transmit more data, it takes a longer time because you send more of these packets. So if you look at that, you're basically looking for the same sort of basic technologies like quantization. And Torch Chat uses a library called Torgeo architecture optimization that one of the Pytorch teams is developing and that takes a look at your model and finds ways to compress the computation to compress the weights, make computation more efficient. And it applies equally well on the one side hand to servers where you care about bandwidth or the amount of memory weights that you can store on a single gpu, because the largest models today are multi GPU inference, because you can't fit all the weights on a single device anymore.

Michael Gschwind [00:36:38]: So you're talking fundamentally about the same constraints that you have with on device. Yeah. And so integrating that for LLMs and Torch Chat and more broadly for AI with libraries such as torchl really allows different communities or different use cases to benefit from that shared investment. That's an important opportunity. And I think with the Torch Chat library in particular, but more generally with Executorge and Pytorch, creating this community that has a much broader reach enables models that used to be either just for this one domain or that other domain to cross domains and really benefit all the users. There will be differences, for example, in size. If you're going on device, you might pick a smaller llama model than if you're running on a bigger server. But it's the same technologies and the same primitives and optimizations that you can share between users and get the benefit of shared investment.

Michael Gschwind [00:38:09]: So users do not have to reinvent or recode the same optimizations just for different domains.

Demetrios [00:38:22]: Well, this was the first time that I've heard someone put it like that, where on device and GPUs or accelerators, you're still, you don't have to think of them as two separate things. It's just one has bigger numbers than the other. And so don't get into the trap of thinking that you need to have this one with the bigger numbers in order to do what you want to do. And almost the idea of standardizing it across all of the places that you want to run your model is a fascinating way of abstracting the device and just recognizing what am I trying to do and what are my constraints.

Michael Gschwind [00:39:11]: And then you can talk to the device experts and have a conversation, what else can we do to optimize for that target? There is, for example, the GPU mode community that is very focused on optimizations you might do for GPUs. And some of the engineers that are very active on that are engineers that work on the torchl, the architecture optimization libraries. So you get that benefit there. But we also have, for example, engineers that worry and work on on device AI contribute to that library and partnering with ARM and embedded manufacturers. So having that continuum is good. And then once you have that and the same primitives, you can Ask, well, what's the best implementation for my device? Or are there additional things I can do for my device that are very specific here to what this device offers me?

Demetrios [00:40:26]: Yeah, exactly. It's almost like you don't want to, as you were mentioning before, have to reinvent the wheel or do things very custom for each device that you're on. You want to get 80% of the way there or 90% of the way there and then make those final tweaks to try and optimize for whatever device you're on if you need it. Because some of the time just out of the box will probably be good enough.

Michael Gschwind [00:40:57]: Right? Especially if the stakeholder the developers of a particular device make sure is ARM is Nvidia it Increasingly AMD are working with the Pytorch community, but other communities to optimize or contribute their specific functions, their specific optimization so it becomes transparent where the ecosystem basically depending on where it runs, can pick the best implementation to solve your problems. There's just one linear operator in Pytorch, but under the hood there are different libraries From Kublas for GPUs to composable kernels on AMD to ARM library, et cetera, to extant impact just providing implementations that the Pytorch Acre system and Torch chat can rely to get the best performance on a particular device.

Demetrios [00:42:29]: There's one thing that I believe everyone who has ever used Pytorch is in strong agreement with, and that is the developer experience is strong. And thinking about it in that way that you're talking about is one of those reasons where you recognize that, okay, the standard way or the way that a developer is going to think is along these lines. So let's try and make it as easy as possible for them to go about it and not add unintentionally or intentionally friction to the workflow. And so having the standardizations and having recognizing where you can get a little extra oomph out of the box if needed is great, but if it's not needed, that's even better.

Michael Gschwind [00:43:26]: Maybe you just transparently get that optimization period. So that's one of the great things I've seen with Torch Check. In response to Torch Check, there have been improvements to Pytorch, to Torchl, to some of these other libraries that have looked at what does a model do here and where the opportunities for further improvement I mentioned our work with ARM for on device. There is similar work with GPU vendors to create better kernels, either Britney Cuda or the Triton, which is the GPU language that's coming out of OpenAI that we use for a lot of applications that we generate code for. When you use the Torch export to export the model, it will generate the Triton code if you're targeting a GPU and then run it through that compiler. And you can get a lot of optimization that has been developed for Triton is part of this ecosystem. And there's another of these pillars where it's a parallel community, but by integrating them into this ecosystem, it now offers benefits to all the Pytorch users that use export and including for Torch chat. If you export that for GPU with the Torch Jet export, it'll generate the Triton code and build the model that way.

Demetrios [00:45:33]: Are there any anti patterns that you've seen recently that are worth talking about?

Michael Gschwind [00:45:39]: I think the biggest anti pattern is premature optimization. There is this tendency that people sometimes develop that oh, my problem is so different, whatever is out there won't apply here. And they start to build a very specific, very niche solution. And sometimes that is necessary. Like if you have, you know, crazy constraints that just nobody else can satisfy. But you'd be surprised how often you find out that constraints or challenges similar to the ones that you are facing in your application exists in so many other places in the ecosystem. And if you're operating in an environment where you can just turn around and grab that solution and bring it to your problem, it's so much more efficient. So cutting yourself off by premature optimization, or I can use general AI framework, I think becomes a very limiting anti pattern at the end of the day.

Michael Gschwind [00:47:22]: You might need some optimization for your target, but it might actually be better to build the overall solution and then ask which parts here do I truly have to optimize for my target and where else can I just get a free ride? I'm often joking that good engineers are lazy and that doesn't mean they don't want to work. It means like, oh, I've built this two times already, why should I implement the same solution for the third time? Let me take this solution over there so now I can spend my engineering time on creating something new of value rather than redoing what already there may be in a slightly different context. So cutting yourself off from the broad sharing of code and ideas, I think is one of those anti patterns that I see people sometimes fall into.

Demetrios [00:48:33]: So there's a funny thing that you mentioned there with the over optimizing too early and really sometimes I can imagine that you're thinking you're taking two steps forward, but really, you're taking two steps back and you recognize it a little bit later and you recognize, wow. If, if actually this wasn't optimized, maybe we could do what I'm trying to do a little bit better. How do you know when it is time to optimize?

Michael Gschwind [00:49:07]: First, get it to run. If it's not running in whatever form, shape or color, it's probably too early to optimize. And then I would say you want to figure out what the obstacles are to run with the performance or the battery life or whatever your constraints are for your target. And to ask, well, specifically, what are the pieces I might optimize to get there so I can focus on those pieces that are, that make the difference, that are make or break. And you can still reuse everybody else's work for the rest because that allows you to focus with the real value of your creation is I can make this run on whatever, this device or that much battery life or whatever. So there is, there is that part and then there is, well, you know, like I need the gazillion of other things that I need functionally, but like a file system. I'll not rewrite a file system for LLMs, hopefully. But yet you need a file system because you gotta store stuff, right? I think everybody will see it for the files in the case of the file system.

Michael Gschwind [00:50:56]: But there are many other things like quantization algorithms. Okay, so I want to compress the data. It doesn't actually, you know, matter what device it is. It's the same sort of quantization approaches. Well, if I can go to Torch ALE the architecture optimization to pick up my quantization algorithm, that's goodness, maybe I have a RISC V and I don't have a RISC V kernel yet. Then I can use the quantitation algorithm that compresses the way. And if my contribution is running the first LLM on the RISC V smartwatch, the optimizations or the work that I have to do, specific to that are the RISC V kernels. XM Impact has some of those already.

Michael Gschwind [00:51:54]: But I Much like for ARM, our team working with X&M PAC and ARM teams developed a better matrix multiply quantized matrix multiply for arm. I can see that there is a similar opportunity, say for RISC V and that's the contribution that enables this to run to scream on that device. Maybe partnering with RISC V. Hey, wouldn't it be cool? There is this instruction that ARM has if you had something similar. But hey, I also have an idea how to do that even better, if you're building your RISC V smartwatch that speaks LLM, that does LLMs, it may come down to you just need a few kernels and most of the rest, like the quantization algorithms itself, like the LLM, like the model export or Buddhic Tequitorge, you can build on all of that and you know, focus on getting your product, your idea out by focusing on what is needs to be different to enable what you want to do rather than rebuilding the world.

Demetrios [00:53:26]: Are you bullish on types of new training architectures or new model architectures for this optimization like you said, hopefully we can see the smartwatch, the LLM running natively on a smartwatch and I feel like it's going to be a little bit of each area that you can, as you've talked about, there's such a broad way that you can go and so many different vectors where you can optimize. And I wonder if you're bullish about the model architectures or if there's other places where you're feeling very bullish on that you want to talk about.

Michael Gschwind [00:54:06]: There are a couple of things that I think looking forward to the next innovations at the model architecture level. There transformers are everywhere and in some form, shape or color they're here to stay. But there is a discussion of are there improved transformer implementations, the MAMBA architecture. So that'll be exciting. There's definitely more opportunity there. I'm typically more focused on infrastructure, so not there. There's probably more, many more of these out there in research and having those researches use Pyrach means that when they find something we can bring it in, plug it in very quickly is a community because it might need a couple of new operators or whatever, but it's still the same tokenizers, it's still the same linear kernels, et cetera, other areas. And definitely excited about numeric.

Michael Gschwind [00:55:34]: The whole trend that we've been seeing to exploring numeric format, I mean we started out with FP32 16 brain float 16, FB8. Now we're down to FB4 people looking at four bit floating point numbers. So I think there's a lot of just innovation experimentation that's going on and the infrastructure and the environment that we have I think is very optimized for that. Typically insights are a function of how many iterations. Like a lot of this is very experimental. So if you can run more experiments, you get more insights and you get better results. So the optimizing for turnaround, for iteration speed. It is something that's been on my mind and I know on other people's minds as well.

Michael Gschwind [00:56:45]: Andrew Ng had pointed out how that is critical if you want to drive success in a startup. For example, like iteration speed is directly proportional to rate of innovation. Almost so excited about both what we can do with the duration speed, but also what can we do to speed up how long it takes to run one iteration, one experiment, so that we might get new insights that allow us to build better, faster, more reliable, more efficient systems.

+ Read More

Watch More

Large Language Model at Scale

Posted May 13, 2023 | Views 908

# Large Language Models

# LLM in Production

# Cohere.ai

Prompt Injection Game - Large Language Model Challenge

Posted Apr 18, 2023 | Views 1.3K

# Large Language Models

# LLM in Production

# Prompt Injection Game

Do More with Less: Large Model Training and Inference with DeepSpeed

Posted Jun 20, 2023 | Views 1.5K

# LLMs

# LLM in Production

# DeepSpeed

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io