MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Efficient Deployment of Models at the Edge

Posted Jan 17, 2025 | Views 22
# AI
# Models at the Edge
# Qualcomm
Share
speakers
avatar
Krishna Sridhar
Vice President @ Qualcomm

Krishna Sridhar leads engineering for Qualcomm™ AI Hub, a system used by more than 10,000 AI developers spanning 1,000 companies to run more than 100,000 models on Qualcomm platforms.

Prior to joining Qualcomm, he was Co-founder and CEO of Tetra AI which made its easy to efficiently deploy ML models on mobile/edge hardware.

Prior to Tetra AI, Krishna helped design Apple's CoreML which was a software system mission critical to running several experiences at Apple including Camera, Photos, Siri, FaceTime, Watch, and many more across all major Apple device operating systems and all hardware and IP blocks.

He has a Ph.D. in computer science from the University of Wisconsin-Madison, and a bachelor’s degree in computer science from Birla Institute of Technology and Science, Pilani, India.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Qualcomm® AI Hub helps to optimize, validate, and deploy machine learning models on-device for vision, audio, and speech use cases.

With Qualcomm® AI Hub, you can:

Convert trained models from frameworks like PyTorch and ONNX for optimized on-device performance on Qualcomm® devices. Profile models on-device to obtain detailed metrics including runtime, load time, and compute unit utilization. Verify numerical correctness by performing on-device inference. Easily deploy models using Qualcomm® AI Engine Direct, TensorFlow Lite, or ONNX Runtime.

The Qualcomm® AI Hub Models repository contains a collection of example models that use Qualcomm® AI Hub to optimize, validate, and deploy models on Qualcomm® devices.

Qualcomm® AI Hub automatically handles model translation from source framework to device runtime, applying hardware-aware optimizations, and performs physical performance/numerical validation. The system automatically provisions devices in the cloud for on-device profiling and inference. The following image shows the steps taken to analyze a model using Qualcomm® AI Hub.

+ Read More
TRANSCRIPT

Krishna Sridhar [00:00:00]: I'm Krishna. I'm VP of Engineering at Qualcomm. I work on Qualcomm's new Qualcomm AI hub. I take my coffee always as a latte with 2% milk.

Demetrios [00:00:13]: Today we're taking it to the Edge. Yes, I did just say that. All about AI on the Edge. Welcome back to the MLOps community podcast. I'm your host, Demetrios. And talking with Krishna, I got to learn all about what Qualcomm is doing in the chip sector, but mainly to make it easier for AI and ML developers to grab their models, bring them into the Qualcomm AI hub, and then easily deploy them onto Qualcomm chips or any chip for that matter. On the Edge, they give you so many different stats and help you optimize for what you need to optimize for. It was really cool to talk to him because I feel like AI on the edge is something that I do not know enough about.

Demetrios [00:01:07]: And there's a lot of people doing some magnificent stuff. I asked him what some of his favorite use cases were on the Edge and he gave me a whole list of them. He rattled off a few. One being having to do with cricket, which is not a sport that comes up much. Let's get right into it. As always, if you enjoy this episode, it would mean the world to me if you share it with one friend. And a huge shout out to Qualcomm for sponsoring this episode. It is because of sponsors like you that we get to do this cool stuff.

Demetrios [00:01:42]: So if you're deploying on the Edge, definitely go check out the AI hub that Quality Qualcomm has set up. Let's go. I'm sensing a pattern in how you do things. You start companies and then you sell them to big companies.

Krishna Sridhar [00:02:07]: Yeah, yeah, yeah. Well. Well, I've done it once and I've been part of a company that's. That. That for which it happened the second time.

Demetrios [00:02:17]: So, yes, it's very cool. And what were the companies? Has it always been in the AI space?

Krishna Sridhar [00:02:25]: Yes, it's always been in the AI space. I've been in the AI space for, I don't know, almost 11, 12 years. So it's been. It's been a while.

Demetrios [00:02:36]: Wow. Very cool. What made you want to get into it?

Krishna Sridhar [00:02:40]: I actually have an interesting story. I was doing my PhD in Wisconsin on numerical optimization, and I'm actually being deadly serious about this. My PhD thesis was how to help oil companies optimize for taxes in their drilling in their, you know, you Found your life's purpose. Yes. And at that time I was obviously much younger and I was like, oh, this is so cool. Such interesting problems to solve. There's a lot of cool technology. And then one day I wake up and I'm like, wait a second, I'm actually helping oil companies optimize their drilling with taxes.

Krishna Sridhar [00:03:29]: And I mean, the story of how this whole thing came about was, is kind of fascinating in itself, which is a lot of governments in Africa, they started imposing these complicated tax laws so that oil companies would end up paying their taxes to them. And then it's sort of a game of cat and mouse where you're like, okay, now let me optimize how to do this. And it became an interesting problem in itself. So I was doing that and then, you know, a lot of interesting problems in machine learning was what, what it was called back then were around how to do numerical optimization. So that's how I got into a, into ML or AI. And then I've been there since. I figured it's a much more useful, you know, useful endeavor than helping oil companies with taxes.

Demetrios [00:04:25]: Yeah, I can imagine since then you've, you've been around the block per se, and you were doing stuff with Apple and building a platform, right?

Krishna Sridhar [00:04:36]: Yeah. So I had this unique opportunity to help design CoreML, which is Apple's inference engine to deploy models on, on the Edge. And what was fascinating about deploying models on the edge is, you know, is the fact that you can do extremely innovative stuff in a way that's very snappy from an experience perspective, as well as extremely privacy sensitive. So the fact that your data doesn't need to leave the phone. And of course the, you know, the quintessential example of it is kind of face ID or face identification that's now there in every phone. I mean, there's no opportunity for you to release your biometric data to the cloud, get it verified and then get it sent back. Doesn't quite make sense. But at the same time you need, you need your face ID to work within, let's say 300 milliseconds.

Krishna Sridhar [00:05:40]: So you have to run some pretty complicated neural networks to be able to do face authentication locally on your phone in a way that's extremely secure. And that was sort of an example of on device mlai. And I'm going to ask you, I'm going to do a little pop quiz with you. Let's say you take a modern smartphone and you take a picture with it. How many AI models do you think run in that you know, in that 300 to 700 millisecond time frame when you press click.

Demetrios [00:06:17]: Well, I do know that it will get better because I see it visually with my eyes. I'll, I'll take a photo. Especially when you're in portrait mode with like the iPhone and it looks like it's not that good at first, and then it refreshes or it updates the pixels and boom, it's like, wow, that was magic. Now it looks really good. I can imagine. What if we say, I don't want to, like, go too crazy and make you. I'm going to, I'm going to lowball it and say five models.

Krishna Sridhar [00:06:51]: Okay, that's actually not a terrible guess. I think it's order of 25. 25? Yeah. So that's not a small number for. You don't have more than 300 to 500 milliseconds. And if you, if you include post processing, you don't have more than a second. That's all the time you have. And it has to do everything, including, you know, framing, sometimes face detection, recoloring, sometimes, like all these complicated things have to be done in 20, 25 milliseconds.

Krishna Sridhar [00:07:19]: You get a perfect shot. And as, as, as you see, the phones are getting better and better at it. I don't know if you saw the latest Android phones. You have these really cool features where, you know, you have a group of friends, you know, all of their eyes are not open at the same time. So, you know, can you find the shot where everybody's eyes are open at the same time? So you got to process a bunch of different frames, find the shot, you know, make the perfect photo. So all of that, you know, has to happen real time. You don't have an opportunity to, like, send all of this information to the cloud. You want to see the picture right away.

Krishna Sridhar [00:07:51]: So you have to do it locally on the device, even with assistance. I mean, even though you're. If you talk about the Google Assistant or in a Siri, a lot of the work has to get done locally. Of course, a bunch of the heavy processing around answering the questions happens in the server, but the speech processing, the text to speech, a lot of that happens locally. So you can have a pretty snappy experience. And you also have, you know, a ton of compute that's there sitting right there on your phone. Like, you know, the latest Snapdragon, you know, you have, you're looking at 40, 45 tops. So, you know, you can compute, you can use all of that compute to take pictures to do speech processing, to do video processing.

Krishna Sridhar [00:08:38]: You can do a lot of that locally. So that's what sort of got me into this field is pretty fascinating.

Demetrios [00:08:45]: And going back to that, well, just so I can place it too, what level do you like to play at in the stack and how do you look at it? Like, like where, where is your sweet spot or your special sauce?

Krishna Sridhar [00:08:58]: So I've had the experience over the years to interact with almost all the layers of the stack, ranging from. Actually, before I answer that question, I'll, I'll tell you what makes this a fascinating problem. So I talked about, you know, 20, 25 different, you know, models running on a camera. I think on a modern smartphone you can have, you know, order of a thousand, you know, models running, you may have an order of, you know, a hundred thousand applications that are built on top of on device AI. So these are some pretty large numbers. And, and what's important is the approach that, you know, we've always had to this problem is build systems that other people can use. Because we're not building all the models that go into all of this technology. We don't, we don't.

Krishna Sridhar [00:09:45]: That's not our special sauce. Our special sauce is building systems so that the people who are innovating with the AI models can bring them into our hardware as easily, as quickly as possible. So what that means is we need to be able to build a system that can translate anything a researcher is doing in the cloud to train the models to be able to bring them onto a device, to run them on the device. Yeah, and what makes this a pretty challenging problem is, is the fact that the pace at which AI innovation is happening is absolutely insane. It's so rapid, it's changing all the time. The software people are using in the cloud to train models are changing all the time. They're using different frameworks. Back in the day, it was like TensorFlow is flavor of the week.

Krishna Sridhar [00:10:35]: Today it's all Pytorch. And things are changing so fast in that ecosystem. At the same time, your hardware innovation is happening at an incredibly rapid pace. The kind of hardware that was being used in a phone or a laptop even five years ago, it looks nothing like what it is today. Almost all the phones today and all the computers today have dedicated neural processors, neural engines to run specialized, you know, specialized AI models. So you have your hardware that's changing and you have, you have the AI and frameworks that are fast moving. So you have two fast moving things, one on the top and the bottom. So that's what makes this a pretty interesting problem because you got to map a fast moving thing at the top to a fast moving thing at the bottom.

Krishna Sridhar [00:11:20]: So you got to have pretty stable infrastructure. And that's what is exciting about this particular problem is how do you build stable infrastructure that can handle a fast moving AI ecosystem and a fast moving hardware ecosystem that allows someone to target their latest models to not only the latest hardware, but also potentially previous generation hardware?

Demetrios [00:11:42]: Yeah, it's, it's a nice metaphor how you talk about the top and bottom moving so quickly and almost like you're trying to set up that foundation and have a strong foundation, but at the same time you're trying to be able to hit the target and the target's moving. So the foundation's moving, the target's moving, and you're having fun trying to play between both of those.

Krishna Sridhar [00:12:07]: Yeah, and that, that's what makes this, this whole space super fascinating.

Demetrios [00:12:11]: Okay, so now getting more along the lines of you're working at Qualcomm, you're doing all kinds of cool stuff there with on device AI. How are you looking at it in that sense of coming back to the metaphor, like being able to understand how everything is moving in these lower levels, but also servicing all of the innovation and how everything's moving at the upper level?

Krishna Sridhar [00:12:42]: Yeah, so a few things have, have gotten a little bit easier. So I mean, the way we look at it is, you know, there's, there's still a lot of stuff going on in your classical computer vision, classical speed spaces, but there's also a lot of new stuff going on with generative AI, with, you know, with new language models and new image generation models. So what's interesting there is, you know, the cloud model, the best cloud model from let's say two years ago performs worse than the best device model today. And that's a fascinating thing in itself. So what that basically means is things that we were able to do with an order or two orders of magnitude more compute a year or two years ago, we can do now on a local device. So almost all the latest flagship Android smartphones today, they have a language model that's in there to do things like summarization, you know, completion, you know, text, you know, replies, do. All of these things can be done locally on your phone. So let's say you get, you get a long email, you know, you know, your boss likes to write super long emails with like, you know, 25 pages and you can just hit summarize and you get a summary of it and that information can stay local on your phone.

Krishna Sridhar [00:14:18]: Now those are the kinds of tasks that are now being done with LLMs on the device, which look a little bit different from, let's say you're capturing a smartphone and doing tone mapping that looks a bit different from your text summarization use cases. So to answer your question, what has changed is sort of there has been more use cases that are coming on, especially with Genai, but luckily a lot of the architectures for these models, they look similar. Ish. From a compute perspective, it's typically transformers and they don't look that different. So that makes it slightly easier. So if everything is fast moving on the top. But kind of the techniques that are being used are similar and that allows us to build great systems because we're not necessarily building wildly different systems for your camera and your language models. So that's one thing that has changed in the last two, three years is there's been more and more transformers that are being deployed.

Krishna Sridhar [00:15:30]: They look pretty similar. In fact, the architectures of models that are shipping on your phone to do language tasks is not very different from, let's say your ADAs or driver assistance systems that are shipping and Qualcomm chips in your car. It's kind of fascinating that those two kinds of problems are using similar techniques. So that has made our life a little bit easier. But the hardware is still innovating at like, and there's always new features at the hardware. So that part hasn't quite changed. But I would say the differences become in more nuanced things like precision. And to give you a perfectly good example, when you're doing something like a large language model summarization, it's not as sensitive to precision, which means you can do something with four bit weights, let's say.

Krishna Sridhar [00:16:32]: Whereas when you're doing, you know, ADAs in your car and you're doing precise calculation of, you know, where is this next car or where is the pedestrian, that stuff's pretty sensitive to precision. So you, you can't, you can't afford to really go lower bit. You got to keep it higher bit to keep the accuracy high. So that is what I see as, as, as differences these days is like in, in terms of hardware, is that you have slightly more nuanced requirements or changes in the hardware to accommodate different kinds of precisions for roughly similar architecture. So that, that I think is what's been, what's different now than it was, let's say two, three years ago.

Demetrios [00:17:15]: And do you think it's the hardware's job to be doing that kind of optimization or is it able to be done in the software layer?

Krishna Sridhar [00:17:26]: This is a great question and I kind of have a good analogy for this. Right? So almost all hardware today on modern smartphones, cars, laptops, industrial devices, pretty much everything Qualcomm ships today, we do this thing which is called heterogeneous compute, which means you have a cpu, you have a GPU and you have a neural processor. And the analogy I give is your. It's a spectrum where the CPU is the most flexible and programmable piece of technology, where you can pretty much do anything. You can program it to do anything. Now, there's no guarantee that everything will be fast, but you can do everything. On the other side, you have your neural processor which is extremely specialized. So it can do some things really, really fast.

Krishna Sridhar [00:18:25]: It's designed to do some things really, really fast. And when you make a design at the hardware level, you can typically get an order of magnitude more efficiency, an order of magnitude more power efficiency, and order of magnitude more, more performance. And that is. But then it comes at the cost of flexibility. You can't program anything like, you can't. Even though it's technically Turing complete, it's not easy to program everything. So it's only efficient at certain things. And then you have your gpu which is somewhere in the middle, which is, which is more programmable, but it's also more efficient at certain things.

Krishna Sridhar [00:19:03]: But it's kind of a spectrum that way. So it's always a dance of like, okay, what are the features that are so critical that we need to bring it into hardware to get that extraordinary magnitude more efficiency versus what can we make do with a slightly less efficient software implementation, maybe on the cpu? And that's the balance. You're always playing with this, with this technology.

Demetrios [00:19:33]: Yeah. You have all these knobs that give you these different trade offs. Right. And I imagine you've seen some surprising use cases or trade offs happening. What are some areas or, or different specific situations where you didn't think it was going to work or it was optimized in a certain way that you were a little bit surprised by?

Krishna Sridhar [00:19:58]: Oh, this is a good question. I would say universally over the past past five years, every time I see some new features that are described in the hardware of the silicon, there's a certain expectation of what people are going to do with it. And it's. I usually have, I would say I've had 0% success rate in predicting how people end up using it, because obviously people will end up using the things that you tell them to use. But then people always find creative ways to, to deploy stuff on your hardware. And that was right from the very beginning of accelerated computing in this neural network space has always been the case because people, when you write the spec, you write a spec a certain way, but when a programmer is reading the spec, they read it differently and they do some pretty fascinating things that surprised me. And Transformers was a great example. I don't think the hardware necessarily was optimized for deploying transformers, but we've figured out ways to do it in software to optimize it in ways that I wasn't expecting.

Krishna Sridhar [00:21:10]: And you know, it's now, I would say an order of magnitude more efficient to deploy in software, primarily because of how we, you know, optimize the software for deploying these kinds of networks. So it happens all the time that people find more interesting and creative and innovative ways to deploy models on hardware.

Demetrios [00:21:33]: Okay, so we've got the spectrum of easy hard trade offs, Fast, slower or flexibility I guess is 1x axis, y axis type of thing. And we've also got the different ways that the chips are being optimized for certain scenarios. Whether it's a chip that's inside of a car and that's being optimized for that precision, or if it's a chip that's inside of a phone, it doesn't necessarily need that precision. The thing that I'm fascinated by is, as you are looking at on device AI and ML, what do you think some challenges are that we're hitting up against or some walls that maybe are hopefully we're going to get over in the next couple months or, or what are just some, some things that you feel like people are grappling with that they haven't had before. Because now, as you mentioned, things are just getting so good. So we're starting to push the limits in different ways that we didn't necessarily get to in the past.

Krishna Sridhar [00:22:51]: Yeah, this, this is a great question. It actually ties in nicely with your previous question as well. So the thing with, especially with these SLMs and LLMs, the things that at least I didn't expect was how quickly we'd run up against memory issues. So now it's not just about compute. There's also memory that you have to factor in. And now that's actually become the biggest bottleneck where the pace at which we can bring in larger models I'd say is somewhat correlated to what kind of memory technology innovations can be made, what kind of algorithmic innovations can be made to consume less memory. And that's become a bottleneck more than the raw compute per se. So that's one thing that's been different in the last, I'd say two years.

Krishna Sridhar [00:23:52]: And the second one, which is another fascinating constraint, is power efficiency or battery life for a lot of, you know, mobile compute. When I say mobile, I mean PC, I mean an industrial chip. I mean, you know, even a car battery life is pretty important and you don't want to be spending too much battery or power to do some of these things. So that's another area where you may, you know, you may have the raw horsepower to go, let's say at you know, 100 miles an hour, but you prefer to go at 30 miles an hour because it consumes less power. So the example I give is, hey, I have this LLM that can generate 50 tokens a second where I'm like I can't read 50 tokens a second. So you don't need to give me the information. At 50 tokens a second you can give to me at 10 tokens a second I can still barely read it, but it consumes a lot less battery life. So that's the, you know, the trade off sometimes you can choose to make.

Krishna Sridhar [00:25:02]: And that's been pretty interesting to balance out over the last couple years. And the third thing which I think is, I still think is the bottleneck is just being able to do more high quality things with gen AI in a limited, limited compute constrained environment setting. So I mean, and obviously the dream is to have something like entirely like Chat GPT running locally but I would assume it, it still consumes maybe two orders of magnitude more compute than a constrained device can handle. So there's still a lot of innovation that has to get made to be able to compress that technology too to smaller sizes while improving the compute. So that that still is, is an ongoing exercise and I think that in order for us to see, and I expect that over the next couple years that will, you know, more and more of that will happen and you'll see more and more use cases that are fascinating and can happen locally and quickly and snappily on any device without needing an incredible amount of compute in the cloud to do everything.

Demetrios [00:26:20]: Yeah, and I wonder if it is we just move towards like you were talking about with the photos and how there's 25 models that are working on a photo. Every time you take a photo, are there just going to be a whole ton of models that we can use that are small language models or I guess that might not be the ideal architecture or the ideal way to do it, to load up 20, 25 different models and you just use Loras. But there's gotta be a way that we can take advantage of the small language models in that respect.

Krishna Sridhar [00:26:50]: I mean, the similar analogy I give is it's with, especially with a lot of photo processing, it's not like they're 25 totally different models. They shared a common backbone and they had different heads. So there was a lot of shared compute. So that allows you to do a lot of things with efficient compute. So with Loras and things like that, it's similar analogy. You've got to share compute. You got to share not just compute, but also memory. As I said, that's become the biggest constraint as well as space.

Krishna Sridhar [00:27:24]: So like you as a, let's say a smartphone, you go buy a smartphone, you don't want 30 gigabytes of that to be consumed by, you know, models. You want, you want that to be, you know, maybe one or two gigabytes because you need space to do stuff with your, you know, store photos, store your memories, things like that. So, you know, space and memory, all of that factors in as well. So I'm sure we'll find more efficient ways to share use cases and share compute and share memory and share, you know, space. The same use cases, but those I expect to happen.

Demetrios [00:28:03]: Yeah, it's interesting you bring up memory because we had Bernie Wu on here from memverbj a few months ago and he was very much talking about how we've come to this place where memory is the biggest constraint. And one thing that he mentioned he's trying to enable is the shared memory pools. But that works with a big GPU cluster. Right? That doesn't necessarily, or even potentially you get a bunch of CPUs together, I don't know. But you don't get a bunch of phones together to share memory. So it's almost like a hard out. You can't do that.

Krishna Sridhar [00:28:48]: Yeah, I mean, you. But you do have the option of having multiple, let's say multiple cores or multiple NPUs or multiple GPUs within the phone, and you can share memory within those. I mean, those kinds of architectures can be done as well. So. Yeah, I mean, those technologies I think are still really key for us to look into before we can do more cool things locally.

Demetrios [00:29:14]: Yeah, yeah. And it's also a great point that I don't want to buy my phone and half of my space is already taken up by all these small language models or Large language models, Semi large language models that are on there. And you have to buy the upgraded phone just so that you can have a little bit of space for your photos and your videos.

Krishna Sridhar [00:29:38]: Yeah, definitely. In fact, I think almost all the phones today come with at least 8 to 12 gigabytes of RAM, because you can't, you can't, you can't do LLMs. It's fascinating that they have to have more RAM just so they can do more AI on it. Not, you know, they don't necessarily need more RAM for, let's say, more regular applications, but it's like, primarily because the way they're upgrading their RAM across all phones in the premium segment.

Demetrios [00:30:05]: Yeah. So talk to me a bit about what you're doing at, like, Qualcomm AI Hub, because I think there's some fun stuff there for the developers that want to develop on the edge and they want to put models onto the edge.

Krishna Sridhar [00:30:20]: Yeah, I can kind of summarize that very, very quickly. So, you know, for. For those who are unaware, Qualcomm's the largest manufacturer of silicon in the world. And we manufacture silicon for phones, for PCs, for cars, and, you know, industrial automation IoT devices. So we manufacture silicon for a lot of power constraint settings. And one of the things that we're trying to solve is to make it extremely easy for a developer, whether that be a manufacturer of these devices or cars or phones or PCs or someone who's building an application on these devices. We want to make their job of being able to bring the latest and greatest AI innovations to our devices as easily as possible. And, you know, we have a little saying which is they have to be able to do this within.

Krishna Sridhar [00:31:22]: Within five minutes, within five lines of code. So we built a system which allows people to, you know, as soon as they finish training, they can say, hey, this is my model. This is the device I want to run it on and go. And what our system does is it'll take the model, it'll translate it to how to run it most efficiently on our new processor. It'll optimize it. We actually even have physical devices in a cloud where we'll measure performance. Right there, we'll run it, we'll measure accuracy, and we'll give back a result in five minutes to the developer saying, okay, this model will run in 60 milliseconds on these devices. Here's how you should run it on the device.

Krishna Sridhar [00:32:09]: Here's the model for you to download and run. And these are the performance characteristics of the model. If you want to Tweak it some more. And here's a link so you can look at it and share it with your colleagues so that they also understand what you're deploying. And here's an automation for you to do this programmatically in your. In your system. So we have sort of automated the process of being able to deploy models on all our devices. And that, we believe is the key to more and more innovation and more and more iteration and more and more deployments of more complex things across all our different devices.

Krishna Sridhar [00:32:51]: So that, in a nutshell, is what Qualcomm AI hub is.

Demetrios [00:32:53]: Okay, so so many questions I have for you on this, because I'm assuming we have a lot of different generations of chips that are getting put into different devices. And so I'll think of, like, a random one, because this is just where my mind goes. A refrigerator that has chips in it. And I want to put some kind of a model on that. But the refrigerator might have lots of generations of chips within the AI hub. Does it also give you, hey, run it like this on this chip, run it like this on that chip.

Krishna Sridhar [00:33:33]: And we do one step better, which is we also have a configuration which allows you to deploy the same model on multiple generations.

Demetrios [00:33:41]: Oh, nice.

Krishna Sridhar [00:33:43]: And then we actually have a small little system on the device that detects which generation it is, knows what features are there in that particular generation, and efficiently maps the operations of the compute to that chip's generation accordingly. So we also allow for developers to be able to go back your example, to create one model that'll run differently on different generations of the refrigerators. Maybe on the latest generation, it'll run faster. On the older generation, it may run slower, but it'll still run. So that, that sort of flexibility is also something we can, we can do.

Demetrios [00:34:27]: So we all know about the autonomous driving use cases, or just like chips inside of cars, we also know about chips inside of our phones because we play with them every day. What are some fun use cases that you've been seeing on developers deploying AI into chips or places that we wouldn't necessarily think.

Krishna Sridhar [00:34:48]: Okay, I'll give you a couple, couple of use cases that I think were pretty fascinating. I'll start with the mobile one because, you know, I thought it was a pretty fascinating scenario. So I'm a big, you know, cricket buff. I play, you know, I play in a recreation cricket league. You know, it's like, it's. It's a pretty intense league. We play for eight hours on a weekend.

Demetrios [00:35:14]: Yeah.

Krishna Sridhar [00:35:14]: And one of my buddies once he, you know, brings his phone and he like, you know, connects it to, you know, a tripod and he starts recording the practice session that we have for before our game. And he's like, hey, you know, I built this app that can do, you know, real time tracking. I don't know if you watched, if you watch cricket or tennis or I think it might be there in soccer too, that a system called Hawkeye, which can do real time ball tracking, real time, you know, prediction. So in tennis it tells you whether the ball was in or out. In cricket, it, you know, it sort of tells you whether the ball's hitting stumps, you know, and in soccer, I'm guessing it predicts whether it's goal has gone in or not, whether enough of the ball has gone in or not.

Demetrios [00:36:05]: Or offsides maybe.

Krishna Sridhar [00:36:06]: Yeah, or offsides. Right. I mean, offsides is more like people tracking, but this is more like ball tracking. Right. So Hawkeye is the system which I think they charge something like $100,000 or something per international game. Per game. Wow. Just to do real time tracking.

Krishna Sridhar [00:36:22]: So he's like, hey, you know, I built this thing which can do it on your, on your phone. You just plug in your phone and it can do real time tracking. It can do, you know, a lot of the lot of the important things that, that you, that you want from a tracking system. It can do it just with the camera technology on your phone. You don't need sensors, don't need, you know, lidars, don't need anything. Just, just phone tracking. And he made it into a commercial app. And I think they got something like a hundred thousand downloads or something.

Krishna Sridhar [00:36:51]: They had, last I spoke to them, they were, they were at a hundred thousand downloads a day because people started using it to, you know, at home for their, you know, either for their little, you know, recreation games or street games or whatever. They started using that. So I thought that was a pretty fascinating use case where they would do real time tracking. They would, you know, automatically, you know, automatically take your, let's say, 30 minute, you know, practice session and clip it down into just the important bits. So to give the baseball analogy, it's like, you know, you just record your entire baseball session and they break it down into just the, you know, just the important clips where things are, you know, the important things are happening and everything else is clipped out. And they do all of this locally on the device, which I thought was pretty fascinating that there's enough compute to do all of that on the device. Yeah, so that I thought was a Pretty fascinating use case even though it's on a mobile. It's a pretty, pretty fascinating use case.

Krishna Sridhar [00:37:56]: So that's one I'm trying to think of one from on the PC. On the PC there was a bunch of really fascinating use cases I've seen around. You know, music and auto tuning and automatic generation of music like auto DJing and things like that. I've seen stuff like that which is know, pretty fascinating. You listen to the music and you're like, oh my God, it's actually automatically DJing some music, generating some stuff in the middle. All of that again happening locally on the PC. I thought that, that, that's pretty fascinating. I actually thought the, the use case that Microsoft released, I don't know if you played around with it on the latest, on the latest PCs it's called, it's called Recall.

Krishna Sridhar [00:38:45]: That's pretty fascinating because it can do a lot of things locally and help you do effectively searches across pictures, across screenshots, across actual real time things that are happening on your computer, which I think are incredibly fascinating that you can ask questions like hey, I was in a meeting at noon, who are all the people in that meeting? And it can kind of answer that question by looking through stuff in your computer, you know, looking through, you know, stuff around your calendar, invitations, potentially screenshots of who's in the meeting, you know, like just fascinating things that can be done entirely locally. I've been super impressed with that particular feature. I definitely recommend people give that a spin, then I'll throw in a few more that I've been pretty interesting, really interested with actually in gen AI, the thing that I've been the most impressed with has been the automatic code generation. That stuff's just getting so good it can practically finish your functions. Like the programmer productivity has just gone up to the point where when people start using those, they just can't go back. That's, I think another interesting use case in terms of actual devices. I think I've seen a lot of innovation in these in, in the space of physical security it sounds, sounds pretty boring, but I think it's pretty interesting where you can have a tiny little device that's plugged into your home and network and it can pull in all the different cameras that might be there or not. Just home Office too, pulls in all the feeds from all the different cameras, analyzes it and it can do this all in real time and then only send the important things back to the cloud.

Krishna Sridhar [00:40:55]: So that's also pretty fascinating use case for physical devices in security where previously it required a lot of cloud compute. Now a lot of stuff's happening locally. I don't know if you've recently flown in airlines in Europe, but they have a ton of real time security stuff that's going on all locally on the device. You know, they compute your, you know, facial features and they do real time detection. They do a lot of things that are happening locally in the space of security. And you think about it, the number of cameras and systems that are there in an airport to be able to secure that. Not super straightforward, but. But again, as compute gets more efficient, it becomes more economical to do things like that.

Krishna Sridhar [00:41:47]: So in terms of new devices, I'm generally fascinated by things like in the physical security space.

Demetrios [00:41:53]: And when you say compute gets more efficient, it's just because we are able to only grab the important bits and do everything on device until something is flagged as, whoa, this is important. And it's able to create that processing on device that knows when something is important.

Krishna Sridhar [00:42:15]: That's correct. That's the part that I think is really cool, is you have a little, let's say a little, you know, SOC or little chip that's plugged into the camera that can do a ton of processing and you know, you have. And that just filters out all the stuff that's not important.

Demetrios [00:42:38]: Yeah, yeah. So going through the AI hub again, and I've got a model, I want to come, I trained it in Pytorch and I want to come and deploy it onto the AI hub on one of these chips for my refrigerator so that it can scan and tell me if my spinach has gone bad or something like that. Maybe there's inside my refrigerator I've got a little camera and so it knows, all right, this spinach, get new spinach and it sends me a text message or something like that. How do I go about deploying it on the AI hub and making sure that it works on my refrigerator?

Krishna Sridhar [00:43:17]: Yeah. So, you know, you train your model in the cloud and once you're done training the model, you just use the Python API of AI hub. You say, okay, these are the devices I want to, I want to deploy to, and we literally translate to all those devices. We actually have physical devices in the cloud that map to those, to those SOCs. So we'll take the translated model, run it on those devices that we have in the cloud, tell you exactly what kind of latency you're going to get, what kind of accuracy you're going to get, and then give you back all those results in a few minutes. It's really as simple as that. There's certainly nothing more complicated than that. And what has been fascinating as a result of that is because it's such an automated system, it allows us to build on top of it to do more interesting things.

Krishna Sridhar [00:44:07]: And what we've been able to do is bring in a ton of integrations into the system. So we have an AWS integration, we have a bunch of third party integrations with companies like Dataloop. And what these integrations, let's say the AWS integration allows us to do, it allows a developer to fully automate the process. So they use aws, launch a bunch of training jobs in their system, then use AI Hub, launch a bunch of optimization, inference, compilation and inference jobs on the physical devices and you have the whole system automated where they can retrain a new model, deploy a new model and everything is automated for them. So that is one cool integration that's happened by automating the process of deploying on each edge. And by doing that, the next cool thing that has happened is now we've done a whole bunch of partnerships with model makers. So these model makers, companies like Mistral and Microsoft for that matter, or even Meta, they typically release their models for the cloud and they partner with a lot of cloud vendors saying, hey, you can make our models available for inference in the cloud. Now they're also partnering with us to make all their devices available locally on the device.

Krishna Sridhar [00:45:29]: So all of their models are available locally on their device. On Qualcomm AI Hub, they can just go to AI Hub, see that, oh, mistress, latest model is available, download the model and start using it. And this allows us to create an entire ecosystem around the automation of deploying on the edge.

Demetrios [00:45:49]: It's just one more step in the dag, basically.

Krishna Sridhar [00:45:53]: Yeah, and we'll just keep building, building along.

Demetrios [00:45:59]: We talked about optimizing for battery life earlier. I wonder about when I'm playing in the AI hub, if I have those specific requirements of I need to optimize for X, how do I go about it?

Krishna Sridhar [00:46:12]: That's a great question. So, you know, we're, we're at the point where we give the first sets of important information around what's actually happening on the device. So as I'd mentioned before, all of Qualcomm's SoCs are entirely heterogeneous, which means you have CPUs, you have GPUs, and you have neural processing units. So now the first thing you as a developer, you want to know, hey, here's my model is it running on the cpu? Is it running on the gpu? Is it running on the neural processing unit? Is it, you know, partially on the neural processing unit? Is it, you know, what fraction of it is on the neural processing unit? How can I get all of it on the neural processing unit? So that's the first step towards optimizing for more efficiency, because the more of it that lands up on the neural processing unit, typically the more power efficient it is. And so what developers can do is play around with their models or work with us sometimes to improve the functionality of the neural processing unit from a software perspective so they can get more of their use cases running on the neural processing unit. So that allows them to really optimize for efficiency and get more of their models running as power efficiently as possible.

Demetrios [00:47:30]: And I'm not sure if I fully understood you're packaging it all up there and then you're also deploying it into the devices. How does it actually look? Or do you package it up and then I grab it and then I go and I deploy it on device?

Krishna Sridhar [00:47:48]: It's more the latter. We package it all up and then you download that package and then it's super easy for you to deploy that on your device locally. So what we've done is it's, we've automated the, all the, all the, all the gnarly stuff around, getting this model, translating it, measuring performance, all of that we've automated. So all you need to do is you train your model. You're like, okay, I want to deploy this and then you get a little package and then you deploy that. It's just super straightforward. And we have all kinds of examples to allow you to get that package and deploy it too.

Demetrios [00:48:24]: Yeah, so then it takes away a lot of those questions of like, oh, do I need to bake this model into a Docker container? What's going to be the best optimization here so that I can optimize for this chip? And yes, I can see the value in that.

Krishna Sridhar [00:48:43]: All of that's been automated away. And to showcase the real breadth of it, what we've done is we've taken, I'd say about the 150 most popular models across all different use cases. Speech, audio, video, text, vision, small, small language models, AI generation, everything. And we've made all of the recipes for deploying all of those models available publicly. So you, if you go to the website, you can find, you know, 150 different recipes of how to deploy all of these models in the most efficient ways across all of our different chips. And all of this has been fully automated.

Demetrios [00:49:18]: Nice. Now here's the real test. Is it only Qualcomm chips or have you branched out too?

Krishna Sridhar [00:49:28]: Great question. So you know, there are in the, in the mobile and the PC space, you typically deploy in a multi chip environment. If you're a developer, let's say that, you know, I gave you that example of the developer that made this ball tracking application. They're shipping this model on Android phones, they're shipping the model on different Android phones, on iPhones, on all kinds of different phones. Right. So at least in the Android ecosystem, the role we play as a leader in the Android ecosystem is to make sure that these kinds of applications can deploy their models across various different SoCs in the Android space. So the actual artifact that we provide to developers can actually be deployed on other Android phones that don't have Qualcomm SoC. They'll run on the CPUs and the GPUs, but they run pretty efficiently on those non Qualcomm SOC phones.

Krishna Sridhar [00:50:36]: And the reason that's important is we want to make it super easy for an application developer to build one model and deploy that on any Android phone. And our viewpoint is it'll run fast on an Android phone, but if it has a Qualcomm SoC, it'll run really fast. And that's kind of our take on it is like we're not making it run slow on other things, we're making it run faster on Qualcomm. So we do make sure that from a compatibility perspective, especially in Android and PC, we are compatible with all the standards, we're compatible with the rest of the community. And these models can be deployed on other SoCs, but they run super fast on the Qualcomm SoCs. That's our sort of philosophy.

+ Read More

Watch More

8:24
Why Specialized NLP Models Might be the Secret to Easier LLM Deployment
Posted Apr 27, 2023 | Views 2.1K
# LLM
# LLM in Production
# LLM Deployments
# TitanML
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com