MLOps Community
+00:00 GMT
Sign in or Join the community to continue

AWS Tranium and Inferentia

Posted Jun 04, 2024 | Views 1.1K
# Accelarators
# AWS Tranium
# AWS Inferentia
# aws.amazon.com
Share
speakers
avatar
Matthew McClean
Head of Customer Engineering, Annapurna ML @ AWS

Leads the Annapurna Labs Solution Architecture and Prototyping teams helping customers train and deploy their Generative AI models with AWS Trainium and AWS Inferentia

+ Read More
avatar
Kamran Khan
Head of Business Development and GTM, Annapurna ML @ AWS

Helping developers and users achieve their AI performance and cost goals for almost 2 decades.

+ Read More
SUMMARY

Unlock unparalleled performance and cost savings with AWS Trainium and Inferentia! These powerful AI accelerators offer MLOps community members enhanced availability, compute elasticity, and energy efficiency. Seamlessly integrate with PyTorch, JAX, and Hugging Face, and enjoy robust support from industry leaders like W&B, Anyscale, and Outerbounds. Perfectly compatible with AWS services like Amazon SageMaker, getting started has never been easier. Elevate your AI game with AWS Trainium and Inferentia!

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Matthew McClean [00:00:01]: My name is Matt McLean, so I lead the Annapurna customer engineering team. So my team essentially help customers adopt AWS Trainium and Inferentia for their workloads. So yeah, so I work for Aws. I've actually been working for AWS for more than eleven years now and I've been working in the Annapurna and AWS training team for a couple of years. And how I take my coffee. Well, my wife's italian so I prefer to take it in Italy and a cafe place with italian coffee to our praterium.

Kamran Khan [00:00:29]: Hello, my name is Kamran Khan. I lead our business development and go to market teams around Inferentia and Tranium. And I like all types of coffee. So generally I'll take a at home I'll make myself a flat white. And on the go I like a variety of kind of light to medium roast drip coffees.

Demetrios [00:00:51]: Welcome back to the MLOps community podcast. I am your host, Demetri-os. And we're back with another banger today. Today we're talking to Matt and Cameron about Inferentia and Tranium, these two AWS services that were demystified for me over the next 40 minutes, I got the deep dive on what exactly they're good for, what they're not so good for, how they work, what they work with. If you're using Sagemaker, do they work with that? If you're using bedrock, how can you utilize them? Why would I even utilize them instead of the common GPU's? And what are their strengths? They really break down why they created both of these, which are basically, they said, you know what? We want a chip that is tailor made for deep learning workloads. And so you know how Amazon and Aws like to do it. They just went out there and they, they made it happen. They turned it into a reality.

Demetrios [00:01:54]: We talk with Cameron and Matt all about it. If you liked this episode, as always, share it with a friend. We would love to hear from you on the Spotify feedback section. That's always great. I get a nice little chuckle when people send in funny comments. I read every one of them and let's get into it. I'll see you on the other side with Matt and Cameron. All right, so, Matt, I gotta call it out first.

Demetrios [00:02:25]: You were in the community in 2021. I was just looking at our conversations in slack from three years ago. It's awesome to finally get to have you here on a podcast, but it's almost like we get a two for one today with Cameron too. We want to talk all about AWS's Inferentia and Tranium. It's probably a good place to start with just giving us a rundown on what it is and why AWS has invested in something like this.

Kamran Khan [00:02:56]: Yeah, absolutely. So we've been at AWS, been investing in different types of accelerators and use cases for customers for a very long time, right? And so as the AI revolution has really taken hold, and I would say like, you know, I think the interesting part is, right, like most consumers or most people are really picked up on AI with the introduction of chat TPT last year, right, where they're like, oh man, look what AI can do for me, look what these large language models can do for me. But I think, you know, the US in this community have been seeing this trend for the last 1015 years, really since Resnet really started showing the potential of what models can do and how they can kind of start providing, you know, more superhuman responses to simple applications like computer vision and classification. But now with language, with video, with images and so much more. So because of this, AWS has been looking at this space and like, well, how can we offer customers more choice, higher performance, lower cost of course, right? And make it easier and more accessible to more users and developers and customers ultimately, right? And so they started investing in our own generation or purpose built accelerators, our AI chips called Inferentia and Tranium. And we launched our first Inferentia AI chips back in 2019. And in 2019 it was Inferentia one focused on inference acceleration, and it was really good for CNN applications like Yolos and Resnets and image classifiers. And then in 20 time is flying right.

Kamran Khan [00:04:37]: 2022 we introduced Tranium. And Tranium really was designed as the successor to inferentiate in some regards. It was our second ML chip. This really brought a lot of modernizations and really focused on LLMs and generative AI and how to accelerate and distribute large models across multiple accelerators and make it very efficient and cost efficient as well. And then last year, we introduced Inferentia two, the successor to Inferentia. Now we're building this family of accelerators basically to provide users choice, give them opportunity to get higher performance, lower cost, make it easier and more accessible for more users. We want to continue to invest in these types of accelerators. Tranium two was announced at last reinvent by Adam.

Kamran Khan [00:05:29]: It's going to be available later this year, but it's going to come with forex, higher performance than tiern one. And we're pretty excited about what that's going to open up and what the possibilities that's going to open up for a lot of users as well.

Demetrios [00:05:41]: So I love that you're talking about the choice and bringing more choices because Amazon is kind of known for that, especially AWS. They're known for all the different choices. And I want this conversation to really be around, like when would we use it, why would we use it and what the strong points are, the weak points and whatnot. Of course, the first question that comes to my mind really is about how does this compare to a GPU? How is it different? What are things, how does it match up? Right, like, and what GPU's, what should we expect? There's like a lot of unanswered questions that I can imagine. People are sitting at home or in their cars listening and thinking, like, okay, cool. So tranium and Inferentia, it's like GPU's but not.

Matthew McClean [00:06:35]: Yeah, so we've designed tranium and Inferentia really to be a super specialized accelerator for deep learning applications. So you know, GPU's, you know, originally were, you know, they're a graphical processing unit, right? So they were never designed for deep learning workloads. But obviously it's a great fit. But we decided to desire something really specific to deep learning. So I guess the key differences comparing a GPU in our hardware was that GPU, you have a lot of these extremely multiprocessors or lots of thousands of cores and intensive cores, essentially, what is a deep learning application? Why do you need an accelerator? It's primarily to, I would say a couple of core things. One, you need to accelerate matrix multiplications, the linear algebra calculations. And that kind of forms more than 90% of the operations or the t flops. This is for both training and inference.

Matthew McClean [00:07:39]: You need something because our standard cpu is just not fast enough. GPU is good for that. And what we provide essentially in our hardware is what we call a tensor engine. So it's actually a systolic array that can really accelerate these matrix multiplications. So it can do, for example, a 128 by 512 in a single clock cycle. So really, really, it's designed for accelerator as matrix multiplications. The other part of accelerators that is needed is really high bandwidth for memory. In this respect, they're quite similar.

Matthew McClean [00:08:13]: So we also have HBM and it's really designed to take the weight gradients and so forth that stored and then really quickly move them into the cache, move it into the accelerators to do the computation. So that's kind of the other key part so, yeah, so the main difference, I would say, is the way computation is done.

Demetrios [00:08:36]: Okay. And that's through the tensor engine. I like that. And I can imagine people are going to want to know, like actual numbers and benchmarks. And so we'll leave a link to a blog post. But do you have any numbers off the top of your head?

Kamran Khan [00:08:52]: Yeah, some numbers off. So of course, I think the best place is going to be our documentation. We publish actually a range of different benchmarks and kind of like hardware specs as well. Right. So you could learn exactly what the t flops are. But for training applications, we kind of like to think about what is the effective t flops you can achieve while training, you know, large language models, for example. Right. And so we like to kind of look at what is our effective t flops per dollar.

Kamran Khan [00:09:22]: And so when we look at that and we compare it to, you know, traditional accelerators on the market on AWS today, we're able to lower the cost to train models by up to 46% using tier in one, even against kind of like the more common or, you know, like latest generation of other accelerators available on AWS. So it's very competitive. And at the same time, on the inference line, we see a very similar story as well. Right. So with the kind of conscious decisions we made in our hardware, really specializing on just these, you know, machine learning workloads rather than trying to be general purpose. Like, for example, we get this question quite a bit, is can I run crypto algorithms on Inferentia and Tranium or can I do like, graphics simulation on, in frontier training? Going back to your earlier point of like, you know, like, I think it's like a GPU. So can I run other workloads that require that level, of course, power? And the answer is no, we, we can't actually run crypto algorithms. We can't run, you know, simulation workloads.

Kamran Khan [00:10:27]: And not that the hardware necessarily can support it. It's like we haven't invested in the software stack to enable them because we want to be specialized, we want to be focused, and that gives us the ability to really hypertune lower cost to train, up to 50% lower cost of training models. And at the same time, when you're deploying models, get higher performance and reducing your deployment cost by up to 40% as well.

Demetrios [00:10:51]: Using Inferentia two, I like this idea of, all right, you're trying to make sure that you're looking at the price of training or the price of inference, and you're seeing that it's going down like 46%, which is not a small number. And I like that number when I'm, especially if I'm thinking about spending a lot of cash. If it's 46% less, that's a good chunk of change right there. And the next obvious question I think I have, just as someone who is not super familiar with Tranium or Inferencia is, do I still use CUDA? What do I work with Inferentia and Tranium on?

Matthew McClean [00:11:35]: Yeah, so we don't support CUDA. So Cuda is a library very specific to GPU's and very specific Nvidia GPU's. So what we have instead is our own, we call SDK, it's the neuron SDK. And so what that comprises of is essentially of kind of think of it three blocks. So we have the framework integration. So for example, we integrate, we work with Pytorch, we also work with Tensorflow, typically for inference. And we're very soon going to have jack support. Basically, the first part is to tie in.

Matthew McClean [00:12:08]: Basically what we want to do is allow our users to stay in the ML framework that they prefer, minimize the code changes they need to make to use our purpose built hardware. That's the first part of the SDK. Then we have the compiler itself. Our compiler stack is based on XLA. XLA was originally part of the Google Tensorflow project, but is now. And we're one of the founding members of the Open XLA initiative. It's an open source initiative. So there we work with Google, we work with others like Meta, Nvidia, Apple, other companies, in terms of evolving the XLA standard.

Matthew McClean [00:12:47]: So that kind of defines all the operators and that kind of get lowered in the compilation flow and uses intermediate representation called stable OCLO to define all those operators. Our compiler uses HLO, and then it compiles essentially the computational graphs forward, backward pass optimization step all into low level machine code. We call that the neuron executable file for NF. Basically that runs the specific commands on the different engines and now an accelerator. Then these nefs get run on the accelerator by a runtime. So the runtime executes these compiled artifacts and also does a lot of the networking. When you're doing a distributed inference or training job, then you have all these collective communications you need to communicate between the different accelerators. So the runtime also handles that.

Matthew McClean [00:13:42]: So with all these kind of components, you kind of get end to end coverage of our software stacks and able to run your pytorch, your code on our accelerator.

Demetrios [00:13:53]: So if I'm understanding this correctly, you're really trying to allow people to use Pytorch or Tensorflow or whatever they're used to using. But then. And if you want, you can just stop there. But if you want to go deeper, you can.

Kamran Khan [00:14:08]: Through neuron. Yeah, you can.

Matthew McClean [00:14:11]: There's a few different ways. So, for example, if, you know, most users would just use standard Pytorch and then compile their model for Pytorch, we use the XLA version of PyTorch that is also used by Google when running on tpus. Most users can just stay writing their code on Pytorch, and it will work fine. If they want to have more control, then there's a couple of different ways that people can get better performance. Or for example, if there's a certain operator that we don't support and they want to implement that, then they can actually write their own custom c operators. One of the engines in our accelerator is a synned engine, so you can write c code to execute custom operators, say Pytorch operator. And that will run on the accelerator in a couple of months. We're going to have a new kernel interface.

Matthew McClean [00:15:05]: It's called neural kernel interface, or Nikki for short. This will provide, if you're familiar with Triton, there's many customers that want to further optimize performance and they want to do kind of, we call it kind of tile level computations. Right. So we reduce the amount of memory transfers between Hbm and SRAM. So if you're familiar, for example, the flash attention algorithm, that is the core, that's how flash attention works. It's operating in Sram rather than going back and forth to HPM. So you can do various implement algorithms like flash attention, like Mamba are using Nikki that will be launching very soon. And that's another way that customers can, you know, get further performance, have more control over.

Kamran Khan [00:15:50]: Yeah.

Matthew McClean [00:15:50]: How the code is run on our accelerators.

Demetrios [00:15:53]: And is it only like, do they have to go together? Are they married at the, or joined at the hip type thing?

Kamran Khan [00:16:02]: Or.

Demetrios [00:16:03]: If I use Tranium, then I can use whatever else I want and I don't necessarily need to be stuck in with Inferentia.

Kamran Khan [00:16:10]: Yeah, that's actually a really important one because we get that question all the time. Right. If you utilize, let's say you're starting with tranium, you train your model and fine tune your models, starting from, let's say, meta's llama three model, which is a great choice to start, and then you fine tune it it's working great for your application. Are you locked in? The answer is no. Our philosophy is try to be part of the heterogeneous world. I think we know, working on and building with Inferentia and Tranium and the software stacks around it, that users and developers are going to be pulling and mashing together different libraries and environments that will be best for them, and maybe not for everyone, but it'll be best for them. The idea is we want to work within that world. We know it's going to be a heterogeneous compute environment where you're going to be utilizing CPU's for certain tasks, GPU's for certain tasks, and inference and tranium for certain tasks.

Kamran Khan [00:17:12]: So when you're training a model with tranium, you could deploy that model. The, or I should say that the entry and exit, let's say you're fine tuning that model is going to be starting from a checkpoint, right? You could pick up Meta's checkpoints, off hugging face, for example, fine tune it for your, with your data sets for your application space. Utilizing tranium, the output is just going to be another checkpoint that's defined by Pytorch. And then you can then deploy that anywhere. Infringe is a great option, but you could pick up any GPU or embed that model down if it's for an on prem solution or an embedded device. There's a lot of wearables and a lot of different versions now with, we're seeing a lot of the AI wearables with what humane is doing, with what other solutions are doing, and then vice versa as well. You could take models that are pre trained or fine tuned on tranium and then deploy them to in friendship seamlessly as well, just taking those checkpoints.

Demetrios [00:18:15]: So basically that's just one part of the whole workflow, right? If you're trying to go and do something with deep learning these days, or AI machine learning, whatever you want to call it, you've got all the other pieces around it. And I instantly think about how does this work with Sagemaker? So that's probably where we could start. But then I want to know like, okay, so what if I'm running kubernetes? And what if I have Ray or what? Like how does this fit with all those different pieces of the puzzle? And so with Sagemaker, that's probably the easiest one to start with because I think a lot of people in the community and listening are using Sagemaker. And then maybe we can go to bedrock and we can go to just ec two or whatever it may be.

Matthew McClean [00:19:09]: Yeah, I think it always starts with the team.

Kamran Khan [00:19:12]: Right.

Matthew McClean [00:19:12]: Responsible for either training or deploying the models. Right. And what is their, for example, level of competency? What do they know today? What is their kind of maybe company policy or standard that they have to adopt? Because a lot of companies will standardize on a certain technology platform and teams are constrained to adopt to that policy. And I see it as a spectrum of ease of use and control. Right. So taking into the cabal as a team, taking the sort of spectrum easy of control is kind of where you kind of land in terms of what is the best choice. Obviously, Sagemaker is an excellent AWS service, offers the full lifecycle management, and it has full support for inferential training. So, for example, there's kind of two options, actually.

Matthew McClean [00:20:05]: On the training side, we have the kind of the standard, sort of ephemeral training jobs that maybe many of the listeners are familiar with. If you want to using the Sagemaker SDK or through studio, you can launch a training job. It'll spin up the instances in the background and manage infrastructure. So that's kind of one option. We also have support for a new service that was launched end of last year, Hyperpod. Sagemaker hyperpod. So if customers are familiar with a slurm interface and more of a fixed cluster, they maybe want to have many different jobs or want to have long running, say, training jobs. So they don't want them to be ephemeral, want to have dedicated cluster, then hyperpod is a nice solution there.

Matthew McClean [00:20:47]: And it handles as well all of the kind of failover, managing essentially the infrastructure when you have a hardware issue to replace the hardware for you. So those kind of two options in sagemaker. And then on the inference side, we also have the endpoint support. So you can deploy to either Inferentia or you can even deploy to Tranium because tranium has a few more accelerators, more hbm memory. So for a large model, say that llama 370 B, sometimes the train makes a better instance choice for inference, for doing your inference of those models. So that's also supported in the Sagemaker endpoint service.

Demetrios [00:21:25]: Oh, interesting. Yeah, that is another piece to it. Like Inferentia basically can take whatever you throw at it.

Kamran Khan [00:21:34]: Yeah.

Matthew McClean [00:21:34]: So maybe I can kind of explain the difference between Inferentia and Tranium. So essentially at the accelerator level, it's basically the think of basically the same. They have each accelerator chip has two neuron cores, so you can kind of think of a neuron core a little bit like a GPU. So Inferentia is essentially a, it can be deployed in more regions, it has lower power requirements because it has less chips. So in Trainium we have 16 accelerators, whereas Inferentia we have twelve. Also the network connectivity between the instances. So we're using ear networking between the instances. So in Trainium we have up to 1600 gigabits per second network connectivity, whereas in Inferentia we don't need, typically when you're doing inference you don't need this high speed, high bandwidth networking connectivity between the instances.

Matthew McClean [00:22:25]: So it's only 100. Inferential is also a cheaper instance option, but you can actually train on inferential, which is interesting, and then you can actually do inference on training. So yeah, I know it's a little bit confusing, but they hold something just.

Demetrios [00:22:40]: To make it easy for everybody. But are there people doing a lot of that? Like just because you can doesn't mean you should, right? Are there people that are doing it?

Matthew McClean [00:22:51]: So there are. So as I mentioned, so for customers who want to deploy large LLMs, you know like llama 370 B, so when you're getting in towards the 100 billion parameter models, actually trainium is actually a better choice.

Demetrios [00:23:07]: Okay. Okay, fascinating to think about that. Now what about with just like an EC two instance? Yeah.

Matthew McClean [00:23:14]: So on EC two there's many different options. So we have a lot of customers who have kind of standardized on using for example kubernetes. Right, as their platform for doing either inference or training. So we have full support. We've even developed a Kubernetes plugins which will help with the resource management or for example, we call them neuron devices. For example, when you configure your Kubernetes pod, you can say, okay, I want to have four neuron devices. So it'll basically have a plugin that will help manage the reserving and allocating those resources. So that's a very popular option.

Matthew McClean [00:23:55]: We have customers using parallel cluster. So parallel cluster essentially is for customers that like using a slurm interface. It's an open source offering targeted kind of for HPC and machine learning workloads, as I said, through a slurm interface. So that is also a popular option for training, training your large models and then always have some good collaboration with partners. So we have support in Ray. So we've worked with the any scale team and the Ray Open source community. So for example, you can deploy your models with ray serve and also use the ray train. Just recently we've announced support for ray train.

Matthew McClean [00:24:30]: So you can use their API and use ray as a way of managing your training or inference workloads and as well out of bounds. So I'm sure many people are fans of the Metaflow open source project. So we had a recent blog so they've actually added support for Tranium and Inferentia in the metafo platform. So that's also another great option. Customers like nice.

Demetrios [00:24:54]: And last but not least, because there is such a rich ecosystem. I am wondering about bedrock and maybe you guys can just clarify too when you've seen people using bedrock versus sagemaker, why, what the use cases are there?

Kamran Khan [00:25:14]: I think there's many. Right? And so going back to our, the beginning, you know, AWS is about offering customers choice. When we think about it at AWS it's like this. We have this concept of the generative AI stack at the top you're going to have the services Amazon is building like transcribe, like translate, things like this that are like end to end services where the models are integrated, the user experience is integrated and it's just an API or service that you use and it makes it really simple for developers to integrate. But it provides you less customizability, right? Like you get the performance you get, you get the feature set you get and then you know, over time it improves the kind of next layer down that we've now added, which is like the bedrock level, which gives you a little bit more customizability right now as more models have become popular as, and to consume those models is models as a service. And that's where bedrock really sits, right? So if you want to get access to the greatest models being produced today right now, like from anthropic or mistral or you know, AI 21 and many others and even Amazon with our Titan series of models, right, you can very quickly get access to those and other open source models and then just consume them as an API, pay for them with a per token, you know, tokens per dollar, right. A very simple use case there. And now even you can even fine tune certain open source models as well.

Kamran Khan [00:26:46]: Kind of make customize them for your liking as well.

Demetrios [00:26:49]: Are you fine tuning them with tranium?

Kamran Khan [00:26:51]: Yeah. So we utilize bedrock heavily utilizes Inferentia, but also other accelerator types, right? So it uses, not only it uses Tranium, it uses Inferentia to kind of build and serve these models across the board. So basically all of the resources AWS has at its disposal to engage with customers. And I think bedrock makes sense for a lot of folks and it depends where your ML expertise is, right? So if it's like, hey, we're just trying to build a really quick prototype or we're just at the application level. We don't have a lot of folks that understand how to train models because, you know, as easy as it's getting in today's world, it still requires some experience to do it efficiently or to even build the services around them. So bedrock is a great answer for that. But if you're going down the stack a little bit where you want to customize the interaction of models, you really want to own all of the data around the model and how it's being deployed and where it's being deployed. Certain applications where, you know, are very sensitive to days data residency and lineage.

Kamran Khan [00:27:54]: Right? Like where is the data coming from to train the model? Where is the servers running everything? It can't leave our borders. And this is probably more important for like european countries which have really strict data privacy laws.

Demetrios [00:28:05]: I love it.

Kamran Khan [00:28:06]: So for that, you know, sagemaker, Inferentia Tranium or EC two are going to be better options. It gives you that same kind of ease of use but greater flexibility of where you're deploying your models, how you're setting them up, which libraries are they integrating with, you know, what, what other data it's commingling with. Right. Like, and if you're building a broader pipeline of services, you can co locate your accelerators into one region to improve the overall throughput and latency. So there's a lot of benefits and it's really just about what are the users objectives? What is the total application, the end application look like and what's the right fit for them and what's the right choice? And actually we have, I would say it's not a one, it's not an either or decision. Right. So what we see a lot or the customers we work with, they'll use services from bedrock. Right.

Kamran Khan [00:28:58]: They'll be using one model from bedrock. Let's say that they're like, all right, I want to use one of the cloud models from bedrock. I also want to integrate with utilizing chat GPT's APIs. Right. And so they have part of their services coming from there. And then hey, we're going to be running all our llama three models at, you know, 70 billion parameter llama three models on, on infrared gen tranium. Right. And so it's kind of a.

Kamran Khan [00:29:21]: It gives you that choice. I can get the right models for the right part of my application and build the best experiences for users and customers. Yeah.

Demetrios [00:29:33]: And that in a way makes a ton of sense just because you're giving so much resilience to your application too by having that. And so I fully understand bedrock now. It's like, all right, if you just want that API, that's bedrock. It's that like OpenAI API experience in a way where you get a model you call the API and you're good, but if you want to go deeper, you want to have much more control. That's where you're going into the sagemaker territory or the EC two, depending. It's like there's levels and there's layers of how much control you need and want. And that is which service you're going to use depending on what you're looking for.

Kamran Khan [00:30:13]: Yeah, I mean, if, if bedrock kind of meets the application needs, like you're getting the right tokens per second, the right latencies, it has the right models that fit your application and needs. It's, it's a great option, right, because it's super simple. But if you need to go, you know, kind of peel that layer back, go one layer down, you know, that's where, you know, utilizing or hosting your models or starting even with jumpstart because the experience is very similar in terms of getting started or with utilizing the hugging face TGI containers. You could deploy any model per pretty easily, but you have greater control, greater flexibility. You can kind of tune the variables that matter the most to your, to the user experience that you're trying to generate for your customers. So it's just we get that question quite a bit as well.

Matthew McClean [00:31:02]: Right.

Kamran Khan [00:31:02]: Which is what's better, bedrock or sagemaker or EC two. And I don't think one is necessarily better. It's just what are the use case? What are the things that are most important to you and your applications?

Demetrios [00:31:16]: Yeah, and I do like this idea too, that, okay, maybe it's you're going to have a different piece of your application using different parts or different services depending on what your needs are. But also I can imagine that you get different teams using different services too because of their needs. And so looking at it from the point of view of like, hey, if this super simple API gets you there, then no need to overcomplicate it with a bunch of services. Just use the API.

Kamran Khan [00:31:50]: Yeah.

Matthew McClean [00:31:50]: If customers are unsure, I think always start at the easiest to use service and then only really go down once you start finding based on your application or context that you need to. But it's going to have a lot more costs and a lot more things to manage.

Demetrios [00:32:08]: Yeah, once you hit the limits, then go deeper, but until you hit those limits. Don't feel like you need to start at level hard.

Kamran Khan [00:32:17]: That's the funny part, is like when I would say that that kind of mindset that you just mentioned is like, start at level hard, right? Because I feel like a year ago, two years ago, when we were working with customers and users, it was always like, I have to train my own foundational model, right? And I have to do it all from scratch, and I have to set up my data and I have to curate my data sets. And it was always starting at hard because there was an unwillingness to use open source to a certain degree in enterprise environments, or there was like, oh, we have to own everything. And I think that mindset has really changed with users, especially in the enterprise space. Like, no, no, no, I can start with open source. I can start with the llama model or a mistral model or a Zephyr or something else, or a stable diffusion model. And I can fine tune it with my own data set and do it much more cost effectively and easier and with less manpower on our side and just get to market a lot quicker. I think the idea is now is how quickly can we get it done? Rather than like, hey, let's own the entire thing. No, let's leverage as much from the community as possible and contribute to the community along the way.

Demetrios [00:33:29]: Okay, so you're talking a lot about what you've seen out there in the wild. I think this is a good point to transition into actual ways. You've been seeing people using Tranium and Inferentia and what some examples are of that.

Matthew McClean [00:33:45]: One of the key partners that were customers that we're working with is anthropic. So towards the end of 2023, about September 2023, we actually had to have an agreement with them. So they are initially deploying Claude models onto infringer and training. And these are actually going to go in bedrock, as Camlin was mentioning earlier. So we're in the process of moving those across. And Cameron also mentioned earlier, trainium two. So trainium two is going to be launched very soon towards the middle of 2024. Towards end of 2024.

Matthew McClean [00:34:19]: So yeah, Anthropoga going to be training the next gen models on training. So they're a really exciting customer to work with and yeah, to look forward to see us seeing those models on bitterup.

Kamran Khan [00:34:31]: And there's a, we've been working with kind of a range of different kind of startups and enterprise customers, kind of customers bringing AI into their traditional applications right now, leveraging the LLMs to kind of improve the experiences to make it more user friendly. And on the startup side specifically, we're starting to see a lot of kind of emergence of new complete products. Right. I think one of those areas is, which is very, I think gaining a lot of momentum in the market is the concept of AI agents, or autonomous agents, and make them very personalized to your workflows, to your company, or to yourself specifically. One of these companies we've been working with that just recently launched a new service called, I think it's called Myninja, but their company name is called Ninja Tech. They're based here in the Bay Area, and they've been leveraging Tranium and Inferentia to fine tune their models and to deploy them at scale and to kind of serve all of their model needs. And so one of the things that's interesting about them is when they were looking out and starting, they started with the integrations from all of the models, models as a service, of course, as we kind of talked about. But really quickly, they're like, you know what, we can create highly customized, very efficient models.

Kamran Khan [00:35:50]: Starting from utilizing open source models, we're going to fine tune those models specifically for our use cases which include code assistants, researchers, your basic information gathering and reporting, and also scheduling tasks, and outperform some of the most expensively foundational models that are trained from scratch. And so with Training and Inferentia, they're basically able to kind of utilize, you know, llama, three models to serve their customers. They're using a range of them, all the way from the small 8 billion parameter code, llama models kind of in the middle, from like 30 billion parameter and all the way to the 70 billion parameter models, and able to reduce their inference costs and their deployment costs by 80% by utilizing Chitranium to serve these models. Another example of this is as more customers as well want to build more customized models as well. We've been working with other guys that are trying to create better tools in this space as well. Fine tuning models can be challenging at a certain point. There's a lot of different techniques from sfts to trls to like, how do you do implement pef? What are the right techniques to not overfit your model to make sure it's as efficient as possible? So we've been working with another startup called RC AI, and they're trying to like, optimize this space where users have data that needs particular sensitivities to privacy and to security, and they can set up training environments inside of customers domains. Like, you know, like in their AWS, private clusters and do end to end fine tuning of models and create really highly specialized models at a fraction of the cost, right? So they're utilizing tranium to accelerate this and seeing up to 90% cost reduction utilizing both their domain experience, which is a combination of highly efficient LLM fine tuning, but also a concept that they've kind of, I don't know about Pioneer, but really leaning into which is going to be small model training, right? So as everyone's trying to go to like larger models, 70 billion, you know, hundreds of billions of parameter models, which can increase the abilities of these models, they're also thinking, well, how do I take a bunch of small models, less than 5 billion parameter models through less than 3 billion parameter models and create a lot of specialized small models and merge them together into a final model, right? So they have this model merging technology as well, and they're utilizing tranium to even lower the cost of building all of these models and deploying kind of for more wide range of use cases.

Kamran Khan [00:38:38]: So great solutions across the board. So like, if you're looking at, you know, you can definitely check out ninja's solutions. They're live now. And then RC at RC AI as well.

Demetrios [00:38:50]: Yeah, huge shout out to RC. I'm actually friends with one of their founders, Brian Benedict.

Kamran Khan [00:38:55]: Oh yeah, Brian's great.

Demetrios [00:38:57]: I love that guy. I love what they're doing. The model merging is fascinating. And there's another guy in the community, Maximilian, who is doing a lot on that front too. And it feels like the model merging is something that is getting very popular and it is getting to be. Yeah, the small language models, theme or narrative is also bubbling up quite a bit. I imagine you guys have been seeing it a ton too, because all in all, like, if you can go smaller and you can still accomplish your goal, then it's going to be much better to be smaller because it's faster, it's cheaper.

Kamran Khan [00:39:43]: And I think, I think that's the hidden cost that people don't realize. They're like, well, okay, fine, I can train my model and it's a one time expenditure. So, you know, I'll bite that bullet. I'll train my 70 billion parameter or 100 billion parameter model, or maybe more. And it cost me x dollars. But now if you want to serve it and keep that real time, low latency, high response engagement with your user to actually use that model, it can be very expensive to do so right now Inferentia and Tranium help reduce those costs, but it is still more expensive. So if you can accomplish the same, the quality of model at a fraction of the model size, you have your deployment options open up right now. You can use smaller instance sizes, less compute, you can scale further without investing more in compute cost and operating cost.

Kamran Khan [00:40:35]: I think there's a lot of potential here. I think we're optimizing, or the industry is optimizing in two fronts. One is how do we make, you know, models more capable and expand their use cases. So that's going to build super large models. But at the other end of it, how do we make small models more efficient? And so I think we're kind of seeing both of those trends. And RC is definitely on the, you know, they're, they're kind of across the board, but really, I think I'm excited to see what they're doing with small models and model merging.

Demetrios [00:41:04]: Have you guys seen agents that have been working well in production?

Kamran Khan [00:41:10]: Yeah, absolutely. Actually, ninja is one of those companies. They're actually using a few different models that are all running on Inferential and Tranium to serve their agents. It's great solution as well, but I think we're going to see a lot more agents and Amazon launch their Q products, which is also trying to help in this space. Personally, I think we're going to see agents infiltrate all aspects of our lives as we move forward. I just saw that Google is replacing their Google assistant on all of their Android phones and tablets with Gemini. Right. So they made that announcement as well.

Kamran Khan [00:41:47]: So I think agents, agents are in our future across the board. Right. But I think the differentiator will be, is good, high quality agents. Not just, let me ask you, the weather or, you know, who won the basketball game last night, but, you know, like very specific and personal. And, you know, it's going to require being able to kind of deploy and fine tune these models in a more of a, kind of like a reinforcement learning technique where they're constantly learning about you, your habits and then implementing that or updating the models as fast as possible.

Demetrios [00:42:19]: Excellent, fellas. Well, this has been awesome. I appreciate you all coming on here and talking to me about this because I know there's a lot of confusion and just people want to use it. Just now I was looking through the ML ops community slack and somebody's asking about GPU resources in AwS and what's this and what's that? So I guess the last question I've got for you, I imagine there's some kind of a chart that you all have as far as, well, if you're looking for this type of GPU, you'll probably be good with this instance of Inferencia or Tranium. Does that exist? And if not, can we make it?

Matthew McClean [00:42:57]: I would say a good starting point is our documentation. So we have the NeuronsDK documentation. I assume we can put a link in the show notes and there we have a lot of information. So for example, tutorials, getting started guides and even we even have a table, a performance table of all the most popular open source models such as llama two, llama three, mistral stable diffusion models, Bert, in which instances it can be Floyd too. And what kind of performance you can get. We have different combinations, right? Depending if people are optimizing, they really need, say low latency. Obviously you probably want to use more accelerators or maybe optimize for costs and use fewer. So yeah, so we have many different options there.

Matthew McClean [00:43:41]: So I encourage folk to check out our documentation and yeah, give it a try.

Kamran Khan [00:43:48]: Awesome.

Demetrios [00:43:49]: I love it guys. Thanks for doing this. And it feels like I've learned a ton all about Inferencia and Tranium. And especially I like the fact that it's not joined at the hip. You get to choose like it's choose your own adventure when it comes to this. You're able to save this 46% cost is huge and being able to get that, then also the idea of start simple and go down the stack as need be. When it gets more complicated or you find you're hitting your limits, then you can go from that bedrock to the sagemaker to the EC two or whatever it may be. But it's cool to know that all along the way, if you're using or utilizing Tranium or Inferentia on any of these services, you can keep utilizing it.

Demetrios [00:44:45]: And it's not like, oh, this only works with bedrock. So it's nice to see that that's like the Amazon way. I guess that you probably wouldn't be able to ship something if it could only work with one of these services. I think that's enough people have written about that, books have been written on that subject, but that's not for today. I'll let you guys go. This was awesome.

Matthew McClean [00:45:06]: Thanks.

Kamran Khan [00:45:07]: Thanks very much. It was fun.

Demetrios [00:45:10]: The value.

+ Read More

Watch More

Vector Databases and Large Language Models
Posted Apr 18, 2023 | Views 3.1K
# LLM in Production
# Vector Database
# ChatGPT
# Redis
# Redis.com
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Efficiently Scaling and Deploying LLMs
Posted Apr 23, 2023 | Views 1.9K
# LLM
# LLM in Production
# LLM Deployments
# MosaicML
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com