Productionalizing AI: Driving Innovation with Cost-Effective Strategies

Name: Productionalizing%20AI:%20Driving%20Innovation%20with%20Cost-Effective%20Strategies
Uploaded: 2024-11-06

Posted Nov 06, 2024 | Views 12

# Trainium and Inferentia

# AI Workflows

# AWS

speakers

Scott Perry

Principal Solutions Architect, Annapurna ML @ AWS

Scott Perry is a Principal Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. Prior to joining AWS in 2017, Scott worked in a variety of solution architecture and consulting roles, spanning data center infrastructure, telecommunications, and computational genomics research.

+ Read More

Eddie Mattia

Data Scientist @ Outerbounds

Eddie is a data scientist with experience in building and scaling data-driven solutions. From wrestling SLURM clusters in grad school to dealing with massive cloud operations at startups and big companies alike, Eddie has years of experience building solutions to customers' data science and cloud infrastructure challenges.

+ Read More

Ben Epstein

- @ -

SUMMARY

Successfully deploying AI applications into production requires a strategic approach that prioritizes cost efficiency without compromising performance. In this one-hour mini-summit, we'll explore how to optimize costs across the key elements of AI development and deployment. Discover how AWS AI chips, Trainium and Inferentia, offer high-performance, cost-effective compute solutions for training and deploying foundation models.

Learn how Outerbounds' platform streamlines AI workflows and makes the most of underlying compute resources, ensuring efficient and cost-effective development.

Gain insights into the latest advancements in cost-efficient AI production and learn how to drive innovation while minimizing expenses.

+ Read More

TRANSCRIPT

Ben Epstein [00:00:06]: Okay, we are live at another ML Ops meetup. We have Scott and Eddie with us here. I think it's going to be a good session, maybe a slightly shorter session than normal, but we'll. We'll have both folks on again. Yeah, welcome. Welcome on, guys. Thanks for taking the time. Why don't you both give a little intro.

Ben Epstein [00:00:25]: We can do a little bit of chatting, understand what the talk is about today, and then we'll jump into the talks.

Eddie Mattia [00:00:32]: I guess I'll start here on the insurance. My name is Eddie. I work at a startup called Outerbounds who maintains an open source software called Metaflow that many ML Ops data scientists, practitioners are familiar with. My talk today is going to be essentially about how we integrated MetaFlow with AWS Trainium and the software stack surrounding us. And I was lucky to work closely with Scott's team on this. So maybe that's a nice segue, Scott.

Scott Perry [00:00:59]: For your intro for sure. Thanks, Eddie.

Ben Epstein [00:01:01]: Hey everybody.

Scott Perry [00:01:01]: My name is Scott Perry. I'm one of the solution architects on the Annapurna Machine Learning Accelerator team here at aws. So we deal specifically with our custom ML chips, Trainium and Inferentia, and I basically spend my days helping customers to onboard and optimize their models to work with our chips. So I'm super excited to be here.

Ben Epstein [00:01:18]: That's awesome. I'm actually a big fan of Menaflow. I've used it in the past. We've had a bunch of talks, even with Jacopo from a couple months ago, did like a. His big. You don't need a bigger boat session and leverage MetaFlow for part of that training. So I'm stoked he went in a lot of depth. But I mean, it's always great to hear from the people who built it.

Ben Epstein [00:01:36]: Do you want to kick us off with that talk first?

Eddie Mattia [00:01:39]: Yeah, let's do it. Awesome.

Ben Epstein [00:01:41]: All righty, Scott, I'll take you off and we'll jump back in in a minute.

Scott Perry [00:01:44]: Sounds good.

Eddie Mattia [00:01:45]: Okay. So the title of the talk is Develop and Train Large Models cost efficiently with MetaFlow and AWS Trainium. So the diagram you see here, if you. If you're curious to read about this, we actually published a full blog post kind of explaining the details more than I'll go into in this talk. So definitely check that out if you're curious and if you want to take a look at the code base that this is all based on. We essentially built kind of an examples repository showing how to do things like how to just get set up, like make sure you can test the Trainium infrastructure and I get your distributed training environments with MetaFlow as well as of course like the fine community examples Hugging Face and Trainium and a pre training llama 2 example. If you want to check all this stuff out, go to this QR code and it's a link throughout the blog post everywhere as well. So I want to start by just a general slide kind of on the state of AI.

Eddie Mattia [00:02:39]: We use this diagram that's from a blog post. I don't even know if it's appropriate to call it a blog post, but there's this brilliant writer named Stuart Brand who wrote about this a couple decades back, discussing the complexity of how societies evolved and especially relates it to business and product questions. And it all comes down to this idea of pace layering. Basically how do these complex systems learn and keep learning? And the idea is that there's these concentric rings and they interact in surprising but also somewhat predictable ways, especially as new technology evolves. So when we cut. When we're thinking about MetaFlow, ML Ops and kind of connecting it to some of these new resources, we need to deal with the emerging AI problems that developers have. Things like really Big Compute, where AWS Trainium comes in. One really simplified, kind of probably oversimplified way to invoke the pace layering idea is we want to build an AI and machine learning infrastructure stack that lets us get access to the opportunities at this commercial layer.

Eddie Mattia [00:03:42]: For most businesses, that's kind of how they're thinking about AI, but that requires kind of following this meandering path at the top, this concentric. At the top of these concentric rings where we have kind of what's fashionable. There's all these different trends and ways to train LLMs. There's different architectures, there's even people kind of researching what if we got rid of the transformer. There's all these different kind of paths to follow at the top. But what we're talking about is really how do we build that infrastructure layer, how does it look and how can we make it so that it's robust no matter what's happening at these outer layers? So before I get into those details, I'm going to actually just go show a quick video since we have some extra time now. So I put together kind of a sped up video of what it looks like to actually deploy this stack. So I'll kind of explain this and then use it as we go throughout the presentation.

Eddie Mattia [00:04:34]: What you're seeing right now is kind of a vanilla AWS experience. If you're familiar with kind of any cloud engineering where we're just taking a generic Cloudformation template for MetaFlow and we're going to deploy that onto AWS. So MetaFlow can be deployed on any cloud. On premise, you can use different kind of compute or job queues and compute providers. And in this case we're going to be using AWS batch. So these cloudformation templates are already pre configured to let MetaFlow and all the services that it needs to run interact with AWS batch. And now the second piece that we're seeing here in the video is this is going to programmatically be creating the training and resources. So we're going to create this compute environment inside of AWS, link those job queues into MetaFlow automatically, and basically once we've deployed these two things, we'll see these two stacks appear in our cloud formation.

Eddie Mattia [00:05:29]: And I think the rest of this video is just kind of boring, waiting for all the stuff to turn green. So once we have all of these creation events completed, we're basically going to have a MetaFlow stack which is kind of a general MLOps framework. If you're not familiar with it, we're going to have that linked into a batch environment that's ready to run jobs on Trainium in a distributed computing environment. Sorry, exit this video. There's more videos on how do you configure MetaFlow for your specific deployment. And then of course those examples that I mentioned earlier as well as pretty much you can kind of imagine how to extrapolate this to any of the Neuron SDK examples you find in the Internet. So essentially zooming out and looking at what are the components of this infrastructure stack that I spoke very hand wavingly about on the slide about the PACE layering and this thing that we want to be robust as things just change dramatically at the modeling layer, the top layers of the stack over the next decade or so. So this is kind of a common stack that we use at Outer Bounds, a conceptual organization, if you will, about how things happen in machine learning lifecycle.

Eddie Mattia [00:06:39]: And essentially what I'm proposing on this slide is that really nothing changes at these conceptual levels when we're dealing with LLMs and kind of all the massive compute requirements, massive data requirements, things like this, at the end of the day you still need data at the foundational layer of the stack. This can be a data Lake in S3 it could be using a warehouse provider like Snowflake built on top of S3, but this is needed to store a Whole bunch of stuff. Whether you're training Scikit learn models or using the Neuron SDK to train state of the art Pytorch models that are ready to compile down to Trainium and infrared devices. So we store things like what Metaflow needs to kind of do its job and handle all of the ML ops plumbing. We need to be able to store and retrieve the different training, evaluation, inference, data sets, things like this, and ideally do that in a way that's rather efficient, can be a big vector of cost if not done carefully. Then of course we also have the models which when dealing with Trainium devices, these models are going to be slightly different than the normal Pytorch models we might be used to. So making sure that we have infrastructure that's robust to all of these different situations becomes very important and potentially extremely useful in driving down costs as we compare the different compute services that we want to use on the back end. When it comes to compute, Metaflow helps us link it to the different runtimes where we want to access this data very easily, where essentially we get this declarative paradigm inside of our Python code where Metaflow can let you dispatch a job to Kubernetes to AWS batch, as we see in this case, and then dynamically configure things like how many CPUs do we need, how many Inferentia or Trainium devices do we need? What kind of Docker image might we need to run this environment and put software dependencies inside of that, as we mentioned earlier, kind of wiring up to the AWS batch job queue where we actually have access to these trainium devices in AWS.

Eddie Mattia [00:08:40]: So with MetaFlow you can kind of annotate any Python function that's in a workflow with these kinds of decorators and you'll get access to be able to run Python jobs inside of your cloud environments or your on PREM devices. Of course, if you're dealing with Trainium, that's going to be in aws, but the idea is that we need to be able to connect these different compute environments and all of the things that we talked about at the data layer and all these other layers. Another thing that we kind of built as part of the integration with Trainium, Trainium has a very nice monitoring stack. So if you're kind of getting started or if you haven't yet used the Neuron monitor is the name of the command line interface. It's very handy and can give you a lot of really important information as you're learning about is My model behaving as I expect, and then of course, optimizing as you move forward and move towards production. We built a tool that essentially allows you to slap on another one of these decorators and kind of run the Neuron monitor at whatever frequency you want. And then inside of the metaflow ui, you'll get these nice plots that kind of show over the lifecycle of your function. Was I utilizing these Trainium cores? The Neuron core is actually the correct way to say it.

Eddie Mattia [00:09:50]: Like how well were my functions actually using those devices? On the Orchestration side, this is a very kind of nebulous word in some ways, I think. Of course, it means very precise things to some people, but a lot of people use it to describe many different processes in ML. What I'm really talking about here is kind of having a way to stitch together these different functions. Maybe some of them run locally, some of them run in the cloud. Each is going to have a different set of dependencies that it needs in terms of code data, maybe model checkpoints, and kind of orchestrating all these things and turning them into schedulable and triggerable units of compute. Those are words is one of the main jobs that metaflow and Outer Bounds on the enterprise side does for our customers. So the workflow that's in the GitHub repo, for example, we basically have a config file where we set a bunch of parameters, like hugging face data sets. Like, what are we actually going to set for the hyperparameters of the Pytorch model and draw an SDK, parameters related to caching and many other different optimizations.

Eddie Mattia [00:10:55]: And then the compute environment itself we're also including in this example. And the idea is that Metaflow then is going to expose these kind of very simple commands. So you could put these inside of a streamlit dashboard if you wanted something even higher level. But the level of interaction once you've written the workflows is going to be pretty high level. And you can take these workloads and package them up to run wherever your target is. Maybe on AWS step functions, maybe it's on airflow. In this case, Argo workflows is kind of the way people tend to do it on Outer Bounds, where we often use Kubernetes as a compute platform. Another way that we think about orchestration is kind of being able to scale things up and down smoothly.

Eddie Mattia [00:11:40]: So say you want to just kind of test, does my Trainium compute environment work at all? Maybe you want to use the smaller Trainium devices in this case where the cost is going to be quite a bit lower. I don't remember exactly the latest prices, but probably a factor of 20x using the small trainium instances versus the full big ones that you need to do, say a pre training run of a model about all the dimensions of scale. But I don't have too much time so I want to keep moving on. Roll the metaphor piece here and then versioning, moving up the stack further. So we're kind of moving towards these artifacts that a data scientist might be thinking about or a lot of the trendiness is happening in AI. So one one way that we think of versioning is the different rungs of each of these workflows. So maybe they run inside of a CI job, maybe a developer is doing it from their laptop or cloud workstation metaflow and therefore outer bounds is going to be versioning each of the runs of these workflows by default. So this is really, really helpful in cases like using Trainium where usually the jobs are very, very large.

Eddie Mattia [00:12:43]: And oftentimes we kind of want to be able to get a perfect view of what happens in all of our past runs, make sure everything is happening as expected when we launch a job. Of course, another aspect of versioning is tying it into GitHub. So you can see how we structured all of the code for things in our GitHub repo. In terms of deployment, you can think of a couple different ways. So you can think about deploying the workflow itself as we talked about earlier. So that might be on Argo workflows or AWS step functions. And in terms of deploying the actual models themselves that are produced inside these workflows, AWS has a lot of nice blogs around the Internet. There's a lot of good examples popping up around how would you connect these workflows that produce checkpoints for in Neuron SDK and then run them on the Inferentia ECQ instances which have the same Neuron cores, not exactly the same, but similar Neuron cores to the training and ECQ instances that we use for training in the repo.

Eddie Mattia [00:13:46]: As I mentioned, there's kind of a lot of different ways to think about deployment and we didn't include this one in the repo because it's an out of bounds only thing. But again, like serving these apps with a streamlit dashboard becomes pretty trivial is kind of the idea. And making sure that you can connect to all of the infrastructure things like Inferentia and Trainium devices, making sure that you can connect that to these sorts of APIs that are easy for data scientists to use, that we're familiar with, is critical. On the final piece of the stack, we have modeling, and this is where data scientists like myself tend to spend a lot of our time. We're not really necessarily thinking about the infrastructure of, like, how is our snowflake database set up, but we're kind of thinking about how do we build models on top of that Data is usually our main job. This is where Neuron SDK comes in super handy and is kind of the segue I wanted to set up for Scott's talk, where I'm guessing we'll go into much more detail on what these Trainium devices are and kind of how they connect to the software stack. But essentially my impression of Neuron SDK working with it for the past six to nine months has been that there's an extremely deep library of examples as well as very diverse examples. So they have all the different kinds of encoders and decoders, vision models, multimodal models, lots of different stuff, and they're adding to these example libraries very frequently, so definitely check that out.

Eddie Mattia [00:15:16]: And yeah, I think this layer, kind of making sure that this layer where there's always a lot of activity and the AWS team is doing a great job of kind of staying up to date on the latest trends and things and adding those examples back into the Neuron SDK, making sure that you have a way to connect that kind of to the rest of the systems that your organization uses to operationalize software is really key going forward if you want these investments to pay off. So that's my spiel for today. Super appreciative to the ML Ops community and to Ben for setting all this up and organizing, and I'll definitely stick around for a Q and A. But also wanted to mention that the Outer Bounds team will be very active this fall. So if you're at any of these events that you see on the screen here, definitely come find us. Thanks.

Ben Epstein [00:16:03]: Which event are you most excited for?

Eddie Mattia [00:16:07]: Good question. I had a very good time at the Genai and ML OPS World Summit last year in Austin, so I'm very excited for that. I'll be there about 10 days. If anyone's around, I'm very involved.

Ben Epstein [00:16:20]: Yeah, I'm kind of interested to see what you guys are talking about at the Pytorch Summit.

Eddie Mattia [00:16:26]: Yeah, well, unfortunately, that event already happened, so this slide was for my previous one. Yeah, there's a lot of related features to what I just talked about, kind of the way that the Neuron SDK is being used inside MetaFlow was a lot of the stuff we talked about at Pytorch.

Ben Epstein [00:16:44]: No, super cool. Okay, let's bring on Scott to go a little bit deeper. I mean, it's a great follow on like you said, Eddie. I mean, walking through it, seeing how MetaFlow gets deployed now, how does it actually work under the hood? How are these things getting orchestrated and scheduled? So, yes, Scott, you want to jump right in?

Scott Perry [00:17:04]: For sure, for sure. I just want to say thank you, Eddie, for all the integration work you did there, integrating MetaFlow with Trainium and then they're on SDK. We really appreciate it and you did that in record time, so thanks so much.

Eddie Mattia [00:17:15]: Yeah, if anyone gets a chance to work with Scott's team, do it. They're awesome and they've got very easy.

Scott Perry [00:17:19]: Thank you. Super cool. So, hey everybody. So at this point we're going to kind of pivot into enabling high performance gen AI workloads using our AWS design ML chips. Okay. So if you're not familiar with the Generative AI stack at aws, we actually have different services and different kind of levels of entry points depending on your use case. Right. So if you're more of a business focused user, you want to start taking advantage of LLMs and foundation models, but maybe you don't have the technical depth or the time to kind of get into it.

Scott Perry [00:17:47]: You can take advantage of Amazon Q. So basically you bring your data, don't have to write a lot of code, configure the tool, and basically you're up and running say in a day or two. So very, very kind of quick time to market without having to invest in the technical depth and expertise that you might need with some of the other levels. If you are a bit more advanced and you're used to kind of working with LLMs and you want to integrate some LLMs or foundation models into your existing tooling or applications. We also have Amazon Bedrock. So you can think of this as a platform that basically gives you hosted LLMs and foundation models, both for inference and training, as well as some guardrails and agents and customization capabilities around it. So it's a great place to get started if you do have some technical depth in the area, but you don't want to focus so much on the underlying infrastructure and customization under the hood. But for the more advanced users that need that kind of ultimate flexibility and level of control of the infrastructure, we do have the lower level of the stack, which is the true infrastructure.

Scott Perry [00:18:43]: So you can think of this as the EC2 instances that maybe have GPUs or the ML chips that we design training, Memphis things like SageMaker or maybe EC2 capabilities like Ultra Clusters or EFA Networking. So today's my talk. Today is going to focus more on this bottom level of the stack, the infrastructure itself, which is where our chips trainee manifest and the Neuron SDK live. Okay. So you know, working at AWS we have the luxury, we get to talk to a lot of customers and you know, during the deep learning craze a few years ago, we were hearing from a lot of people that wanted to take advantage of deep learning workloads in the cloud. But the current offerings at the time in some ways kind of weren't meeting their expectations and left a lot to be desired. So some kind, some of the common things people were asking for was basically they wanted the best performance they could get in the cloud at the best cost. Right? Pretty obvious they wanted the services and offerings to be easy to use because obviously if there's a lot of friction involved in trying to adopt one of these services, it's going to be a non starter for a lot of people.

Scott Perry [00:19:45]: In this recent world, a lot of people are also concerned with energy efficiency under the hood, right? They care if their model is fast, but they also don't want to be unnecessarily contributing pollution, extra energy usage if they don't need to. And then of course they need to have their services available when they need them. So with these sorts of themes in mind, a few years ago AWS actually embarked on designing and building out a custom set of ML chips designed specifically for these types of workloads. So back in 2019 we actually launched the first ML chip at AWS called AWS Inferentia. This chip is targeted obviously at inference predictions and back then it was focused on kind of smaller deep learning models that existed at the time, so things like Resnet or YOLO for example. And with Inferentia back in 2019, launching this first chip, this first instance type, we were able to achieve up to 70% lower cost per inference than comparable instances at the time, which is pretty cool. So we saw good momentum there. We learned a lot from that endeavor and we actually have since launched Inferentia 2, kind of the follow up version of that chip back in 2023.

Scott Perry [00:20:51]: And this is obviously targeted at the more recent models, so larger models, transformer type models and diffuser models, sort of the generative AI type workloads. And we took a lot of the lessons learned from the original work in Inferentia 1. And we're able to apply this to Inferentia 2. So with this offering, we were able to see up to 40% better price performance than comparable EC2 instances at the time. And in 2022, we actually launched AWS Trainium. So this was our first training specific ML chip designed completely in house here at aws. And this was targeted. It's got a similar architecture to Inferentia 2, but it's targeted more at those large scale distributed training workloads.

Scott Perry [00:21:30]: And with training, we're able to drive up to 50% savings on training costs compared to comparable EC2 instances. So I did want to take a minute and just focus a little bit on the Inferentia 2 and Trainium architecture, kind of explain, you know, some of the components here. I won't spend too, too much time on this, but I think it's important because, you know, when people hear that, you know, we've been designing and launching our own chips at AWS, they sort of assume that this is just AWS's version of a GPU. And that's, that's definitely not the case. So if we take a look here, we've, we've got a schematic of each chip. We've got Inferentia 2 on the left and Trainium on the right. At a high level, you can see that the architectures are very, very similar. But when you kind of drill in, I think you'll see that these are tailored specifically to the types of workloads that we're trying to address here.

Scott Perry [00:22:13]: So, you know, we'll take a look at Inferentia 2 here. So at a high level, you know, first off, we've got these two neuron cores and a Neuron core. Within this architecture is the basic level of compute that's basically user addressable, right? So if you were doing distributed infrared or distributed training, for example, you might shard your model across each of these two neuron cores in a chip, for example. But within the neuron cores, you can see that this is just, it's not just a general purpose processor. We've actually got these targeted engines that do very specific things which kind of map onto the types of workloads that we're trying to address. So first off, we have the tensor engine, right? And if you think about deep learning and generative AI workloads, a lot of the compute behind the scenes is actually matrix multiplies and matrix type operations. So they're handled by this tensor engine which is actually powered by a systolic array. Okay.

Scott Perry [00:23:03]: We've also got dedicated engines for vector operations like batch normalization, for example, scalar engine to handle things like activation functions. And we have a set of general purpose SIMD processors embedded within these cores as well. So these are general purpose processors, but we can actually write code apply it to these processors at any point. So if there's a new operator that comes up tomorrow, for example, that we didn't know about when these chips were first launched, even users can actually write C code and apply it and run it on these cores that have access to the on chip memory just like the other engines. So it kind of in a way future proofs these architectures for operators that we don't know that are around the corner. We also have HBM memory within these chips. So we have 32 gigs of memory per chip and we have a collective communications engine here. So the benefit of this is that the chips can actually do compute and collectives at the same time.

Scott Perry [00:23:59]: Right. So when you're done compute at some point you usually have to run a collectives to maybe reduce across your set of workers. By being able to overlap these operations, you can actually improve your throughput for both training and inference. And down below we have these neuron link capabilities. So this is a high bandwidth, low latency link between chips within an instance. So in the distributed setting this helps improve performance quite a bit. So hopefully you can see here that this, you know, obviously not a gpu. Right.

Scott Perry [00:24:29]: Like there's been a lot of thoughtful design put into these chips to make sure that we're actually able to address the deep learning and generative AI workloads themselves. So again, not a general purpose chip. You know, we are seeing some customers, you know, if say you're doing a lot of matrix multiplies and a non machining machine learning workload, you might still be able to use our chips. We're, we're laser focused on deep learning and generative AI. So these chips I mentioned before, we started launching them in 2019 and since then we've had quite a bit of momentum, both internal and external to aws. So you can see a lot of big names, big logos. We're super excited to see that customers are really interested in adopting Trainium and Inferentia. And with our latest offerings, I mean the level of demand has just been insane.

Scott Perry [00:25:13]: So everybody's keeping us pretty busy. And also within the Amazon family of companies as well, we have a lot of use cases that involve machine learning. So we've seen pretty good uptake internally too. So it's great to have chips that are high performance and lots of advantages to being able to design these chips in house. But customers need to be able to adopt them. And at the end of the day, we know our customer stacks are varied and nobody's basically using an identical stack. Everybody wants to customize and kind of use the tooling that they prefer and they don't want to change that just to adopt a new technology. So we've put a lot of effort into ensuring that we have good ecosystem support.

Scott Perry [00:25:51]: Right. So that includes things like the machine learning frameworks, you know, Pytorch, Jax primarily. We still have some customers using TensorFlow as well. We were a founding member of the Open XLA initiative. You know, our stack actually makes good utilization of XLA under the hood and then, you know, other kind of ecosystem third party players like Hugging Face, for example. We have great integration with Hugging Face, Transformers and the Trainer API via a collaborative project called Optimum Neuron. This is a library being written largely by Hugging Face that allows you to continue using Transformers and the Trainer API, which you're probably used to using already. But under the hood, take advantage of Trainium and Inferentia and of course metaflow.

Scott Perry [00:26:30]: We just heard from Eddie on the integration work that Outerbounds did where you can actually take advantage of trainium using MetaFlow today. And there's other places like Ray Weights and Biases, Lightning. And the list continues. We're seeing more and more integration work happen all the time. Okay, so it's one thing for me to sit here and kind of tell you how awesome the chips are. Obviously I love them, but I think it's more meaningful to kind of hear testimonials from our customers themselves. So if you're not familiar with Ninja Tech, they recently released a set of AI personal assistants. And these models that power these assistants were actually trained and deployed on Inferentia and Trainium.

Scott Perry [00:27:06]: So here we can see, you know, testimonial from the CEO and founder saying that, you know, he's very impressed with the level of model support, but also the fact that they were able to save up to 80% in total costs and see 50% more energy efficiency than they were able to previously with GPUs. So pretty compelling testimonial, I think. And similarly, if you haven't heard of Leonardo AI, this company allows you to design visual assets. So whether you're a professional or a hobbyist, allows you to design visual assets using AI assisted capabilities and by moving some models over to INF2. So Inferentia Leonardo was able to see 80% cost reduction compared to their previous GPU usage without sacrificing performance. So again, I think this is a pretty real world and compelling use case and testimonial. Okay, so for Trane human Inferentia, the chips themselves are called Train human Inferentia, but you actually consume them via EC2. So on the Trainium side we have the TRN1 instance type and we actually have three different flavors of Trainium one today.

Scott Perry [00:28:13]: So there's a TRN1.2XL which is a smaller instance, just a single chip. And this is good for kind of smaller fine tuning jobs as well if you're just trying to get introduced to Trainium. For example, we also have the TRM1 32XL and TRM1 N32XL. And these are larger instances with 16 chips which would be 32 neuron cores, a large amount of HBM memory, and good inter instance networking capabilities for distributed training. Right. So the only difference between the 1 and the n is that the trm1 n has is twice the networking capabilities for the distributed training cases. So, you know, depending on your use case, if you're just trying to try out Trainium for per se fine tuning, you might want to get started with TRM12XL. If you want to jump into distributed training, we've got some tutorials around how you can take advantage of the larger instances to get started right away.

Scott Perry [00:29:05]: And the idea is here is, you know, by working with Trainium, you can take the models that you're working with today, say in Pytorch, map them over to Trainium and start realizing up to 50% cost of train savings. And on the Inferentia side, it's a similar situation. So Inferentia is the chip, but you actually consume those as an end user via the INF2 instance types. So here we have four different instance sizes, x large, up to 48 x large. And the big difference is really the VCPUS memory and the number of Inferentia chips. So depending on your use case, some people prefer to deploy kind of a single model per instance, maybe in EKS using Pods and you know, INF2XL might be a good fit there. Other users want to combine a whole bunch of models or maybe run large models and require the larger instances. So we're just trying to make sure that we can, you know, suit the use cases that you have in mind and give you the flexibility to kind of scale up and scale down as needed.

Scott Perry [00:29:58]: In terms of availability, you know, we're currently at 23 plus regions with more on the way. So, you know, if you're interested in getting started with Inferentia and Trainium, ideally, you know, it's already available in a region that's close to you. If it's not, please reach out to your account team and we can make sure to record the influence and start to work towards providing capabilities where you need it. Okay, so I did want to take a minute just to talk about the Neuron SDK. So, you know, up until now we're talking about the chips, we're talking about the instances, that's really the hardware. But you know, to take advantage of these chips to really kind of realize the performance, the cost savings, it's important to have a solid set of software to actually power those chips. Right. And in our case, the Neuron SDK is the complete set of software that we use to basically drive these chips.

Scott Perry [00:30:45]: And there's a few components there. So at the base layer kind of running on the EC2 instances is the Neuron runtime, right? You can think of this as a driver in a runtime library that allows you to kind of load models, execute models, that sort of thing with on the actual host itself to convert your code to get it to run. On Inferentia and Trainium, there's framework integration, right? So the idea here is that you don't necessarily have to rewrite everything to make it work. With Trainium and Inferentia you could basically take your Pytorch model that you have already working today, move the model and move your tensors that are going into the model onto XLA devices via the framework integration and run your model as normal. The only difference here is that we're actually a compiler based stack. So when you do use this framework integration, it's going to trigger a compilation process. So what happens here is under the hood, when you're running your model, we actually identify the the compute graphs that represent the computations underlying your model and extract those as an XLA graph. And then we compile using the Neural compiler, we take that XLA graph, optimize it to run efficiently on Inferentia and Trainium hardware.

Scott Perry [00:31:53]: And in most cases that compilation on the training side is meant to be like a very transparent process. Right? You could do ahead of time compilation if you want. You can also do just in time and just allow it to happen on the inference side. You usually call out the trace capability on your model and force it to compile and then an output of that compilation is an artifact that you can then run. With Pytorch, for example, we've also got a set of user land tools, so things like Neuron Monitor, Neuron ls, Neuron Top that allow you to kind of view and probe the hardware to make sure that you're using it appropriately and maybe diagnose performance bottlenecks and those sorts of things. We also have a profiler part of that package as well. A new capability that we've recently launched is this Neuron kernel interface, Nikki, which I'm gonna talk about a little bit later. But this allows you to write lower level kernels that execute directly on the hardware, giving you a little bit more of a control over the execution of your models.

Scott Perry [00:32:53]: And above the Neuron SDK, this is where kind of ecosystem partners come in. So we've called out Optimum Neuron here. This is that collaboration with Hugging Face where they've basically adapted Transformers and Trainer API and SFT Trainer and all these kind of Hugging Face concepts that you're probably working with today. They've converted these and optimized them to work with training Inferentia. So this is probably the easiest way to get started, honestly, with Trainium Inferentia today is to kind of just try the Optimum Neuron library, whether you're doing fine tuning or deploying models, and you'll see that it feels very much like the Transformers experience that you're probably already used to. And because, you know, Trainium and Inferentia are consumed via EC2 instances, you know, we have support for Trainium and Inferentia in a variety of AWS services, things like EC2, like proper ECs, EKs, SageMaker, parallel cluster batch, you name it. And in terms of model hosting, you know, we strive to support the most popular model servers and kind of hosting platforms out there. So things like vlm, Hugging Face, tgi, dgl, Torch Serve and Race Serve, for example.

Scott Perry [00:33:59]: And I briefly mentioned the Neuron kernel interface. So, you know, one, you know, thing users were calling out to us is that a lot of people working with GPUs today do have custom CUDA kernels that they're using. And previously it was, it was challenging because, you know, we don't natively support CUDA with Trainium and Inferentia. Right. So if you heavily made use of CUDA kernels, there would be some friction in trying to move those models over to Inferentia and Trainium. But recently we launched the Neuron kernel interface, Niki, which is similar to the OpenAI Triton. So this gives you a lower level interface for implementing your own code that can run, get compiled and run directly on the Neuron accelerator. So training Inferentia without passing through the kind of full compiler flow, right.

Scott Perry [00:34:42]: So as an end user, you know, you're probably usually writing your model code in Pytorch or jax, for example, but there could be a chance that maybe the compiler doesn't fully optimize that code that you've written, or it just doesn't behave exactly how you want it to. Or maybe you just have a few optimizations in mind that you think could speed up your model. This Niki interface allows you to kind of get that fine level granularity and control, so you can introduce your own code and run that directly on the accelerator engines. So we're very excited about this. There's a bunch of examples that you can take a look at in the documentation. Things like Flash attention have been implemented using Nikki. So we're really excited to see what the community can contribute back. And I mentioned the tooling.

Scott Perry [00:35:23]: So we've got neurontop, Neuron LS here. On the right hand side you can see Neuron Top, which is kind of our graphical or text based, graphical view of utilization when you're doing a training job or doing inference. For example, here you can see that there's 32 total neuron cores. You can see the utilization on those cores up above, memory utilization on the bottom, as well as some system information. And on the left hand side we have Neuron ls, which is a command line tool that gives you basically a view of how many Neuron cores you have, which processes are using those cores and that sort of thing. And there's some additional tooling as well, you know. So from a high level, if you did want to get started with Trainium or Inferentia today, like what, what would that look like? So, you know, we've kind of welded down, it's meant to be easy here. So first off, you could launch an instance, whether that's a Trainium instance or an Inferentia instance.

Scott Perry [00:36:11]: We do have deep learning amis and deep learning containers for you to take advantage of, so you don't have to install the software from scratch if you don't want to. So you can boot an instance, you can pick the framework that you're going to be working with or the integration. For example, you might say, hey, you want to get Started with Hugging Face, using Optimum Neuron, for example. Generally you take your code so you've got some existing Pytorch code today. If you're working with the Hugging Face Optimum Neuron, you don't have to make any changes. You basically just tell it which model you're using. If you're just using a vanilla Pytorch model, you might have to make a few modifications just to let it know that it's xla. So a few code changes, you run your model, you get an output that can run on Trainium Inferentia, and then you go to town.

Scott Perry [00:36:51]: You should be good after that. And obviously I'm glossing over some of the details, but this is the high level flow that we tend to see and we're striving to make it as easy as possible. Okay, so where to get started? I mentioned it a few times. Optimum Neuron, if you're familiar with Hugging Face and you kind of like their interface and ease of use, probably the easiest way to get started. I've included a link there. But within the Neuron documentation, obviously we've got loads of detailed tutorial and architectural details and all kinds of goodies there. So if you're interested in learning more about the architectural details or just the services themselves, please take a look.

Ben Epstein [00:37:29]: That was awesome. That was really, really particularly cool. I would love to have you and potentially other people from your team come on and do like a live session with it, like building a kernel live. I think that would be, I mean, it could be intimidating for a bunch of people to start even thinking about like building kernels. But I think it would be very cool to have you or someone from the, from the community come on with you and like live build one.

Scott Perry [00:37:53]: Yeah, I can talk to the team. I mean, we have a few individuals in particular that are very focused on Nikki right now. Exciting times when it comes to basically this custom kernel implementation and where people are going to take it. Right, because that was one of the challenges before is that the compiler itself, the end user, doesn't really have a lot of control other than you provide the code and you get the output that runs on the accelerators. But with Nikki, you really can have that, that deep level of control of how your model is going to operate on the various engines. And finally we're at the point we can kind of give that capability to the community. Right. So hopefully we'll see lots of contributions back, custom kernels for new models and that sort of thing.

Scott Perry [00:38:34]: Everybody's super pumped about It.

Ben Epstein [00:38:36]: Yeah. Eddie, have you or your team started taking advantage of that yet or are you planning to.

Eddie Mattia [00:38:42]: We haven't been using Nikkei yet. I'm also very excited, as Ben was saying, to kind of understand, especially when you mentioned how it's like sort of like OpenAI's Triton. That is a really interesting way to think about like what would limit me from using Neuron SDK. I think we've talked about this a lot, Scott. It's kind of like CUDA is so deeply integrated in so much of the deep learning stack that it's kind of like hard to imagine another hardware provider navigating around that. But yes, this sounds really cool that you guys have something ready that I can get handled.

Scott Perry [00:39:14]: Definitely, definitely available now. And we tried to include a few relevant examples so people can kind of see how to implement and what an implementation looks like. Super cool stuff.

Eddie Mattia [00:39:28]: Yeah. But yeah, Ben, to answer your question though, on our side the integration has been a bit more high level where it's kind of like how do we implement it with MetaFlow? So you get all the benefits of the nice stuff that Scott's team has built around these accelerators on the software side, like the monitoring and the LS functionality. Like how do we kind of like tie that into what MetaFlow already does?

Ben Epstein [00:39:49]: Yeah, and MetaFlow has such strong integrations with a lot of components. I mean even down to like storing the metadata, checkpointing models and things like that. It's like to even to have it, the idea of it just being almost like a one click and it kind of runs somewhere else. They don't even have to worry about about the compute. Scott, I'm curious about. I was looking, you were showcasing like the services that AWS offers that can integrate with this new tech. I saw like EC2 and SageMaker, things like that. I was thinking like the first thing that popped in my head is, oh, you know, you could set up a Lambda instance and have this be almost serverless.

Ben Epstein [00:40:25]: Like is that something that you guys are working on or scale to zero, something anybody has asked for?

Scott Perry [00:40:30]: I don't actually know off the top of my head about the availability of training and inferentia via Lambda. It's something I'd have to check and get back to you on, but kind of the most. The closest analog I guess would be AWS batch today. Yeah, because it, you know, it's still kind of this ephemeral hardware that just shows up when you need it can definitely take advantage of Trainium and Inferentia today. And that's, you know, the initial integration work that Eddie did, that's how they're leveraging Trainium is via batch. That's probably the closest thing.

Ben Epstein [00:40:58]: Yeah, that. Honestly I didn't actually see batch on that list. I must have missed it.

Eddie Mattia [00:41:03]: Yeah, I would say too. Ben, unlike the serverless or like the kind of the UX of the serverless piece kind of Trainium seems in my experience of doing this integration, running some larger scale experiments I was always using on demand and always kind of like just asking for that cold start spin up of these Trainium devices and they were always available within 10 or 15 minutes. That's awesome. Once I kind of just knew what regions to look in, which to me is like kind of a qualitatively different user experience than say shopping for GPUs has been for like a lot of the last couple of years.

Ben Epstein [00:41:37]: Yeah, a hundred percent. My workflows with GPUs have become increasingly like batch serverless type of interfaces and technology like this aws batch modal like things like that have made it. Yeah, just a. It's been a paradigm shift in terms of like access to, to being able to compute large things without having to while being GPU poor essentially. It's a really, it's a really paradigm shift. Yeah. What are your. I mean, I mean when I've interfaced with Metaflow, I've really thought about it much more from like the tracking perspective.

Eddie Mattia [00:42:11]: Right.

Ben Epstein [00:42:12]: Like especially with the open source tooling, just the idea of like everything being tracked and checkpointed over time when, as you probably tighten these integrations, like what are maybe the biggest new capabilities that MetaFlow is offering through AWS and through these like heavier cluster systems in my point of view.

Eddie Mattia [00:42:33]: I'm very curious to hear your thoughts on this as well, Scott and Ben. But in my point of view, I wouldn't say it's something new necessarily as much as it is the big value proposition is like connecting these big large scale jobs, like the kind of stuff that I got trained to do in grad school and then I was kind of doing one off like over the past five or 10 years. Like GPUs are kind of like you log into the box with SSH and sort of run your command from the head node and then the NPI cluster just does whatever. I feel like there's a big shift going on right now moving from that world where it kind of just requires one like expert modeler who's like willing to deal with all the infrastructure like this and like figure out how to do it towards like the same way people just deliver software normally. Um, just kind of like the whole thesis of ML Ops in some ways. And I think it's like starting to mature to the point where you can connect these huge compute workloads into like the regular system that you deploy. All the different things that become your end product for whoever your end user is. Um, so yeah, I'd say like that moving in that direction is like the totally not sexy but actual value prop that I'm kind of excited about here.

Ben Epstein [00:43:40]: It's kind of.

Eddie Mattia [00:43:40]: What do you guys think? What, what, what's your, what are your thoughts on kind of like this like link between like these huge compute jobs and like more mainstream or kind of pre existing?

Scott Perry [00:43:49]: Yeah, I mean on our side we're seeing more and more customers using EKS or Kubernetes for their machine learning workloads. And I think, you know, the reality is instead of like you said, having to build all this infrastructure manually, you can just have a node group that scales up the trainium instances and then, you know, submit your distributed job and not worry so much about the infrastructure. You're just focused on getting the results. So there's a lot of interest in, you know, the EKS integration and we do support EKS for training and inferentia today. And I know Eddie, that was sort of phase two, right? Like we started with batch with integration. We do want to focus on the EKS integration for metaflow as well. I think that would be a nice, nice kind of follow up, maybe get another blog post as well.

Eddie Mattia [00:44:30]: Yeah, there we go.

Scott Perry [00:44:32]: Yeah, but yeah, curious. Ben, Eddie, like, are you seeing the same thing? Like is EKS or Kubernetes in general, is that sort of the preferred platform for distributed training and imprints today?

Ben Epstein [00:44:43]: I think it kind of goes back to what Eddie said in that it, it's, it's. I think it's dependent on the team that you have. Like that's definitely awesome if you have the team to support it. And it, I mean it is infinitely easier than it was even honestly two years ago. And, and almost anyone can, can kind of spin up a few people and make that happen. But in addition to that, I think the fact that it's so easy to do things that are even easier than EKs, like the idea that, the idea that you don't even need Kubernetes to run large ML workflows to me is actually the bigger, the bigger shift. Like you can start with a team of two people and you can just Submit some batch jobs and not actually have to think about kernels and really just have to think about like all right, like I've built a training data set. I've maybe fine tuned a model just with Lora on like on like an L4 GPU.

Ben Epstein [00:45:29]: And now I can deploy that on this gigantic system and not have to worry about it and get you know, 100 qps throughput or you know like it's. That's. That to me is. Is really what's pretty game changing. Like something I was reading just today was about, I think they called it the universal assistant model. The. The new like paradigm from that hug based had on a blog post around assistant models where if you have a model like llama 405B you can use the smaller llama model like 1B to do like speculative, speculative decoding with it and actually make it a lot faster by using the smaller model where you can. But then a new one came out that was like you can actually pair any smaller model with any bigger model even when the tokens are.

Ben Epstein [00:46:16]: To me that's like a place where this becomes really cool where you can leverage these enormous clusters and actually get a 405 billion parameter model running which I would never otherwise consider even trying and run it alongside a smaller model. Now you're getting this incredible throughput with this crazy GPU and you're able to fall back to the bigger model. Like it's this crazy best of both that. That even six months ago wasn't even.

Scott Perry [00:46:39]: Yeah.

Ben Epstein [00:46:40]: Like that to me is making it particularly awesome.

Eddie Mattia [00:46:45]: Yeah, I, I definitely agree. Like I think that the possibilities for very small shops or even people that aren't even at a company that are just kind of trying to experiment with this stuff. It's. The cost is reasonable enough where in your own AWS account you might be able to try some of this stuff if it's useful to your career. But on the other part of your question, Scott, like I'm definitely seeing Kubernetes is only increasing in adoption and the preferences of a lot of platform engineers and kind of the people at our customers that make these decisions. Kubernetes seems to be increasing.

Scott Perry [00:47:16]: Okay. And I ask. I'm a bit biased because I you know my day to day I'm. I'm working in Kubernetes most days.

Eddie Mattia [00:47:21]: Yeah.

Scott Perry [00:47:22]: So it's what I will say on.

Eddie Mattia [00:47:23]: The open source metaphor side. I know there's, there's still a lot of power users of AWS batch that was kind of like the first computer integration inside MetaFlow like five or six years ago or whenever. And for a lot of the companies that only use like kind of the original MetaFlow in AWS version, that batch is still like kind of the blessed path for many folks.

Scott Perry [00:47:43]: Yeah, I think there's a lot of genomics use cases. It's the same thing. It's kind of proven it just works. It's there when you need it.

Eddie Mattia [00:47:48]: So yeah, it does what it does well.

Ben Epstein [00:47:50]: Yeah, I think we're going to be forever in this battle between serverless to not serverless to serverless to not serverless. And it's just going to keep going back and forth pretty much forever. Each will be battling each other to make their version a little bit better, and consumers just end up having the coolest pieces of technology to work with. I mean, it's a very, very cool, cool spot to be in. Well, thank you guys both so much. We're almost about time. If you guys have any other thoughts or comments about the talks.

Scott Perry [00:48:20]: I just wanted to call it. I'm going to reinvent. Hopefully some of the listeners are as well. So got a session, I think CMP 304, if you're interested coming to see me or you just want to chat afterwards, I'll be around. So feel free to reach out. Otherwise, you know, thanks so much guys for having us in today. It's a pleasure to be here and really excited to see you guys again.

Ben Epstein [00:48:39]: Yeah, and a huge thanks to AWS for making this episode possible. Really, really appreciate it. I mean, that's, that's really what makes the mlops community continue to thrive. So we really appreciate that.

Eddie Mattia [00:48:48]: So I said thanks to all the listeners and Ben for setting this up and organizing and of course Scott, you and your team as well for helping us out and supporting that.

Ben Epstein [00:48:57]: All right, thanks so much, everyone.

+ Read More