AIStore Virtual Meetup by NVIDIA

Architect for DGX Cloud Storage and Data Services


An exclusive virtual meetup among handpicked individuals by NVIDIA in the industry.
A Demo by NVIDIA to set the stage followed by an open roundtable to trade insights, feedback and war stories.
The session is an active discussion on AI data stores. What’s working, what’s holding things back, and what needs to evolve.
Demetrios [00:00:01]: We wanted to go over some of the cool stuff that the Nvidia folks are doing with the AI Store and also then have an interactive conversation and really talk about what we're seeing, what our experiences are and all that fun stuff. And I can try and lead the discussion, but we've got lots of really smart people on this call, so hopefully nobody feels shy. And if you do, I will break out the guitar and I will show you why you don't need to feel shy at all. So yeah, I think we're going to see a little bit of what AI Store is first.
Rodney [00:00:45]: Yes, we're going to hear about aistore this morning from Pradeep from our product organization.
Demetrios [00:00:52]: There we go. Cool. And then we're going to have a little pow wow Kumbaya fire session. Yeah?
James J [00:00:58]: Yes.
Demetrios [00:01:00]: All right. I like that.
Rodney [00:01:02]: Cool. Do you want to, if we're ready to kick off, do we want to do a couple introductions on the Nvidia side?
Demetrios [00:01:06]: Yeah, that might. I like the way you think. I totally jumped over it, but it's good to get to know each other. Yeah. Hit us. Rodney, you want to start?
Rodney [00:01:15]: Yeah, I'll start us off and then I'll head over to Pradeep and other members of the team. So my name is Rodney Scheider. I lead Devrel for DGX Cloud at Nvidia. I've been working with the community. I think I see some folks I've chatted with before on here. So I've been involved in working with the mlops community for a few months now. We've had some super cool conversations about AI tech, what folks are using and so we thought this would be a great idea to bring some new technology as Demetrios said with aistore. So with that I'll pass it over to Pradeep.
Pradeep [00:01:50]: I'm Pradeep. I'm an architect in the DGX Cloud storage and data services group and I'll be joined today in my presentation by Abhishek, Aaron and Phil. Aaron, go ahead.
Aaron W [00:02:10]: Lately been working mostly on the Kubernetes side of things. Operator and a lot of our observability.
Demetrios [00:02:20]: Excellent. There's one more person. Who is it?
Abhishek [00:02:24]: Hi, I'm Abhishek. So I'm a software developer at Nvidia. I work on a yes Store as well. I work on the Python SDK operator and the ETL part of AASTORE as of now.
Demetrios [00:02:38]: Super cool. So Pradeep, while you share your screen, I would love to know who we're here with on the other side of the house. So if, if folks want to jump in and like just drop in a note on who you are, what you're doing, where you're at in the world, that's all cool. Just so we can have a quick idea and then we don't need to go through like 20 different people introductions on the other side. So yeah, Pradeep, kick us off man.
Pradeep [00:03:10]: Right, so I'm going to talk about aistore, which is an open source product being developed by Nvidia. So I'm going to start with what is the motivation for building this? So there are, there's a need for a fast tier, faster, fast storage tier close to GPUs. Now there are many other storage solutions, but there are gaps in many of those. For example, so there's a notion of a source of truth. A lot of companies, a lot of people want to keep one copy of their data, that is the canonical copy, and then use that data by copying or by synchronizing to other locations. And typical source of truth storage services are like S3 or any other large object store like GCS or Azure Blobs. We internally use an object store called Swift Stack. A lot of times the source of truth is in one region, or sometimes it's multi region, but still it's a small set of regions.
Pradeep [00:04:28]: And where it is being used through GPUs is could be in a completely different region, could be in a different zone and copying data or moving using data from those further locations is not always, doesn't always give the best performance. There's a lot of flakiness in the network, the throughput is low and it's in general not the best experience for applications. So in addition to the distance of the network, there's also high egress costs that most CSPs impose on data going outside the region where they're stored, even if it is to the same csp. So using the source of tool directly is fraught with these problems. CSPs also provide shared storage within each region or within each zone even for example Lustre, a parallel file system that is very popular for high performance computing, but it has a limit on how big it can go. We internally use a lot of Lustre, many many petabytes, but it still has a limit. It cannot go to exabytes. There's a lot of problems with the reliability of Lustre.
Pradeep [00:05:42]: It was built for a on prem high performance computing, supercomputing kind of cluster, not so much for a cloud environment where nodes keep going up and down frequently. NFS is another option, but it doesn't have the same throughput as parallel file systems. It also has some limits on capacity. Object stores. There are some object stores that are built for fast access from within individual zones, like S3 express one zone. So those are good. But many other object stores don't have sufficient throughput. They are mostly built off of hard disks and not SSDs.
Pradeep [00:06:25]: And with hard disks you don't get as much throughput that you need for some of these very demanding AI workloads. And then there are raw SSDs on the GPU nodes themselves. They have limited capacity because you're limited to one node and they are not shared across nodes. So every time a job or different parts of a job needs a piece of data, it needs to be redownloaded to every node. And when the job dies or gets finished and then gets rescheduled again in the future, it can go to a different set of nodes and then it won't be able to reuse what was downloaded earlier. It'll have to redownload to that node. So those are all the gaps. So what can a cluster local fast tier of storage bring here? It needs to be cluster local because it has to have high throughput to serve very demanding AI workloads.
Pradeep [00:07:23]: Data preparation, small model, training, inference, they all have very high throughput needs. When you are training very large models, especially text models, then the data ingestion may not be as demanding, but the checkpoints are very big. And so reading and writing checkpoints still needs a lot of throughput. Then a lot of times there's huge amount of repetition in terms of reading the same data over and over again for different jobs, different multiple epochs within a job. Instead of reading the data across large distances, it makes sense to read it from within the cluster. Then this fast tier. Why am I calling it a fast tier instead of a cluster local storage service? That's because we want the data to be synchronized between the source of truth and the local copy here. Manual synchronization across these locations is very complex and error prone and frankly frustrating for a lot of our customers.
Pradeep [00:08:34]: And keeping the global source of truth and the local copy in sync in an automated way is very convenient, very desirable. By the way, feel free to pause, interrupt any time and ask questions. So we want this to be interactive. We built AI store to satisfy all the needs that I just talked about in the previous two slides. It is an open source product. It's been on GitHub for the last seven years, since the beginning, since when it started. It is very lightweight. It is built from scratch.
Pradeep [00:09:19]: It's an object storage system tailored for AI workloads. So lightweight and tailored for AI workloads. What does that mean? It means it only implements the small subset of features that are necessary for AI workloads. Reading, writing, blobs, listing, deleting. It doesn't have versioning, it doesn't have detailed tags, it doesn't have lots of other features that popular CSV based object stores have nowadays. Just because what we have seen is AI workloads, both training and inferencing mostly don't need those features. It provides linear scalability with every added storage discord node. In the rest of this presentation you'll see some of our dashboards that show that as you keep increasing the number of nodes, you get proportionally more throughput.
Pradeep [00:10:14]: The I O is balanced across all the nodes and the SSDs in a very efficient way. There are no hotspots. This is also elastic, meaning you can keep adding nodes, you can also remove nodes and the data will get automatically redistributed. It is deployable anywhere from a single Linux machine to many machines to a huge cluster managed through Kubernetes, for example. We'll go through a few more details in that it's been in development for seven years and so any data system, any storage system needs a lot of time to bake to make sure there are no deep bugs and AI Store has gone through that pain and what we have built so far is very reliable now and it is in limited production at Nvidia. Ltd. Just because we are ramping up gradually and we'll also talk a little more about how it has been used in production at Nvidia. I will pause there before handing off to the next presenter.
Pradeep [00:11:24]: Any questions?
Rodney [00:11:31]: I see a couple hands up. Are you able to unmute and ask questions?
Steen M [00:11:37]: Was that my name?
Pradeep [00:11:39]: Yes.
Steen M [00:11:40]: All right, maybe I'm the only one with a hand raised. Yeah, thanks for that introduction and very nice. I am currently investigating various Turing systems for hot storage and especially for working with workloads close to a lot of GPUs. One of the sort of requirements that a lot of the users have right now are that they come from a classical HPC world. A lot of the HPC workloads that they are currently using are expecting Posix file systems. Is there anything in here related to Posix file system or do you need something else to sort of abstract it? I know you Talked a bit about Lustre, but we've already tested out Lustre and that just falls short on so many areas.
Pradeep [00:12:38]: Yeah, very good question. So actually we have faced a similar requirement and constraint even internally as well and we have two approaches to solving that problem. One is a library called multi storage client that integrates with all different object stores as well as AI store and provides a file like interface in Python. And we have integrated this library with many of the Nvidia frameworks like Megatron, Nemo, Modulus and then we are also planning to integrate that with some of the more non Nvidia like open Open Source, other open source projects like Pytorch and others. So many of those frameworks they are expecting POSIX kind of interface and so multi storage kind provides that. That's one approach. Another approach is to provide a file system interface directly through fuse adapters and those fuse adapters can access AI store or other object stores still in development. We are targeting that for again AI workloads.
Pradeep [00:13:52]: There are other fuse systems available out there that are also being used in some places, but we are trying to make it even better for AI workloads.
Demetrios [00:14:04]: I see another hand up. James, where are you at?
James J [00:14:11]: So I'm James, I actually in the uk in the Cotswolds, I work at Visa doing machine learning and infrastructure and Devex and a few other and Genai at the moment. My question actually is quite an interesting one because. Because I just started reading the docs, it's about a balanced IO stuff that's what kind of instantaneously in this. So if you've got a massive Kubernetes cluster, I'm just curious how. Because the data is going to be spread across the whole cluster, right? It's going to be partitioned. I'm just curious about when it tries to access the data, how is it going to find it like the nearest node? Because you might have nodes across three data centers because that's pretty much AWS for example and the latency networking between each one. Is it trying to put the data in a logical place based on the machine it's trying to get to. I'm just curious if there's any more documentation about how it does IO distribution because it's really quite an interesting space.
Pradeep [00:15:11]: Yes. So AI store is actually meant to be deployed. One instance of a store is meant to be deployed in one zone.
James J [00:15:19]: It is one available zone. Okay, that makes sense.
Pradeep [00:15:23]: Yes.
James J [00:15:24]: Okay, that was. That's my question. Not multiple. So that's fine. Okay, cool.
Pradeep [00:15:31]: Each AI store cluster can also talk to other AI store clusters. So if you have different zones, it can pull from another AI store cluster in another zone rather than going all the way to S3 in a different region, for example.
James J [00:15:42]: Fine, that makes sense. Cool. All right, that, that explains a lot. That's, that's. Thanks for answering my question.
Demetrios [00:15:48]: Nice. I see one other hand. Yen, where you. Where are you at?
Yen K [00:15:54]: Yes, I'm right here.
Demetrios [00:15:57]: Cool.
Yen K [00:15:59]: Can you hear me?
Demetrios [00:16:00]: Yeah, yeah, fire away.
Yen K [00:16:02]: Yeah. I have a question for you, Pradeep. I'm really curious about the elastic cluster architecture. How does that work in terms of the. Its ranking capabilities?
Pradeep [00:16:15]: Ranking as in where to put the data or.
Yen K [00:16:19]: Yeah, like using. Are you talking about using the, the searching capability Elastic, or is it something else completely different?
Pradeep [00:16:28]: No, no, sorry. What I mean is elasticity in terms of growth and shrinkage. It's not. I'm not talking about the elasticity.
Yen K [00:16:34]: Oh, okay, okay, that, that makes sense. Okay, okay. And then behind the scene, just out of curiosity, I'm total new op in this, in this particular area here. So in the back end, like what sort of toolings or features or tech stack is used to. Or is it all like dictated by Kubernetes microservices in the back end or.
Pradeep [00:16:58]: Yeah, so the data path is completely built from scratch. That's all AI store, it's all open source. The management is driven mostly by Kubernetes. We have a operator and we'll go to some more details there in subsequent slides.
Yen K [00:17:15]: I see. Okay. Okay, well that makes sense. Okay, cool. Well, thanks.
Pradeep [00:17:20]: Okay, I'll hand it off to Abhishek if there are no more questions.
Abhishek [00:17:25]: Yes? Am I audible?
Pradeep [00:17:27]: Yeah. Yes. Yeah, yeah.
Abhishek [00:17:29]: Hi guys, I'm Abhishek. So let's go with the overview of AI store. At the crux of the system we have AIs proxies and AIs targets. AIs proxies are gateways where the clients can reach out to. They are just lightweight endpoints where you can point your application to read data from or write data to. All the storing of the data is done on AIs targets, which are the storage nodes. So, so these nodes, they use PVs and PVCs on top of the disk to store the data. So this part we will look deeper in the next few slides.
Abhishek [00:18:04]: AIs cluster, as Pradeep said, is elastic in nature. Elastic, meaning it can grow and shrink at any time. There's no hard limit on the number of proxies, the number of targets. You can easily add any proxy or any target at any time. Of that cluster's life cycle. And moreover, AI Store cluster linearly scales up with each added disk or each added storage node. So if you're adding more nodes, you'll expect a better, more, more throughput. And if you're removing disks or removing nodes, your throughput will get targeted.
Abhishek [00:18:38]: So on the right side you can see front end API. So there are multiple different ways which where you can reach out to AI store from. So AI store in itself is an S3 compatible client, S3 compatible object store. So you can use any existing S3 compatible clients such as S3 cmd, AWS, CLI, Porto 3 to reach out to AI store. AI store also has its own native API clients. We have the GO and Python based SDK and we also have an AI CLI which you can use to monitor or manage the cluster from Everything in here is open source. On the left hand side we have backends. AI supports multiple different backends.
Abhishek [00:19:22]: We have native support for aws, Google Cloud, Azure and now recently Oracle Cloud. You can also connect different AI Store clusters to the AI Store cluster which you are running. All of them have a generic backend interface. So you don't need to change any API on your end. It's a very simple get input request. So do you have any questions on the previous architecture? Yeah, Jeeves.
James J [00:19:56]: Curious. So I can understand adding a node. Okay. So you create more capacity. So let's say you've got a petabyte of data sitting across, I don't know, 100 nodes, right? Yeah, just example I just choose in. Maybe that's a bit high, but N terabytes of data, right? And let's say you are now refreshing some notes because you need to security vulnerability. That's right. And you've got to roll nodes over in kubernetes.
James J [00:20:24]: So you're rolling 10% of your whole cluster at once because you need to the whole cluster. How does that work? Because okay, I've sold some machines, it's not like I've got they're on prem where I can just take them offline for a little while, bring them back again and they still got the data that was on disk. What happens backend when I lose 10% of the whole cluster and it's trying to find data, is it? I mean, I'm just curious how the system repartitions itself.
Abhishek [00:20:54]: So this is a valid case. We have seen this before. So in AI Store there's a lifecycle called as maintenance mode. So if you put storage nodes into maintenance mode or proxies into maintenance mode. So proxies, it doesn't matter because you can take down proxies at any time. But if you put targets into maintenance, all the data that is stored on the target will be rebalanced onto all the active nodes. So there is a rebalance process where it will change its cluster map and it will remove or remove the data from the node that you are putting into maintenance to other nodes. And once like you have added a node back, the cluster map will change and based on the cluster map changes again.
Abhishek [00:21:38]: Like we use consistent hashing on the proxies and based on that like it will be decided if whether some of the some data will be moved to that new node.
James J [00:21:48]: Okay, so just so I'm clear and I think for the rest of the group, I just from my understanding in best world scenario, I'd say I'm a DevOps engineer as well as a bunch of other things. I set these nodes to maintenance, I cycled them out, the data's moved off and put it in the rest of the cluster, got enough capacity, I then bring up 10 new nodes to return my cluster back as it was. Does it then go? I've got 10 new storage nodes.
Abhishek [00:22:20]: Yes.
James J [00:22:20]: So I'm going to now rebalance the network and just move it around. Okay, my last question is what happens if I just pull the plug on one of these machines? Right. There's no maintenance. It's because in the cloud things can just. You might have a machine, but it's gone. How does it kind of cope with like those kind of failures?
Abhishek [00:22:44]: Yep, we've seen this one before as well. So say you have 10 targets and one of your targets abruptly stops working. Like the whole node goes down. So eventually like the keep alive will detect that it's not a part of the cluster anymore. So it will remove it from the cluster map and any request. So proxies redirect request to the targets. Right. So if the target is removed, the proxies will redirect request to a new target and that new target will realize that it doesn't have the data on itself.
Abhishek [00:23:16]: So it will put pull the data from the remote cloud and it will store onto the cluster and then subsequent gets will be from that new target. So essentially you lose the data. It's fine. There will be a cold read, but subsequent reads for the same update will be all warm.
James J [00:23:34]: Okay, amazing. So it's self healing itself. Thank you, thank you for such a great answer that.
Pradeep [00:23:40]: Thank you so much and just want to add there actually the top of the slide. We also have an eraser coding mode to it that can so you can reconstitute the data from other nodes instead of having to pull it from the source as well.
Abhishek [00:23:56]: Yep. So if there are no any questions then I can move with the next slides. Okay, so here are some of the features of AI Store. We provide high availability and data protection as Pradeep said. We have erasure coding. You can control the number of partitions and tolerance. We have self healing as we have seen from an example in production at Nvidia. We control the whole deployments through an AIs operator.
Abhishek [00:24:29]: We will come to the operator in the next few slides. We also have batch operations. So this is one thing which I've noticed a lot of data engineers or data scientists use. So if you want to run a machine learning training job and you don't want to get high latencies or low throughput on the first epoch, we just prefetch the entire data set before running the machine learning job and the data is already stored on the cluster and the subsequent reads for that objects are directly from the disk instead of going to the S3 backend. I've shown this in a future slide as well. We'll come to this part later. We also have a copy operation where you can copy data sets from one S3 backend to another S3 backend. You can also copy data sets from S3 to AWS to Azure or GCP and you can also copy it to a local AI store.
Abhishek [00:25:24]: Local AISS bucket. In AI Store we also have an authen. We have an authentication server called as authen. It's an OAuth2 compliant server. So when you enable authentication, all your requests to Eistore will require a valid token to reach out to. So that token will have permissions based on if the user can read the specific bucket. It's controlled through roles. AI Store has read after write consistency and it supports write through caching.
Abhishek [00:25:59]: So there are also more features which are on the AI store website as well as the GitHub page where we can check where you can check it out. So this is a simplified read flow which I was explaining earlier. So consider you want to fetch an object so you'll do a get call to the proxies. So all the proxies are sitting in sitting. So a load balancer is sitting in front of all the proxies. So essentially you reach out to a load balancer. The load balancer routes your request to a proxy. If you do a get request on a proxy it will redirect you saying that the data might reside on this target.
Abhishek [00:26:39]: So the redirect goes to that target. If the target finds that it has not reached it doesn't have that object, it will reach out to the backend to fetch that object from the backend. It will write to the disk and it will return the object to the client. So this is what I meant from read through cache, read read through caching and all the subsequent reads to that specific object, they won't go to the backend unless you have mentioned that you want to fetch the latest or there was an out of band update. It will be directly returned from the target itself. Any questions on this slide? Perfect. So coming to the AI Store Kubernetes integration We have an AI Store operator. So we have an AI Store custom resource definition that we use to manage the whole life cycle of AI Store.
Abhishek [00:27:35]: AI Store operator is responsible for the entire lifecycle of AI Store. You can easily scale or remove nodes, you can add proxies, targets, etc. Yep, we have Ansible playbooks. So for deploying AI store we need to set up the underlying environment, we need to format disks, we need to increase the number of open file descriptors, we need to adjust the networks to make it perform better. So all that we have provided Ansible playbooks for that. We also have helm charts to basically deploy the operator and the AIs cluster deployments. Even for scaling, adding nodes, we have helm and ansible playbooks for that. And you can also maintain customizable templates for each environment.
Abhishek [00:28:24]: So if you want to change your AWS creds, or if you want to change any of the creds, or if you want to add or remove nodes, you can just sync the helm chart for that respective environment and everything will be in place. Yep. James, I'm gonna just.
James J [00:28:44]: Sorry, I'm asking lots of questions. I'm just the nature of.
Abhishek [00:28:46]: No worries, no worries.
James J [00:28:47]: The answerable stuff. What OS is supported? So what is it like our family? Is it Ubuntu?
Abhishek [00:28:55]: So yeah, so for we initially supported Linux like Ubuntu family, but later on like we started deploying a lot on OCI and they are. I think it's.
Aaron W [00:29:11]: It's CentOS.
Abhishek [00:29:12]: Yeah, CentOS. So I think we support most of the Linux OS's for now.
James J [00:29:19]: Thanks.
Abhishek [00:29:21]: Yep. Any other question?
Yen K [00:29:24]: Yeah, I have a quick question regarding the Ansible playbooks. I was just out of curiosity, how different is that versus using like Terraform or any other IAC framework for this?
Abhishek [00:29:39]: So Terraform like We initially supported Terraform, but because of the strength of the team, like we have stopped the support for Terraform, we've removed it from the operator. But it's similar like you can. So Terraform is pretty straightforward. We don't have support for it right now, but using Ansible you can do the like essentially the same things or with the helm charts you can essentially do the same things and get up your cluster.
Pradeep [00:30:07]: Okay.
Aaron W [00:30:08]: I would say ideally we would use something like Terraform for some of the initial node set up. In this case we've been doing just doing most of our setup through the OCI console ourselves. And the Ansible playbooks are more for the system configuration on top. So ideally I think you would have a Terraform set up for your actual nodes, your security groups, that kind of thing, and then probably have a custom built node node image. But for now Ansible playbooks are generally just for setting up that host from whatever scratch Linux you have.
Yen K [00:30:49]: Okay. Okay. So technically kind of sounds like initially you going to have to set up Terraform in some shape or form and then, and then connect it, hook it up to the Ansible playbooks that you have set up here. Is that right?
Aaron W [00:31:01]: Yeah, yeah. These playbooks kind of start at a higher level expecting you to have some. Something set up already.
Yen K [00:31:06]: Okay, cool. All right, thanks. That's good to know.
Abhishek [00:31:13]: Yeah.
James J [00:31:14]: Just a quick follow on from what she was saying. I presume you create a custom. Use the Ansibles to actually create a custom AMI or have some base image and then use that and then customize it afterwards. Or do you. Are you having to run anspot every single time you bring up a fresh node?
Aaron W [00:31:32]: We currently do run it every time we bring up a fresh node. We need to get a custom built Oracle node that's definitely on our list.
James J [00:31:41]: A pre built node that you just.
Aaron W [00:31:43]: Ideally, yes, yes, you would have a custom node.
Abhishek [00:31:45]: There's some work going on in that, but I don't think I can disclose details.
Yen K [00:31:50]: Okay.
Abhishek [00:31:53]: Okay. So moving forward, if there are no other questions. Yep. So this is how a looks in a Kubernetes environment or in a Kubernetes cluster. So at the very top we have like a proxy stateful set which will manage the number of proxies in the whole cluster. So we also have targets. So if you, if you look at targets on the very left we have a pod and that node will have a lot of disks. So we typically use nodes with NVME drives.
Abhishek [00:32:26]: So on top of every NVME drive we create a PV which is then attached to the pod using a pvc. And this is how the data is stored on the disk. So if a client is reaching out to the cluster, it will reach out to an AIs proxy load balancer which will be then routed to a proxy and then finally proxies will redirect the client to reach out to the respective target. And once it comes to the target, it finds which disk it's on and it is retrieved. And to control all of the things, we have an AIs operator which will reside on one of the nodes. Any questions on the slide?
James J [00:33:15]: One quick question with the operator. Are you running leadership elections? So if it dies, the node is stuck on disappears copies of it running. So that they do the rest of them will do a leadership election, decide which one is becoming the new leader of the Otherwise the whole cluster is.
Pradeep [00:33:38]: Offline while it use slightly modified algorithms. So instead of consistent hashing, actually we use rendezvous hashing, which doesn't have as much dependency on leader. So a lot of this is leaderless.
James J [00:33:56]: Okay. The operator is not so critical to day to day operations of the cluster. It just means you can't provision any nodes or do any like it would if the opera went offline for because it died and got started, the cluster would continue to be available. Or is it the operator?
Pradeep [00:34:16]: Yes, yes. So, yep, we have data path and the control path. So the data path taken is taken care of by AIs. The AI store code, the data plane path. But Kubernetes takes care of the control plane path. If the operator dies, Kubernetes will help them.
Abhishek [00:34:36]: It's a deployment. So operator will come back up again, it will identify the state of the AI store cluster, what's it in and it will take off from there. And about leaders. So we have a proxy, we have a primary proxy, but if that primary proxy goes down like it's still fine. Like there is a leader election inside of EI store as well. To reelect the primary proxy, we need the primary proxy for reach, like for the keeping alive failures and reaching out and health checks and all the things.
Demetrios [00:35:15]: I see another question coming through from Rafael too. I just wanted to make sure.
Rafael [00:35:21]: Yeah, I would like to ask one thing. Maybe it's a dumb thing, but is there the load balancer basically topology aware of the cluster? I mean I'm running a distributed workload and I would make sure that the ports, the client ports are able to reach the data that is more efficient to get. So closest in terms of topology and with the highest bandwidth at that Point.
Abhishek [00:35:48]: So, like in the production setups that we have, we have like, if the GPU nodes are in a certain region, we try to deploy the AI store cluster in the same region. And that's why, that's how we get the least latency and highest throughput, because they are mostly in the same region and they use the backbone of the infrastructure like they use backbone of OCI network to reach out or fetch data from the nodes.
Rafael [00:36:13]: Okay, thanks very much.
Abhishek [00:36:19]: Any other questions? Fine. Let's move ahead with next slide. I think Ayden can take it up from here.
Aaron W [00:36:28]: Yeah, I'll try to move a little quickly here. So I just want to talk a little bit about why we deploy in Kubernetes and kind of some of the things that it enables for us. So one of the things that Kubernetes lets us do is scale linearly as nodes are added, like we talked about. So here in the slide I have a couple images from our custom dashboards from one of the clusters we ran a benchmark on. And in this case we had a set of benchmark nodes that were just saturating the cluster with requests the whole time. And at this top image here, you can see there's a throughput per node. And what we're showing here on the right is that each of these 16 target nodes contributes an equal 11.1 gigabytes per second to the cluster's overall throughput here, which was around 178. And you can also kind of see like there's not really any deviation over two hours of benchmarks here.
Aaron W [00:37:28]: So we saw some stability there as well and ran it for much longer than that. And one of the other things is below here you can see the disk utilization on a single node, and you can see there that it's also distributed across the disks on the individual node. And so this just kind of shows that because these objects are equally distributed across nodes and disks, you can expect your performance to scale linearly with your cluster size as long as you have an appropriately scaled client. And one thing to keep this even distribution across disks and nodes. AI Store does have those batch jobs. So we have a batch job for rebalance across nodes and we have a batch job for resilver, which balances objects across drives on a node. So as the data is moved or changed, you can keep that scaling going. On to the next slide, I wanted to talk a little bit about our observability stack.
Aaron W [00:38:29]: So today in production, we use a set of helm charts to take advantage of some of the publicly Available tooling in Kubernetes to monitor the cluster health. So we have a whole set of AI Store custom metrics and alerts so we can keep an eye on exactly what's going on in the cluster. You can also see a lot of these through cli. We offer a whole performance section in the cli. So these as POD metrics include put and get metrics per bucket. You can see error rate of different types of errors. You can see disk usage. You can see some of our cloud metrics as well.
Aaron W [00:39:09]: And once we export all those, we export them as Prometheus metrics. And then we use the Alloy tool from Grafana Labs to aggregate and relabel and ship all the signals. So metrics, logs, traces and so I have it kind of divided up here in this image as each of the deployments. So we do offer a kind of a sample that we use for our deployments of observability over a few different helm charts. So qprom Stack manages a lot of our local monitoring cluster in a separate node pool here. And we use this on this on the right side here to store some more detailed information for like a shorter period of time without having to worry about external access. And then we can also with Alloy we can connect any external LGTM stack, so Loki, Grafana, Tempo, Mimir and we can push to anything else externally as well for our users to view. So we found Grafana Alloy to be pretty, pretty user friendly.
Aaron W [00:40:21]: It's a, they call it a vendor neutral distribution of the OpenTelemetry collector. OpenTelemetry is kind of a fairly new set of tools and standards for observability, but it has a lot of really good compatibility and integration with existing tools. So definitely check that out if you haven't tried OpenTelemetry. Any questions on those? All right, I'll keep moving then. So this is our custom dashboard or just a couple panels from it. And so you can see here that we bring in the metrics from some of these different tools Right now we pull in from system metrics from node exporter so you can see things like disk, we have CPU memory monitoring, network monitoring. And then we can also bring in metrics from Kubernetes, things like pod health, node health, statefulset size and and then of course we also get all of our metrics from AI Store. And so some of these custom metrics that AI Store exports or makes available, they'll include labels for filtering on certain things.
Aaron W [00:41:38]: Things. So here I just showed a couple where for an object get statistic you can filter based on node to see how your nodes are performing relative to each other. You can filter based on bucket so you can see, you can see who's getting lots of bucket usage. You can see how large the objects are being pulled from each bucket. And so this is really useful for us for capacity planning and troubleshooting. Some of the other metrics we provide, like I said, errors of different types and all of our statistics from buckets with different cloud backends. And then we also export statistics about a lot of the various batch jobs that AISTORE runs. So it's kind of an ongoing project for us to figure out how to collect all of these metrics from all these different sources and show them to users in a way that's useful.
Aaron W [00:42:37]: And I can post, I'll post a doc in the chat for some of the metrics we expose as well.
Pradeep [00:42:44]: Thanks, Arun. So I'll move quickly through the last couple of slides. In the interest of time, I would like to keep some time for asking questions from you guys in the community. So we've been using AI student production for a few different cases. The one we are talking about today is automated speech recognition training. So they train a bunch of different models and they are currently using about 128 GPUs high channel GPUs. It's a medium sized cluster. It's not too large, but I thought also not too small.
Pradeep [00:43:25]: They have GPUs in multiple places and they have a source of truth in one of our object stores user stack. So what AI store provides them is workload portability. AISTORE keeps the data in sync on each of these clusters. There's a different deployment each different cluster in each of these data centers and they synchronize data with the source of truth. And AS two is able to provide a very high throughput compared to any other large storage services. So on a per GPU basis they are able to get up to 0.65 gigabits gigabytes per second, gigabytes per second per GPU, which is about 700 Gbps overall for 128 GPUs. Then on the right side, just a little more details about the hardware, the infrastructure. We use OCI two different regions in OCI, i8 and ORD and we are using 16 servers for each of these clusters and the total capacity is 1.2 petabytes.
Pradeep [00:44:34]: Very good latency metrics. With single stream we are able to get 10 MB in 17 milliseconds. You can parallelize that into multiple streams depending on how much network you have available. AI Store is able to saturate the network we can get in our benchmarks, we received up to 91% of network throughput. Then to talk a little about deployment. So Astore can be deployed in multiple ways. One is as a storage service that is abstracted out from users on a separate set of storage nodes. But another way is to deploy it in a hyper converged format where it uses the GPU nodes SSDs.
Pradeep [00:45:19]: So a lot of GPU nodes have a huge amount of SSD capacity and it is highly underutilized today. And customers want a way to use those SSDs. They want an easy way. And so using them individually is not useful because of the repeated downloads. So we need to pull them together into one single high capacity store and AI Store provides. That means then on that deployment you can configure the durability with erasure coding. You can turn it on or off because GPU nodes do tend to go down more frequently than other storage nodes. And then you can also have the backend object store.
Pradeep [00:46:04]: We have providers written for many of these different ones that Abhishek also mentioned earlier. We also can do without an object store in the backend, so it can be just written to AI Store locally. For example, not every checkpoint needs to be stored in a very highly durable form. Some checkpoints, intermediate checkpoints can be stored just in the local AI store. And the advantage is get the high network within the cluster as well as very low cost because people are already paying for those GPU nodes. The SSDs are already there. It's essentially free. Any questions? Now we have some questions for the community, so everybody please feel free to jump in without having to be called on.
Pradeep [00:47:06]: I would love to make it as interactive as possible. What kind of storage do you currently use? I heard Lustre earlier Steve mentioned and what other kinds of storage do you.
Demetrios [00:47:31]: And so I see somebody's raising their hand the yeah, you want to jump in. And then if others want to just drop in the chat too like for. For simple ones that is helpful.
James J [00:47:47]: Sure.
Steen M [00:47:47]: I just wanted to volunteer myself for nothing else to break this silence. So yeah, as I sort of alluded to before, Lustre has been part of the stack for a couple of years now. But one of the things that really troubled us was sort of the just throw more CPUs added kind of problem solving that we experienced. Which is also why we're looking in several other directions both for the sort of the Hot storage tiering, but also for stuff like global name spacing. A lot of the like on paper a lot of problems are about performance, but when it comes down to it, it's really about discoverability and availability of data across various compute environments. So I'd like to hear your thoughts about that. There are, it's a congested market, There are many vendors and this solution, at least on slight level, seems to overlap with some of the commercial vendors on the market right now that we're looking into. So some thoughts around that.
Pradeep [00:48:54]: Yeah, absolutely. So we support many different commercial vendors, we certify them and we are happy that they are building some solutions that have overlaps with this. See this was started seven years ago. @ that point there was nothing in the market like this. And even today we feel there are still some gaps. We are happy for others to catch up and we are happy to provide them our code and our designs and everything. Our interest is to make sure that GPUs kept fed with data without interruption to make the most best possible use of the GPU. GPUs.
Pradeep [00:49:34]: Yeah, and we are, we are building this internally, using this internally. We are also working with others who want to use it. Does that answer your questions?
Steen M [00:49:49]: Well, somewhat, at least, and I think it's probably a much, much longer conversation if we need to go into the technical details. So I'll just stop here.
James J [00:49:57]: Thanks.
Pradeep [00:49:58]: Yeah, exactly.
Demetrios [00:49:59]: I like that pragmatism. I see another hand, James. There we go. What you got for us?
James J [00:50:07]: I'm going to throw a bit of a spanner in the works. I suppose we're talking about SBX storage and just general storage. How difficult would it be to integrate into a feature store? So the same problem, especially with ML training to pull all the features down from a feature store and have them locally available on the cluster while it doing massive training jobs. Is it the functionality? I mean, ourselves, if we needed to. I'm just curious.
Pradeep [00:50:42]: Yes. So the interface it provides to clients is an object like interface, but on the back end you can implement your own provider to connect to a feature store and put the data into some parquet format or any other format that gives you a bag of bytes. And yes, it's absolutely possible to do that.
James J [00:51:03]: Okay, cool. That's really interesting. I know it's not what you guys.
Demetrios [00:51:13]: Uh. Oh wait, we may have lost you. Or maybe it was me.
Pradeep [00:51:19]: Yeah, no, yeah, I don't hear James anymore.
Steen M [00:51:22]: Definitely. Yes.
Demetrios [00:51:23]: Okay. James, we lost you.
Pradeep [00:51:28]: Hey.
James J [00:51:29]: Oh, sorry. I just. I said that was it. That was my question over that was nothing else.
Pradeep [00:51:34]: Thanks.
James J [00:51:35]: I was hiding myself to show that I finished.
Demetrios [00:51:39]: All right, let's keep it rocking. What other questions are there?
Pradeep [00:51:49]: So what training or inference frameworks are people using?
Demetrios [00:51:56]: Say that again. Training or frameworks or inference? Sorry, I missed it again because I was talking training frameworks and inference. Prince. Yeah. Okay, who's got something? I. I also would love to hear from some other people that I know. James is willing to chat, I'm sure. Which I.
Demetrios [00:52:28]: I really appreciate that about you, James, and. But maybe there's other folks that want to chime in or just drop it in the. In the chat so that we can go from there. I see. Rafael, you got something? What you got for us?
Rafael [00:52:43]: Thanks very much. I have a question to the previous point about the challenges on storage. I'm interested in how does AI store from what you offer, support multi tenancy and maybe tenant separation in terms of resources? So we'll be working with multiple teams, multiple workloads, multiple maybe customers. And we would like to ensure, for security reasons, that they only operate on that separate backend partition or maybe something like this.
Pradeep [00:53:20]: Yes. So we do have authentication authorization built in. We didn't cover details on that, but you can find more details in the documentation. So you can authorize specific users to specific buckets and so that way you control access. We are also able to utilize any underlying encryption features that are provided by the underlying storage. So in oci, we do use their encryption. Now, from a capacity perspective, we actually don't have strict quotas. So the capacity shared across all users, it's kind of similar to a shared file system kind of approach.
Pradeep [00:54:07]: This is not meant to be used as a shared cluster across different companies. It's meant to be used just within one company, cooperatively among different teams.
Rafael [00:54:19]: Okay, so basically each team should deploy its own AI store for its use. Yes. And select appropriate nodes for their workloads.
Pradeep [00:54:33]: Depends on how strict of a separation you need. Internally, we do have multiple teams sharing the same clusters. It's just the organization that's keeping them separate. They do share the same capacity. We manage capacity more on a resource governance level.
Rafael [00:54:51]: Okay, got it. So basically at that point, no features that would support like availability, zone separation in terms of given capacity units.
Pradeep [00:55:06]: No. Yeah.
Rafael [00:55:07]: Thanks very much.
Pradeep [00:55:10]: But thanks. Yeah, good questions.
Demetrios [00:55:17]: And Yen, I see you.
Yen K [00:55:19]: Yeah, I have a question here. The AI Store. I was just taking a look at the git repo that you have for the AI store and I was curious. It seems like it's more catered to the aws Cloud platform. What about Azure and gcp? How easy is it to to configure or set this up? If teams want to use leverage its AI ML capability with this AI store, it's pretty straightforward.
Abhishek [00:55:53]: In the repo you'll see mentions of S3 because a lot of people usually use S3 but we do support the other backends as well, which is gcp, Azure and OCI now. So there are ways where you can add your credentials and you can get started with a local cluster as well if you want to try it out today.
Yen K [00:56:14]: Oh, okay, okay. And would it be. Is there a separate git repo for that or is it just the configuration slightly different?
Abhishek [00:56:22]: It's just a configuration change like you just need to add credentials to the right. So I think for Azure I think you need to set environment variables or files and it should then pick up your buckets on Azure.
Yen K [00:56:37]: Okay.
Abhishek [00:56:38]: I think it's containers in Azure.
James J [00:56:40]: Yeah.
Aaron W [00:56:40]: I will say if you're looking at the Kubernetes repo, there's definitely some missing stuff for Azure and OCI for automatic configuration. If you do configure it and mount it properly on your bucket, AI store itself as the cluster will work. We just don't have all of the automation set up for all those yet.
Pradeep [00:56:58]: Okay.
Yen K [00:57:00]: That'S good to know but you.
Abhishek [00:57:03]: Can read the feature request through the repo and we can take a look into it as soon as possible.
Aaron W [00:57:10]: Yeah, well that's possible things. If you guys have stuff you want, go open an issue and we'll prioritize things a bit if we see requests for things.
Yen K [00:57:20]: Okay, yeah, that's good to know.
Demetrios [00:57:26]: Excellent. Well, this is super cool. I want to just like give a huge thanks to the Nvidia team and to everybody for joining us that James was asking about the recording. We're not going to make it public, but we will share it with everyone that came to this session. That was kind of the deal is that if you come to the session we can share it with you, but we ask that you don't go and post it all over the Internet. And yeah, hopefully there's been some fruitful learnings that have happened and if anybody wants to continue the conversation, that is great, you can hang out on here and keep talking. I gotta go, but I will see you all later. Hopefully soon.
Demetrios [00:58:10]: We're doing an AI in production event coming up, so join us for that and otherwise I'll catch you all on the next time around. See ya. Thanks guys.
Steen M [00:58:20]: See ya.
Pradeep [00:58:21]: Thank you.
Abhishek [00:58:22]: Thank you.
