Distributed Training in MLOps: How to Efficiently Use GPUs for Distributed Machine Learning in MLOps

Optimizing GPU Utilization for Scalable Machine Learning Operations
March 19, 2025


Efficient GPU orchestration is key (Image AI Generated using Ideogram 2.0)
Efficient GPU orchestration enables MLOps to support distributed training and serving of increasingly complex models. GPUs excel where CPUs cannot meet the computational demands of large-scale machine learning models or petabytes of data. GPUs processing thousands of operations concurrently reduces training time and costs.
Organizations can lower energy usage and operational costs by effectively distributing workloads across multiple GPUs. For instance, DeepSeek-V3 — a 671-billion-parameter Mixture-of-Experts model requiring 2,788,000 NVIDIA H800 GPU hours, was trained on 14.8 trillion tokens using a 2048-GPU cluster in under two months. This shows how efficient GPU use can significantly boost performance at scale.
If you missed the first article of my mini-series on distributed processing in MLOps, I explored core strategies for distributed training and practical MLOps implementation patterns. In this Part 2, we dive deeper into GPU-accelerated distributed training, unpacking critical technical considerations for scaling modern Machine Learning workloads. Key topics include:
- System setup — How can GPU-accelerated machines be enabled to be used in distributed training?
- Orchestration — Strategies for architecting GPU-optimized clusters in Kubernetes and HPC systems
- Performance optimization — System-level tuning to maximize throughput and resource utilization.
See other articles in this series:
Accelerate MLOps with Distributed Computing for Scalable Machine Learning
Enable your MLOps platform to train bigger and faster — because your models deserve a “team effort”
Break GPU Vendor Lock-In: Distributed MLOps across mixed AMD and NVIDIA Clusters
Unlocking the forbidden cloud hack: Running distributed PyTorch jobs on mixed GPU clusters with AWS NVIDIA g4dn and AMD…
I used AWS G4dn instances with NVIDIA T4 GPUs and G4ad instances with AMD Radeon Pro V520 GPUs for this demo. Although this particular AMD GPU is no longer officially supported by the latest ROCm releases, the necessary libraries can be unofficially built from source. All image definitions and build instructions — for CUDA 12.4 on Turing architecture and ROCm 6.2.2 on GFX1011, plus PyTorch 2.5.1 with example training jobs and cloud infra — are available in my repo:
GitHub — RafalSiwek/distributed-mlops-overview: An exploration of distributed processing solutions…
An exploration of distributed processing solutions for MLOps. — RafalSiwek/distributed-mlops-overview
Enabling Multi-GPU Communication for Distributed Training
Efficient multi-GPU communication is critical for scaling distributed training, yet achieving near-linear speedup is challenging without proper optimizations.
Imagine distributing a model across multiple GPUs: in theory, training should scale almost linearly, yet in practice, each all_reduce operation — the backbone of gradient synchronization — turns into a series of memory copies. Without GPU-optimized communication libraries, data takes a long detour: GPU → PCIe → CPU RAM → PCIe → NIC → Network → Repeat. This introduces significant latency and throughput reduction.

Transporting data from GPU to CPU with gRPC-based collectives requires staging the memory with CPU — this increases the processing latency
Standard multi-node communication methods, such as gRPC, are common in many distributed systems. gRPC is excellent for general-purpose remote procedure calls but is not tailored for high-throughput and low latency. It relies on the CPU for data serialization, deserialization, and extra data staging across the network layers. This makes gRPC considerably slower for large tensor transfers.
GPU-optimized Collective Communication Libraries, like NCCL (NVIDIA) and RCCL (AMD), offer better performance. Their predominance comes from clever solutions that cooperate with modern hardware topologies — such as executing an entire All-Reduce operation in a single CUDA or ROCm (HIP) kernel. In TCP-based cloud network environments, NCCL/RCCL aggregates small tensors into fewer large packets, amortizing TCP/IP overhead. This approach avoids fragmented communication and allows data to move directly from GPU memory to the network interface, offloading CPU cycles.

Benchmarks show that these libraries can accelerate multi-node communication by up to 5–6x compared to standard CPU-mediated methods like gRPC, especially when large tensor workloads are involved.

Comparison of throughout performance between MPI+NCCL and gRCP (source: https://arxiv.org/pdf/1901.01703)
Frameworks like PyTorch abstract these backends with various flavours. NCCL/RCCL excels in GPU workloads, using topology-aware algorithms (ring, tree) to maximize network bandwidth, making them ideal for homogeneous multi-node GPU clusters. In contrast, Gloo prioritizes portability but relies on CPU-mediated network transfers.
Implementing GPU-accelerated, multi-process distributed training with PyTorch is straightforward and includes two main steps:
The rest stays the same as for standard distributed training:

Multinode distributed training with PyTorch DDP example using torchrun elastic launcher on 2x AWS G4dn.xlarge instances (NVIDIA T4 GPU)

Multinode distributed training with PyTorch DDP example using torchrun elastic launcher on 2x AWS G4ad.xlarge instances (AMD Radeon V520 GPU)
PyTorch works the same with both AMD RCCL and NVIDIA NCCL libraries, requiring no extra effort. This is because their APIs are similar, thanks to AMD’s HIP portability features and device-aware building. For a hands-on look at how all_reduce works — and to compare raw C++ implementations with NCCL and RCCL optimizations — I’ve built a practical example here.
Moving from CPU-mediated communication to GPU-optimized collective operations makes distributed deep learning more efficient. By removing bottlenecks and enabling near-linear scaling, these advancements make large-scale models and datasets viable for production.
GPU — Accelerated Distributed Training on Kubernetes
Managing thousands of GPU devices — like the 5,000-plus deployed at Uber — presents a formidable challenge. Organizations must ensure effective GPU utilization to avoid over-provisioning or underutilization. Without proper scheduling and cluster scaling, these resources can become costly and inefficient, consuming power while staying idle.
Kubernetes is the most popular platform for container orchestration, automating provisioning, scheduling, and scaling across large clusters. It efficiently manages GPU resources by:
- Abstracting infrastructure via declarative configurations.
- Integrating vendor drivers (e.g., NVIDIA, AMD) for GPU resource exposure.
- Optimizing GPU resource allocation with optimized schedulers.
- Allocating resources based on topology.
- Automating maintenance tasks like driver updates.
These features make Kubernetes perfect for large-scale AI and Machine Learning operations.
To enable GPU support in Kubernetes, install the vendor-specific drivers and deploy the matching device plugin (e.g., NVIDIA or AMD). This exposes custom GPU resources (like nvidia.com/gpu) for pods to request:
Using these annotations, a distributed training job can be successfully launched using the same methods described in the first article of the series.
GPU Operators
Kubernetes provides access to specialized hardware — NICs, Infiniband, and ROCE adapters through its device plugin framework. However, managing nodes with these resources requires configuring multiple software components, including drivers, container runtimes, and libraries.
GPU operators provided by AMD and NVIDIA simplify the task by automating the management of all necessary components, including device drivers (for CUDA or ROCm), the Kubernetes device plugin for GPUs, automatic node labelling, and resource monitoring:

Manual install vs. automation by GPU Operator with fully containerized components (based on: NVIDIA)
These operators deploy components that automate GPU node provisioning, node annotation, and feature discovery. They install and update drivers, configure telemetry, and perform resource discovery to ensure each node’s GPU capabilities match scheduled workloads. For example, NVIDIA’s GPU operator configures device sharing — allowing a single GPU to serve multiple tasks — provisions the optimal driver for machine learning frameworks like PyTorch, and integrates with RDMA-capable resources to increase data throughput.
Both operators also expose monitoring and observability metrics that can be consumed by tools such as Datadog or Dash0 to provide real-time insights into resource utilization and enable dynamic cluster monitoring.
Despite these benefits, the extra initialization steps can increase pod launch times, with delays potentially accumulating across distributed operations:

Running GPU-accelerated PyTorchJob on 2x AWS G4dn.xlarge nodes (with NVIDIA T4 GPU) took ~20 min to launch, install drivers and pull the container images. The training job took 5 seconds
Additionally, careful integration with external webhook handlers and the plugin CRD is essential to avoid potential misconfigurations.
By orchestrating thousands of high-performance GPUs, Kubernetes dramatically cuts training times — processing petabyte-scale datasets in days rather than months. Integrating Kubernetes’ scheduling with next-generation hardware simplifies machine learning, allowing faster training of larger models while reducing costs through optimized resource use.
Performance Tuning and Optimizations
Despite advanced communication libraries and Kubernetes plugins, GPU utilization often plateaus 60–70% due to resource fragmentation, communication overhead, and scheduling inefficiencies. This underutilization hampers distributed training workflows, delaying model training and increasing operational costs.
Optimizing these bottlenecks increases GPU utilization, shortens training, and uses infrastructure cost-effectively. For companies, this means quicker time-to-market for AI products and better ROI on GPU hardware.
Increasing GPU Utilization with Device Sharing
Even as teams optimize batch sizes and dataloaders to maximize GPU usage, fully saturating GPUs remains tricky. Sometimes, it is more efficient to share a single physical GPU across workloads, such as pods or processes. However, maintaining isolation and avoiding performance degradation in this setup is a critical challenge.

GPU concurrency mechanisms (source: NVIDIA)
At this point, NVIDIA GPUs offer advanced GPU sharing capabilities with CUDA Multi-Process Service (MPS), time-slicing, and multi-instancing via Multi-Instance GPU (MIG). All of these features are supported and can be configured by the Kubernetes device plugin:
AMD GPUs allow device sharing via virtualization partitioning using direct SR-IOV (Single Root I/O Virtualization) or AMD-branded MxGPU SR-IOV.
SR-IOV enables a physical PCIe GPU to be subdivided into several “virtual functions” (VFs), with each VF appearing as an independent, dedicated GPU instance that can be directly assigned — or “passed through” — to a guest environment such as a virtual machine or container.
The ability to partition an AMD GPU is typically detailed in its datasheet, as with the AMD Instinct™ MI325X Accelerator, which specifies virtualization support for up to eight partitions via SR-IOV. Enabling VF provisioning over SR-IOV requires a specific system configuration and may involve manual VF provisioning.
The AMD K8s Device Plugin can enumerate these GPU partitions as available devices, enabling them to be schedulable Kubernetes resources.

GPU partitioning mechanism over SR-IOV
Limitations:
- When using MIG instances, NCCL treats each slice as a separate device. Cross-MIG communication might route through the network (even on the same GPU), increasing latency.
- GPU sharing with SR-IOV partitioning is not a purely software-configurable approach. AMD K8s GPU plugin does not provide the same configuration flexibility as the NVIDIA operator.
Configuring GPU sharing for Kubernetes workloads lets teams match GPU allocation to workload requirements, supporting lightweight inference tasks alongside intensive training jobs on the same hardware. These optimizations lower infrastructure costs through higher GPU utilization.
Network, NUMA and Topology Awareness
Optimizing GPU utilization helps reduce inefficiencies at the device level, but multi-node distributed training faces a different challenge — the overhead caused by network communication and hardware topology:

Multinode distributed training with PyTorch DDP example using torchrun elastic launcher on 1x G4dn.12xlarge with four NVIDIA T4 GPUs (left) and 4x G4dn.xlarge with one NVIDIA T4 GPU each (right)
As shown in the video comparison above, when workloads span across nodes, communication latency over slow TCP transport (25 Gbps) with cross-NUMA memory access contention silently throttles performance.
NUMA alignment of scheduled processes minimizes memory latency and maximizes efficiency. In NUMA systems, each processor has local memory that can be accessed much faster than remote memory over an interconnect. For example, if a training process runs on a processor but its data resides on a remote NUMA node, the extra latency from remote access can slow down the overall training process.

It is faster to access local reasources than remote. (source: Boost)
In Kubernetes clusters, heterogeneous node architectures increase the challenge, requiring precise alignment of GPUs, CPUs, and memory to NUMA domains to minimize latency.
The Topology Manager can enforce NUMA alignment with --topology-manager-policy=single-numa-node and --cpu-manager-policy=static. Additionally, numactl can be used in pods to bind CPUs/GPUs to NUMA nodes:
Modern schedulers like Volcano provide NUMA-aware scheduling capabilities:
volcano/docs/design/numa-aware.md at master · volcano-sh/volcano
A Cloud Native Batch System (Project under CNCF). Contribute to volcano-sh/volcano development by creating an account…
Limitations:
- Strict NUMA policies may lead to pod scheduling failures if resources are fragmented. This means that although a node might have enough total resources, they are spread across multiple NUMA zones — each typically formed by a separate CPU socket with its dedicated memory — instead of being available together in one zone. As a result, a pod that requires all its resources from a single NUMA domain may not be able to get them and get stuck in the pending state.
- Mixed NUMA topologies across cluster nodes complicate affinity rules because inconsistent NUMA configurations — such as differing numbers of nodes or mismatched CPU and memory layouts — make it challenging to schedule related workloads on a single NUMA node. As a result, workloads can span multiple NUMA nodes (i.e., be split across them), which forces remote memory access, increases latency, and degrades performance.
By co-locating GPUs with their associated CPUs and memory within the same NUMA node, teams reduce cross-domain traffic and PCIe bottlenecks, streamlining communication for collective operations.
Reducing CPU-GPU Bottlenecks
Slow data transfers between CPU and GPU, or inefficient preprocessing, stall training pipelines and leave GPUs idle.
Optimizing the data loader to pre-allocate data in pinned CPU memory can address this issue. This enables asynchronous, non-blocking transfers to the GPU. In PyTorch, you can achieve this by setting pin_memory=True.
Another way to tackle this issue is to bypass the CPU transferring data between GPUs during collective operations with Remote Direct Memory Access (RDMA).

With RDMA the data can be transferred between the devices directly, without engaging the CPU (source: Dell blog)
RDMA requires compatible hardware, including RDMA-capable NICs (e.g., Mellanox ConnectX-6/7 for InfiniBand/RoCE) and GPUs with GPUDirect RDMA support (NVIDIA A100/H100 or AMD Instinct MI-series). On Kubernetes, RDMA is enabled via device plugins (e.g., k8s-rdma-device-plugin) to expose RDMA resources, and kernel modules like nvidia-peermem (for NVIDIA GPUs) or rocm-core with libfabric(for AMD GPUs).
AWS EKS supports RDMA over AWS EFA but only on selected instance types (no support for g4ad), however the performance increase is substantial:

AWS EFA performance (source: AWS blog)
Adopting DMA and RDMA-driven data transfer strategies unlocks significant efficiency gains. Pinned memory allocation reduces latency, allowing GPUs to process data continuously without waiting for CPU-GPU transfers.
Fast Storage Access
Inefficient storage slows data retrieval from feature stores and warehouses, limiting scalability when hundreds or thousands of GPUs access data in feature stores or warehouse systems.
Multimodal workloads — for example, vLLM training — demand rapid loading of massive models onto GPUs. Tensor parallel operations force thousands of devices to access shared blocks of object-stored training data and metadata simultaneously, while frequent checkpointing requires scratch storage that can handle continuous overwrites.
AI-oriented storage solutions from vendors such as Weka, Vast Data, DDN, or Dell ECS implement hybrid software and hardware solutions for RDMA, channelling data directly from storage to GPUs while bypassing CPU overhead:

RDMA-optimized storage increases througput by bypassing CPU
Using performant storage systems enables high IOPS and ultra-low latency to support thousands of concurrent operations.
Collectives Benchmarking and Tuning
Network fabrics, GPU architectures, and workload patterns differ, requiring tuning of collective operations like all_reduce and all_gather. For example, adjusting the chunk size parameter defines the data processed per reduction step, while setting the ring buffer count determines the number of buffers for overlapping computation and communication.
Frameworks like NCCL and RCCL offer tuner plugins to auto-optimize algorithms and protocols, while MPI provides guidelines for parameter adjustments. However, these tools need iterative benchmarking to ensure their effectiveness.
Regardless of the approach, benchmarking the setup using available tests — such as the OSU benchmark for MPI collectives or the dedicated NCCL and RCCL test suites — ensures accurate performance assessment:
Integrating these optimizations into Kubernetes-based pipelines and automation tools for periodic re-tuning keeps AI infrastructure efficient against evolving workloads. It ensures high GPU utilization and cost-effective model training at scale.
Summary
Managing clusters with thousands of GPUs presents challenges such as network congestion, inefficient resource allocation, and communication overhead, which can lead to performance bottlenecks and increased operational costs. Without effective orchestration, organizations risk underutilized hardware, slower training times, and higher energy consumption.
Strategies to improve GPU efficiency include:
- Optimized Multi-GPU Communication: NCCL (NVIDIA) and RCCL (AMD) reduce latency by enabling direct GPU-to-network communication, avoiding CPU bottlenecks, and improving throughput.
- Kubernetes for Scalability: Kubernetes automates GPU resource management, ensuring efficient scheduling and scaling across multiple machines and enabling faster training on massive datasets.
- Performance Optimization: GPU sharing (MPS, time-slicing, MIG), NUMA-aware scheduling, and RDMA for data transfers maximize GPU utilization and reduce idle cycles.
- Benchmarking Collectives: Tuning collective communication backends (NCCL, RCCL, MPI) through iterative benchmarking ensures optimal performance, leveraging hardware capabilities for faster training.
Strategic distributed GPU orchestration allows organizations to efficiently leverage hardware, drive innovation, reduce costs, and dominate competitive markets.
. . .
Reach out to me via my social media
Originally posted at: https://medium.com/weles-ai/how-to-efficiently-use-gpus-for-distributed-machine-learning-in-mlops-94add9801a2b