Nvidia outlines AI strategies for GPUs and Kubernetes

Nvidia's strategies to accelerate AI development on GPUs with Kubernetes. Source: CNCF

Image:
Nvidia's strategies to accelerate AI development on GPUs with Kubernetes. Source: CNCF

At the KubeCon event in Paris this week, Nvidia engineers spoke of challenges and solutions for accelerating AI workloads using GPUs with Kubernetes.

Some measures involve Nvidia making its hardware and firmware more amenable to Kubernetes, while others will require extensions and addons to the cloud native orchestration platform.

During their keynote, engineering manager Sanjay Chatterjee introduced Nvidia Picasso, a generative AI foundry that allows businesses to build and deploy foundational models for computer vision. It is based on Kubernetes and supports the model development lifecycle from training to inference.

Nvidia, whose GitHub contains several libraries and pluggins for Kubernetes, will continue to support the development of AI infrastructure by contributing to the cloud native ecosystem around Picasso and elsewhere where it touches GPUs, Chatterjee said.

To expand the potential of GPUs for running AI workloads on Kubernetes, Nvidia is tackling several challenges at once, from low-level Kubernetes mechanisms for requesting GPU access to higher-level processes for mapping GPUs to workloads.

Chatterjee walked through three key areas of focus: topology-aware placement, fault tolerance and multi-dimensional optimisation.

Topology-aware placement is important for optimising GPU utilisation in large-scale clusters. By minimising the distance between nodes and the AI workloads, Nvidia aims to improve cluster occupancy and performance.

Fault-tolerant scheduling ensures the reliability of training jobs by adding visibility into the underlying GPU infrastructure, detecting faulty nodes early, and automatically redirecting workloads to new nodes when necessary. This is important for large jobs, such as AI model training, where faults can lead to performance bottlenecks and possible failure.

"Automated fault-tolerant scheduling, which is both reactive and proactive, is an essential requirement with scaling out in GPU clusters," Chatterjee explained.

Meanwhile multi-dimensional optimisation is about automatically balancing the sometimes opposing needs of developers, the business, cost and resiliency. "We need to think about a configurable, multi-objective optimisation framework that will make deterministic decisions by considering the global constraints in a GPU cluster."

Earlier this week Nvidia unveiled its latest B200 Blackwell GPU which, it claims, is twice as powerful as current GPUs for training AI models. Blackwell brings more built-in hardware support for resiliency, and the Silicon Valley chip designer is actively engaging with the Kubernetes community to make the most of these advances to ease GPU scaling challenges.

"[We have] been engaging with the community on the low-level mechanisms to enable GPU resource management, and we will keep engaging to solve the GPU scale challenges as well," Chatterjee said.

In a related development, Kevin Klues, a distinguished engineer at Nvidia, discussed Dynamic Resource Allocation (DRA), a (relatively) new way of requesting resources in Kubernetes that gives third-party developers more control.

"It puts full control of the API to select and configure resources directly in the hands of third party developers," he said. "It also gives them the ability to precisely control how resources are shared between containers and pods, which is one of the main limitations of the existing device plugin API."

DRA, a Kubernetes API which is currently in alpha, complements Nvidia's efforts to optimise GPU utilisation and resource management, Klues explained.

Nvidia has become one of the world's most valuable companies thanks to the massive demand for GPUs to train and run AI models. Meanwhile Kubernetes is already widely used as a platform to support and deploy these models, but further integration will require an effort from both the chip designer and cloud native developers.

"For many, Kubernetes has already become [the default] platform. But we still have a lot of work to do before we can unlock the full potential of GPUs to accelerate AI workloads on Kubernetes," said Klues.