Skip to main content

Enable GPU Stack

LLMOS GPU Stack is a toolset that brings virtual GPUs (vGPU) and multi-accelerator support to the LLMOS platform. It improves GPU resource usage and flexibility.

This guide provides a quick overview of the GPU Stack architecture, its components, and basic configurations.

note

The LLMOS GPU Stack is always enabled because it is a core part of the platform.

cluster-tools

info

Currently supported GPUs: Nvidia CUDA

Planned future support:

  • AMD ROCm
  • Ascend CANN
  • Cambricon MLU
  • HYGON DCU

Prerequisites

For GPU Stack to work correctly, ensure the following requirements are met:

GPU Stack Components

When the GPU Stack is active, the following components are installed in the llmos-gpu-stack-system namespace:

  • Nvidia GPU Operator: Manages Nvidia GPU resources. Learn more.
  • GPU Device Plugin: A DaemonSet that exposes and manages GPU devices. Learn more.
  • GPU Device Manager: A Deployment that manages LLMOS GPU custom resource definitions (CRDs).
  • Volcano Scheduler: A plugin for batch and gang scheduling of workloads with GPU resources. Learn more.

Configurations

GPU Operator

  • Enabled NVIDIA GPU Operator: The Nvidia GPU Operator is enabled by default to manage Nvidia GPU devices.
  • vGPU Count: Sets the maximum number of vGPU instances that can be created per GPU. Default: 10.

gpu-stack-edit-gpu-operator

Device Manager

  • Resource Settings: Configure resource requests and limits for GPU device manager pod.

gpu-stack-edit-device-manager

Monitoring and Status

Once the GPU Stack is active:

  1. Check the status of GPU Stack components in the llmos-gpu-stack-system namespace.
  2. Ensure the Nvidia Cluster Policies in the GPU Management is marked as Ready.

From the dashboard, you can monitor the GPU usage and status via:

  • Overview page: GPU metrics are displayed on the overview page when the LLMOS Monitoring is enabled. overview-gpu-metrics
  • Nodes Page: For nodes with GPU devices, the GPU device count and metrics are displayed on the Nodes page. nodes
  • GPU Devices: The GPU Devices page will show the status and details of all GPU devices.