Enable GPU Stack
LLMOS GPU Stack is a toolset that brings virtual GPUs (vGPU) and multi-accelerator support to the LLMOS platform. It improves GPU resource usage and flexibility.
This guide provides a quick overview of the GPU Stack architecture, its components, and basic configurations.
note
The LLMOS GPU Stack is always enabled because it is a core part of the platform.
info
Currently supported GPUs: ✅ Nvidia CUDA
Planned future support:
- AMD ROCm
- Ascend CANN
- Cambricon MLU
- HYGON DCU
Prerequisites
For GPU Stack to work correctly, ensure the following requirements are met:
- Nvidia Driver: Install the Nvidia driver on nodes with GPU devices. See the Nvidia GPU Driver Installation Guide.
- The GPU must have a CUDA Compute Capability of 7.5 or higher.
GPU Stack Components
When the GPU Stack is active, the following components are installed in the llmos-gpu-stack-system
namespace:
- Nvidia GPU Operator: Manages Nvidia GPU resources. Learn more.
- GPU Device Plugin: A DaemonSet that exposes and manages GPU devices. Learn more.
- GPU Device Manager: A Deployment that manages LLMOS GPU custom resource definitions (CRDs).
- Volcano Scheduler: A plugin for batch and gang scheduling of workloads with GPU resources. Learn more.
Configurations
GPU Operator
- Enabled NVIDIA GPU Operator: The Nvidia GPU Operator is enabled by default to manage Nvidia GPU devices.
- vGPU Count: Sets the maximum number of vGPU instances that can be created per GPU. Default:
10
.
Device Manager
- Resource Settings: Configure resource requests and limits for GPU device manager pod.
Monitoring and Status
Once the GPU Stack is active
:
- Check the status of GPU Stack components in the
llmos-gpu-stack-system
namespace. - Ensure the
Nvidia Cluster Policies
in the GPU Management is marked asReady
.
From the dashboard, you can monitor the GPU usage and status via:
- Overview page: GPU metrics are displayed on the overview page when the LLMOS Monitoring is enabled.
- Nodes Page: For nodes with GPU devices, the GPU device count and metrics are displayed on the Nodes page.
- GPU Devices: The GPU Devices page will show the status and details of all GPU devices.