Enable Monitoring
LLMOS Monitoring makes it easy to track cluster and GPU metrics with ready-to-use Grafana dashboards, Prometheus rules, and more, using the Prometheus Operator.
This page describes how to enable monitoring and alerting within a cluster using the built-in LLMOS Monitoring addon.
Enabling Monitoring
To enable monitoring, go to the Cluster Tools page as an admin user. Click Install, and you'll be directed to the Monitoring configuration page.
Requirements
- If your cluster is a multi-node cluster and requires persistent storage for the monitoring, enable the Ceph System Storage first before setting up monitoring.
- Ensure your cluster meets the resource requirements:
- CPU: At least
1250m
- Memory: At least
1210Mi
- Storage: At least
50Gi
- For details, see Resource Limits and Requests.
- CPU: At least
Prometheus Settings
- Admin API: Enable the Prometheus Admin API for advanced features like snapshots and deleting time series. Disabled by default.
- Scrape Interval: How often Prometheus collects metrics. Default:
30s
. - Evaluation Interval: How often Prometheus checks alerting rules. Default:
30s
. - Retention: How long metrics are kept. Default:
10d
. - Retention Size: Maximum size for stored metrics. Default:
50GiB
. - Resources: Set resource requests and limits for Prometheus pods.
- Persistent Storage: To retain data across deployments and upgrades, configure persistent storage for Prometheus
- At least
50Gi
is recommended.
- At least
Grafana Settings
- Resources: Set resource requests and limits for Grafana pods.
- Persistent Storage: Configure storage to retain custom dashboards during upgrades or redeployments.
note
Default dashboards provided by LLMOS Monitoring don’t require persistent storage.
AlertManager Settings
- Enable AlertManager: Enabled by default.
Resource Limits and Requests
You can adjust resource requests and limits during installation. The table below shows the default minimum requirements:
Component | CPU Request | Memory Request | CPU Limit | Memory Limit |
---|---|---|---|---|
prometheus-operator | 100m | 100Mi | 500m | 200Mi |
prometheus | 750m | 750Mi | 1000m | 3000Mi |
alertmanager | 100m | 100Mi | 1000m | 500Mi |
grafana | 100m | 100Mi | 200m | 200Mi |
kube-state-metrics | 100m | 130Mi | 200m | 200Mi |
prometheus-node-exporter | 100m | 30Mi | 200m | 50Mi |
Total | 1250m | 1210Mi | 3100m | 4150Mi |
Persistent Storage: Prometheus requires at least 50Gi
of storage.