Skip to main content

Enable Monitoring

LLMOS Monitoring makes it easy to track cluster and GPU metrics with ready-to-use Grafana dashboards, Prometheus rules, and more, using the Prometheus Operator.

This page describes how to enable monitoring and alerting within a cluster using the built-in LLMOS Monitoring addon.

Enabling Monitoring

To enable monitoring, go to the Cluster Tools page as an admin user. Click Install, and you'll be directed to the Monitoring configuration page.

cluster-tools

Requirements

  • If your cluster is a multi-node cluster and requires persistent storage for the monitoring, enable the Ceph System Storage first before setting up monitoring.
  • Ensure your cluster meets the resource requirements:

Prometheus Settings

  • Admin API: Enable the Prometheus Admin API for advanced features like snapshots and deleting time series. Disabled by default.
  • Scrape Interval: How often Prometheus collects metrics. Default: 30s.
  • Evaluation Interval: How often Prometheus checks alerting rules. Default: 30s.
  • Retention: How long metrics are kept. Default: 10d.
  • Retention Size: Maximum size for stored metrics. Default: 50GiB.
  • Resources: Set resource requests and limits for Prometheus pods.
  • Persistent Storage: To retain data across deployments and upgrades, configure persistent storage for Prometheus
    • At least 50Gi is recommended.

monitoring-edit-prometheus

Grafana Settings

  • Resources: Set resource requests and limits for Grafana pods.
  • Persistent Storage: Configure storage to retain custom dashboards during upgrades or redeployments.
note

Default dashboards provided by LLMOS Monitoring don’t require persistent storage.

monitoring-edit-grafana

AlertManager Settings

  • Enable AlertManager: Enabled by default.

monitoring-edit-alerting

Resource Limits and Requests

You can adjust resource requests and limits during installation. The table below shows the default minimum requirements:

ComponentCPU RequestMemory RequestCPU LimitMemory Limit
prometheus-operator100m100Mi500m200Mi
prometheus750m750Mi1000m3000Mi
alertmanager100m100Mi1000m500Mi
grafana100m100Mi200m200Mi
kube-state-metrics100m130Mi200m200Mi
prometheus-node-exporter100m30Mi200m50Mi
Total1250m1210Mi3100m4150Mi

Persistent Storage: Prometheus requires at least 50Gi of storage.