GPU Monitoring Setup | Docs

Overview

Kubeadapt supports GPU cost monitoring through integration with NVIDIA DCGM (Data Center GPU Manager) Exporter. This allows you to:

Track GPU costs alongside CPU and memory
Monitor GPU utilization per node and workload
Optimize GPU usage with rightsizing recommendations
Identify idle GPUs for cost savings

GPU Metrics Collected:

DCGM_FI_DEV_GPU_UTIL - GPU compute utilization percentage

Prerequisites

Before enabling GPU monitoring, ensure:

NVIDIA GPUs in your cluster nodes
NVIDIA device plugin installed (or will install with GPU Operator)
Helm 3.x for Kubeadapt installation
Cluster admin permissions

Supported GPU types:

NVIDIA A100, V100, T4, P100
Any NVIDIA GPU with DCGM support

Installation Options

You have two options for enabling GPU monitoring:

Option 1: Install GPU Operator with Kubeadapt (Recommended)

Use this if you don't have GPU Operator installed yet.

Pros:

Single Helm install for everything
Automatic DCGM Exporter deployment
GPU Operator manages NVIDIA drivers and device plugins
Simpler configuration

Cons:

Installs additional components (GPU Operator stack)

Option 2: Use Existing DCGM Exporter

Use this if you already have DCGM Exporter running.

Pros:

Reuses existing infrastructure
Lighter weight (no additional deployments)

Cons:

Requires manual scrape configuration
Need to know your DCGM Exporter namespace/labels

Option 1: Install with GPU Operator

Step 1: Enable GPU Operator in Helm

Install (or upgrade) Kubeadapt with GPU Operator enabled:

bash

helm install kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --create-namespace \
  --set agent.enabled=true \
  --set agent.config.token=YOUR_TOKEN \
  --set gpu-operator.enabled=true

Or if already installed, upgrade:

bash

helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set gpu-operator.enabled=true

Step 2: Enable DCGM Scraping

Create a values.yaml file with DCGM scrape configuration:

yaml

gpu-operator:
  enabled: true
  operator:
    defaultRuntime: containerd
  dcgmExporter:
    enabled: true
    resources:
      requests:
        memory: "128Mi"
        cpu: "50m"
      limits:
        memory: "512Mi"
        cpu: "250m"

prometheus:
  serverFiles:
    prometheus.yml:
      scrape_configs:
        # Enable DCGM Exporter scraping
        - job_name: "dcgm-exporter"
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - kubeadapt
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex: dcgm-exporter
            - source_labels: [__meta_kubernetes_pod_container_port_name]
              action: keep
              regex: metrics
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash

helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  -f gpu-monitoring-values.yaml

Step 3: Verify GPU Operator Installation

Check that GPU Operator components are running:

bash

kubectl get pods -n kubeadapt | grep -E 'gpu-operator|dcgm'

Expected output:

text

gpu-operator-6b8f9d7c4d-x7k9m              1/1     Running   0          2m
nvidia-dcgm-exporter-abcde                 1/1     Running   0          2m
nvidia-device-plugin-daemonset-fghij       1/1     Running   0          2m

Step 4: Verify GPU Metrics

After configuration, GPU metrics should appear in the Kubeadapt dashboard.

Expected output:

json

[
  {
    "metric": {
      "__name__": "DCGM_FI_DEV_GPU_UTIL",
      "gpu": "0",
      "instance": "10.0.1.42:9400",
      "job": "dcgm-exporter"
    },
    "value": [1705315200, "85.5"]
  }
]

Option 2: Use Existing DCGM Exporter

If you already have DCGM Exporter running in your cluster:

Step 1: Identify Your DCGM Exporter

Find the namespace and labels of your existing DCGM Exporter:

bash

kubectl get pods --all-namespaces -l app=dcgm-exporter

Example output:

text

NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
gpu-system    dcgm-exporter-abc123       1/1     Running   0          10d

Step 2: Configure Prometheus Scraping

Create a values.yaml with your DCGM Exporter details:

yaml

# existing-dcgm-values.yaml

prometheus:
  serverFiles:
    prometheus.yml:
      scrape_configs:
        # Scrape existing DCGM Exporter
        - job_name: "dcgm-exporter"
          kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - gpu-system # Change to your namespace
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex: dcgm-exporter # Change if your label is different
            - source_labels: [__meta_kubernetes_pod_container_port_name]
              action: keep
              regex: metrics
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash

helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  -f existing-dcgm-values.yaml

Step 3: Verify Scraping

Verify DCGM Exporter pods are running:

bash

kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

All pods should show Running status.

Agent Configuration

Enable GPU Monitoring in Agent

To enable GPU cost tracking in the Kubeadapt agent, add the GPU monitoring flag:

bash

helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set agent.config.enableGpuMonitoring=true

Or in values.yaml:

yaml

agent:
  enabled: true
  config:
    token: "YOUR_TOKEN"
    enableGpuMonitoring: true

Verify Agent Configuration

Check agent logs to confirm GPU monitoring is enabled:

bash

kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu

Expected output:

text

INFO: GPU monitoring enabled
INFO: Discovered 4 GPU nodes in cluster
INFO: Collecting metrics from DCGM exporter

Viewing GPU Costs in Dashboard

Once configured, GPU metrics will appear in your Kubeadapt dashboard:

Dashboard Features

1. GPU Cost Overview

Total GPU spend per month
GPU utilization percentage
GPU-enabled nodes count

2. GPU Utilization by Node

Per-node GPU usage graphs
Idle GPU identification
GPU memory utilization

3. Workload GPU Usage

GPU allocation per pod
GPU request vs. actual usage
Rightsizing recommendations for GPU workloads

GPU Metrics Available

Node-level:

GPU count per node
GPU model and memory capacity

Workload-level:

GPU requests (nvidia.com/gpu)
GPU compute utilization percentage
GPU idle time detection

Troubleshooting

DCGM Exporter pods not running

Symptoms:

bash

kubectl get pods -n kubeadapt | grep dcgm
# No pods or CrashLoopBackOff

Common causes:

No GPU nodes in cluster
NVIDIA device plugin not installed
GPU Operator failed to install

Solution:

bash

# Check for GPU nodes
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)'

# Check GPU Operator logs
kubectl logs -n kubeadapt deployment/gpu-operator

# Reinstall GPU Operator
helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set gpu-operator.enabled=false
helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set gpu-operator.enabled=true

No GPU metrics appearing

Symptoms:

GPU cost data not visible in Kubeadapt dashboard.

Common causes:

DCGM Exporter not scraped by Prometheus
Incorrect namespace or labels in scrape config
DCGM Exporter not exposing metrics

Solution:

bash

# Verify DCGM Exporter is running
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

# Check DCGM Exporter logs
kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter

# Verify Kubeadapt agent logs
kubectl logs -n kubeadapt -l app=kubeadapt-agent | grep -i gpu

GPU costs not showing in dashboard

Symptoms:

DCGM metrics available in Prometheus
GPU costs not visible in Kubeadapt dashboard

Common causes:

Agent GPU monitoring not enabled
Agent not collecting GPU metrics
GPU pricing not configured

Solution:

bash

# Enable GPU monitoring in agent
helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set agent.config.enableGpuMonitoring=true

# Check agent logs
kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu

# Verify GPU pricing is configured in dashboard
# Navigate to Settings → Cloud Providers → GPU Pricing

MIG Mode Limitation

IMPORTANT: DCGM Exporter in Kubernetes mode does NOT support container-level GPU utilization mapping when MIG (Multi-Instance GPU) is enabled.

If using MIG:

Node-level GPU metrics: Available
Container-level GPU metrics: Not available
Future: eBPF-based agent will support MIG mode

Workaround:

Use GPU node labels for cost allocation
Manual GPU cost distribution based on GPU requests
Wait for eBPF agent support (roadmap)

Best Practices

1. Right-size GPU Requests

Monitor GPU utilization and adjust requests:

yaml

# Before (over-provisioned)
resources:
  requests:
    nvidia.com/gpu: 1  # GPU utilization: 25%

# After (right-sized)
resources:
  requests:
    nvidia.com/gpu: 0  # Moved to CPU-only node

2. Use GPU Node Taints

Prevent non-GPU workloads from running on expensive GPU nodes:

bash

# Taint GPU nodes
kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule

yaml

# GPU workloads need toleration
tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: present
    effect: NoSchedule

3. Enable GPU Time-Slicing (Optional)

For multiple workloads sharing a single GPU:

yaml

# GPU Operator configuration
gpu-operator:
  devicePlugin:
    config:
      name: time-slicing-config
      default: any
      sharing:
        timeSlicing:
          renameByDefault: false
          failRequestsGreaterThanOne: false
          resources:
            - name: nvidia.com/gpu
              replicas: 4 # 4 containers can share 1 GPU

GPU Pricing Configuration

Configure GPU Costs in Dashboard

Navigate to Settings → Cloud Providers
Select your cloud provider (AWS, GCP, Azure)
GPU Pricing section:
- Set hourly cost per GPU type
- Or enable automatic pricing from cloud provider API

Example GPU pricing:

text

NVIDIA A100 (80GB): $3.67/hour
NVIDIA V100: $2.48/hour
NVIDIA T4: $0.95/hour

On-Premises GPU Pricing

For on-prem clusters, calculate GPU cost based on:

text

GPU Hourly Cost = (Hardware Cost / Depreciation Period) / Hours per Year

Example:
- Hardware: $10,000 per GPU
- Depreciation: 3 years
- Hourly cost: $10,000 / (3 × 365 × 24) = $0.38/hour

What's Next?

After enabling GPU monitoring:

Dashboard - View GPU costs and utilization
💎 Available Savings - GPU rightsizing recommendations
Workload Details - Per-pod GPU usage
Cost Query - Custom GPU cost queries

Need Help?

NVIDIA GPU Operator Docs
DCGM Exporter Docs
Support - Email support team