GPU Monitoring Setup
Overview
Kubeadapt supports GPU cost monitoring through integration with NVIDIA DCGM (Data Center GPU Manager) Exporter. This allows you to:
- Track GPU costs alongside CPU and memory
- Monitor GPU utilization per node and workload
- Optimize GPU usage with rightsizing recommendations
- Identify idle GPUs for cost savings
GPU Metrics Collected:
- DCGM_FI_DEV_GPU_UTIL - GPU compute utilization percentage
Prerequisites
Before enabling GPU monitoring, ensure:
- NVIDIA GPUs in your cluster nodes
- NVIDIA device plugin installed (or will install with GPU Operator)
- Helm 3.x for Kubeadapt installation
- Cluster admin permissions
Supported GPU types:
- NVIDIA A100, V100, T4, P100
- Any NVIDIA GPU with DCGM support
Installation Options
You have two options for enabling GPU monitoring:
Option 1: Install GPU Operator with Kubeadapt (Recommended)
Use this if you don't have GPU Operator installed yet.
Pros:
- Single Helm install for everything
- Automatic DCGM Exporter deployment
- GPU Operator manages NVIDIA drivers and device plugins
- Simpler configuration
Cons:
- Installs additional components (GPU Operator stack)
Option 2: Use Existing DCGM Exporter
Use this if you already have DCGM Exporter running.
Pros:
- Reuses existing infrastructure
- Lighter weight (no additional deployments)
Cons:
- Requires manual scrape configuration
- Need to know your DCGM Exporter namespace/labels
Option 1: Install with GPU Operator
Step 1: Enable GPU Operator in Helm
Install (or upgrade) Kubeadapt with GPU Operator enabled:
1helm install kubeadapt kubeadapt/kubeadapt \
2 --namespace kubeadapt \
3 --create-namespace \
4 --set agent.enabled=true \
5 --set agent.config.token=YOUR_TOKEN \
6 --set gpu-operator.enabled=trueOr if already installed, upgrade:
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2 --namespace kubeadapt \
3 --reuse-values \
4 --set gpu-operator.enabled=trueStep 2: Enable DCGM Scraping
Create a values.yaml file with DCGM scrape configuration:
1gpu-operator:
2 enabled: true
3 operator:
4 defaultRuntime: containerd
5 dcgmExporter:
6 enabled: true
7 resources:
8 requests:
9 memory: "128Mi"
10 cpu: "50m"
11 limits:
12 memory: "512Mi"
13 cpu: "250m"
14
15prometheus:
16 serverFiles:
17 prometheus.yml:
18 scrape_configs:
19 # Enable DCGM Exporter scraping
20 - job_name: "dcgm-exporter"
21 kubernetes_sd_configs:
22 - role: pod
23 namespaces:
24 names:
25 - kubeadapt
26 relabel_configs:
27 - source_labels: [__meta_kubernetes_pod_label_app]
28 action: keep
29 regex: dcgm-exporter
30 - source_labels: [__meta_kubernetes_pod_container_port_name]
31 action: keep
32 regex: metrics
33 metric_relabel_configs:
34 - source_labels: [__name__]
35 action: keep
36 regex: ^DCGM_FI_DEV_GPU_UTIL$Apply the configuration:
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2 --namespace kubeadapt \
3 --reuse-values \
4 -f gpu-monitoring-values.yamlStep 3: Verify GPU Operator Installation
Check that GPU Operator components are running:
1kubectl get pods -n kubeadapt | grep -E 'gpu-operator|dcgm'Expected output:
1
2gpu-operator-6b8f9d7c4d-x7k9m 1/1 Running 0 2m
3nvidia-dcgm-exporter-abcde 1/1 Running 0 2m
4nvidia-device-plugin-daemonset-fghij 1/1 Running 0 2m
5Step 4: Verify GPU Metrics
After configuration, GPU metrics should appear in the Kubeadapt dashboard.
Sign in to app.kubeadapt.io and navigate to your cluster to verify GPU cost data is visible.
Expected output:
1[
2 {
3 "metric": {
4 "__name__": "DCGM_FI_DEV_GPU_UTIL",
5 "gpu": "0",
6 "instance": "10.0.1.42:9400",
7 "job": "dcgm-exporter"
8 },
9 "value": [1705315200, "85.5"]
10 }
11]Option 2: Use Existing DCGM Exporter
If you already have DCGM Exporter running in your cluster:
Step 1: Identify Your DCGM Exporter
Find the namespace and labels of your existing DCGM Exporter:
1kubectl get pods --all-namespaces -l app=dcgm-exporterExample output:
1NAMESPACE NAME READY STATUS RESTARTS AGE
2gpu-system dcgm-exporter-abc123 1/1 Running 0 10dStep 2: Configure Prometheus Scraping
Create a values.yaml with your DCGM Exporter details:
1# existing-dcgm-values.yaml
2
3prometheus:
4 serverFiles:
5 prometheus.yml:
6 scrape_configs:
7 # Scrape existing DCGM Exporter
8 - job_name: "dcgm-exporter"
9 kubernetes_sd_configs:
10 - role: pod
11 namespaces:
12 names:
13 - gpu-system # Change to your namespace
14 relabel_configs:
15 - source_labels: [__meta_kubernetes_pod_label_app]
16 action: keep
17 regex: dcgm-exporter # Change if your label is different
18 - source_labels: [__meta_kubernetes_pod_container_port_name]
19 action: keep
20 regex: metrics
21 metric_relabel_configs:
22 - source_labels: [__name__]
23 action: keep
24 regex: ^DCGM_FI_DEV_GPU_UTIL$Apply the configuration:
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2 --namespace kubeadapt \
3 --reuse-values \
4 -f existing-dcgm-values.yamlStep 3: Verify Scraping
Verify DCGM Exporter pods are running:
1kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporterAll pods should show Running status.
Agent Configuration
Enable GPU Monitoring in Agent
To enable GPU cost tracking in the Kubeadapt agent, add the GPU monitoring flag:
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2 --namespace kubeadapt \
3 --reuse-values \
4 --set agent.config.enableGpuMonitoring=trueOr in values.yaml:
1agent:
2 enabled: true
3 config:
4 token: "YOUR_TOKEN"
5 enableGpuMonitoring: trueVerify Agent Configuration
Check agent logs to confirm GPU monitoring is enabled:
1kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpuExpected output:
1INFO: GPU monitoring enabled
2INFO: Discovered 4 GPU nodes in cluster
3INFO: Collecting metrics from DCGM exporterViewing GPU Costs in Dashboard
Once configured, GPU metrics will appear in your Kubeadapt dashboard:
Dashboard Features
1. GPU Cost Overview
- Total GPU spend per month
- GPU utilization percentage
- GPU-enabled nodes count
2. GPU Utilization by Node
- Per-node GPU usage graphs
- Idle GPU identification
- GPU memory utilization
3. Workload GPU Usage
- GPU allocation per pod
- GPU request vs. actual usage
- Rightsizing recommendations for GPU workloads
GPU Metrics Available
Node-level:
- GPU count per node
- GPU model and memory capacity
Workload-level:
- GPU requests (nvidia.com/gpu)
- GPU compute utilization percentage
- GPU idle time detection
Troubleshooting
DCGM Exporter pods not running
Symptoms:
1kubectl get pods -n kubeadapt | grep dcgm
2# No pods or CrashLoopBackOffCommon causes:
- No GPU nodes in cluster
- NVIDIA device plugin not installed
- GPU Operator failed to install
Solution:
1# Check for GPU nodes
2kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)'
3
4# Check GPU Operator logs
5kubectl logs -n kubeadapt deployment/gpu-operator
6
7# Reinstall GPU Operator
8helm upgrade kubeadapt kubeadapt/kubeadapt \
9 --namespace kubeadapt \
10 --reuse-values \
11 --set gpu-operator.enabled=false
12helm upgrade kubeadapt kubeadapt/kubeadapt \
13 --namespace kubeadapt \
14 --reuse-values \
15 --set gpu-operator.enabled=trueNo GPU metrics appearing
Symptoms:
GPU cost data not visible in Kubeadapt dashboard.
Common causes:
- DCGM Exporter not scraped by Prometheus
- Incorrect namespace or labels in scrape config
- DCGM Exporter not exposing metrics
Solution:
1# Verify DCGM Exporter is running
2kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
3
4# Check DCGM Exporter logs
5kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter
6
7# Verify Kubeadapt agent logs
8kubectl logs -n kubeadapt -l app=kubeadapt-agent | grep -i gpuGPU costs not showing in dashboard
Symptoms:
- DCGM metrics available in Prometheus
- GPU costs not visible in Kubeadapt dashboard
Common causes:
- Agent GPU monitoring not enabled
- Agent not collecting GPU metrics
- GPU pricing not configured
Solution:
1# Enable GPU monitoring in agent
2helm upgrade kubeadapt kubeadapt/kubeadapt \
3 --namespace kubeadapt \
4 --reuse-values \
5 --set agent.config.enableGpuMonitoring=true
6
7# Check agent logs
8kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu
9
10# Verify GPU pricing is configured in dashboard
11# Navigate to Settings โ Cloud Providers โ GPU PricingMIG Mode Limitation
IMPORTANT: DCGM Exporter in Kubernetes mode does NOT support container-level GPU utilization mapping when MIG (Multi-Instance GPU) is enabled.
If using MIG:
- Node-level GPU metrics: Available
- Container-level GPU metrics: Not available
- Future: eBPF-based agent will support MIG mode
Workaround:
- Use GPU node labels for cost allocation
- Manual GPU cost distribution based on GPU requests
- Wait for eBPF agent support (roadmap)
Best Practices
1. Right-size GPU Requests
Monitor GPU utilization and adjust requests:
1# Before (over-provisioned)
2resources:
3 requests:
4 nvidia.com/gpu: 1 # GPU utilization: 25%
5
6# After (right-sized)
7resources:
8 requests:
9 nvidia.com/gpu: 0 # Moved to CPU-only node2. Use GPU Node Taints
Prevent non-GPU workloads from running on expensive GPU nodes:
1# Taint GPU nodes
2kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule1# GPU workloads need toleration
2tolerations:
3 - key: nvidia.com/gpu
4 operator: Equal
5 value: present
6 effect: NoSchedule3. Enable GPU Time-Slicing (Optional)
For multiple workloads sharing a single GPU:
1# GPU Operator configuration
2gpu-operator:
3 devicePlugin:
4 config:
5 name: time-slicing-config
6 default: any
7 sharing:
8 timeSlicing:
9 renameByDefault: false
10 failRequestsGreaterThanOne: false
11 resources:
12 - name: nvidia.com/gpu
13 replicas: 4 # 4 containers can share 1 GPUGPU Pricing Configuration
Configure GPU Costs in Dashboard
- Navigate to Settings โ Cloud Providers
- Select your cloud provider (AWS, GCP, Azure)
- GPU Pricing section:
- Set hourly cost per GPU type
- Or enable automatic pricing from cloud provider API
Example GPU pricing:
1NVIDIA A100 (80GB): $3.67/hour
2NVIDIA V100: $2.48/hour
3NVIDIA T4: $0.95/hourOn-Premises GPU Pricing
For on-prem clusters, calculate GPU cost based on:
1GPU Hourly Cost = (Hardware Cost / Depreciation Period) / Hours per Year
2
3Example:
4- Hardware: $10,000 per GPU
5- Depreciation: 3 years
6- Hourly cost: $10,000 / (3 ร 365 ร 24) = $0.38/hourWhat's Next?
After enabling GPU monitoring:
- Dashboard - View GPU costs and utilization
- ๐ Available Savings - GPU rightsizing recommendations
- Workload Details - Per-pod GPU usage
- Cost Query - Custom GPU cost queries
Need Help?
- NVIDIA GPU Operator Docs
- DCGM Exporter Docs
- Support - Email support team