GPU Monitoring Setup
Overview
Kubeadapt supports GPU cost monitoring through integration with NVIDIA DCGM (Data Center GPU Manager) Exporter. This allows you to:
- Track GPU costs alongside CPU and memory
- Monitor GPU utilization per node and workload
- Optimize GPU usage with rightsizing recommendations
- Identify idle GPUs for cost savings
GPU Metrics Collected:
- DCGM_FI_DEV_GPU_UTIL - GPU compute utilization percentage
Prerequisites
Before enabling GPU monitoring, ensure:
- NVIDIA GPUs in your cluster nodes
- NVIDIA device plugin installed (or will install with GPU Operator)
- Helm 3.x for Kubeadapt installation
- Cluster admin permissions
Supported GPU types:
- NVIDIA A100, V100, T4, P100
- Any NVIDIA GPU with DCGM support
Installation Options
You have two options for enabling GPU monitoring:
Option 1: Install GPU Operator with Kubeadapt (Recommended)
Use this if you don't have GPU Operator installed yet.
Pros:
- Single Helm install for everything
- Automatic DCGM Exporter deployment
- GPU Operator manages NVIDIA drivers and device plugins
- Simpler configuration
Cons:
- Installs additional components (GPU Operator stack)
Option 2: Use Existing DCGM Exporter
Use this if you already have DCGM Exporter running.
Pros:
- Reuses existing infrastructure
- Lighter weight (no additional deployments)
Cons:
- Requires manual scrape configuration
- Need to know your DCGM Exporter namespace/labels
Option 1: Install with GPU Operator
Step 1: Enable GPU Operator in Helm
Install (or upgrade) Kubeadapt with GPU Operator enabled:
bash1helm install kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --create-namespace \ 4 --set agent.enabled=true \ 5 --set agent.config.token=YOUR_TOKEN \ 6 --set gpu-operator.enabled=true
Or if already installed, upgrade:
bash1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 --set gpu-operator.enabled=true
Step 2: Enable DCGM Scraping
Create a values.yaml file with DCGM scrape configuration:
yaml1gpu-operator: 2 enabled: true 3 operator: 4 defaultRuntime: containerd 5 dcgmExporter: 6 enabled: true 7 resources: 8 requests: 9 memory: "128Mi" 10 cpu: "50m" 11 limits: 12 memory: "512Mi" 13 cpu: "250m" 14 15prometheus: 16 serverFiles: 17 prometheus.yml: 18 scrape_configs: 19 # Enable DCGM Exporter scraping 20 - job_name: "dcgm-exporter" 21 kubernetes_sd_configs: 22 - role: pod 23 namespaces: 24 names: 25 - kubeadapt 26 relabel_configs: 27 - source_labels: [__meta_kubernetes_pod_label_app] 28 action: keep 29 regex: dcgm-exporter 30 - source_labels: [__meta_kubernetes_pod_container_port_name] 31 action: keep 32 regex: metrics 33 metric_relabel_configs: 34 - source_labels: [__name__] 35 action: keep 36 regex: ^DCGM_FI_DEV_GPU_UTIL$
Apply the configuration:
bash1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 -f gpu-monitoring-values.yaml
Step 3: Verify GPU Operator Installation
Check that GPU Operator components are running:
bash1kubectl get pods -n kubeadapt | grep -E 'gpu-operator|dcgm'
Expected output:
text1 2gpu-operator-6b8f9d7c4d-x7k9m 1/1 Running 0 2m 3nvidia-dcgm-exporter-abcde 1/1 Running 0 2m 4nvidia-device-plugin-daemonset-fghij 1/1 Running 0 2m 5
Step 4: Verify GPU Metrics
After configuration, GPU metrics should appear in the Kubeadapt dashboard.
Sign in to app.kubeadapt.io and navigate to your cluster to verify GPU cost data is visible.
Expected output:
json1[ 2 { 3 "metric": { 4 "__name__": "DCGM_FI_DEV_GPU_UTIL", 5 "gpu": "0", 6 "instance": "10.0.1.42:9400", 7 "job": "dcgm-exporter" 8 }, 9 "value": [1705315200, "85.5"] 10 } 11]
Option 2: Use Existing DCGM Exporter
If you already have DCGM Exporter running in your cluster:
Step 1: Identify Your DCGM Exporter
Find the namespace and labels of your existing DCGM Exporter:
bash1kubectl get pods --all-namespaces -l app=dcgm-exporter
Example output:
text1NAMESPACE NAME READY STATUS RESTARTS AGE 2gpu-system dcgm-exporter-abc123 1/1 Running 0 10d
Step 2: Configure Prometheus Scraping
Create a values.yaml with your DCGM Exporter details:
yaml1# existing-dcgm-values.yaml 2 3prometheus: 4 serverFiles: 5 prometheus.yml: 6 scrape_configs: 7 # Scrape existing DCGM Exporter 8 - job_name: "dcgm-exporter" 9 kubernetes_sd_configs: 10 - role: pod 11 namespaces: 12 names: 13 - gpu-system # Change to your namespace 14 relabel_configs: 15 - source_labels: [__meta_kubernetes_pod_label_app] 16 action: keep 17 regex: dcgm-exporter # Change if your label is different 18 - source_labels: [__meta_kubernetes_pod_container_port_name] 19 action: keep 20 regex: metrics 21 metric_relabel_configs: 22 - source_labels: [__name__] 23 action: keep 24 regex: ^DCGM_FI_DEV_GPU_UTIL$
Apply the configuration:
bash1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 -f existing-dcgm-values.yaml
Step 3: Verify Scraping
Verify DCGM Exporter pods are running:
bash1kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
All pods should show Running status.
Agent Configuration
Enable GPU Monitoring in Agent
To enable GPU cost tracking in the Kubeadapt agent, add the GPU monitoring flag:
bash1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 --set agent.config.enableGpuMonitoring=true
Or in values.yaml:
yaml1agent: 2 enabled: true 3 config: 4 token: "YOUR_TOKEN" 5 enableGpuMonitoring: true
Verify Agent Configuration
Check agent logs to confirm GPU monitoring is enabled:
bash1kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu
Expected output:
text1INFO: GPU monitoring enabled 2INFO: Discovered 4 GPU nodes in cluster 3INFO: Collecting metrics from DCGM exporter
Viewing GPU Costs in Dashboard
Once configured, GPU metrics will appear in your Kubeadapt dashboard:
Dashboard Features
1. GPU Cost Overview
- Total GPU spend per month
- GPU utilization percentage
- GPU-enabled nodes count
2. GPU Utilization by Node
- Per-node GPU usage graphs
- Idle GPU identification
- GPU memory utilization
3. Workload GPU Usage
- GPU allocation per pod
- GPU request vs. actual usage
- Rightsizing recommendations for GPU workloads
GPU Metrics Available
Node-level:
- GPU count per node
- GPU model and memory capacity
Workload-level:
- GPU requests (nvidia.com/gpu)
- GPU compute utilization percentage
- GPU idle time detection
Troubleshooting
DCGM Exporter pods not running
Symptoms:
bash1kubectl get pods -n kubeadapt | grep dcgm 2# No pods or CrashLoopBackOff
Common causes:
- No GPU nodes in cluster
- NVIDIA device plugin not installed
- GPU Operator failed to install
Solution:
bash1# Check for GPU nodes 2kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)' 3 4# Check GPU Operator logs 5kubectl logs -n kubeadapt deployment/gpu-operator 6 7# Reinstall GPU Operator 8helm upgrade kubeadapt kubeadapt/kubeadapt \ 9 --namespace kubeadapt \ 10 --reuse-values \ 11 --set gpu-operator.enabled=false 12helm upgrade kubeadapt kubeadapt/kubeadapt \ 13 --namespace kubeadapt \ 14 --reuse-values \ 15 --set gpu-operator.enabled=true
No GPU metrics appearing
Symptoms:
GPU cost data not visible in Kubeadapt dashboard.
Common causes:
- DCGM Exporter not scraped by Prometheus
- Incorrect namespace or labels in scrape config
- DCGM Exporter not exposing metrics
Solution:
bash1# Verify DCGM Exporter is running 2kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter 3 4# Check DCGM Exporter logs 5kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter 6 7# Verify Kubeadapt agent logs 8kubectl logs -n kubeadapt -l app=kubeadapt-agent | grep -i gpu
GPU costs not showing in dashboard
Symptoms:
- DCGM metrics available in Prometheus
- GPU costs not visible in Kubeadapt dashboard
Common causes:
- Agent GPU monitoring not enabled
- Agent not collecting GPU metrics
- GPU pricing not configured
Solution:
bash1# Enable GPU monitoring in agent 2helm upgrade kubeadapt kubeadapt/kubeadapt \ 3 --namespace kubeadapt \ 4 --reuse-values \ 5 --set agent.config.enableGpuMonitoring=true 6 7# Check agent logs 8kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu 9 10# Verify GPU pricing is configured in dashboard 11# Navigate to Settings → Cloud Providers → GPU Pricing
MIG Mode Limitation
IMPORTANT: DCGM Exporter in Kubernetes mode does NOT support container-level GPU utilization mapping when MIG (Multi-Instance GPU) is enabled.
If using MIG:
- Node-level GPU metrics: Available
- Container-level GPU metrics: Not available
- Future: eBPF-based agent will support MIG mode
Workaround:
- Use GPU node labels for cost allocation
- Manual GPU cost distribution based on GPU requests
- Wait for eBPF agent support (roadmap)
Best Practices
1. Right-size GPU Requests
Monitor GPU utilization and adjust requests:
yaml1# Before (over-provisioned) 2resources: 3 requests: 4 nvidia.com/gpu: 1 # GPU utilization: 25% 5 6# After (right-sized) 7resources: 8 requests: 9 nvidia.com/gpu: 0 # Moved to CPU-only node
2. Use GPU Node Taints
Prevent non-GPU workloads from running on expensive GPU nodes:
bash1# Taint GPU nodes 2kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule
yaml1# GPU workloads need toleration 2tolerations: 3 - key: nvidia.com/gpu 4 operator: Equal 5 value: present 6 effect: NoSchedule
3. Enable GPU Time-Slicing (Optional)
For multiple workloads sharing a single GPU:
yaml1# GPU Operator configuration 2gpu-operator: 3 devicePlugin: 4 config: 5 name: time-slicing-config 6 default: any 7 sharing: 8 timeSlicing: 9 renameByDefault: false 10 failRequestsGreaterThanOne: false 11 resources: 12 - name: nvidia.com/gpu 13 replicas: 4 # 4 containers can share 1 GPU
GPU Pricing Configuration
Configure GPU Costs in Dashboard
- Navigate to Settings → Cloud Providers
- Select your cloud provider (AWS, GCP, Azure)
- GPU Pricing section:
- Set hourly cost per GPU type
- Or enable automatic pricing from cloud provider API
Example GPU pricing:
text1NVIDIA A100 (80GB): $3.67/hour 2NVIDIA V100: $2.48/hour 3NVIDIA T4: $0.95/hour
On-Premises GPU Pricing
For on-prem clusters, calculate GPU cost based on:
text1GPU Hourly Cost = (Hardware Cost / Depreciation Period) / Hours per Year 2 3Example: 4- Hardware: $10,000 per GPU 5- Depreciation: 3 years 6- Hourly cost: $10,000 / (3 × 365 × 24) = $0.38/hour
What's Next?
After enabling GPU monitoring:
- Dashboard - View GPU costs and utilization
- 💎 Available Savings - GPU rightsizing recommendations
- Workload Details - Per-pod GPU usage
- Cost Query - Custom GPU cost queries
Need Help?
- NVIDIA GPU Operator Docs
- DCGM Exporter Docs
- Support - Email support team