Node Monitoring
Overview
Node Monitoring provides infrastructure-level visibility through node groups (auto-scaling groups or managed node pools):
- Node costs - Hourly and monthly costs per node and node group
- Resource utilization - CPU, memory, and GPU usage percentages
- Efficiency metrics - How well nodes pack pods and use resources
- Instance type tracking - Which machine types are deployed
- Health monitoring - Healthy, warning, and critical node counts
- Spot instance recommendations - Workloads suitable for spot instances
Access: Select cluster (Clusters page or sidebar dropdown) → Nodes
Node Group Metrics
Summary Cards:
- Total Nodes: Aggregate count across all filtered groups
- Total Cost: Hourly rate for all nodes (multiply by 720 for monthly)
- Avg Efficiency: Average efficiency score across all node groups
- Healthy Nodes: Count of nodes in ready state
Per-Group Metrics:
nodeCount- Number of nodes in grouptotalCostHourly- Combined hourly cost for all nodes in groupavgEfficiency- Average of CPU and memory efficiencyinstanceType- EC2/GCE instance type (e.g., m5.2xlarge, t3.large)region- Cloud provider regionzones[]- Availability zonestotalPods- Total pods scheduled across all nodes in group
Health Status:
health.healthy- Nodes in Ready statehealth.warning- Nodes with degraded performancehealth.critical- Nodes in NotReady or error state
Trend Data:
trends.nodeCountChange- Node count delta vs previous periodtrends.costChangePercent- Cost change percentage
Individual Node Metrics
Identity Fields:
name- Node hostname (e.g., ip-10-0-142-18.us-east-1.compute.internal)instance_type- Instance size (m5.2xlarge, t3.large, etc.)arch- Architecture (amd64, arm64)os- Operating systemage_days- Days since node creation
Resource Capacity:
total_cpu- Allocatable CPU coresmemory_total_bytes- Allocatable memory in bytes
Resource Usage:
cpu_usage_percent- Current CPU utilization (0-100%)memory_usage_percent- Current memory utilization (0-100%)gpu_usage_percent- GPU utilization if GPU-enabled (optional)gpu_model- GPU hardware model (e.g., NVIDIA Tesla T4)
Cost Breakdown:
total_cost- Total hourly cost for this nodecpu_cost- CPU portion of hourly costmemory_cost- Memory portion of hourly costgpu_cost- GPU cost if applicablecpu_cost_per_core- Unit cost per CPU core/hourmemory_cost_per_gb- Unit cost per GB memory/hour
Efficiency Scores:
cpu_efficiency- See Resource Efficiencymemory_efficiency- See Resource Efficiencyworkload_efficiency- Combined CPU + memory efficiency
Location:
region- Cloud provider region (us-east-1, us-west-2, etc.)zone- Availability zone (us-east-1a, us-east-1b, etc.)
Filtering & Sorting
Available Filters:
Search:
- Filter by node group name or instance type
Node Count Range:
nodeCountMin- Minimum nodes per groupnodeCountMax- Maximum nodes per group
Cost Range:
costMin- Minimum hourly cost per groupcostMax- Maximum hourly cost per group
Efficiency Levels:
high- High efficiencymedium- Medium efficiencylow- Low efficiencyall- No efficiency filter
GPU Enabled:
yes- Only GPU-enabled node groupsno- Only non-GPU node groupsall- All node groups
Location:
region- Filter by specific cloud regionzone- Filter by specific availability zoneinstanceType- Filter by instance type
Timeframe Selection:
- Historical data selection
Spot Instance Recommendations
Purpose: Identify workloads that can safely run on spot/preemptible instances for 70-80% cost savings.
Recommendation Fields:
resource_name- Workload name (Deployment, StatefulSet)namespace- Kubernetes namespaceresource_type- Deployment, StatefulSet, DaemonSetpriority- High, Medium, Low
Cost Analysis:
current_hourly_cost- Current on-demand costtarget_hourly_cost- Projected spot instance costestimated_savings- Monthly savings estimatesavings_percentage- Percentage cost reduction
Migration Assessment:
is_migratable- Boolean: safe to migratecontroller_type- Deployment typecurrent_replicas- Number of replicasminimum_recommended_replicas- Minimum for HAhas_pdb- PodDisruptionBudget configured
Compatibility Checks:
controller_type_ok- Controller supports spotstorage_compatible- Storage type works with spotpdb_compatible- PDB configuration allows spotrolling_update_ok- Rolling update configuredvolume_type_ok- Volume type supports spot
Storage Configuration:
local_storage_enabled- Uses local storagevolume_type- PVC storage classstorage_migratable- Storage survives node termination
PDB Configuration:
min_available- Minimum pods availablethreshold- Threshold percentagepdb_migratable- PDB allows spot migration
Instance Type Specifications
AWS Examples (2025 Pricing):
t3.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.0832
- Use case: Burstable, low-traffic workloads
m5.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.10
- Use case: General purpose
m5.xlarge:
- CPU: 4 cores
- Memory: 16 GB
- Hourly cost: $0.192
- Use case: Moderate workloads
m5.2xlarge:
- CPU: 8 cores
- Memory: 32 GB
- Hourly cost: $0.384
- Use case: High-performance workloads
m5n.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.119
- Use case: Network-optimized
Node Health Status
Health States:
Healthy:
- Node in Ready state
- All health checks passing
- No resource pressure
Warning:
- Degraded performance
- High resource utilization
- Minor health check failures
Critical:
- NotReady state
- Unreachable
- Out of memory/disk
- System component failures
Common Workflows
Review Node Group Efficiency:
- Navigate to Dashboard → Nodes Tab
- Sort by avgEfficiency (lowest first)
- Identify groups with low efficiency
- Click group to view individual nodes
- Review pod allocation and resource usage
Identify Spot Instance Candidates:
- Navigate to cluster → Spot Recommendations
- Filter by priority "High"
- Check
is_migratable= true - Review compatibility checks
- Estimate monthly savings
- Implement via node selector or taints/tolerations
Find Cost Optimization Opportunities:
- Sort node groups by totalCostHourly
- Check avgEfficiency for top expensive groups
- For low efficiency groups, consider:
- Smaller instance types
- Node pool consolidation
- Autoscaler adjustments
Monitor Node Health:
- Check health.critical count
- If >0, click group to view problematic nodes
- Review node logs and events
- Consider node replacement if persistent issues