Node Monitoring
Overview
Node Monitoring provides infrastructure-level visibility through node groups (auto-scaling groups or managed node pools):
- Node costs - Hourly and monthly costs per node and node group
- Resource utilization - CPU, memory, and GPU usage percentages
- Efficiency metrics - How well nodes pack pods and use resources
- Instance type tracking - Which machine types are deployed
- Health monitoring - Healthy, warning, and critical node counts
- Spot instance recommendations - Workloads suitable for spot instances
Access: Select cluster (Clusters page or sidebar dropdown) → Nodes
Node Group Metrics
Summary Cards:
- Total Nodes: Aggregate count across all filtered groups
- Total Cost: Hourly rate for all nodes (multiply by 720 for monthly)
- Avg Efficiency: Average efficiency score across all node groups
- Healthy Nodes: Count of nodes in ready state
Per-Group Metrics:
- - Number of nodes in grouptext
1nodeCount - - Combined hourly cost for all nodes in grouptext
1totalCostHourly - - Average of CPU and memory efficiencytext
1avgEfficiency - - EC2/GCE instance type (e.g., m5.2xlarge, t3.large)text
1instanceType - - Cloud provider regiontext
1region - - Availability zonestext
1zones[] - - Total pods scheduled across all nodes in grouptext
1totalPods
Health Status:
- - Nodes in Ready statetext
1health.healthy - - Nodes with degraded performancetext
1health.warning - - Nodes in NotReady or error statetext
1health.critical
Trend Data:
- - Node count delta vs previous periodtext
1trends.nodeCountChange - - Cost change percentagetext
1trends.costChangePercent
Individual Node Metrics
Identity Fields:
- - Node hostname (e.g., ip-10-0-142-18.us-east-1.compute.internal)text
1name - - Instance size (m5.2xlarge, t3.large, etc.)text
1instance_type - - Architecture (amd64, arm64)text
1arch - - Operating systemtext
1os - - Days since node creationtext
1age_days
Resource Capacity:
- - Allocatable CPU corestext
1total_cpu - - Allocatable memory in bytestext
1memory_total_bytes
Resource Usage:
- - Current CPU utilization (0-100%)text
1cpu_usage_percent - - Current memory utilization (0-100%)text
1memory_usage_percent - - GPU utilization if GPU-enabled (optional)text
1gpu_usage_percent - - GPU hardware model (e.g., NVIDIA Tesla T4)text
1gpu_model
Cost Breakdown:
- - Total hourly cost for this nodetext
1total_cost - - CPU portion of hourly costtext
1cpu_cost - - Memory portion of hourly costtext
1memory_cost - - GPU cost if applicabletext
1gpu_cost - - Unit cost per CPU core/hourtext
1cpu_cost_per_core - - Unit cost per GB memory/hourtext
1memory_cost_per_gb
Efficiency Scores:
- - See Resource Efficiencytext
1cpu_efficiency - - See Resource Efficiencytext
1memory_efficiency - - Combined CPU + memory efficiencytext
1workload_efficiency
Location:
- - Cloud provider region (us-east-1, us-west-2, etc.)text
1region - - Availability zone (us-east-1a, us-east-1b, etc.)text
1zone
Filtering & Sorting
Available Filters:
Search:
- Filter by node group name or instance type
Node Count Range:
- - Minimum nodes per grouptext
1nodeCountMin - - Maximum nodes per grouptext
1nodeCountMax
Cost Range:
- - Minimum hourly cost per grouptext
1costMin - - Maximum hourly cost per grouptext
1costMax
Efficiency Levels:
- - High efficiencytext
1high - - Medium efficiencytext
1medium - - Low efficiencytext
1low - - No efficiency filtertext
1all
GPU Enabled:
- - Only GPU-enabled node groupstext
1yes - - Only non-GPU node groupstext
1no - - All node groupstext
1all
Location:
- - Filter by specific cloud regiontext
1region - - Filter by specific availability zonetext
1zone - - Filter by instance typetext
1instanceType
Timeframe Selection:
- Historical data selection
Spot Instance Recommendations
Purpose: Identify workloads that can safely run on spot/preemptible instances for 70-80% cost savings.
Recommendation Fields:
- - Workload name (Deployment, StatefulSet)text
1resource_name - - Kubernetes namespacetext
1namespace - - Deployment, StatefulSet, DaemonSettext
1resource_type - - High, Medium, Lowtext
1priority
Cost Analysis:
- - Current on-demand costtext
1current_hourly_cost - - Projected spot instance costtext
1target_hourly_cost - - Monthly savings estimatetext
1estimated_savings - - Percentage cost reductiontext
1savings_percentage
Migration Assessment:
- - Boolean: safe to migratetext
1is_migratable - - Deployment typetext
1controller_type - - Number of replicastext
1current_replicas - - Minimum for HAtext
1minimum_recommended_replicas - - PodDisruptionBudget configuredtext
1has_pdb
Compatibility Checks:
- - Controller supports spottext
1controller_type_ok - - Storage type works with spottext
1storage_compatible - - PDB configuration allows spottext
1pdb_compatible - - Rolling update configuredtext
1rolling_update_ok - - Volume type supports spottext
1volume_type_ok
Storage Configuration:
- - Uses local storagetext
1local_storage_enabled - - PVC storage classtext
1volume_type - - Storage survives node terminationtext
1storage_migratable
PDB Configuration:
- - Minimum pods availabletext
1min_available - - Threshold percentagetext
1threshold - - PDB allows spot migrationtext
1pdb_migratable
Instance Type Specifications
AWS Examples (2025 Pricing):
t3.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.0832
- Use case: Burstable, low-traffic workloads
m5.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.10
- Use case: General purpose
m5.xlarge:
- CPU: 4 cores
- Memory: 16 GB
- Hourly cost: $0.192
- Use case: Moderate workloads
m5.2xlarge:
- CPU: 8 cores
- Memory: 32 GB
- Hourly cost: $0.384
- Use case: High-performance workloads
m5n.large:
- CPU: 2 cores
- Memory: 8 GB
- Hourly cost: $0.119
- Use case: Network-optimized
Node Health Status
Health States:
Healthy:
- Node in Ready state
- All health checks passing
- No resource pressure
Warning:
- Degraded performance
- High resource utilization
- Minor health check failures
Critical:
- NotReady state
- Unreachable
- Out of memory/disk
- System component failures
Common Workflows
Review Node Group Efficiency:
- Navigate to Dashboard → Nodes Tab
- Sort by avgEfficiency (lowest first)
- Identify groups with low efficiency
- Click group to view individual nodes
- Review pod allocation and resource usage
Identify Spot Instance Candidates:
- Navigate to cluster → Spot Recommendations
- Filter by priority "High"
- Check = truetext
1is_migratable - Review compatibility checks
- Estimate monthly savings
- Implement via node selector or taints/tolerations
Find Cost Optimization Opportunities:
- Sort node groups by totalCostHourly
- Check avgEfficiency for top expensive groups
- For low efficiency groups, consider:
- Smaller instance types
- Node pool consolidation
- Autoscaler adjustments
Monitor Node Health:
- Check health.critical count
- If >0, click group to view problematic nodes
- Review node logs and events
- Consider node replacement if persistent issues