Node Monitoring | Docs

Overview

Node Monitoring provides infrastructure-level visibility through node groups (auto-scaling groups or managed node pools):

Node costs - Hourly and monthly costs per node and node group
Resource utilization - CPU, memory, and GPU usage percentages
Efficiency metrics - How well nodes pack pods and use resources
Instance type tracking - Which machine types are deployed
Health monitoring - Healthy, warning, and critical node counts
Spot instance recommendations - Workloads suitable for spot instances

Access: Select cluster (Clusters page or sidebar dropdown) → Nodes

Node Group Metrics

Summary Cards:

Total Nodes: Aggregate count across all filtered groups
Total Cost: Hourly rate for all nodes (multiply by 720 for monthly)
Avg Efficiency: Average efficiency score across all node groups
Healthy Nodes: Count of nodes in ready state

Per-Group Metrics:

nodeCount - Number of nodes in group
totalCostHourly - Combined hourly cost for all nodes in group
avgEfficiency - Average of CPU and memory efficiency
instanceType - EC2/GCE instance type (e.g., m5.2xlarge, t3.large)
region - Cloud provider region
zones[] - Availability zones
totalPods - Total pods scheduled across all nodes in group

Health Status:

health.healthy - Nodes in Ready state
health.warning - Nodes with degraded performance
health.critical - Nodes in NotReady or error state

Trend Data:

trends.nodeCountChange - Node count delta vs previous period
trends.costChangePercent - Cost change percentage

Individual Node Metrics

Identity Fields:

name - Node hostname (e.g., ip-10-0-142-18.us-east-1.compute.internal)
instance_type - Instance size (m5.2xlarge, t3.large, etc.)
arch - Architecture (amd64, arm64)
os - Operating system
age_days - Days since node creation

Resource Capacity:

total_cpu - Allocatable CPU cores
memory_total_bytes - Allocatable memory in bytes

Resource Usage:

cpu_usage_percent - Current CPU utilization (0-100%)
memory_usage_percent - Current memory utilization (0-100%)
gpu_usage_percent - GPU utilization if GPU-enabled (optional)
gpu_model - GPU hardware model (e.g., NVIDIA Tesla T4)

Cost Breakdown:

total_cost - Total hourly cost for this node
cpu_cost - CPU portion of hourly cost
memory_cost - Memory portion of hourly cost
gpu_cost - GPU cost if applicable
cpu_cost_per_core - Unit cost per CPU core/hour
memory_cost_per_gb - Unit cost per GB memory/hour

Efficiency Scores:

cpu_efficiency - See Resource Efficiency
memory_efficiency - See Resource Efficiency
workload_efficiency - Combined CPU + memory efficiency

Location:

region - Cloud provider region (us-east-1, us-west-2, etc.)
zone - Availability zone (us-east-1a, us-east-1b, etc.)

Filtering & Sorting

Available Filters:

Search:

Filter by node group name or instance type

Node Count Range:

nodeCountMin - Minimum nodes per group
nodeCountMax - Maximum nodes per group

Cost Range:

costMin - Minimum hourly cost per group
costMax - Maximum hourly cost per group

Efficiency Levels:

high - High efficiency
medium - Medium efficiency
low - Low efficiency
all - No efficiency filter

GPU Enabled:

yes - Only GPU-enabled node groups
no - Only non-GPU node groups
all - All node groups

Location:

region - Filter by specific cloud region
zone - Filter by specific availability zone
instanceType - Filter by instance type

Timeframe Selection:

Historical data selection

Spot Instance Recommendations

Purpose: Identify workloads that can safely run on spot/preemptible instances for 70-80% cost savings.

Recommendation Fields:

resource_name - Workload name (Deployment, StatefulSet)
namespace - Kubernetes namespace
resource_type - Deployment, StatefulSet, DaemonSet
priority - High, Medium, Low

Cost Analysis:

current_hourly_cost - Current on-demand cost
target_hourly_cost - Projected spot instance cost
estimated_savings - Monthly savings estimate
savings_percentage - Percentage cost reduction

Migration Assessment:

is_migratable - Boolean: safe to migrate
controller_type - Deployment type
current_replicas - Number of replicas
minimum_recommended_replicas - Minimum for HA
has_pdb - PodDisruptionBudget configured

Compatibility Checks:

controller_type_ok - Controller supports spot
storage_compatible - Storage type works with spot
pdb_compatible - PDB configuration allows spot
rolling_update_ok - Rolling update configured
volume_type_ok - Volume type supports spot

Storage Configuration:

local_storage_enabled - Uses local storage
volume_type - PVC storage class
storage_migratable - Storage survives node termination

PDB Configuration:

min_available - Minimum pods available
threshold - Threshold percentage
pdb_migratable - PDB allows spot migration

Instance Type Specifications

AWS Examples (2025 Pricing):

t3.large:

CPU: 2 cores
Memory: 8 GB
Hourly cost: $0.0832
Use case: Burstable, low-traffic workloads

m5.large:

CPU: 2 cores
Memory: 8 GB
Hourly cost: $0.10
Use case: General purpose

m5.xlarge:

CPU: 4 cores
Memory: 16 GB
Hourly cost: $0.192
Use case: Moderate workloads

m5.2xlarge:

CPU: 8 cores
Memory: 32 GB
Hourly cost: $0.384
Use case: High-performance workloads

m5n.large:

CPU: 2 cores
Memory: 8 GB
Hourly cost: $0.119
Use case: Network-optimized

Node Health Status

Health States:

Healthy:

Node in Ready state
All health checks passing
No resource pressure

Warning:

Degraded performance
High resource utilization
Minor health check failures

Critical:

NotReady state
Unreachable
Out of memory/disk
System component failures

Common Workflows

Review Node Group Efficiency:

Navigate to Dashboard → Nodes Tab
Sort by avgEfficiency (lowest first)
Identify groups with low efficiency
Click group to view individual nodes
Review pod allocation and resource usage

Identify Spot Instance Candidates:

Navigate to cluster → Spot Recommendations
Filter by priority "High"
Check is_migratable = true
Review compatibility checks
Estimate monthly savings
Implement via node selector or taints/tolerations

Find Cost Optimization Opportunities:

Sort node groups by totalCostHourly
Check avgEfficiency for top expensive groups
For low efficiency groups, consider:
- Smaller instance types
- Node pool consolidation
- Autoscaler adjustments

Monitor Node Health:

Check health.critical count
If >0, click group to view problematic nodes
Review node logs and events
Consider node replacement if persistent issues