Node Monitoring

Overview

Node Monitoring provides infrastructure-level visibility through node groups (auto-scaling groups or managed node pools):

  • Node costs - Hourly and monthly costs per node and node group
  • Resource utilization - CPU, memory, and GPU usage percentages
  • Efficiency metrics - How well nodes pack pods and use resources
  • Instance type tracking - Which machine types are deployed
  • Health monitoring - Healthy, warning, and critical node counts
  • Spot instance recommendations - Workloads suitable for spot instances

Access: Select cluster (Clusters page or sidebar dropdown) → Nodes


Node Group Metrics

Summary Cards:

  • Total Nodes: Aggregate count across all filtered groups
  • Total Cost: Hourly rate for all nodes (multiply by 720 for monthly)
  • Avg Efficiency: Average efficiency score across all node groups
  • Healthy Nodes: Count of nodes in ready state

Per-Group Metrics:

  • nodeCount - Number of nodes in group
  • totalCostHourly - Combined hourly cost for all nodes in group
  • avgEfficiency - Average of CPU and memory efficiency
  • instanceType - EC2/GCE instance type (e.g., m5.2xlarge, t3.large)
  • region - Cloud provider region
  • zones[] - Availability zones
  • totalPods - Total pods scheduled across all nodes in group

Health Status:

  • health.healthy - Nodes in Ready state
  • health.warning - Nodes with degraded performance
  • health.critical - Nodes in NotReady or error state

Trend Data:

  • trends.nodeCountChange - Node count delta vs previous period
  • trends.costChangePercent - Cost change percentage

Individual Node Metrics

Identity Fields:

  • name - Node hostname (e.g., ip-10-0-142-18.us-east-1.compute.internal)
  • instance_type - Instance size (m5.2xlarge, t3.large, etc.)
  • arch - Architecture (amd64, arm64)
  • os - Operating system
  • age_days - Days since node creation

Resource Capacity:

  • total_cpu - Allocatable CPU cores
  • memory_total_bytes - Allocatable memory in bytes

Resource Usage:

  • cpu_usage_percent - Current CPU utilization (0-100%)
  • memory_usage_percent - Current memory utilization (0-100%)
  • gpu_usage_percent - GPU utilization if GPU-enabled (optional)
  • gpu_model - GPU hardware model (e.g., NVIDIA Tesla T4)

Cost Breakdown:

  • total_cost - Total hourly cost for this node
  • cpu_cost - CPU portion of hourly cost
  • memory_cost - Memory portion of hourly cost
  • gpu_cost - GPU cost if applicable
  • cpu_cost_per_core - Unit cost per CPU core/hour
  • memory_cost_per_gb - Unit cost per GB memory/hour

Efficiency Scores:

Location:

  • region - Cloud provider region (us-east-1, us-west-2, etc.)
  • zone - Availability zone (us-east-1a, us-east-1b, etc.)

Filtering & Sorting

Available Filters:

Search:

  • Filter by node group name or instance type

Node Count Range:

  • nodeCountMin - Minimum nodes per group
  • nodeCountMax - Maximum nodes per group

Cost Range:

  • costMin - Minimum hourly cost per group
  • costMax - Maximum hourly cost per group

Efficiency Levels:

  • high - High efficiency
  • medium - Medium efficiency
  • low - Low efficiency
  • all - No efficiency filter

GPU Enabled:

  • yes - Only GPU-enabled node groups
  • no - Only non-GPU node groups
  • all - All node groups

Location:

  • region - Filter by specific cloud region
  • zone - Filter by specific availability zone
  • instanceType - Filter by instance type

Timeframe Selection:

  • Historical data selection

Spot Instance Recommendations

Purpose: Identify workloads that can safely run on spot/preemptible instances for 70-80% cost savings.

Recommendation Fields:

  • resource_name - Workload name (Deployment, StatefulSet)
  • namespace - Kubernetes namespace
  • resource_type - Deployment, StatefulSet, DaemonSet
  • priority - High, Medium, Low

Cost Analysis:

  • current_hourly_cost - Current on-demand cost
  • target_hourly_cost - Projected spot instance cost
  • estimated_savings - Monthly savings estimate
  • savings_percentage - Percentage cost reduction

Migration Assessment:

  • is_migratable - Boolean: safe to migrate
  • controller_type - Deployment type
  • current_replicas - Number of replicas
  • minimum_recommended_replicas - Minimum for HA
  • has_pdb - PodDisruptionBudget configured

Compatibility Checks:

  • controller_type_ok - Controller supports spot
  • storage_compatible - Storage type works with spot
  • pdb_compatible - PDB configuration allows spot
  • rolling_update_ok - Rolling update configured
  • volume_type_ok - Volume type supports spot

Storage Configuration:

  • local_storage_enabled - Uses local storage
  • volume_type - PVC storage class
  • storage_migratable - Storage survives node termination

PDB Configuration:

  • min_available - Minimum pods available
  • threshold - Threshold percentage
  • pdb_migratable - PDB allows spot migration

Instance Type Specifications

AWS Examples (2025 Pricing):

t3.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.0832
  • Use case: Burstable, low-traffic workloads

m5.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.10
  • Use case: General purpose

m5.xlarge:

  • CPU: 4 cores
  • Memory: 16 GB
  • Hourly cost: $0.192
  • Use case: Moderate workloads

m5.2xlarge:

  • CPU: 8 cores
  • Memory: 32 GB
  • Hourly cost: $0.384
  • Use case: High-performance workloads

m5n.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.119
  • Use case: Network-optimized

Node Health Status

Health States:

Healthy:

  • Node in Ready state
  • All health checks passing
  • No resource pressure

Warning:

  • Degraded performance
  • High resource utilization
  • Minor health check failures

Critical:

  • NotReady state
  • Unreachable
  • Out of memory/disk
  • System component failures

Common Workflows

Review Node Group Efficiency:

  1. Navigate to Dashboard → Nodes Tab
  2. Sort by avgEfficiency (lowest first)
  3. Identify groups with low efficiency
  4. Click group to view individual nodes
  5. Review pod allocation and resource usage

Identify Spot Instance Candidates:

  1. Navigate to cluster → Spot Recommendations
  2. Filter by priority "High"
  3. Check is_migratable = true
  4. Review compatibility checks
  5. Estimate monthly savings
  6. Implement via node selector or taints/tolerations

Find Cost Optimization Opportunities:

  1. Sort node groups by totalCostHourly
  2. Check avgEfficiency for top expensive groups
  3. For low efficiency groups, consider:
    • Smaller instance types
    • Node pool consolidation
    • Autoscaler adjustments

Monitor Node Health:

  1. Check health.critical count
  2. If >0, click group to view problematic nodes
  3. Review node logs and events
  4. Consider node replacement if persistent issues