Node Monitoring

Overview

Node Monitoring provides infrastructure-level visibility through node groups (auto-scaling groups or managed node pools):

  • Node costs - Hourly and monthly costs per node and node group
  • Resource utilization - CPU, memory, and GPU usage percentages
  • Efficiency metrics - How well nodes pack pods and use resources
  • Instance type tracking - Which machine types are deployed
  • Health monitoring - Healthy, warning, and critical node counts
  • Spot instance recommendations - Workloads suitable for spot instances

Access: Select cluster (Clusters page or sidebar dropdown) → Nodes


Node Group Metrics

Summary Cards:

  • Total Nodes: Aggregate count across all filtered groups
  • Total Cost: Hourly rate for all nodes (multiply by 720 for monthly)
  • Avg Efficiency: Average efficiency score across all node groups
  • Healthy Nodes: Count of nodes in ready state

Per-Group Metrics:

  • text
    1nodeCount
    - Number of nodes in group
  • text
    1totalCostHourly
    - Combined hourly cost for all nodes in group
  • text
    1avgEfficiency
    - Average of CPU and memory efficiency
  • text
    1instanceType
    - EC2/GCE instance type (e.g., m5.2xlarge, t3.large)
  • text
    1region
    - Cloud provider region
  • text
    1zones[]
    - Availability zones
  • text
    1totalPods
    - Total pods scheduled across all nodes in group

Health Status:

  • text
    1health.healthy
    - Nodes in Ready state
  • text
    1health.warning
    - Nodes with degraded performance
  • text
    1health.critical
    - Nodes in NotReady or error state

Trend Data:

  • text
    1trends.nodeCountChange
    - Node count delta vs previous period
  • text
    1trends.costChangePercent
    - Cost change percentage

Individual Node Metrics

Identity Fields:

  • text
    1name
    - Node hostname (e.g., ip-10-0-142-18.us-east-1.compute.internal)
  • text
    1instance_type
    - Instance size (m5.2xlarge, t3.large, etc.)
  • text
    1arch
    - Architecture (amd64, arm64)
  • text
    1os
    - Operating system
  • text
    1age_days
    - Days since node creation

Resource Capacity:

  • text
    1total_cpu
    - Allocatable CPU cores
  • text
    1memory_total_bytes
    - Allocatable memory in bytes

Resource Usage:

  • text
    1cpu_usage_percent
    - Current CPU utilization (0-100%)
  • text
    1memory_usage_percent
    - Current memory utilization (0-100%)
  • text
    1gpu_usage_percent
    - GPU utilization if GPU-enabled (optional)
  • text
    1gpu_model
    - GPU hardware model (e.g., NVIDIA Tesla T4)

Cost Breakdown:

  • text
    1total_cost
    - Total hourly cost for this node
  • text
    1cpu_cost
    - CPU portion of hourly cost
  • text
    1memory_cost
    - Memory portion of hourly cost
  • text
    1gpu_cost
    - GPU cost if applicable
  • text
    1cpu_cost_per_core
    - Unit cost per CPU core/hour
  • text
    1memory_cost_per_gb
    - Unit cost per GB memory/hour

Efficiency Scores:

Location:

  • text
    1region
    - Cloud provider region (us-east-1, us-west-2, etc.)
  • text
    1zone
    - Availability zone (us-east-1a, us-east-1b, etc.)

Filtering & Sorting

Available Filters:

Search:

  • Filter by node group name or instance type

Node Count Range:

  • text
    1nodeCountMin
    - Minimum nodes per group
  • text
    1nodeCountMax
    - Maximum nodes per group

Cost Range:

  • text
    1costMin
    - Minimum hourly cost per group
  • text
    1costMax
    - Maximum hourly cost per group

Efficiency Levels:

  • text
    1high
    - High efficiency
  • text
    1medium
    - Medium efficiency
  • text
    1low
    - Low efficiency
  • text
    1all
    - No efficiency filter

GPU Enabled:

  • text
    1yes
    - Only GPU-enabled node groups
  • text
    1no
    - Only non-GPU node groups
  • text
    1all
    - All node groups

Location:

  • text
    1region
    - Filter by specific cloud region
  • text
    1zone
    - Filter by specific availability zone
  • text
    1instanceType
    - Filter by instance type

Timeframe Selection:

  • Historical data selection

Spot Instance Recommendations

Purpose: Identify workloads that can safely run on spot/preemptible instances for 70-80% cost savings.

Recommendation Fields:

  • text
    1resource_name
    - Workload name (Deployment, StatefulSet)
  • text
    1namespace
    - Kubernetes namespace
  • text
    1resource_type
    - Deployment, StatefulSet, DaemonSet
  • text
    1priority
    - High, Medium, Low

Cost Analysis:

  • text
    1current_hourly_cost
    - Current on-demand cost
  • text
    1target_hourly_cost
    - Projected spot instance cost
  • text
    1estimated_savings
    - Monthly savings estimate
  • text
    1savings_percentage
    - Percentage cost reduction

Migration Assessment:

  • text
    1is_migratable
    - Boolean: safe to migrate
  • text
    1controller_type
    - Deployment type
  • text
    1current_replicas
    - Number of replicas
  • text
    1minimum_recommended_replicas
    - Minimum for HA
  • text
    1has_pdb
    - PodDisruptionBudget configured

Compatibility Checks:

  • text
    1controller_type_ok
    - Controller supports spot
  • text
    1storage_compatible
    - Storage type works with spot
  • text
    1pdb_compatible
    - PDB configuration allows spot
  • text
    1rolling_update_ok
    - Rolling update configured
  • text
    1volume_type_ok
    - Volume type supports spot

Storage Configuration:

  • text
    1local_storage_enabled
    - Uses local storage
  • text
    1volume_type
    - PVC storage class
  • text
    1storage_migratable
    - Storage survives node termination

PDB Configuration:

  • text
    1min_available
    - Minimum pods available
  • text
    1threshold
    - Threshold percentage
  • text
    1pdb_migratable
    - PDB allows spot migration

Instance Type Specifications

AWS Examples (2025 Pricing):

t3.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.0832
  • Use case: Burstable, low-traffic workloads

m5.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.10
  • Use case: General purpose

m5.xlarge:

  • CPU: 4 cores
  • Memory: 16 GB
  • Hourly cost: $0.192
  • Use case: Moderate workloads

m5.2xlarge:

  • CPU: 8 cores
  • Memory: 32 GB
  • Hourly cost: $0.384
  • Use case: High-performance workloads

m5n.large:

  • CPU: 2 cores
  • Memory: 8 GB
  • Hourly cost: $0.119
  • Use case: Network-optimized

Node Health Status

Health States:

Healthy:

  • Node in Ready state
  • All health checks passing
  • No resource pressure

Warning:

  • Degraded performance
  • High resource utilization
  • Minor health check failures

Critical:

  • NotReady state
  • Unreachable
  • Out of memory/disk
  • System component failures

Common Workflows

Review Node Group Efficiency:

  1. Navigate to Dashboard → Nodes Tab
  2. Sort by avgEfficiency (lowest first)
  3. Identify groups with low efficiency
  4. Click group to view individual nodes
  5. Review pod allocation and resource usage

Identify Spot Instance Candidates:

  1. Navigate to cluster → Spot Recommendations
  2. Filter by priority "High"
  3. Check
    text
    1is_migratable
    = true
  4. Review compatibility checks
  5. Estimate monthly savings
  6. Implement via node selector or taints/tolerations

Find Cost Optimization Opportunities:

  1. Sort node groups by totalCostHourly
  2. Check avgEfficiency for top expensive groups
  3. For low efficiency groups, consider:
    • Smaller instance types
    • Node pool consolidation
    • Autoscaler adjustments

Monitor Node Health:

  1. Check health.critical count
  2. If >0, click group to view problematic nodes
  3. Review node logs and events
  4. Consider node replacement if persistent issues