Core Concepts
Smart Alerting
How Kubeadapt's alerts work — threshold checks against a rolling baseline, not z-score anomaly math — and the four alert types you can configure.
You want to hear about cost surprises before finance does. Smart Alerting is Kubeadapt's way of getting a message to the right channel when one of four things happens in your environment: spend spikes, a budget runs hot, a workload sits idle, or an expensive new workload appears. This page explains the model so the rest of the section makes sense.
Alerts are threshold-based, not anomaly-based
Smart Alerting compares observed values to a baseline and fires when the difference crosses a threshold. The baseline is a rolling window of recent history. The threshold is either configured by you (budget, idle percentage, monthly dollar limit) or derived automatically from the scope's spending pattern (cost spike).
This is deliberate. Kubeadapt does not use z-scores, IQR, MAD, or any learned model of what your spend usually looks like. Threshold-with-rolling-baseline is simpler, predictable, and easy to reason about when an alert fires at 3am. If the spend doubled compared to recent days, you'll know it doubled. There's no need to guess what the algorithm decided was "abnormal".
The trade-off: you'll occasionally see false positives during real changes (a new team onboards, a major deployment ships). The fix is tuning the sensitivity, not adding statistics.
The four user-configurable alert types
| Alert type | Fires when… | Common scope |
|---|---|---|
cost_spike | Daily spend rises sharply against the rolling baseline for the same day type (weekday/weekend). | Cluster, team |
budget_threshold | Period-to-date spend crosses 50%, 80%, 90%, or 100% of a budget you set. | Org, department |
unused_resources | A scope's idle resources cross a configured percentage. Delivered as a weekly digest. | Cluster |
new_expensive_workload | A workload appears that's projected to cost more than a configured monthly threshold. | Cluster, team |
You configure each type's conditions on a Rule. One Rule can enable any combination of the four. See Rules for the rule shape.
How cost-spike actually works
Of the four, Cost spike is the only one with a non-trivial baseline. The spec deserves the detail.
For each scope, Kubeadapt looks at the last 30 days of daily cost, then computes the baseline as the median of historical samples bucketed by day type — weekday samples for weekday evaluation, weekend samples for weekend evaluation. The current day's cost is compared against that bucketed median.
Two reasons for the day-type bucketing:
- Weekend traffic is structurally different from weekday traffic in most production environments. Comparing Sunday's spend to last Tuesday's baseline produces noise.
- The bucket-by-day-type approach holds up against a single one-off day. A Black Friday spike in a 30-day window won't poison the next 29 days' baselines because most days are still ordinary.
If the scope's day-type bucket has fewer than two historical samples (cold-start, very new organizations), Kubeadapt falls back to the all-day-type median. The page shows which fallback was used on the incident detail.
Sensitivity modes
Cost spike has two sensitivity modes:
- Sustained (default) — fires only if the spike holds for two consecutive completed days. The spike days are excluded from the baseline math, so they don't artificially inflate their own baseline. Quiet, fewer false positives.
- Immediate — fires on every qualifying jump, including a single-day spike. Far noisier. Useful when one bad day is already actionable (e.g., when a small dev cluster's nightly job hangs and burns the weekend).
The threshold and the floor that decide what counts as "qualifying" adapt to the scope's spending pattern — nothing to tune on the rule itself. The sensitivity is the only knob.
The other three alert types
Budget threshold is the simplest. You set a budget for a period (monthly, quarterly, or fiscal year), pick which threshold percentages to be notified at (50%, 80%, 90%, 100% — any non-empty subset), and pick a cost mode (Fully Loaded vs Workload Only). The rule fires once when each selected threshold is first crossed; the counter resets at the start of the next period.
Unused resources runs on a weekly schedule. It rolls up the scope's idle resources (low CPU/memory utilization combined with low network activity), filters by the minimum idle percentage you set, and delivers the top N most expensive idle items in a single digest. You configure the percentage and the cap; the cadence is fixed at weekly. Use this when you want a "did we leave money on the table this week?" recap instead of a real-time alert.
New expensive workload watches for workloads that appear and exceed a monthly dollar projection. You configure the dollar threshold, a minimum age (so a workload has to run long enough for the cost projection to stabilize), and a list of owner Kinds to exclude. Defaults exclude short-lived built-ins (Job, CronJob) and common ML operators (Argo Workflows, Spark, Ray, Kubeflow). The auto-exclude option for custom-resource-owned workloads is on by default, which is the right setting for most teams running operators.
Scope: where a rule applies
Every rule has a scope. The valid kinds:
organization— every cluster in your org.cluster— a single cluster.namespace— a namespace within a single cluster.team— workloads attributed to a Team (via labels or Assignment Workbench).department— workloads attributed to a Department.
Team and department scopes are dynamic. Kubeadapt resolves the membership at evaluation time, and the baseline math operates on a stable set of workloads — so attribution reshuffles (a team grows, a workload moves) don't create false cost-spike alerts on their own.
Where alerts go
A Rule decides what to fire on. A Policy decides where the alert goes. A Channel is the destination (Slack, email, webhook, in-app). The three layers are separate so you can route critical-severity alerts to PagerDuty-like channels and informational alerts to a digest mailing list without duplicating rule definitions.
The full routing model is covered in Policies and Channels.
Incident lifecycle
When a rule fires, an incident is created. Incidents move through five states: pending (just created), firing (notification dispatched), acknowledged (a human marked it seen), snoozed (silenced for a window), resolved (the condition cleared). The state column is visible on every incident; transitions are logged on the timeline.
A rule that's currently muted, disabled, or degraded won't fire new incidents — see Rules for the difference between those states.