Cooldown, suppression & recovery
Alert Cooldown (Spam Prevention)
Section titled “Alert Cooldown (Spam Prevention)”Prevent alert fatigue from persistent anomalies with cooldown periods.
Default is null — no cooldown. Without a cooldown, a persisting
anomaly re-alerts on every dtk run for as long as the conditions
hold. Set alert_cooldown (e.g. "2h") for production metrics.
The Problem: Alert Spam
Section titled “The Problem: Alert Spam”With frequent monitoring intervals, long-running anomalies generate excessive duplicate alerts:
Example: 10-minute interval metric with 5-hour anomaly:
10:00 - Anomaly detected → Alert sent ✓10:10 - Still anomalous → Alert sent (duplicate!)10:20 - Still anomalous → Alert sent (duplicate!)10:30 - Still anomalous → Alert sent (duplicate!)... (27 more alerts over 5 hours)Result: 30 identical alerts for a single issue.
The Solution: Alert Cooldown
Section titled “The Solution: Alert Cooldown”Configure minimum time between alerts:
alerting: enabled: true channels: - mattermost_ops consecutive_anomalies: 3
# Cooldown configuration alert_cooldown: "30min" # Minimum 30 minutes between alerts cooldown_reset_on_recovery: true # Reset timer when metric recoversCooldown state is stored per alert config block (in the
_dtk_alert_states table). Within one block, no-data alerts and
anomaly alerts share the same cooldown state: either kind of alert
starts the cooldown for both.
Cooldown Behavior
Section titled “Cooldown Behavior”With Recovery Reset (Recommended)
Section titled “With Recovery Reset (Recommended)”Configuration:
alert_cooldown: "30min"cooldown_reset_on_recovery: true # DefaultTimeline:
10:00 - Anomaly detected → Alert sent ✓10:10 - Persists → Skipped (cooldown)10:20 - Persists → Skipped (cooldown)10:30 - Persists → Skipped (cooldown)10:40 - RECOVERS to normal → Cooldown timer RESETS10:50 - NEW anomaly → Alert sent ✓ (recovery reset cooldown)Advantages:
- Alert on first occurrence
- Skip duplicate alerts during persistent issue
- Alert again when new issue occurs after recovery
- Best for most use cases
Strict Cooldown (Noisy Metrics)
Section titled “Strict Cooldown (Noisy Metrics)”Configuration:
alert_cooldown: "1hour"cooldown_reset_on_recovery: false # Strict modeTimeline:
10:00 - Anomaly detected → Alert sent ✓10:10 - Persists → Skipped (cooldown)10:20 - RECOVERS → No alert (recovery doesn't reset)10:30 - NEW anomaly → Skipped (only 30min < 1hour)11:00 - NEW anomaly → Skipped (only 60min = 1hour)11:01 - NEW anomaly → Alert sent ✓ (>1hour passed)Advantages:
- Absolute minimum time between any alerts
- Useful for very noisy metrics
- Prevents alert storms even with rapid recovery/anomaly cycles
Configuration Options
Section titled “Configuration Options”String Format (Human-Readable)
Section titled “String Format (Human-Readable)”alert_cooldown: "10min" # 10 minutesalert_cooldown: "30min" # 30 minutesalert_cooldown: "1hour" # 1 houralert_cooldown: "2hours" # 2 hoursalert_cooldown: "1day" # 1 dayInteger Format (Seconds)
Section titled “Integer Format (Seconds)”alert_cooldown: 600 # 10 minutes (600 seconds)alert_cooldown: 1800 # 30 minutesalert_cooldown: 3600 # 1 houralert_cooldown: 7200 # 2 hoursRecovery Behavior
Section titled “Recovery Behavior”# Reset cooldown on metric recovery (default)cooldown_reset_on_recovery: true
# Strict cooldown regardless of recoverycooldown_reset_on_recovery: falseChoosing Cooldown Settings
Section titled “Choosing Cooldown Settings”By Metric Criticality
Section titled “By Metric Criticality”Critical metrics (API availability, payment processing):
alert_cooldown: "5min" # Short cooldowncooldown_reset_on_recovery: true # Alert on new issues quicklyImportant metrics (Application performance, database latency):
alert_cooldown: "30min" # Standard cooldowncooldown_reset_on_recovery: true # Default behaviorNoisy metrics (Non-critical warnings, experimental monitors):
alert_cooldown: "2hours" # Long cooldowncooldown_reset_on_recovery: false # Strict modeBy Interval
Section titled “By Interval”Fast intervals (1min, 5min):
# More aggressive cooldown neededalert_cooldown: "30min"Slow intervals (1hour, 1day):
# Less aggressive cooldownalert_cooldown: "1hour"How Recovery Detection Works
Section titled “How Recovery Detection Works”detectkit automatically detects recovery by checking if consecutive anomalies dropped below threshold:
Example with consecutive_anomalies: 3:
Points: A A A N N N A A A ↑ ↑ ↑ ↑ ↑ ↑ 1 2 3 Recovery detected!
Timeline:10:00 - 1st anomaly10:10 - 2nd anomaly10:20 - 3rd anomaly → Alert sent (threshold met)10:30 - Normal point10:40 - Normal point10:50 - Normal point → Recovery detected, cooldown reset11:00 - NEW 1st anomaly11:10 - NEW 2nd anomaly11:20 - NEW 3rd anomaly → Alert sent (new issue)Temporary Alert Suppression
Section titled “Temporary Alert Suppression”When you’ve identified the root cause of an anomaly and want to stop alerts while the fix is deployed, use suppress_until to temporarily silence alerts without disabling the metric.
The Problem
Section titled “The Problem”Using enabled: false requires two config edits — one to disable, another to re-enable later. If you forget the second edit, alerting stays off.
The Solution: suppress_until
Section titled “The Solution: suppress_until”Set a UTC datetime after which alerts automatically resume:
alerting: enabled: true suppress_until: "2026-04-11 18:00:00" # Alerts suppressed until this UTC time channels: - mattermost_ops consecutive_anomalies: 3Key behavior:
- Load and detect steps continue running normally — data collection is not interrupted
- Only the alert step is skipped while
now < suppress_until - After the specified time, alerts resume automatically — no second config edit needed
- The
suppress_untilvalue can be left in the config after it expires — it has no effect once the time has passed
Timeline Example
Section titled “Timeline Example”Config: suppress_until: "2026-04-11 18:00:00"
2026-04-10 14:00 - Anomaly detected → Suppressed (before 18:00 Apr 11)2026-04-10 15:00 - Anomaly detected → Suppressed2026-04-11 12:00 - Anomaly detected → Suppressed2026-04-11 18:01 - Anomaly detected → Alert sent ✓ (suppress period ended)2026-04-11 19:00 - Anomaly detected → Normal cooldown rules applyWhen to Use
Section titled “When to Use”| Scenario | Use |
|---|---|
| Known issue being fixed, ETA ~6 hours | suppress_until: "<now + 6h>" |
| Planned maintenance window | suppress_until: "<end of window>" |
| Permanently disable alerting | enabled: false |
| Reduce alert frequency | alert_cooldown: "1hour" |
Recovery Notifications
Section titled “Recovery Notifications”In addition to cooldown reset, detectkit can send a separate notification when a metric returns to normal after an anomaly.
Enabling Recovery Notifications
Section titled “Enabling Recovery Notifications”alerting: enabled: true channels: - mattermost_ops consecutive_anomalies: 3 notify_on_recovery: true # Send notification when metric recoversRecovery Logic
Section titled “Recovery Logic”Recovery notification is sent when all of the following are true:
- A previous anomaly alert was sent for this metric
- The metric has returned to normal (no blocking anomalies at the latest point)
- A recovery notification has not already been sent for this incident
Recovery is direction-aware: only anomalies matching the alert’s direction block recovery. For example, after a “down” alert a fresh “up” anomaly does not prevent the recovery notification — the original alert condition no longer holds.
Timeline with notify_on_recovery: true and consecutive_anomalies: 3:
10:00 - 1st anomaly10:10 - 2nd anomaly10:20 - 3rd anomaly → ALERT sent ("🔴 Alert: cpu_usage")10:30 - Normal point10:40 - Normal point → RECOVERY sent ("🟢 Alert cleared: cpu_usage")10:50 - Normal point11:00 - NEW 1st anomaly...11:20 - NEW 3rd anomaly → ALERT sent (new incident)11:30 - Normal point → RECOVERY sent (new recovery)Custom Recovery Template
Section titled “Custom Recovery Template”Use template_recovery to customize the recovery message. Supports the same variables as anomaly templates, plus {status}:
alerting: notify_on_recovery: true template_recovery: "{metric_name} recovered at {timestamp}\nValue: {value} | Interval: {confidence_interval}"Available template variables:
| Variable | Description |
|---|---|
{metric_name} | Metric name |
{timestamp} | Timestamp of the last detection point |
{timezone} | Configured timezone |
{value} | Metric value at recovery point |
{value_display} | NaN-safe value string — always renders, falls back to "no data" (used by the default recovery template) |
{confidence_lower} | Lower confidence bound |
{confidence_upper} | Upper confidence bound |
{confidence_interval} | Formatted as [lower, upper] |
{expected_range} | One-sided aware expected band (>= lo / <= hi / [lo, hi] / N/A) |
{detector_name} | Detector that was monitoring |
{min_detectors} / {direction_policy} / {consecutive_required} | The rule of the alert that cleared (echoed so recovery names the same condition) |
{status} | Always "RECOVERED" in recovery messages |
{mentions} | Formatted mentions string (e.g., @user1 @user2), empty if none |
{mentions_line} | Same as {mentions} with leading newline, empty if none |
Recovery with Cooldown
Section titled “Recovery with Cooldown”Recovery notifications work independently of alert_cooldown. The cooldown only applies to anomaly alerts. Recovery is always sent once per incident regardless of cooldown settings.
alerting: alert_cooldown: "30min" cooldown_reset_on_recovery: true # Resets cooldown timer on recovery notify_on_recovery: true # Also sends a recovery notificationComplete Example
Section titled “Complete Example”name: api_response_time_p95description: API response time 95th percentileinterval: "5min"
query: | SELECT timestamp, quantile(0.95)(response_time_ms) as value FROM http_requests WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: - type: mad params: threshold: 3.5 window_size: 288 # 24 hours
alerting: enabled: true timezone: "UTC"
channels: - mattermost_ops - slack_incidents
# Anomaly filtering min_detectors: 1 direction: "any" consecutive_anomalies: 3
# Alert cooldown alert_cooldown: "30min" # No more than 1 alert per 30 minutes cooldown_reset_on_recovery: true # Alert again when new issue after recovery
# Recovery notifications notify_on_recovery: true # Send notification when metric stabilizes template_recovery: "{metric_name} is back to normal at {timestamp}"
# Special alerts no_data_alert: falseBest Practices
Section titled “Best Practices”- Start with recovery reset: Use
cooldown_reset_on_recovery: trueinitially - Enable recovery notifications:
notify_on_recovery: trueis recommended for critical metrics - Tune cooldown duration: Match to your team’s response time (15min - 1hour typical)
- Adjust for interval: Faster intervals need longer cooldowns
- Monitor alert frequency: Track via
_dtk_alert_states.alert_countin database - Use strict mode sparingly: Only for very noisy experimental metrics
Note: Alert state (last alert/recovery timestamps, alert counter) lives in the
_dtk_alert_statestable, keyed by metric and alert config block. The table is created automatically — no manual migration needed.
Disabling Cooldown
Section titled “Disabling Cooldown”Omit alert_cooldown or set to null (this is the default):
alerting: enabled: true channels: - mattermost_ops consecutive_anomalies: 3 # No alert_cooldown = alert on EVERY run while conditions holdWarning: Without a cooldown, a persistent anomaly fires a duplicate
alert on every dtk run (e.g. every cron tick). Setting alert_cooldown
is recommended for production metrics.