Skip to content

Examples

This directory contains practical examples for common monitoring scenarios.


Monitor server CPU usage with both hard limit and statistical detection.

name: cpu_usage
interval: 30s
query: |
SELECT
toStartOfInterval(timestamp, INTERVAL 30 SECOND) AS timestamp,
AVG(cpu_percent) AS value
FROM system_metrics
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
# Hard limit: CPU should never exceed 95%
- type: manual_bounds
params:
upper_bound: 95.0
# Statistical: detect unusual CPU patterns
- type: zscore
params:
threshold: 3.0
window_size: 2880 # 1 day of 30s intervals
min_samples: 100
alerting:
enabled: true
channels:
- slack_ops
min_detectors: 1 # Alert if ANY detector triggers
direction: "up" # Only alert on high CPU (low is good)
consecutive_anomalies: 2 # Require 2 consecutive points
  • Manual Bounds: Catches critical threshold violations immediately
  • Z-Score: Detects unusual patterns even if below 95%
  • Direction filter: Prevents alerts when CPU drops (which is good)
  • Short consecutive: CPU spikes need fast response

Track memory usage with adaptive detection.

name: memory_usage_pct
interval: 1min
query: |
SELECT
toStartOfMinute(timestamp) AS timestamp,
(used_memory_bytes / total_memory_bytes) * 100 AS value
FROM system_metrics
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
- type: mad
params:
threshold: 3.0
window_size: 1440 # 1 day
min_samples: 100
alerting:
enabled: true
channels:
- mattermost_ops
direction: "up"
consecutive_anomalies: 5 # Memory grows slowly, wait for confirmation
  • Memory usage can have outliers (garbage collection spikes)
  • MAD is robust to these temporary spikes
  • Higher consecutive threshold avoids false positives

Monitor disk space with SLA threshold.

name: disk_usage_pct
interval: 5min
query: |
SELECT
toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS timestamp,
(used_space_bytes / total_space_bytes) * 100 AS value
FROM storage_metrics
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
- type: manual_bounds
params:
upper_bound: 85.0 # Alert when disk > 85% full
alerting:
enabled: true
channels:
- slack_critical
- email_oncall
consecutive_anomalies: 3 # Disk fills slowly, confirm trend
  • Clear SLA: disk should never exceed 85%
  • Disk usage grows predictably, no need for statistical detection
  • Simpler than statistical methods for this use case

Track API latency with percentile-based detection.

name: api_p95_latency_ms
interval: 1min
query: |
SELECT
toStartOfMinute(timestamp) AS timestamp,
quantile(0.95)(response_time_ms) AS value
FROM http_requests
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
# SLA: P95 latency < 1000ms
- type: manual_bounds
params:
upper_bound: 1000
# Detect degradation before hitting SLA
- type: iqr
params:
threshold: 1.5
window_size: 1440
min_samples: 100
alerting:
enabled: true
channels:
- slack_ops
consecutive_anomalies: 3
direction: "up"
  • Percentile metrics are skewed (heavy-tailed)
  • IQR handles skewness better than Z-Score
  • Manual bounds ensures SLA compliance

Zero-tolerance error monitoring.

name: api_error_rate
interval: 1min
query: |
SELECT
toStartOfMinute(timestamp) AS timestamp,
countIf(status_code >= 500) / count() AS value
FROM http_requests
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
# Zero tolerance for errors
- type: manual_bounds
params:
upper_bound: 0.01 # Alert if error rate > 1%
alerting:
enabled: true
channels:
- slack_critical
- email_oncall
consecutive_anomalies: 1 # Alert immediately
direction: "up"
  • Errors are critical, need fast response
  • No need for consecutive threshold
  • Manual bounds with low threshold (1%)

Example 6: Request Throughput with Seasonality

Section titled “Example 6: Request Throughput with Seasonality”

Monitor API traffic with daily/weekly patterns.

name: api_requests_per_minute
interval: 1min
query: |
SELECT
toStartOfMinute(timestamp) AS timestamp,
count() AS value
FROM http_requests
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
# Extract seasonality features from timestamps (built-in)
# Available: hour, day_of_week, day_of_month, month, is_weekend
seasonality_columns:
- hour
- day_of_week
detectors:
- type: mad
params:
threshold: 3.0
window_size: 10080 # 1 week of 1-min data
min_samples: 500
seasonality_components:
- ["hour", "day_of_week"]
min_samples_per_group: 10
alerting:
enabled: true
channels:
- mattermost_ops
consecutive_anomalies: 3
  • Traffic varies by hour (business hours vs night)
  • Traffic varies by day (weekday vs weekend)
  • Combined seasonality creates 168 unique patterns (24h × 7d)
  • Prevents false positives during natural low-traffic periods

Track user engagement.

name: daily_active_users
interval: 1day
query: |
SELECT
toDate(timestamp) AS timestamp,
uniqExact(user_id) AS value
FROM user_events
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
- type: mad
params:
threshold: 3.0
window_size: 60 # 2 months
min_samples: 30
alerting:
enabled: true
channels:
- slack_analytics
consecutive_anomalies: 2
direction: "down" # Alert only on drops (increases are good)
  • Increases in DAU are positive (don’t alert)
  • Decreases are concerning (alert)
  • MAD robust to occasional spikes/drops

Monitor financial metrics.

name: daily_revenue_usd
interval: 1day
query: |
SELECT
toDate(timestamp) AS timestamp,
SUM(amount_usd) AS value
FROM transactions
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
AND status = 'completed'
GROUP BY timestamp
ORDER BY timestamp
# Extract day of week from timestamps (built-in)
seasonality_columns:
- day_of_week
detectors:
- type: mad
params:
threshold: 3.0
window_size: 90 # 3 months
min_samples: 30
seasonality_components:
- "day_of_week" # Different revenue on weekends
alerting:
enabled: true
channels:
- slack_finance
- email_management
consecutive_anomalies: 2
direction: "down" # Alert on revenue drops
  • Revenue often varies by day of week
  • Weekdays vs weekends have different patterns
  • Prevents false positives on expected low-revenue days

Track funnel metrics.

name: signup_conversion_rate
interval: 1hour
query: |
SELECT
toStartOfHour(timestamp) AS timestamp,
countIf(action = 'signup') / countIf(action = 'visit') AS value
FROM user_events
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
# Extract hour of day from timestamps (built-in)
seasonality_columns:
- hour
detectors:
- type: mad
params:
threshold: 3.0
window_size: 672 # 4 weeks
min_samples: 100
seasonality_components:
- "hour"
alerting:
enabled: true
channels:
- slack_growth
consecutive_anomalies: 3
direction: "down" # Alert on conversion drops

Example 10: Gaming Metrics with Complex Seasonality

Section titled “Example 10: Gaming Metrics with Complex Seasonality”

Monitor gaming metrics with multi-dimensional seasonality.

name: group_assigned_users_pct
interval: 10min
query_file: sql/group_assigned.sql
# Seasonality columns come from the query itself (query_columns.seasonality),
# so no timestamp-based extraction (seasonality_columns) is needed.
query_columns:
timestamp: period_time
metric: group_assigned_users_pct
seasonality:
- offset_10minutes # 0-143 (10-min offset in day)
- league_day # 1-3 (tournament day)
loading_start_time: "2024-01-01 00:00:00"
loading_batch_size: 2160 # 15 days
detectors:
- type: mad
params:
threshold: 3.0
window_size: 8640 # 60 days
min_samples: 1000
start_time: "2024-03-01 00:00:00"
batch_size: 2160
seasonality_components:
- ["offset_10minutes", "league_day"] # 432 unique combinations
min_samples_per_group: 10
alerting:
enabled: true
timezone: "Europe/Moscow"
channels:
- mattermost_analytics
consecutive_anomalies: 3
  • Gaming metric with tournament schedule (3-day leagues)
  • Different patterns for each 10-minute interval within each tournament day
  • 432 unique groups (144 intervals × 3 days)
  • Requires large window (60 days) to have enough samples per group

Combine multiple detectors with different sensitivities.

name: critical_service_latency
interval: 30s
query: |
SELECT
toStartOfInterval(timestamp, INTERVAL 30 SECOND) AS timestamp,
AVG(latency_ms) AS value
FROM critical_service_logs
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
# Detector 1: Conservative (fewer false positives)
- type: mad
params:
threshold: 5.0 # Very high threshold
window_size: 2880 # 1 day
min_samples: 100
# Detector 2: Aggressive (catch subtle issues)
- type: zscore
params:
threshold: 2.5 # Lower threshold
window_size: 1440 # 12 hours
min_samples: 100
# Detector 3: Hard limit (SLA)
- type: manual_bounds
params:
upper_bound: 1000 # Never exceed 1s
alerting:
enabled: true
channels:
- slack_critical
min_detectors: 2 # Require 2 detectors to agree
direction: "same" # ...on the SAME direction (up or down)
consecutive_anomalies: 3
  • Conservative detector (MAD with high threshold) for confidence
  • Aggressive detector (Z-Score with low threshold) for early warning
  • Hard limit for SLA compliance
  • Requiring 2 to agree reduces false positives

Alert logic (min_detectors: 2, direction: "same", consecutive_anomalies: 3):

  • MAD + Z-Score both detect “up” → counts toward the alert (high confidence)
  • Manual bounds fires “up” + a statistical detector fires “up” → both vote “up”, so the quorum is met (the votes must point the SAME way)
  • Only Z-Score detects → no alert (might be noise)
  • One detector says “up” while another says “down” → no alert: they are two anomalies in opposite directions, not two votes for one direction (disagreement is not consensus)
  • The 2-detector quorum must hold at each of the last 3 consecutive points (exactly one interval apart — a gap breaks the chain)

Note: with min_detectors: 2, a manual-bounds violation alone never alerts — it is one vote in the quorum. If the hard limit must page immediately on its own, monitor it as a separate metric (or a separate alerting entry with min_detectors: 1, consecutive_anomalies: 1 — but then any single detector can trigger that entry).


Fire an alert when a metric stops producing data — e.g., the source ETL hung and no rows arrive for the latest interval.

metrics/hourly_revenue.yml
name: hourly_revenue_usd
description: Total revenue per hour, sourced from the orders ETL
interval: 1hour
query: |
SELECT
toStartOfHour(timestamp) AS timestamp,
SUM(amount_usd) AS value
FROM transactions
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
AND status = 'completed'
GROUP BY timestamp
ORDER BY timestamp
detectors:
- type: mad
params:
threshold: 3.0
window_size: 720 # 30 days
min_samples: 100
alerting:
enabled: true
channels:
- mattermost_finance
- email_oncall
consecutive_anomalies: 2
direction: "down"
# No-data alert — fires if the previous full hour has no row
no_data_alert: true
template_no_data: |
hourly_revenue stopped reporting
Last expected hour: {timestamp} ({timezone})
Likely cause: orders ETL is hung. Check the upstream job.
{mentions}
mentions: [oncall_data]
alert_cooldown: "1hour" # don't spam every cron tick
  • Hourly cron schedule means a missing hour is a real signal, not noise
  • no_data_alert independently checks the last full interval, so it fires even when there are no anomalies to evaluate against
  • Custom template_no_data makes the on-call action obvious — they don’t need to guess what “no data” means
  • alert_cooldown: "1hour" ensures only one no-data alert per cron tick, even if the ETL stays broken (anomaly and no-data alerts share the same cooldown state per alerting block)
  • Naturally sparse metrics (events that don’t happen every interval)
  • High-cardinality slicing where empty buckets are normal

Catch failures at the project level — DB outages, query timeouts, lock failures — that affect every metric in the run.

detectkit_project.yml
name: my_monitoring
default_profile: prod
paths:
metrics: metrics
sql: sql
templates: templates
# Catch pipeline crashes (one alert per dtk run, then abort)
error_alerting:
enabled: true
channels:
- mattermost_oncall
- email_oncall
mentions: [oncall_engineer, here]
timezone: "Europe/Moscow"
template: |
detectkit pipeline failure
Metric: {metric_name}
{error_type}: {error_message}
Time: {timestamp} ({timezone})
{mentions}
  • Without error_alerting, the run silently moves to the next metric on failure — ops only notices when expected anomaly alerts stop arriving (could be hours)
  • One alert per dtk run (subsequent failures suppressed) — if CH is down, you don’t get 30 identical alerts
  • Run aborts after the first error alert — no point loading the rest if the source is dead
  • Channels reuse profiles.yml — no config duplication
  • Pair with cron exit-code monitoring: error_alerting covers in-process crashes, cron monitoring covers dtk run not running at all
profiles.yml
alert_channels:
mattermost_oncall:
type: mattermost
webhook_url: "https://mattermost.example.com/hooks/xxx"
channel: "oncall-alerts"
email_oncall:
type: email
smtp_host: smtp.gmail.com
smtp_port: 587
from_email: detectkit@example.com
to_emails: [oncall@example.com]

Route the same metric to several destinations with different rules by giving alerting: as a YAML list instead of a single block. Each block is fully independent — its own channels, conditions, templates, cooldown, and alert/recovery/cooldown state.

name: api_latency_p95
interval: 5min
query: |
SELECT timestamp, quantile(0.95)(response_time_ms) AS value
FROM api_requests
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp
ORDER BY timestamp
detectors:
- type: mad
params: {threshold: 3.0, window_size: 288, min_samples: 100}
- type: zscore
params: {threshold: 2.5, window_size: 288, min_samples: 100}
# alerting as a list of independent blocks
alerting:
# Route 1: page on-call fast — any single detector, short streak, tight cooldown
- channels: [slack_oncall]
min_detectors: 1
direction: "up"
consecutive_anomalies: 2
alert_cooldown: "10min"
mentions: ["oncall", "here"]
# Route 2: calmer team summary — require both detectors to agree, longer streak
- channels: [mattermost_team]
min_detectors: 2
direction: "same"
consecutive_anomalies: 5
alert_cooldown: "1hour"
notify_on_recovery: true

Each block keeps its own channels, conditions, templates, cooldown, and alert/recovery state — they fire and recover independently. See the full config in multi-alert-routing-example.yml and the Alerting Guide for routing details.


Use CaseDetectorSeasonalityConsecutiveDirection
System ResourcesManual + Z-ScoreNo2-3up
API LatencyManual + IQROptional3up
Error RatesManualNo1up
Traffic/ThroughputMADYes (hour + day_of_week)3any
User EngagementMADOptional2-3down
RevenueMADYes (day_of_week)2down
Conversion RateMADYes (hour)3down
FeatureConfigEffect
Basic alertconsecutive_anomalies: 3Alert after 3 consecutive anomalies (default: 3; gaps in the grid break the chain)
Detector quorummin_detectors: 2Require 2 detectors per the direction policy (default: 1)
Direction policydirection: "same"same (default): quorum must agree on one direction; any: every anomaly counts; up/down: only that direction counts
Cooldownalert_cooldown: "30min"No more than 1 alert per 30 min (default: none — a persisting anomaly re-alerts on every dtk run)
Cooldown resetcooldown_reset_on_recovery: trueCooldown resets when metric normalizes
Recovery notifynotify_on_recovery: true”All clear” sent once per incident
Custom recoverytemplate_recovery: "..."Custom message text for recovery
Single-anomaly templatetemplate_single: "..."Used when the alert has no streak (consecutive count ≤ 1); falls back to template_consecutive
Streak templatetemplate_consecutive: "..."Used for consecutive-anomaly alerts
Mentionsmentions: ["oncall", "here"]@mention users/groups in alerts
Suppresssuppress_until: "2026-04-11 18:00:00"Pause alerts until UTC time
No-data alertno_data_alert: trueFire when latest interval has no row
No-data templatetemplate_no_data: "..."Custom no-data message body
Multiple alert routesalerting: as a YAML listEach block independent (channels/conditions/templates/cooldown/state)
Rule-aware template vars (v0.9+){expected_range}, {min_detectors}, {direction_policy}, {consecutive_required}, {detector_count}Surface the rule the alert fired with (the default message is now alert-centric)
Project errorserror_alerting: in detectkit_project.ymlCatch pipeline crashes (DB outage etc.)