Examples
This directory contains practical examples for common monitoring scenarios.
Example Index
Section titled “Example Index”Infrastructure Monitoring
Section titled “Infrastructure Monitoring”- CPU Usage - System resource monitoring with multiple bounds
- Memory Usage - Memory monitoring with threshold
- Disk Usage - Storage monitoring with SLA
Application Monitoring
Section titled “Application Monitoring”- API Response Time - Latency monitoring with percentiles
- API Error Rate - Error tracking with zero tolerance
- Request Throughput - Traffic monitoring with hourly patterns
Business Metrics
Section titled “Business Metrics”- Daily Active Users - User engagement monitoring
- Revenue Tracking - Financial metrics monitoring
- Conversion Rate - Funnel metrics monitoring
Advanced Examples
Section titled “Advanced Examples”- Gaming Metrics with Complex Seasonality - Multi-dimensional seasonality
- Multi-Detector Strategy - Combining multiple detectors
Alerting & Preprocessing Features
Section titled “Alerting & Preprocessing Features”- No-Data Alerts - Fire when the latest interval has no datapoint
- Project-Level Error Alerting - Catch DB outages and pipeline crashes at the project level
- Multiple Alerting Blocks - Route one metric to several independent alert blocks
- Mentions Example - @mention users/groups in alerts across all channels
- Alert Cooldown Example - Prevent spam with alert cooldown
- Recovery Notifications Example - “All clear” messages when metric stabilizes
- Detector Preprocessing Example - input_type, smoothing, window weighting, detrending
Auto-tuning
Section titled “Auto-tuning”- Auto-tune Incidents Example - A labels file (intervals + points) for supervised
dtk autotune - Auto-tuned Metric Example - A metric with an
autotune:block constraining the search
Example 1: CPU Usage Monitoring
Section titled “Example 1: CPU Usage Monitoring”Monitor server CPU usage with both hard limit and statistical detection.
Configuration
Section titled “Configuration”name: cpu_usageinterval: 30s
query: | SELECT toStartOfInterval(timestamp, INTERVAL 30 SECOND) AS timestamp, AVG(cpu_percent) AS value FROM system_metrics WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: # Hard limit: CPU should never exceed 95% - type: manual_bounds params: upper_bound: 95.0
# Statistical: detect unusual CPU patterns - type: zscore params: threshold: 3.0 window_size: 2880 # 1 day of 30s intervals min_samples: 100
alerting: enabled: true channels: - slack_ops min_detectors: 1 # Alert if ANY detector triggers direction: "up" # Only alert on high CPU (low is good) consecutive_anomalies: 2 # Require 2 consecutive pointsWhy This Works
Section titled “Why This Works”- Manual Bounds: Catches critical threshold violations immediately
- Z-Score: Detects unusual patterns even if below 95%
- Direction filter: Prevents alerts when CPU drops (which is good)
- Short consecutive: CPU spikes need fast response
Example 2: Memory Usage Monitoring
Section titled “Example 2: Memory Usage Monitoring”Track memory usage with adaptive detection.
Configuration
Section titled “Configuration”name: memory_usage_pctinterval: 1min
query: | SELECT toStartOfMinute(timestamp) AS timestamp, (used_memory_bytes / total_memory_bytes) * 100 AS value FROM system_metrics WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: - type: mad params: threshold: 3.0 window_size: 1440 # 1 day min_samples: 100
alerting: enabled: true channels: - mattermost_ops direction: "up" consecutive_anomalies: 5 # Memory grows slowly, wait for confirmationWhy MAD
Section titled “Why MAD”- Memory usage can have outliers (garbage collection spikes)
- MAD is robust to these temporary spikes
- Higher consecutive threshold avoids false positives
Example 3: Disk Usage Monitoring
Section titled “Example 3: Disk Usage Monitoring”Monitor disk space with SLA threshold.
Configuration
Section titled “Configuration”name: disk_usage_pctinterval: 5min
query: | SELECT toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS timestamp, (used_space_bytes / total_space_bytes) * 100 AS value FROM storage_metrics WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: - type: manual_bounds params: upper_bound: 85.0 # Alert when disk > 85% full
alerting: enabled: true channels: - slack_critical - email_oncall consecutive_anomalies: 3 # Disk fills slowly, confirm trendWhy Manual Bounds
Section titled “Why Manual Bounds”- Clear SLA: disk should never exceed 85%
- Disk usage grows predictably, no need for statistical detection
- Simpler than statistical methods for this use case
Example 4: API Response Time Monitoring
Section titled “Example 4: API Response Time Monitoring”Track API latency with percentile-based detection.
Configuration
Section titled “Configuration”name: api_p95_latency_msinterval: 1min
query: | SELECT toStartOfMinute(timestamp) AS timestamp, quantile(0.95)(response_time_ms) AS value FROM http_requests WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: # SLA: P95 latency < 1000ms - type: manual_bounds params: upper_bound: 1000
# Detect degradation before hitting SLA - type: iqr params: threshold: 1.5 window_size: 1440 min_samples: 100
alerting: enabled: true channels: - slack_ops consecutive_anomalies: 3 direction: "up"Why IQR
Section titled “Why IQR”- Percentile metrics are skewed (heavy-tailed)
- IQR handles skewness better than Z-Score
- Manual bounds ensures SLA compliance
Example 5: API Error Rate Monitoring
Section titled “Example 5: API Error Rate Monitoring”Zero-tolerance error monitoring.
Configuration
Section titled “Configuration”name: api_error_rateinterval: 1min
query: | SELECT toStartOfMinute(timestamp) AS timestamp, countIf(status_code >= 500) / count() AS value FROM http_requests WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: # Zero tolerance for errors - type: manual_bounds params: upper_bound: 0.01 # Alert if error rate > 1%
alerting: enabled: true channels: - slack_critical - email_oncall consecutive_anomalies: 1 # Alert immediately direction: "up"Why Immediate Alerts
Section titled “Why Immediate Alerts”- Errors are critical, need fast response
- No need for consecutive threshold
- Manual bounds with low threshold (1%)
Example 6: Request Throughput with Seasonality
Section titled “Example 6: Request Throughput with Seasonality”Monitor API traffic with daily/weekly patterns.
Configuration
Section titled “Configuration”name: api_requests_per_minuteinterval: 1min
query: | SELECT toStartOfMinute(timestamp) AS timestamp, count() AS value FROM http_requests WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
# Extract seasonality features from timestamps (built-in)# Available: hour, day_of_week, day_of_month, month, is_weekendseasonality_columns: - hour - day_of_week
detectors: - type: mad params: threshold: 3.0 window_size: 10080 # 1 week of 1-min data min_samples: 500 seasonality_components: - ["hour", "day_of_week"] min_samples_per_group: 10
alerting: enabled: true channels: - mattermost_ops consecutive_anomalies: 3Why Seasonality
Section titled “Why Seasonality”- Traffic varies by hour (business hours vs night)
- Traffic varies by day (weekday vs weekend)
- Combined seasonality creates 168 unique patterns (24h × 7d)
- Prevents false positives during natural low-traffic periods
Example 7: Daily Active Users
Section titled “Example 7: Daily Active Users”Track user engagement.
Configuration
Section titled “Configuration”name: daily_active_usersinterval: 1day
query: | SELECT toDate(timestamp) AS timestamp, uniqExact(user_id) AS value FROM user_events WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: - type: mad params: threshold: 3.0 window_size: 60 # 2 months min_samples: 30
alerting: enabled: true channels: - slack_analytics consecutive_anomalies: 2 direction: "down" # Alert only on drops (increases are good)Why Direction Filter
Section titled “Why Direction Filter”- Increases in DAU are positive (don’t alert)
- Decreases are concerning (alert)
- MAD robust to occasional spikes/drops
Example 8: Daily Revenue Tracking
Section titled “Example 8: Daily Revenue Tracking”Monitor financial metrics.
Configuration
Section titled “Configuration”name: daily_revenue_usdinterval: 1day
query: | SELECT toDate(timestamp) AS timestamp, SUM(amount_usd) AS value FROM transactions WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' AND status = 'completed' GROUP BY timestamp ORDER BY timestamp
# Extract day of week from timestamps (built-in)seasonality_columns: - day_of_week
detectors: - type: mad params: threshold: 3.0 window_size: 90 # 3 months min_samples: 30 seasonality_components: - "day_of_week" # Different revenue on weekends
alerting: enabled: true channels: - slack_finance - email_management consecutive_anomalies: 2 direction: "down" # Alert on revenue dropsWhy Seasonality
Section titled “Why Seasonality”- Revenue often varies by day of week
- Weekdays vs weekends have different patterns
- Prevents false positives on expected low-revenue days
Example 9: Conversion Rate Monitoring
Section titled “Example 9: Conversion Rate Monitoring”Track funnel metrics.
Configuration
Section titled “Configuration”name: signup_conversion_rateinterval: 1hour
query: | SELECT toStartOfHour(timestamp) AS timestamp, countIf(action = 'signup') / countIf(action = 'visit') AS value FROM user_events WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
# Extract hour of day from timestamps (built-in)seasonality_columns: - hour
detectors: - type: mad params: threshold: 3.0 window_size: 672 # 4 weeks min_samples: 100 seasonality_components: - "hour"
alerting: enabled: true channels: - slack_growth consecutive_anomalies: 3 direction: "down" # Alert on conversion dropsExample 10: Gaming Metrics with Complex Seasonality
Section titled “Example 10: Gaming Metrics with Complex Seasonality”Monitor gaming metrics with multi-dimensional seasonality.
Configuration
Section titled “Configuration”name: group_assigned_users_pctinterval: 10min
query_file: sql/group_assigned.sql
# Seasonality columns come from the query itself (query_columns.seasonality),# so no timestamp-based extraction (seasonality_columns) is needed.query_columns: timestamp: period_time metric: group_assigned_users_pct seasonality: - offset_10minutes # 0-143 (10-min offset in day) - league_day # 1-3 (tournament day)
loading_start_time: "2024-01-01 00:00:00"loading_batch_size: 2160 # 15 days
detectors: - type: mad params: threshold: 3.0 window_size: 8640 # 60 days min_samples: 1000 start_time: "2024-03-01 00:00:00" batch_size: 2160 seasonality_components: - ["offset_10minutes", "league_day"] # 432 unique combinations min_samples_per_group: 10
alerting: enabled: true timezone: "Europe/Moscow" channels: - mattermost_analytics consecutive_anomalies: 3Why Complex Seasonality
Section titled “Why Complex Seasonality”- Gaming metric with tournament schedule (3-day leagues)
- Different patterns for each 10-minute interval within each tournament day
- 432 unique groups (144 intervals × 3 days)
- Requires large window (60 days) to have enough samples per group
Example 11: Multi-Detector Strategy
Section titled “Example 11: Multi-Detector Strategy”Combine multiple detectors with different sensitivities.
Configuration
Section titled “Configuration”name: critical_service_latencyinterval: 30s
query: | SELECT toStartOfInterval(timestamp, INTERVAL 30 SECOND) AS timestamp, AVG(latency_ms) AS value FROM critical_service_logs WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: # Detector 1: Conservative (fewer false positives) - type: mad params: threshold: 5.0 # Very high threshold window_size: 2880 # 1 day min_samples: 100
# Detector 2: Aggressive (catch subtle issues) - type: zscore params: threshold: 2.5 # Lower threshold window_size: 1440 # 12 hours min_samples: 100
# Detector 3: Hard limit (SLA) - type: manual_bounds params: upper_bound: 1000 # Never exceed 1s
alerting: enabled: true channels: - slack_critical min_detectors: 2 # Require 2 detectors to agree direction: "same" # ...on the SAME direction (up or down) consecutive_anomalies: 3Why Multiple Detectors
Section titled “Why Multiple Detectors”- Conservative detector (MAD with high threshold) for confidence
- Aggressive detector (Z-Score with low threshold) for early warning
- Hard limit for SLA compliance
- Requiring 2 to agree reduces false positives
Alert logic (min_detectors: 2, direction: "same", consecutive_anomalies: 3):
- MAD + Z-Score both detect “up” → counts toward the alert (high confidence)
- Manual bounds fires “up” + a statistical detector fires “up” → both vote “up”, so the quorum is met (the votes must point the SAME way)
- Only Z-Score detects → no alert (might be noise)
- One detector says “up” while another says “down” → no alert: they are two anomalies in opposite directions, not two votes for one direction (disagreement is not consensus)
- The 2-detector quorum must hold at each of the last 3 consecutive points (exactly one interval apart — a gap breaks the chain)
Note: with min_detectors: 2, a manual-bounds violation alone never
alerts — it is one vote in the quorum. If the hard limit must page
immediately on its own, monitor it as a separate metric (or a separate
alerting entry with min_detectors: 1, consecutive_anomalies: 1 —
but then any single detector can trigger that entry).
Example 12: No-Data Alerts
Section titled “Example 12: No-Data Alerts”Fire an alert when a metric stops producing data — e.g., the source ETL hung and no rows arrive for the latest interval.
Configuration
Section titled “Configuration”name: hourly_revenue_usddescription: Total revenue per hour, sourced from the orders ETLinterval: 1hour
query: | SELECT toStartOfHour(timestamp) AS timestamp, SUM(amount_usd) AS value FROM transactions WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' AND status = 'completed' GROUP BY timestamp ORDER BY timestamp
detectors: - type: mad params: threshold: 3.0 window_size: 720 # 30 days min_samples: 100
alerting: enabled: true channels: - mattermost_finance - email_oncall consecutive_anomalies: 2 direction: "down"
# No-data alert — fires if the previous full hour has no row no_data_alert: true template_no_data: | hourly_revenue stopped reporting Last expected hour: {timestamp} ({timezone}) Likely cause: orders ETL is hung. Check the upstream job. {mentions} mentions: [oncall_data] alert_cooldown: "1hour" # don't spam every cron tickWhy This Works
Section titled “Why This Works”- Hourly cron schedule means a missing hour is a real signal, not noise
no_data_alertindependently checks the last full interval, so it fires even when there are no anomalies to evaluate against- Custom
template_no_datamakes the on-call action obvious — they don’t need to guess what “no data” means alert_cooldown: "1hour"ensures only one no-data alert per cron tick, even if the ETL stays broken (anomaly and no-data alerts share the same cooldown state per alerting block)
When NOT to Use
Section titled “When NOT to Use”- Naturally sparse metrics (events that don’t happen every interval)
- High-cardinality slicing where empty buckets are normal
Example 13: Project-Level Error Alerting
Section titled “Example 13: Project-Level Error Alerting”Catch failures at the project level — DB outages, query timeouts, lock failures — that affect every metric in the run.
Configuration
Section titled “Configuration”name: my_monitoringdefault_profile: prod
paths: metrics: metrics sql: sql templates: templates
# Catch pipeline crashes (one alert per dtk run, then abort)error_alerting: enabled: true channels: - mattermost_oncall - email_oncall mentions: [oncall_engineer, here] timezone: "Europe/Moscow" template: | detectkit pipeline failure Metric: {metric_name} {error_type}: {error_message} Time: {timestamp} ({timezone}) {mentions}Why This Works
Section titled “Why This Works”- Without
error_alerting, the run silently moves to the next metric on failure — ops only notices when expected anomaly alerts stop arriving (could be hours) - One alert per
dtk run(subsequent failures suppressed) — if CH is down, you don’t get 30 identical alerts - Run aborts after the first error alert — no point loading the rest if the source is dead
- Channels reuse
profiles.yml— no config duplication - Pair with cron exit-code monitoring:
error_alertingcovers in-process crashes, cron monitoring coversdtk runnot running at all
Channel Config (in profiles.yml)
Section titled “Channel Config (in profiles.yml)”alert_channels: mattermost_oncall: type: mattermost webhook_url: "https://mattermost.example.com/hooks/xxx" channel: "oncall-alerts"
email_oncall: type: email smtp_host: smtp.gmail.com smtp_port: 587 from_email: detectkit@example.com to_emails: [oncall@example.com]Multiple Alerting Blocks
Section titled “Multiple Alerting Blocks”Route the same metric to several destinations with different rules by
giving alerting: as a YAML list instead of a single block. Each block
is fully independent — its own channels, conditions, templates, cooldown,
and alert/recovery/cooldown state.
name: api_latency_p95interval: 5min
query: | SELECT timestamp, quantile(0.95)(response_time_ms) AS value FROM api_requests WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' GROUP BY timestamp ORDER BY timestamp
detectors: - type: mad params: {threshold: 3.0, window_size: 288, min_samples: 100} - type: zscore params: {threshold: 2.5, window_size: 288, min_samples: 100}
# alerting as a list of independent blocksalerting: # Route 1: page on-call fast — any single detector, short streak, tight cooldown - channels: [slack_oncall] min_detectors: 1 direction: "up" consecutive_anomalies: 2 alert_cooldown: "10min" mentions: ["oncall", "here"]
# Route 2: calmer team summary — require both detectors to agree, longer streak - channels: [mattermost_team] min_detectors: 2 direction: "same" consecutive_anomalies: 5 alert_cooldown: "1hour" notify_on_recovery: trueEach block keeps its own channels, conditions, templates, cooldown, and alert/recovery state — they fire and recover independently. See the full config in multi-alert-routing-example.yml and the Alerting Guide for routing details.
Common Patterns Summary
Section titled “Common Patterns Summary”| Use Case | Detector | Seasonality | Consecutive | Direction |
|---|---|---|---|---|
| System Resources | Manual + Z-Score | No | 2-3 | up |
| API Latency | Manual + IQR | Optional | 3 | up |
| Error Rates | Manual | No | 1 | up |
| Traffic/Throughput | MAD | Yes (hour + day_of_week) | 3 | any |
| User Engagement | MAD | Optional | 2-3 | down |
| Revenue | MAD | Yes (day_of_week) | 2 | down |
| Conversion Rate | MAD | Yes (hour) | 3 | down |
Alerting Feature Comparison
Section titled “Alerting Feature Comparison”| Feature | Config | Effect |
|---|---|---|
| Basic alert | consecutive_anomalies: 3 | Alert after 3 consecutive anomalies (default: 3; gaps in the grid break the chain) |
| Detector quorum | min_detectors: 2 | Require 2 detectors per the direction policy (default: 1) |
| Direction policy | direction: "same" | same (default): quorum must agree on one direction; any: every anomaly counts; up/down: only that direction counts |
| Cooldown | alert_cooldown: "30min" | No more than 1 alert per 30 min (default: none — a persisting anomaly re-alerts on every dtk run) |
| Cooldown reset | cooldown_reset_on_recovery: true | Cooldown resets when metric normalizes |
| Recovery notify | notify_on_recovery: true | ”All clear” sent once per incident |
| Custom recovery | template_recovery: "..." | Custom message text for recovery |
| Single-anomaly template | template_single: "..." | Used when the alert has no streak (consecutive count ≤ 1); falls back to template_consecutive |
| Streak template | template_consecutive: "..." | Used for consecutive-anomaly alerts |
| Mentions | mentions: ["oncall", "here"] | @mention users/groups in alerts |
| Suppress | suppress_until: "2026-04-11 18:00:00" | Pause alerts until UTC time |
| No-data alert | no_data_alert: true | Fire when latest interval has no row |
| No-data template | template_no_data: "..." | Custom no-data message body |
| Multiple alert routes | alerting: as a YAML list | Each block independent (channels/conditions/templates/cooldown/state) |
| Rule-aware template vars (v0.9+) | {expected_range}, {min_detectors}, {direction_policy}, {consecutive_required}, {detector_count} | Surface the rule the alert fired with (the default message is now alert-centric) |
| Project errors | error_alerting: in detectkit_project.yml | Catch pipeline crashes (DB outage etc.) |
See Also
Section titled “See Also”- Detectors Guide - Choosing the right detector
- Configuration Guide - All configuration options
- Alerting Guide - Alert configuration
- Quickstart Guide - Getting started