Skip to content

Metric Configuration

Files: metrics/*.yml

# Metric identification
name: cpu_usage
description: | # Optional: multi-line, surfaces as {description}
CPU usage monitoring metric.
Tracks system load over time.
tags: [critical, infrastructure] # Optional: drives `dtk run --select tag:critical`
profile: prod # Optional: override default_profile
enabled: true # Optional: disable metric
# Data loading
interval: 1min
query: |
SELECT timestamp, cpu_percent AS value
FROM system_metrics
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
ORDER BY timestamp
# Or use external SQL file
# query_file: sql/cpu_usage.sql
# Column mapping (optional)
query_columns:
timestamp: timestamp
metric: value
# Data loading options
loading_start_time: "2024-01-01 00:00:00"
loading_batch_size: 1440 # Load 1 day at a time
# Seasonality extraction (auto-extracted from timestamps)
seasonality_columns:
- hour
- day_of_week
# Detectors
detectors:
- type: mad
params:
threshold: 3.0
window_size: 1440
min_samples: 100
# Alerting
alerting:
enabled: true
channels:
- mattermost_ops
consecutive_anomalies: 3
# Custom table names (optional)
tables:
datapoints: _dtk_datapoints_cpu
detections: _dtk_detections_cpu

Editing a metric after it already has data? A detector’s identity is a hash of its parameters, and each alerting block’s state is keyed by a hash of its functional fields. So changing a detector parameter (or seasonality_components), removing a detector, or changing/removing an alerting block leaves the old rows behind in _dtk_detections / _dtk_alert_states — the pipeline simply stops writing to them. Run dtk clean --select <metric> to preview and prune that orphaned data. Renamed or deleted the metric entirely? Use dtk clean --orphaned-metrics. (Datapoints are not orphaned by a parameter edit — they are keyed only by timestamp; use --full-refresh to reload those.)

Unique metric identifier. Used in:

  • CLI selectors (dtk run --select cpu_usage)
  • Database queries (WHERE metric_name = ‘cpu_usage’)
  • Logs and alerts

Must be unique across all metrics in the project.

Free-form description of the metric. Supports multi-line text (use a YAML block scalar |). Surfaced in alert templates as the {description} and {description_line} variables.

Labels for selecting metrics on the command line. Run all metrics carrying a tag with dtk run --select tag:<tag> (e.g., dtk run --select tag:critical). Tags allow alphanumeric characters, underscores, and dashes; duplicates are rejected.

Database profile to use for this metric. Overrides default_profile from project config.

Whether metric is active. Disabled metrics are skipped by dtk run.

Time interval between data points.

String format:

  • "1min", "5min", "10min"
  • "1hour", "2hours"
  • "1day", "7days"

Integer format (seconds):

  • 60 = 1 minute
  • 600 = 10 minutes
  • 3600 = 1 hour

Inline SQL query to load data.

Built-in template variables (Jinja2, substituted by detectkit for every loading batch):

  • {{ dtk_start_time }} - Start of time range (inclusive), rendered as YYYY-MM-DD HH:MM:SS
  • {{ dtk_end_time }} - End of time range (exclusive), same format
  • {{ interval_seconds }} - Metric interval in seconds

Every query must constrain its time range using {{ dtk_start_time }} and {{ dtk_end_time }} — otherwise incremental and batched loading cannot work. The rendered values are plain datetime strings, so wrap them in quotes in SQL.

Required columns:

  • Timestamp column (default name: timestamp)
  • Metric value column (default name: value)
  • Optional seasonality columns (declare them in query_columns.seasonality)

Example:

SELECT
timestamp,
AVG(response_time_ms) AS value,
EXTRACT(HOUR FROM timestamp) AS hour_of_day
FROM api_logs
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp, hour_of_day
ORDER BY timestamp

Path to external SQL file (relative to sql_dir).

Mutually exclusive with query.

Example:

query_file: sql/complex_metric.sql

Map query column names to internal names.

query_columns:
timestamp: time_interval # Query has "time_interval" column
metric: metric_value # Query has "metric_value" column
seasonality: # Query has these seasonality columns
- hour_of_day
- day_of_week

Defaults:

  • timestamp: "timestamp"
  • metric: "value"
  • seasonality: null

Start timestamp for initial data load (UTC).

Format: "YYYY-MM-DD HH:MM:SS"

Used only when the metric has no saved datapoints yet. If it is not set and no --from date is passed on the command line, the initial load fails with an error — detectkit does not guess where your data begins. Once datapoints exist, subsequent runs resume from the last saved timestamp and this setting is ignored.

Example:

loading_start_time: "2024-01-01 00:00:00" # Start from Jan 1, 2024

Number of rows to load per batch. Useful for large datasets.

Example:

interval: 10min
loading_batch_size: 2160 # 15 days of 10-min intervals

seasonality_columns (list of strings, optional)

Section titled “seasonality_columns (list of strings, optional)”

Seasonality features auto-extracted from the timestamp for seasonal detection.

Available features:

  • hour: Hour of day (0-23)
  • day_of_week: Day of week (0=Monday, 6=Sunday)
  • day_of_month: Day of month (1-31)
  • month: Month (1-12)
  • is_weekend: Boolean (Saturday/Sunday)
  • is_holiday: Boolean (holiday calendar not implemented yet — always false)

Example:

seasonality_columns:
- hour
- day_of_week

These features are stored with each datapoint and can be referenced in detector seasonality_components.

Alternatively, return custom seasonality columns directly from the query and declare them in query_columns.seasonality — query-provided columns take precedence over seasonality_columns.

List of detector configurations. Each detector independently analyzes the metric.

Full parameter set for the windowed statistical detectors (mad, zscore, iqr — they share one implementation and accept identical parameters):

detectors:
- type: mad # mad, zscore, iqr, manual_bounds
params:
# Algorithm parameters (all participate in the detector ID)
threshold: 3.0 # defaults: mad 3.0, zscore 3.0, iqr 1.5
window_size: 100 # trailing window in points (current point excluded)
min_samples: 30 # min valid points in window to run detection
seasonality_components: # default: null
- "hour" # single component
- ["hour", "day_of_week"] # or combined grouping
min_samples_per_group: 10 # defaults: mad 10, zscore 3, iqr 4
input_type: values # values | changes | absolute_changes | log_changes
smoothing: null # null | ema | sma
smoothing_alpha: 0.3 # EMA factor (0, 1]
smoothing_window: 10 # SMA window in points
window_weights: null # null (uniform) | exponential | linear
half_life: null # for exponential weights: age at which a point's
# weight halves; int = points or duration string ("3d")
# (default when unset: max(window_size / 20, min_samples / 2))
detrend: null # null | linear (robust in-window detrending)
# Execution parameters (not part of the detector ID)
start_time: "2024-01-01 00:00:00" # when detection starts
batch_size: 500 # detection batch size

Parameter semantics — the σ-equivalent threshold scaling (MAD is scaled by 1.4826), detector-identity hashing, the deprecated weight_decay alias, and fail-fast validation timing — are documented once in Shared Detector Parameters.

See the Detectors Guide for choosing and tuning a detector.

Alert configuration for the metric.

alerting:
enabled: true # Enable/disable alerting
suppress_until: null # Suppress alerts until UTC datetime (default: null)
timezone: "Europe/Moscow" # Display timezone (default: UTC)
channels: # List of channel names from profiles.yml
- mattermost_ops
- slack_critical
# Dashboard / runbook links (v0.13.0)
dashboard_url: null # Optional dashboard/runbook URL (default: null)
links: {} # Extra "label: url" links (default: {})
# Anomaly filtering
min_detectors: 1 # Detectors that must satisfy the quorum per point (default: 1)
direction: "same" # "same", "any", "up", "down" (default: "same")
consecutive_anomalies: 3 # Consecutive quorum points to trigger (default: 3)
# Alert cooldown - Prevent spam from persistent anomalies
alert_cooldown: "30min" # Minimum time between alerts
# (default: null = re-alert on EVERY run!)
cooldown_reset_on_recovery: true # Reset cooldown when metric recovers (default: true)
# Recovery notifications
notify_on_recovery: false # Send notification when metric stabilizes (default: false)
template_recovery: null # Custom recovery message template (default: null)
# Mentions (v0.3.8) — tag users/groups in alerts
mentions: [] # Plain usernames without @, e.g., ["oncall", "here"]
# Missing data alert (v0.5.0)
no_data_alert: false # Fire alert when last interval has no row (default: false)
template_no_data: null # Custom no-data message template
# Custom templates
template_single: null # Used when consecutive_count <= 1
template_consecutive: null # Used for streaks (falls back to template_single)

alerting: also accepts a list of blocks. Each block is dispatched independently and carries its own cooldown and alert state, so you can route the same metric to different channels with different rules (e.g., a noisy warning stream plus a strict on-call page).

alerting:
- enabled: true
channels: [slack_ops]
consecutive_anomalies: 1 # warn early
alert_cooldown: "15min"
- enabled: true
channels: [telegram_oncall]
min_detectors: 2 # page only on a stronger signal
consecutive_anomalies: 3
alert_cooldown: "2h"

The single-dict form shown above is still supported and is treated as a list of one block. See the Alerting Guide for full details.

Alert filtering options (see the Alerting Guide for the full contract):

  • min_detectors: How many detectors must satisfy the direction policy at every point in the consecutive chain

    • 1 = One qualifying detector is enough
    • 2 = At least 2 detectors must qualify at each point
  • direction: Which anomalies count toward the quorum

    • "same" (default) = At least min_detectors detectors must agree on ONE direction at the latest point (up and down counted separately — disagreement is not consensus). The winning direction is locked for the whole consecutive chain.
    • "any" = Every anomaly counts regardless of direction (1 up + 1 down satisfies min_detectors: 2)
    • "up" = Only anomalies above the confidence interval count; “down” anomalies are ignored (they neither help nor block)
    • "down" = Only anomalies below the confidence interval count
  • consecutive_anomalies: Consecutive quorum points required

    • 1 = Alert on first anomaly
    • 3 = Alert after 3 consecutive anomalies (reduces false positives)
    • Points must be exactly one metric interval apart — a gap in the detection grid breaks the chain
  • alert_cooldown: Minimum time between alerts (e.g., "2h", 1800)

    • null (default) = no cooldown: a persisting anomaly re-alerts on every dtk run. Set a cooldown for production metrics.
    • No-data alerts and anomaly alerts share the same cooldown state per alert config block.
  • notify_on_recovery: Send notification when metric returns to normal

    • false = No recovery notifications (default)
    • true = Send one recovery notification per incident
  • template_recovery: Custom recovery message template

    • Supports the same variables as anomaly templates (incl. {expected_range} and the rule echo {min_detectors} / {direction_policy} / {consecutive_required}), plus {status}
    • Default template (alert-centric): "🟢 Alert cleared: {metric_name}\nThe alert condition no longer holds — the metric is back within expected bounds.\nRule: ...\n..."
  • suppress_until: Temporarily suppress alerts until a UTC datetime

    • null = No suppression (default)
    • "2026-04-11 18:00:00" = Suppress alerts until this UTC time
    • Load and detect steps continue running; only alerting is paused
    • Alerts auto-resume after the specified time — no need to edit config again
  • mentions: Users/groups to mention in alerts

    • Plain usernames without @ prefix (e.g., ["oncall_user", "here"])
    • Special keywords: here, channel, all for broadcast mentions
    • Each channel formats mentions in its native syntax
    • Available as {mentions} and {mentions_line} template variables
  • dashboard_url (v0.13.0): Optional dashboard/runbook URL

    • null (default) — no dashboard link
    • Surfaced as a first-class action on every channel: a clickable attachment title on Slack/Mattermost, an inline “Open dashboard” link on Telegram, and an “Open dashboard” button in email
    • Also available to custom templates as {dashboard_url} and {dashboard_line}
  • links (v0.13.0): Extra label: url links shown alongside dashboard_url

    • {} (default) — no extra links
    • Each entry is appended as a labelled link, e.g. {Runbook: 'https://...', Grafana: 'https://...'}
    • On webhook channels (Slack/Mattermost/generic) these render as compact clickable labels in one Links field — never raw URLs (since v0.16.1)
  • {help_url} / {help_line} (template variables, since v0.16.0): the “How to read this alert” stakeholder link carried on every alert. This is not a per-metric field — it is set project-wide via alert_help_url in detectkit_project.yml (tri-state: unset → official guide, URL → your runbook, false → hidden); see Configuration → alert_help_url. Available to custom templates as {help_url} / {help_line}, mirroring {dashboard_url} / {dashboard_line}.

alerting:
channels: [mattermost_ops]
dashboard_url: https://grafana.ops/d/api-errors
links:
Runbook: https://runbooks.ops/api-errors
  • no_data_alert (v0.5.0): Alert when the latest expected interval has no datapoint

    • false (default) — disabled
    • true — at the alert step, checks _dtk_datapoints for the last complete interval. If no row exists OR the row’s value is NULL / NaN, fires a dedicated alert with status=NO_DATA through the same channels. Honours alert_cooldown and suppress_until.
    • min_detectors and consecutive_anomalies deliberately do not apply — missing data is a single binary signal, not a per-detector vote.
    • Webhook channels render no-data alerts in amber (#F0AD4E) instead of red.
  • template_no_data (v0.5.0): Custom message body for no-data alerts

    • Default: "No data for metric: {metric_name}\n...Time: {timestamp}\nStatus: query returned no datapoint for the latest interval"
    • Variables: {metric_name}, {timestamp}, {timezone}, {description}, {description_line}, {mentions}, {mentions_line}, {status} (always "NO_DATA")
    • Avoid {value:.2f} / {confidence_interval} — there is no value for no-data alerts. The formatter falls back to the default template if your template uses a numeric format spec on a non-numeric value, but it’s cleaner not to rely on the fallback.

Override default table names for this metric.

tables:
datapoints: _dtk_datapoints_sales
detections: _dtk_detections_sales

Use cases:

  • Separate critical metrics into dedicated tables
  • Organize metrics by team or service
  • Apply different retention policies

Note: tasks table cannot be overridden (shared across all metrics).

See the Internal Tables reference for the full schema (columns, primary keys, engines) of every _dtk_* table.

A target false-alert rate (FDR) for this metric — a fraction in (0, 1], e.g. 0.3 for “at most 30% of fired alerts should be false”. The dtk tune cockpit flags, gently, when the metric’s false-alert rate exceeds this budget.

false_alert_budget: 0.3

Overrides the project-wide false_alert_budget; unset, the project default (then a built-in 0.5) applies. Tuning-only — it never affects the load/detect/alert pipeline, and labeling stays optional.

name: api_errors
interval: 1min
query: |
SELECT
timestamp,
error_count AS value
FROM logs
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
ORDER BY timestamp
detectors:
- type: manual_bounds
params:
upper_bound: 10
alerting:
enabled: true
channels:
- slack_critical
consecutive_anomalies: 1 # Alert immediately
name: website_traffic
interval: 10min
query_file: sql/traffic.sql
# The query itself returns the seasonality columns
query_columns:
timestamp: period_time
metric: visitor_count
seasonality:
- hour_of_day
- day_of_week
loading_start_time: "2024-01-01 00:00:00"
loading_batch_size: 2160 # 15 days
detectors:
- type: mad
params:
threshold: 3.0
window_size: 8640 # 60 days
min_samples: 1000
start_time: "2024-03-01 00:00:00"
seasonality_components:
- ["hour_of_day", "day_of_week"]
min_samples_per_group: 10
window_weights: exponential # favor recent data...
half_life: "3d" # ...so gradual trends don't cause alert spam
alerting:
enabled: true
timezone: "Europe/Moscow"
channels:
- mattermost_ops
min_detectors: 1
direction: "same"
consecutive_anomalies: 3
name: cpu_usage
interval: 30s
query: |
SELECT timestamp, cpu_percent AS value
FROM system_metrics
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
ORDER BY timestamp
detectors:
# Hard limit: CPU should never exceed 95%
- type: manual_bounds
params:
upper_bound: 95.0
# Statistical: detect unusual patterns
- type: mad
params:
threshold: 3.0
window_size: 2880 # 1 day
min_samples: 100
alerting:
enabled: true
channels:
- slack_ops
min_detectors: 1 # Alert if ANY detector triggers
consecutive_anomalies: 2

1. Use External SQL Files for Complex Queries

Section titled “1. Use External SQL Files for Complex Queries”
# Good: Readable, maintainable
query_file: sql/daily_revenue.sql
# Avoid: Hard to read and maintain
query: |
WITH daily_sales AS (
SELECT ...
FROM ...
-- 50 lines of SQL
)
SELECT ...
# 10-minute interval, load 15 days at a time
interval: 10min
loading_batch_size: 2160 # 15 days × 144 intervals/day

Rule of thumb: 7-30 days worth of data per batch.

3. Use loading_start_time for Historical Metrics

Section titled “3. Use loading_start_time for Historical Metrics”
# Don't load years of old data unnecessarily
loading_start_time: "2024-01-01 00:00:00"
metrics/
├── api_errors.yml
├── api_latency.yml
├── api_throughput.yml
└── database_cpu.yml
# Good
name: api_p95_latency_ms
# Avoid
name: metric1

Before adding to detectkit, test SQL queries in your database client to ensure they return expected data.

Add comments explaining non-obvious settings:

detectors:
- type: mad
params:
threshold: 4.0 # Higher threshold due to noisy metric
window_size: 8640 # 60 days to smooth seasonality