Metric Configuration

Files: metrics/*.yml

Basic Structure

# Metric identification
name: cpu_usage
description: |                   # Optional: multi-line, surfaces as {description}
  CPU usage monitoring metric.
  Tracks system load over time.
tags: [critical, infrastructure] # Optional: drives `dtk run --select tag:critical`
profile: prod                   # Optional: override default_profile
enabled: true                   # Optional: disable metric

# Data loading
interval: 1min
query: |
  SELECT timestamp, cpu_percent AS value
  FROM system_metrics
  WHERE timestamp >= '{{ dtk_start_time }}'
    AND timestamp < '{{ dtk_end_time }}'
  ORDER BY timestamp

# Or use external SQL file
# query_file: sql/cpu_usage.sql

# Column mapping (optional)
query_columns:
  timestamp: timestamp
  metric: value

# Data loading options
loading_start_time: "2024-01-01 00:00:00"
loading_batch_size: 1440         # Load 1 day at a time

# Seasonality extraction (auto-extracted from timestamps)
seasonality_columns:
  - hour
  - day_of_week

# Detectors
detectors:
  - type: mad
    params:
      threshold: 3.0
      window_size: 1440
      min_samples: 100

# Alerting
alerting:
  enabled: true
  channels:
    - mattermost_ops
  consecutive_anomalies: 3

# Custom table names (optional)
tables:
  datapoints: _dtk_datapoints_cpu
  detections: _dtk_detections_cpu

Editing a metric after it already has data? A detector’s identity is a hash of its parameters, and each alerting block’s state is keyed by a hash of its functional fields. So changing a detector parameter (or seasonality_components), removing a detector, or changing/removing an alerting block leaves the old rows behind in _dtk_detections / _dtk_alert_states — the pipeline simply stops writing to them. Run dtk clean --select <metric> to preview and prune that orphaned data. Renamed or deleted the metric entirely? Use dtk clean --orphaned-metrics. (Datapoints are not orphaned by a parameter edit — they are keyed only by timestamp; use --full-refresh to reload those.)

Metric Identification

`name` (string, required)

Unique metric identifier. Used in:

CLI selectors (dtk run --select cpu_usage)
Database queries (WHERE metric_name = ‘cpu_usage’)
Logs and alerts

Must be unique across all metrics in the project.

`description` (string, optional)

Free-form description of the metric. Supports multi-line text (use a YAML block scalar |). Surfaced in alert templates as the {description} and {description_line} variables.

`tags` (list of strings, optional)

Labels for selecting metrics on the command line. Run all metrics carrying a tag with dtk run --select tag:<tag> (e.g., dtk run --select tag:critical). Tags allow alphanumeric characters, underscores, and dashes; duplicates are rejected.

`profile` (string, optional)

Database profile to use for this metric. Overrides default_profile from project config.

`enabled` (boolean, default: true)

Whether metric is active. Disabled metrics are skipped by dtk run.

Data Loading

`interval` (string or int, required)

Time interval between data points.

String format:

"1min", "5min", "10min"
"1hour", "2hours"
"1day", "7days"

Integer format (seconds):

60 = 1 minute
600 = 10 minutes
3600 = 1 hour

`query` (string, optional)

Inline SQL query to load data.

Built-in template variables (Jinja2, substituted by detectkit for every loading batch):

{{ dtk_start_time }} - Start of time range (inclusive), rendered as YYYY-MM-DD HH:MM:SS
{{ dtk_end_time }} - End of time range (exclusive), same format
{{ interval_seconds }} - Metric interval in seconds

Every query must constrain its time range using {{ dtk_start_time }} and {{ dtk_end_time }} — otherwise incremental and batched loading cannot work. The rendered values are plain datetime strings, so wrap them in quotes in SQL.

Required columns:

Timestamp column (default name: timestamp)
Metric value column (default name: value)
Optional seasonality columns (declare them in query_columns.seasonality)

Example:

SELECT
  timestamp,
  AVG(response_time_ms) AS value,
  EXTRACT(HOUR FROM timestamp) AS hour_of_day
FROM api_logs
WHERE timestamp >= '{{ dtk_start_time }}'
  AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp, hour_of_day
ORDER BY timestamp

`query_file` (string, optional)

Path to external SQL file (relative to sql_dir).

Mutually exclusive with query.

Example:

query_file: sql/complex_metric.sql

`query_columns` (object, optional)

Map query column names to internal names.

query_columns:
  timestamp: time_interval      # Query has "time_interval" column
  metric: metric_value          # Query has "metric_value" column
  seasonality:                  # Query has these seasonality columns
    - hour_of_day
    - day_of_week

Defaults:

timestamp: "timestamp"
metric: "value"
seasonality: null

`loading_start_time` (string, optional)

Start timestamp for initial data load (UTC).

Format: "YYYY-MM-DD HH:MM:SS"

Used only when the metric has no saved datapoints yet. If it is not set and no --from date is passed on the command line, the initial load fails with an error — detectkit does not guess where your data begins. Once datapoints exist, subsequent runs resume from the last saved timestamp and this setting is ignored.

Example:

loading_start_time: "2024-01-01 00:00:00"  # Start from Jan 1, 2024

`loading_batch_size` (int, optional)

Number of rows to load per batch. Useful for large datasets.

Example:

interval: 10min
loading_batch_size: 2160  # 15 days of 10-min intervals

Seasonality Extraction

`seasonality_columns` (list of strings, optional)

Seasonality features auto-extracted from the timestamp for seasonal detection.

Available features:

hour: Hour of day (0-23)
day_of_week: Day of week (0=Monday, 6=Sunday)
day_of_month: Day of month (1-31)
month: Month (1-12)
is_weekend: Boolean (Saturday/Sunday)
is_holiday: Boolean (holiday calendar not implemented yet — always false)

Example:

seasonality_columns:
  - hour
  - day_of_week

These features are stored with each datapoint and can be referenced in detector seasonality_components.

Alternatively, return custom seasonality columns directly from the query and declare them in query_columns.seasonality — query-provided columns take precedence over seasonality_columns.

Detectors

`detectors` (list, required)

List of detector configurations. Each detector independently analyzes the metric.

Full parameter set for the windowed statistical detectors (mad, zscore, iqr — they share one implementation and accept identical parameters):

detectors:
  - type: mad                     # mad, zscore, iqr, manual_bounds
    params:
      # Algorithm parameters (all participate in the detector ID)
      threshold: 3.0              # defaults: mad 3.0, zscore 3.0, iqr 1.5
      window_size: 100            # trailing window in points (current point excluded)
      min_samples: 30             # min valid points in window to run detection
      seasonality_components:     # default: null
        - "hour"                  # single component
        - ["hour", "day_of_week"] # or combined grouping
      min_samples_per_group: 10   # defaults: mad 10, zscore 3, iqr 4
      input_type: values          # values | changes | absolute_changes | log_changes
      smoothing: null             # null | ema | sma
      smoothing_alpha: 0.3        # EMA factor (0, 1]
      smoothing_window: 10        # SMA window in points
      window_weights: null        # null (uniform) | exponential | linear
      half_life: null             # for exponential weights: age at which a point's
                                  # weight halves; int = points or duration string ("3d")
                                  # (default when unset: max(window_size / 20, min_samples / 2))
      detrend: null               # null | linear (robust in-window detrending)

      # Execution parameters (not part of the detector ID)
      start_time: "2024-01-01 00:00:00"   # when detection starts
      batch_size: 500                     # detection batch size

Parameter semantics — the σ-equivalent threshold scaling (MAD is scaled by 1.4826), detector-identity hashing, the deprecated weight_decay alias, and fail-fast validation timing — are documented once in Shared Detector Parameters.

See the Detectors Guide for choosing and tuning a detector.

Alerting

`alerting` (object, optional)

Alert configuration for the metric.

alerting:
  enabled: true                  # Enable/disable alerting
  suppress_until: null           # Suppress alerts until UTC datetime (default: null)
  timezone: "Europe/Moscow"      # Display timezone (default: UTC)
  channels:                      # List of channel names from profiles.yml
    - mattermost_ops
    - slack_critical

  # Dashboard / runbook links (v0.13.0)
  dashboard_url: null            # Optional dashboard/runbook URL (default: null)
  links: {}                      # Extra "label: url" links (default: {})

  # Anomaly filtering
  min_detectors: 1               # Detectors that must satisfy the quorum per point (default: 1)
  direction: "same"              # "same", "any", "up", "down" (default: "same")
  consecutive_anomalies: 3       # Consecutive quorum points to trigger (default: 3)

  # Alert cooldown - Prevent spam from persistent anomalies
  alert_cooldown: "30min"        # Minimum time between alerts
                                 # (default: null = re-alert on EVERY run!)
  cooldown_reset_on_recovery: true  # Reset cooldown when metric recovers (default: true)

  # Recovery notifications
  notify_on_recovery: false      # Send notification when metric stabilizes (default: false)
  template_recovery: null        # Custom recovery message template (default: null)

  # Mentions (v0.3.8) — tag users/groups in alerts
  mentions: []                   # Plain usernames without @, e.g., ["oncall", "here"]

  # Missing data alert (v0.5.0)
  no_data_alert: false           # Fire alert when last interval has no row (default: false)
  template_no_data: null         # Custom no-data message template

  # Custom templates
  template_single: null          # Used when consecutive_count <= 1
  template_consecutive: null     # Used for streaks (falls back to template_single)

Multiple alerting blocks

alerting: also accepts a list of blocks. Each block is dispatched independently and carries its own cooldown and alert state, so you can route the same metric to different channels with different rules (e.g., a noisy warning stream plus a strict on-call page).

alerting:
  - enabled: true
    channels: [slack_ops]
    consecutive_anomalies: 1     # warn early
    alert_cooldown: "15min"
  - enabled: true
    channels: [telegram_oncall]
    min_detectors: 2             # page only on a stronger signal
    consecutive_anomalies: 3
    alert_cooldown: "2h"

The single-dict form shown above is still supported and is treated as a list of one block. See the Alerting Guide for full details.

Alert filtering options (see the Alerting Guide for the full contract):

min_detectors: How many detectors must satisfy the direction policy at every point in the consecutive chain
- 1 = One qualifying detector is enough
- 2 = At least 2 detectors must qualify at each point
direction: Which anomalies count toward the quorum
- "same" (default) = At least min_detectors detectors must agree on ONE direction at the latest point (up and down counted separately — disagreement is not consensus). The winning direction is locked for the whole consecutive chain.
- "any" = Every anomaly counts regardless of direction (1 up + 1 down satisfies min_detectors: 2)
- "up" = Only anomalies above the confidence interval count; “down” anomalies are ignored (they neither help nor block)
- "down" = Only anomalies below the confidence interval count
consecutive_anomalies: Consecutive quorum points required
- 1 = Alert on first anomaly
- 3 = Alert after 3 consecutive anomalies (reduces false positives)
- Points must be exactly one metric interval apart — a gap in the detection grid breaks the chain
alert_cooldown: Minimum time between alerts (e.g., "2h", 1800)
- null (default) = no cooldown: a persisting anomaly re-alerts on every dtk run. Set a cooldown for production metrics.
- No-data alerts and anomaly alerts share the same cooldown state per alert config block.
notify_on_recovery: Send notification when metric returns to normal
- false = No recovery notifications (default)
- true = Send one recovery notification per incident
template_recovery: Custom recovery message template
- Supports the same variables as anomaly templates (incl. {expected_range} and the rule echo {min_detectors} / {direction_policy} / {consecutive_required}), plus {status}
- Default template (alert-centric): "🟢 Alert cleared: {metric_name}\nThe alert condition no longer holds — the metric is back within expected bounds.\nRule: ...\n..."
suppress_until: Temporarily suppress alerts until a UTC datetime
- null = No suppression (default)
- "2026-04-11 18:00:00" = Suppress alerts until this UTC time
- Load and detect steps continue running; only alerting is paused
- Alerts auto-resume after the specified time — no need to edit config again
mentions: Users/groups to mention in alerts
- Plain usernames without @ prefix (e.g., ["oncall_user", "here"])
- Special keywords: here, channel, all for broadcast mentions
- Each channel formats mentions in its native syntax
- Available as {mentions} and {mentions_line} template variables
dashboard_url (v0.13.0): Optional dashboard/runbook URL
- null (default) — no dashboard link
- Surfaced as a first-class action on every channel: a clickable attachment title on Slack/Mattermost, an inline “Open dashboard” link on Telegram, and an “Open dashboard” button in email
- Also available to custom templates as {dashboard_url} and {dashboard_line}
links (v0.13.0): Extra label: url links shown alongside dashboard_url
- {} (default) — no extra links
- Each entry is appended as a labelled link, e.g. {Runbook: 'https://...', Grafana: 'https://...'}
- On webhook channels (Slack/Mattermost/generic) these render as compact clickable labels in one Links field — never raw URLs (since v0.16.1)
{help_url} / {help_line} (template variables, since v0.16.0): the “How to read this alert” stakeholder link carried on every alert. This is not a per-metric field — it is set project-wide via alert_help_url in detectkit_project.yml (tri-state: unset → official guide, URL → your runbook, false → hidden); see Configuration → alert_help_url. Available to custom templates as {help_url} / {help_line}, mirroring {dashboard_url} / {dashboard_line}.

alerting:
  channels: [mattermost_ops]
  dashboard_url: https://grafana.ops/d/api-errors
  links:
    Runbook: https://runbooks.ops/api-errors

no_data_alert (v0.5.0): Alert when the latest expected interval has no datapoint
- false (default) — disabled
- true — at the alert step, checks _dtk_datapoints for the last complete interval. If no row exists OR the row’s value is NULL / NaN, fires a dedicated alert with status=NO_DATA through the same channels. Honours alert_cooldown and suppress_until.
- min_detectors and consecutive_anomalies deliberately do not apply — missing data is a single binary signal, not a per-detector vote.
- Webhook channels render no-data alerts in amber (#F0AD4E) instead of red.
template_no_data (v0.5.0): Custom message body for no-data alerts
- Default: "No data for metric: {metric_name}\n...Time: {timestamp}\nStatus: query returned no datapoint for the latest interval"
- Variables: {metric_name}, {timestamp}, {timezone}, {description}, {description_line}, {mentions}, {mentions_line}, {status} (always "NO_DATA")
- Avoid {value:.2f} / {confidence_interval} — there is no value for no-data alerts. The formatter falls back to the default template if your template uses a numeric format spec on a non-numeric value, but it’s cleaner not to rely on the fallback.

Custom Table Names

`tables` (object, optional)

Override default table names for this metric.

tables:
  datapoints: _dtk_datapoints_sales
  detections: _dtk_detections_sales

Use cases:

Separate critical metrics into dedicated tables
Organize metrics by team or service
Apply different retention policies

Note: tasks table cannot be overridden (shared across all metrics).

See the Internal Tables reference for the full schema (columns, primary keys, engines) of every _dtk_* table.

Tuning aids

`false_alert_budget` (float, optional)

A target false-alert rate (FDR) for this metric — a fraction in (0, 1], e.g. 0.3 for “at most 30% of fired alerts should be false”. The dtk tune cockpit flags, gently, when the metric’s false-alert rate exceeds this budget.

false_alert_budget: 0.3

Overrides the project-wide false_alert_budget; unset, the project default (then a built-in 0.5) applies. Tuning-only — it never affects the load/detect/alert pipeline, and labeling stays optional.

Complete Examples

Simple Metric

name: api_errors
interval: 1min
query: |
  SELECT
    timestamp,
    error_count AS value
  FROM logs
  WHERE timestamp >= '{{ dtk_start_time }}'
    AND timestamp < '{{ dtk_end_time }}'
  ORDER BY timestamp

detectors:
  - type: manual_bounds
    params:
      upper_bound: 10

alerting:
  enabled: true
  channels:
    - slack_critical
  consecutive_anomalies: 1  # Alert immediately

Advanced Metric with Seasonality

name: website_traffic
interval: 10min
query_file: sql/traffic.sql

# The query itself returns the seasonality columns
query_columns:
  timestamp: period_time
  metric: visitor_count
  seasonality:
    - hour_of_day
    - day_of_week

loading_start_time: "2024-01-01 00:00:00"
loading_batch_size: 2160  # 15 days

detectors:
  - type: mad
    params:
      threshold: 3.0
      window_size: 8640  # 60 days
      min_samples: 1000
      start_time: "2024-03-01 00:00:00"
      seasonality_components:
        - ["hour_of_day", "day_of_week"]
      min_samples_per_group: 10
      window_weights: exponential  # favor recent data...
      half_life: "3d"              # ...so gradual trends don't cause alert spam

alerting:
  enabled: true
  timezone: "Europe/Moscow"
  channels:
    - mattermost_ops
  min_detectors: 1
  direction: "same"
  consecutive_anomalies: 3

Multiple Detectors

name: cpu_usage
interval: 30s
query: |
  SELECT timestamp, cpu_percent AS value
  FROM system_metrics
  WHERE timestamp >= '{{ dtk_start_time }}'
    AND timestamp < '{{ dtk_end_time }}'
  ORDER BY timestamp

detectors:
  # Hard limit: CPU should never exceed 95%
  - type: manual_bounds
    params:
      upper_bound: 95.0

  # Statistical: detect unusual patterns
  - type: mad
    params:
      threshold: 3.0
      window_size: 2880  # 1 day
      min_samples: 100

alerting:
  enabled: true
  channels:
    - slack_ops
  min_detectors: 1  # Alert if ANY detector triggers
  consecutive_anomalies: 2

Best Practices

1. Use External SQL Files for Complex Queries

# Good: Readable, maintainable
query_file: sql/daily_revenue.sql

# Avoid: Hard to read and maintain
query: |
  WITH daily_sales AS (
    SELECT ...
    FROM ...
    -- 50 lines of SQL
  )
  SELECT ...

2. Set Appropriate Batch Sizes

# 10-minute interval, load 15 days at a time
interval: 10min
loading_batch_size: 2160  # 15 days × 144 intervals/day

Rule of thumb: 7-30 days worth of data per batch.

3. Use `loading_start_time` for Historical Metrics

# Don't load years of old data unnecessarily
loading_start_time: "2024-01-01 00:00:00"

metrics/
├── api_errors.yml
├── api_latency.yml
├── api_throughput.yml
└── database_cpu.yml

5. Use Descriptive Metric Names

# Good
name: api_p95_latency_ms

# Avoid
name: metric1

6. Test Queries Manually First

Before adding to detectkit, test SQL queries in your database client to ensure they return expected data.

7. Document Custom Configurations

Add comments explaining non-obvious settings:

detectors:
  - type: mad
    params:
      threshold: 4.0  # Higher threshold due to noisy metric
      window_size: 8640  # 60 days to smooth seasonality

Metric Configuration

Basic Structure

Metric Identification

name (string, required)

description (string, optional)

tags (list of strings, optional)

profile (string, optional)

enabled (boolean, default: true)

Data Loading

interval (string or int, required)

query (string, optional)

query_file (string, optional)

query_columns (object, optional)

loading_start_time (string, optional)

loading_batch_size (int, optional)

Seasonality Extraction

seasonality_columns (list of strings, optional)

Detectors

detectors (list, required)

Alerting

alerting (object, optional)

Multiple alerting blocks

Custom Table Names

tables (object, optional)

Tuning aids

false_alert_budget (float, optional)

Complete Examples

Simple Metric

Advanced Metric with Seasonality

Multiple Detectors

Best Practices

1. Use External SQL Files for Complex Queries

2. Set Appropriate Batch Sizes

3. Use loading_start_time for Historical Metrics

4. Group Related Metrics

5. Use Descriptive Metric Names

6. Test Queries Manually First

7. Document Custom Configurations

See Also

`name` (string, required)

`description` (string, optional)

`tags` (list of strings, optional)

`profile` (string, optional)

`enabled` (boolean, default: true)

`interval` (string or int, required)

`query` (string, optional)

`query_file` (string, optional)

`query_columns` (object, optional)

`loading_start_time` (string, optional)

`loading_batch_size` (int, optional)

`seasonality_columns` (list of strings, optional)

`detectors` (list, required)

`alerting` (object, optional)

`tables` (object, optional)

`false_alert_budget` (float, optional)

3. Use `loading_start_time` for Historical Metrics