Metric Configuration
Files: metrics/*.yml
Basic Structure
Section titled “Basic Structure”# Metric identificationname: cpu_usagedescription: | # Optional: multi-line, surfaces as {description} CPU usage monitoring metric. Tracks system load over time.tags: [critical, infrastructure] # Optional: drives `dtk run --select tag:critical`profile: prod # Optional: override default_profileenabled: true # Optional: disable metric
# Data loadinginterval: 1minquery: | SELECT timestamp, cpu_percent AS value FROM system_metrics WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' ORDER BY timestamp
# Or use external SQL file# query_file: sql/cpu_usage.sql
# Column mapping (optional)query_columns: timestamp: timestamp metric: value
# Data loading optionsloading_start_time: "2024-01-01 00:00:00"loading_batch_size: 1440 # Load 1 day at a time
# Seasonality extraction (auto-extracted from timestamps)seasonality_columns: - hour - day_of_week
# Detectorsdetectors: - type: mad params: threshold: 3.0 window_size: 1440 min_samples: 100
# Alertingalerting: enabled: true channels: - mattermost_ops consecutive_anomalies: 3
# Custom table names (optional)tables: datapoints: _dtk_datapoints_cpu detections: _dtk_detections_cpuEditing a metric after it already has data? A detector’s identity is a hash of its parameters, and each alerting block’s state is keyed by a hash of its functional fields. So changing a detector parameter (or
seasonality_components), removing a detector, or changing/removing an alerting block leaves the old rows behind in_dtk_detections/_dtk_alert_states— the pipeline simply stops writing to them. Rundtk clean --select <metric>to preview and prune that orphaned data. Renamed or deleted the metric entirely? Usedtk clean --orphaned-metrics. (Datapoints are not orphaned by a parameter edit — they are keyed only by timestamp; use--full-refreshto reload those.)
Metric Identification
Section titled “Metric Identification”name (string, required)
Section titled “name (string, required)”Unique metric identifier. Used in:
- CLI selectors (
dtk run --select cpu_usage) - Database queries (WHERE metric_name = ‘cpu_usage’)
- Logs and alerts
Must be unique across all metrics in the project.
description (string, optional)
Section titled “description (string, optional)”Free-form description of the metric. Supports multi-line text (use a YAML
block scalar |). Surfaced in alert templates as the {description} and
{description_line} variables.
tags (list of strings, optional)
Section titled “tags (list of strings, optional)”Labels for selecting metrics on the command line. Run all metrics carrying a
tag with dtk run --select tag:<tag> (e.g., dtk run --select tag:critical).
Tags allow alphanumeric characters, underscores, and dashes; duplicates are
rejected.
profile (string, optional)
Section titled “profile (string, optional)”Database profile to use for this metric. Overrides default_profile from project config.
enabled (boolean, default: true)
Section titled “enabled (boolean, default: true)”Whether metric is active. Disabled metrics are skipped by dtk run.
Data Loading
Section titled “Data Loading”interval (string or int, required)
Section titled “interval (string or int, required)”Time interval between data points.
String format:
"1min","5min","10min""1hour","2hours""1day","7days"
Integer format (seconds):
60= 1 minute600= 10 minutes3600= 1 hour
query (string, optional)
Section titled “query (string, optional)”Inline SQL query to load data.
Built-in template variables (Jinja2, substituted by detectkit for every loading batch):
{{ dtk_start_time }}- Start of time range (inclusive), rendered asYYYY-MM-DD HH:MM:SS{{ dtk_end_time }}- End of time range (exclusive), same format{{ interval_seconds }}- Metric interval in seconds
Every query must constrain its time range using {{ dtk_start_time }} and
{{ dtk_end_time }} — otherwise incremental and batched loading cannot
work. The rendered values are plain datetime strings, so wrap them in
quotes in SQL.
Required columns:
- Timestamp column (default name:
timestamp) - Metric value column (default name:
value) - Optional seasonality columns (declare them in
query_columns.seasonality)
Example:
SELECT timestamp, AVG(response_time_ms) AS value, EXTRACT(HOUR FROM timestamp) AS hour_of_dayFROM api_logsWHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}'GROUP BY timestamp, hour_of_dayORDER BY timestampquery_file (string, optional)
Section titled “query_file (string, optional)”Path to external SQL file (relative to sql_dir).
Mutually exclusive with query.
Example:
query_file: sql/complex_metric.sqlquery_columns (object, optional)
Section titled “query_columns (object, optional)”Map query column names to internal names.
query_columns: timestamp: time_interval # Query has "time_interval" column metric: metric_value # Query has "metric_value" column seasonality: # Query has these seasonality columns - hour_of_day - day_of_weekDefaults:
timestamp:"timestamp"metric:"value"seasonality:null
loading_start_time (string, optional)
Section titled “loading_start_time (string, optional)”Start timestamp for initial data load (UTC).
Format: "YYYY-MM-DD HH:MM:SS"
Used only when the metric has no saved datapoints yet. If it is not set and
no --from date is passed on the command line, the initial load fails with
an error — detectkit does not guess where your data begins. Once datapoints
exist, subsequent runs resume from the last saved timestamp and this setting
is ignored.
Example:
loading_start_time: "2024-01-01 00:00:00" # Start from Jan 1, 2024loading_batch_size (int, optional)
Section titled “loading_batch_size (int, optional)”Number of rows to load per batch. Useful for large datasets.
Example:
interval: 10minloading_batch_size: 2160 # 15 days of 10-min intervalsSeasonality Extraction
Section titled “Seasonality Extraction”seasonality_columns (list of strings, optional)
Section titled “seasonality_columns (list of strings, optional)”Seasonality features auto-extracted from the timestamp for seasonal detection.
Available features:
hour: Hour of day (0-23)day_of_week: Day of week (0=Monday, 6=Sunday)day_of_month: Day of month (1-31)month: Month (1-12)is_weekend: Boolean (Saturday/Sunday)is_holiday: Boolean (holiday calendar not implemented yet — always false)
Example:
seasonality_columns: - hour - day_of_weekThese features are stored with each datapoint and can be referenced in detector seasonality_components.
Alternatively, return custom seasonality columns directly from the query and declare them in query_columns.seasonality — query-provided columns take precedence over seasonality_columns.
Detectors
Section titled “Detectors”detectors (list, required)
Section titled “detectors (list, required)”List of detector configurations. Each detector independently analyzes the metric.
Full parameter set for the windowed statistical detectors (mad, zscore, iqr — they share one implementation and accept identical parameters):
detectors: - type: mad # mad, zscore, iqr, manual_bounds params: # Algorithm parameters (all participate in the detector ID) threshold: 3.0 # defaults: mad 3.0, zscore 3.0, iqr 1.5 window_size: 100 # trailing window in points (current point excluded) min_samples: 30 # min valid points in window to run detection seasonality_components: # default: null - "hour" # single component - ["hour", "day_of_week"] # or combined grouping min_samples_per_group: 10 # defaults: mad 10, zscore 3, iqr 4 input_type: values # values | changes | absolute_changes | log_changes smoothing: null # null | ema | sma smoothing_alpha: 0.3 # EMA factor (0, 1] smoothing_window: 10 # SMA window in points window_weights: null # null (uniform) | exponential | linear half_life: null # for exponential weights: age at which a point's # weight halves; int = points or duration string ("3d") # (default when unset: max(window_size / 20, min_samples / 2)) detrend: null # null | linear (robust in-window detrending)
# Execution parameters (not part of the detector ID) start_time: "2024-01-01 00:00:00" # when detection starts batch_size: 500 # detection batch sizeParameter semantics — the σ-equivalent threshold scaling (MAD is scaled by
1.4826), detector-identity hashing, the deprecated weight_decay alias, and
fail-fast validation timing — are documented once in
Shared Detector Parameters.
See the Detectors Guide for choosing and tuning a detector.
Alerting
Section titled “Alerting”alerting (object, optional)
Section titled “alerting (object, optional)”Alert configuration for the metric.
alerting: enabled: true # Enable/disable alerting suppress_until: null # Suppress alerts until UTC datetime (default: null) timezone: "Europe/Moscow" # Display timezone (default: UTC) channels: # List of channel names from profiles.yml - mattermost_ops - slack_critical
# Dashboard / runbook links (v0.13.0) dashboard_url: null # Optional dashboard/runbook URL (default: null) links: {} # Extra "label: url" links (default: {})
# Anomaly filtering min_detectors: 1 # Detectors that must satisfy the quorum per point (default: 1) direction: "same" # "same", "any", "up", "down" (default: "same") consecutive_anomalies: 3 # Consecutive quorum points to trigger (default: 3)
# Alert cooldown - Prevent spam from persistent anomalies alert_cooldown: "30min" # Minimum time between alerts # (default: null = re-alert on EVERY run!) cooldown_reset_on_recovery: true # Reset cooldown when metric recovers (default: true)
# Recovery notifications notify_on_recovery: false # Send notification when metric stabilizes (default: false) template_recovery: null # Custom recovery message template (default: null)
# Mentions (v0.3.8) — tag users/groups in alerts mentions: [] # Plain usernames without @, e.g., ["oncall", "here"]
# Missing data alert (v0.5.0) no_data_alert: false # Fire alert when last interval has no row (default: false) template_no_data: null # Custom no-data message template
# Custom templates template_single: null # Used when consecutive_count <= 1 template_consecutive: null # Used for streaks (falls back to template_single)Multiple alerting blocks
Section titled “Multiple alerting blocks”alerting: also accepts a list of blocks. Each block is dispatched
independently and carries its own cooldown and alert state, so you can route
the same metric to different channels with different rules (e.g., a noisy
warning stream plus a strict on-call page).
alerting: - enabled: true channels: [slack_ops] consecutive_anomalies: 1 # warn early alert_cooldown: "15min" - enabled: true channels: [telegram_oncall] min_detectors: 2 # page only on a stronger signal consecutive_anomalies: 3 alert_cooldown: "2h"The single-dict form shown above is still supported and is treated as a list of one block. See the Alerting Guide for full details.
Alert filtering options (see the Alerting Guide for the full contract):
-
min_detectors: How many detectors must satisfy the direction policy at every point in the consecutive chain1= One qualifying detector is enough2= At least 2 detectors must qualify at each point
-
direction: Which anomalies count toward the quorum"same"(default) = At leastmin_detectorsdetectors must agree on ONE direction at the latest point (up and down counted separately — disagreement is not consensus). The winning direction is locked for the whole consecutive chain."any"= Every anomaly counts regardless of direction (1 up + 1 down satisfiesmin_detectors: 2)"up"= Only anomalies above the confidence interval count; “down” anomalies are ignored (they neither help nor block)"down"= Only anomalies below the confidence interval count
-
consecutive_anomalies: Consecutive quorum points required1= Alert on first anomaly3= Alert after 3 consecutive anomalies (reduces false positives)- Points must be exactly one metric interval apart — a gap in the detection grid breaks the chain
-
alert_cooldown: Minimum time between alerts (e.g.,"2h",1800)null(default) = no cooldown: a persisting anomaly re-alerts on everydtk run. Set a cooldown for production metrics.- No-data alerts and anomaly alerts share the same cooldown state per alert config block.
-
notify_on_recovery: Send notification when metric returns to normalfalse= No recovery notifications (default)true= Send one recovery notification per incident
-
template_recovery: Custom recovery message template- Supports the same variables as anomaly templates (incl.
{expected_range}and the rule echo{min_detectors}/{direction_policy}/{consecutive_required}), plus{status} - Default template (alert-centric):
"🟢 Alert cleared: {metric_name}\nThe alert condition no longer holds — the metric is back within expected bounds.\nRule: ...\n..."
- Supports the same variables as anomaly templates (incl.
-
suppress_until: Temporarily suppress alerts until a UTC datetimenull= No suppression (default)"2026-04-11 18:00:00"= Suppress alerts until this UTC time- Load and detect steps continue running; only alerting is paused
- Alerts auto-resume after the specified time — no need to edit config again
-
mentions: Users/groups to mention in alerts- Plain usernames without
@prefix (e.g.,["oncall_user", "here"]) - Special keywords:
here,channel,allfor broadcast mentions - Each channel formats mentions in its native syntax
- Available as
{mentions}and{mentions_line}template variables
- Plain usernames without
-
dashboard_url(v0.13.0): Optional dashboard/runbook URLnull(default) — no dashboard link- Surfaced as a first-class action on every channel: a clickable attachment title on Slack/Mattermost, an inline “Open dashboard” link on Telegram, and an “Open dashboard” button in email
- Also available to custom templates as
{dashboard_url}and{dashboard_line}
-
links(v0.13.0): Extralabel: urllinks shown alongsidedashboard_url{}(default) — no extra links- Each entry is appended as a labelled link, e.g.
{Runbook: 'https://...', Grafana: 'https://...'} - On webhook channels (Slack/Mattermost/generic) these render as compact
clickable labels in one
Linksfield — never raw URLs (since v0.16.1)
-
{help_url}/{help_line}(template variables, since v0.16.0): the “How to read this alert” stakeholder link carried on every alert. This is not a per-metric field — it is set project-wide viaalert_help_urlindetectkit_project.yml(tri-state: unset → official guide, URL → your runbook,false→ hidden); see Configuration →alert_help_url. Available to custom templates as{help_url}/{help_line}, mirroring{dashboard_url}/{dashboard_line}.
alerting: channels: [mattermost_ops] dashboard_url: https://grafana.ops/d/api-errors links: Runbook: https://runbooks.ops/api-errors-
no_data_alert(v0.5.0): Alert when the latest expected interval has no datapointfalse(default) — disabledtrue— at the alert step, checks_dtk_datapointsfor the last complete interval. If no row exists OR the row’s value isNULL/NaN, fires a dedicated alert withstatus=NO_DATAthrough the samechannels. Honoursalert_cooldownandsuppress_until.min_detectorsandconsecutive_anomaliesdeliberately do not apply — missing data is a single binary signal, not a per-detector vote.- Webhook channels render no-data alerts in amber (
#F0AD4E) instead of red.
-
template_no_data(v0.5.0): Custom message body for no-data alerts- Default:
"No data for metric: {metric_name}\n...Time: {timestamp}\nStatus: query returned no datapoint for the latest interval" - Variables:
{metric_name},{timestamp},{timezone},{description},{description_line},{mentions},{mentions_line},{status}(always"NO_DATA") - Avoid
{value:.2f}/{confidence_interval}— there is no value for no-data alerts. The formatter falls back to the default template if your template uses a numeric format spec on a non-numeric value, but it’s cleaner not to rely on the fallback.
- Default:
Custom Table Names
Section titled “Custom Table Names”tables (object, optional)
Section titled “tables (object, optional)”Override default table names for this metric.
tables: datapoints: _dtk_datapoints_sales detections: _dtk_detections_salesUse cases:
- Separate critical metrics into dedicated tables
- Organize metrics by team or service
- Apply different retention policies
Note: tasks table cannot be overridden (shared across all metrics).
See the Internal Tables reference for the full
schema (columns, primary keys, engines) of every _dtk_* table.
Tuning aids
Section titled “Tuning aids”false_alert_budget (float, optional)
Section titled “false_alert_budget (float, optional)”A target false-alert rate (FDR) for this metric — a fraction in (0, 1], e.g.
0.3 for “at most 30% of fired alerts should be false”. The dtk tune
cockpit flags, gently, when the metric’s false-alert rate exceeds this budget.
false_alert_budget: 0.3Overrides the project-wide false_alert_budget; unset, the project default (then a
built-in 0.5) applies. Tuning-only — it never affects the load/detect/alert
pipeline, and labeling stays optional.
Complete Examples
Section titled “Complete Examples”Simple Metric
Section titled “Simple Metric”name: api_errorsinterval: 1minquery: | SELECT timestamp, error_count AS value FROM logs WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' ORDER BY timestamp
detectors: - type: manual_bounds params: upper_bound: 10
alerting: enabled: true channels: - slack_critical consecutive_anomalies: 1 # Alert immediatelyAdvanced Metric with Seasonality
Section titled “Advanced Metric with Seasonality”name: website_trafficinterval: 10minquery_file: sql/traffic.sql
# The query itself returns the seasonality columnsquery_columns: timestamp: period_time metric: visitor_count seasonality: - hour_of_day - day_of_week
loading_start_time: "2024-01-01 00:00:00"loading_batch_size: 2160 # 15 days
detectors: - type: mad params: threshold: 3.0 window_size: 8640 # 60 days min_samples: 1000 start_time: "2024-03-01 00:00:00" seasonality_components: - ["hour_of_day", "day_of_week"] min_samples_per_group: 10 window_weights: exponential # favor recent data... half_life: "3d" # ...so gradual trends don't cause alert spam
alerting: enabled: true timezone: "Europe/Moscow" channels: - mattermost_ops min_detectors: 1 direction: "same" consecutive_anomalies: 3Multiple Detectors
Section titled “Multiple Detectors”name: cpu_usageinterval: 30squery: | SELECT timestamp, cpu_percent AS value FROM system_metrics WHERE timestamp >= '{{ dtk_start_time }}' AND timestamp < '{{ dtk_end_time }}' ORDER BY timestamp
detectors: # Hard limit: CPU should never exceed 95% - type: manual_bounds params: upper_bound: 95.0
# Statistical: detect unusual patterns - type: mad params: threshold: 3.0 window_size: 2880 # 1 day min_samples: 100
alerting: enabled: true channels: - slack_ops min_detectors: 1 # Alert if ANY detector triggers consecutive_anomalies: 2Best Practices
Section titled “Best Practices”1. Use External SQL Files for Complex Queries
Section titled “1. Use External SQL Files for Complex Queries”# Good: Readable, maintainablequery_file: sql/daily_revenue.sql
# Avoid: Hard to read and maintainquery: | WITH daily_sales AS ( SELECT ... FROM ... -- 50 lines of SQL ) SELECT ...2. Set Appropriate Batch Sizes
Section titled “2. Set Appropriate Batch Sizes”# 10-minute interval, load 15 days at a timeinterval: 10minloading_batch_size: 2160 # 15 days × 144 intervals/dayRule of thumb: 7-30 days worth of data per batch.
3. Use loading_start_time for Historical Metrics
Section titled “3. Use loading_start_time for Historical Metrics”# Don't load years of old data unnecessarilyloading_start_time: "2024-01-01 00:00:00"4. Group Related Metrics
Section titled “4. Group Related Metrics”metrics/├── api_errors.yml├── api_latency.yml├── api_throughput.yml└── database_cpu.yml5. Use Descriptive Metric Names
Section titled “5. Use Descriptive Metric Names”# Goodname: api_p95_latency_ms
# Avoidname: metric16. Test Queries Manually First
Section titled “6. Test Queries Manually First”Before adding to detectkit, test SQL queries in your database client to ensure they return expected data.
7. Document Custom Configurations
Section titled “7. Document Custom Configurations”Add comments explaining non-obvious settings:
detectors: - type: mad params: threshold: 4.0 # Higher threshold due to noisy metric window_size: 8640 # 60 days to smooth seasonalitySee Also
Section titled “See Also”- Detectors Guide - Detector-specific configuration
- Alerting Guide - Alert channels and templates
- CLI Reference - Command-line options