Skip to content

CLI Reference

Complete reference for the dtk command-line tool.

The dtk CLI provides dbt-like commands for managing metric monitoring:

Terminal window
dtk init <project> # Initialize new project
dtk init-claude # Set up Claude Code context for this folder
dtk run --select <selector> # Run metric pipeline
dtk autotune --select <sel> # Auto-configure a metric's detector from data
dtk tune --select <sel> # Interactively tune a detector, write it back
dtk test-alert <metric> # Test alert channels
dtk unlock --select <selector> # Clear a stuck pipeline lock
dtk clean --select <selector> # Prune data that no longer matches configs
dtk --version # Show version
dtk --help # Show help

Show the installed detectkit package version:

Terminal window
dtk --version

Output:

detectkit, version x.y.z

Show help for any command:

Terminal window
dtk --help
dtk run --help
dtk init --help

Initialize a new detectkit project.

Terminal window
dtk init <project_name> [OPTIONS]

project_name (required) Name of the project to create.

--target-dir, -d (default: .) Directory to create project in.

Create project in current directory:

Terminal window
dtk init my_monitoring

Create project in specific directory:

Terminal window
dtk init analytics --target-dir /opt/projects
my_monitoring/
├── detectkit_project.yml # Project configuration
├── profiles.yml # Database connections & alert channels
├── README.md # Getting-started notes for the project
├── metrics/ # Metric definitions
│ ├── .gitkeep
│ └── example_cpu_usage.yml # Example metric to copy/edit
├── incidents/ # Labeled incidents for supervised `dtk autotune`
│ └── example_cpu_usage.yml # Example labels file to copy/edit
└── sql/ # SQL query files
└── .gitkeep

Set up Claude Code context for working with detectkit. Run it in the folder that holds your detectkit project(s) — it gives an AI assistant the context and tools to help you create metrics, tune detectors, configure alerts and run the pipeline natively.

Terminal window
dtk init-claude [OPTIONS]

--target-dir, -d (default: .) Folder holding your detectkit project(s) to set up.

<target>/
├── CLAUDE.md # created, or a managed detectkit block is
│ # injected/refreshed (your content is kept)
└── .claude/
├── rules/detectkit/ # reference docs the assistant reads on demand
│ ├── overview.md
│ ├── cli.md
│ ├── project.md
│ ├── metrics.md
│ ├── detectors.md
│ └── alerting.md
└── skills/
├── dtk-setup-project/ # skill: configure profiles.yml (DB + channels)
│ └── SKILL.md
├── dtk-new-metric/ # skill: scaffold a validated metric YAML
│ └── SKILL.md
└── dtk-feedback/ # skill: file a redacted bug/feature/feedback
└── SKILL.md # issue upstream (with your confirmation)
  • Idempotent. The detectkit block in CLAUDE.md lives between <!-- BEGIN detectkit … --> / <!-- END detectkit --> markers; re-running refreshes only that block and the managed files. Anything you write outside the markers is preserved. A re-run with no upstream change reports everything unchanged.
  • Versioned. The content ships with detectkit and tracks the installed version, so re-run dtk init-claude after upgrading to refresh the guidance to match the new release.
  • Works whether the folder holds one project or several side by side.
Terminal window
# Set up the current folder
dtk init-claude
# Set up a specific monitoring root
dtk init-claude --target-dir /opt/monitoring

After running, open the folder in Claude Code and ask it about your metrics, alerts or configs. Three skills come with it: dtk-setup-project (configure profiles.yml — the database connection and a first alert channel — so runs work end to end), dtk-new-metric (scaffold a validated metric YAML), and dtk-feedback (file a bug report, feature request, or feedback as a GitHub issue on the upstream repo — it collects the diagnostic context, redacts every secret, and asks you to confirm before submitting).


Run the metric processing pipeline.

Terminal window
dtk run --select <selector> [OPTIONS]

Selector for metrics to run. Three selector types are supported:

1. Metric name (searches only root metrics/ directory):

Terminal window
dtk run --select cpu_usage # Finds metrics/cpu_usage.yml
dtk run --select api_latency # Finds metrics/api_latency.yml

Note: When using metric name (without path separators), do not include .yml extension. The extension is added automatically.

2. Path pattern (glob - supports subdirectories):

Terminal window
# Select specific file with full path
dtk run --select "metrics/critical/cpu.yml"
# Select all metrics in a folder
dtk run --select "metrics/critical/*"
# Select all metrics recursively
dtk run --select "metrics/**/*.yml"
# Pattern matching
dtk run --select "api_*" # All metrics starting with "api_"

3. Tag selector (searches recursively):

Terminal window
# Select all metrics with "critical" tag
dtk run --select tag:critical
# Select metrics tagged as "api"
dtk run --select tag:api
# Select metrics tagged as "10min"
dtk run --select tag:10min

Tags must be configured in metric YAML files:

name: api_latency
tags: ["critical", "api", "10min"]
# ... rest of config

Uniqueness validation: All selected metrics are validated to ensure no duplicate metric names exist. If duplicates are found, an error is raised listing the conflicting files.

Selector for metrics to exclude.

Terminal window
dtk run --select "*" --exclude "metrics/staging/*"

Pipeline steps to execute.

Available steps:

  • load - Load data from database
  • detect - Run anomaly detection
  • alert - Send alerts

Examples:

Terminal window
# All steps (default)
dtk run --select cpu_usage
# Load only
dtk run --select cpu_usage --steps load
# Detect and alert (skip load)
dtk run --select cpu_usage --steps detect,alert
# Detect only (no load, no alert)
dtk run --select cpu_usage --steps detect

Start date for data loading.

Format: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS

Terminal window
# Load from January 1, 2024
dtk run --select cpu_usage --from "2024-01-01"
# Load from specific timestamp
dtk run --select cpu_usage --from "2024-01-01 12:00:00"

Behavior:

  • Overrides metric’s loading_start_time config
  • Only affects load step
  • Timestamps are in UTC

End date for data loading.

Format: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS

Terminal window
# Load up to February 1, 2024
dtk run --select cpu_usage --from "2024-01-01" --to "2024-02-01"

Behavior:

  • Defaults to current time if not specified
  • Only affects load step
  • Timestamps are in UTC

Delete existing data and reload from scratch.

Terminal window
dtk run --select cpu_usage --full-refresh

Behavior (delete/reload is range-scoped to --from/--to):

  1. Deletes _dtk_datapoints and _dtk_detections rows in the [--from, --to) window — and all history only when neither --from nor --to is given (detect uses --to or now as the upper bound when --to is omitted)
  2. Reloads data from --from (or loading_start_time when no --from) up to --to (or now)

Use cases:

  • Fixing corrupted data
  • Changing data loading logic
  • Reprocessing with new detector configuration

Warning: This is a destructive operation. Use with caution.

Ignore an existing task lock and run anyway.

Terminal window
dtk run --select cpu_usage --force

Behavior:

  • Skips the held-lock check (runs even if another lock is marked running)
  • Still takes ownership of the lock for the duration of the run and releases it on exit — so a --force run also clears a previously stuck lock
  • Allows concurrent runs (not recommended)

Warning: Can cause data corruption if multiple processes run simultaneously.

Note: You usually don’t need --force to recover from a crash. A running lock left behind by a dead process (e.g. the database restarted mid-run) auto-expires after its timeout (1 hour) and is overridden by the next normal run. To clear a stuck lock immediately, use dtk unlock instead of --force.

Override the default profile from project config.

Terminal window
dtk run --select cpu_usage --profile staging

Use cases:

  • Testing with different database
  • Running against multiple environments

After the run, write a self-contained HTML report per selected metric — values, each detector’s confidence band, the flagged anomalies, the alerts that fired (anomaly / recovery / no-data) and a summary, with a client-side period selector (24h / 7d / 30d / All + zoom/pan). The report is offline: the chart and data are inlined into one file, so nothing is fetched and nothing leaves the page.

Terminal window
# Default path: reports/<metric>.html
dtk run --select cpu_usage --report
# Into a directory: <dir>/<metric>.html
dtk run --select cpu_usage --report reports/
# Into a specific file
dtk run --select cpu_usage --report cpu.html

Behavior:

  • Bare --reportreports/<metric>.html; a directory<dir>/<metric>.html; a .html path → that exact file.
  • Reads the persisted _dtk_datapoints / _dtk_detections, so it works even on a --steps load (or any partial) run, charting whatever is already stored.
  • Best-effort: a report failure is reported and does not fail the run.

Advanced — alerts are reconstructed, not read from state. _dtk_alert_states stores last-writer-wins cooldown/recovery bookkeeping, not an event log, so the report cannot read past alerts from it. Instead it replays the real decision logic (quorum, consecutive_anomalies, cooldown, recovery, no-data) over the stored detections to reconstruct the timeline. This is faithful to the rules, but because cooldown suppression depends on when the live pipeline ran (run cadence), the set of suppressed repeat alerts a live run dispatched can differ slightly from the replay, which evaluates every grid point causally. The anomalies, bands, and which incidents fired are unaffected.

Understanding how metric selection works is important to avoid confusion:

Two different identifiers:

  1. File name (e.g., metrics/cpu.yml) - where config is stored
  2. Metric name (e.g., name: cpu_usage in YAML) - identifier used in database

Important: detectkit uses metric name (from config) for all operations:

  • Database table rows are keyed by metric_name
  • Task locking uses metric_name
  • Display shows metric_name (not file name)

Best practice: Keep file names and metric names consistent:

metrics/cpu_usage.yml
name: cpu_usage # Matches file name (recommended)
metrics/cpu.yml
name: server_cpu_usage # Confusing - file name doesn't match

Metric names MUST be unique across the entire project.

Why uniqueness matters:

  • Database tables use metric_name as PRIMARY KEY component
  • Duplicate names cause data to mix from different sources
  • Task locking conflicts prevent metrics from running
  • Anomaly detection becomes invalid (mixed data)

Example of invalid configuration:

metrics/api/cpu.yml
name: cpu_usage # Duplicate name!
query: "SELECT * FROM api_metrics"
# metrics/system/cpu.yml
name: cpu_usage # Same name causes data corruption!
query: "SELECT * FROM system_metrics"

Validation: detectkit automatically validates uniqueness when selecting metrics. If duplicates are found:

Error: Duplicate metric name 'cpu_usage' found:
- metrics/api/cpu.yml
- metrics/system/cpu.yml
Metric names must be unique across the project.
Please rename one of the metrics to avoid data corruption.

Solution - use unique names:

metrics/api/cpu.yml
name: api_cpu_usage # Unique
# metrics/system/cpu.yml
name: system_cpu_usage # Unique
Selector TypeExampleSearchesExtension
Metric namecpu_usageRoot metrics/ onlyAuto-added
Path with /metrics/api/cpu.ymlGlob patternKeep as-is
Pattern with *api_*Glob patternKeep as-is
Tagtag:criticalRecursive searchN/A

Common mistakes:

  • dtk run --select cpu_usage.yml → Won’t work (searches for metrics/cpu_usage.yml.yml)
  • dtk run --select cpu_usage → Correct (searches for metrics/cpu_usage.yml)
  • dtk run --select "metrics/cpu_usage.yml" → Also works (explicit path)

Run single metric:

Terminal window
dtk run --select cpu_usage

Run all metrics:

Terminal window
dtk run --select "*"

Run metrics matching pattern:

Terminal window
dtk run --select "api_*"

Load data only (skip detection):

Terminal window
dtk run --select cpu_usage --steps load

Run detection only (skip load and alert):

Terminal window
dtk run --select cpu_usage --steps detect

Run detection and alert (skip load):

Terminal window
dtk run --select cpu_usage --steps detect,alert

Load data from specific date:

Terminal window
dtk run --select cpu_usage --from "2024-01-01"

Load specific date range:

Terminal window
dtk run --select cpu_usage \
--from "2024-01-01" \
--to "2024-02-01"

Delete and reload all data:

Terminal window
dtk run --select cpu_usage --full-refresh

Full refresh with custom start date:

Terminal window
dtk run --select cpu_usage \
--full-refresh \
--from "2024-01-01"

Run multiple metrics by pattern:

Terminal window
dtk run --select "metrics/critical/*.yml"

Run all except staging:

Terminal window
dtk run --select "*" --exclude "metrics/staging/*"

Run against staging database:

Terminal window
dtk run --select cpu_usage --profile staging

Force run if previous run crashed:

Terminal window
dtk run --select cpu_usage --force

Each run renders as a load → detect → alert tree per metric:

Project root: /path/to/project
Found 1 metric(s) to process
Processing metric: cpu_usage
Config file: metrics/cpu_usage.yml
Steps: load, detect, alert
┌─ LOAD
│ Resuming from last saved: 2024-03-15 09:50:00
│ Loading from 2024-03-15 10:00:00 to 2024-03-15 10:00:00
│ Total points: ~1,440 | Batch size: 2,160
│ Loading in single batch...
└─ Loaded 1,440 datapoints
✓ Pipeline completed successfully

On failure the tree ends with a red ✗ Failed: … line instead of ✓ Pipeline completed successfully.


Automatically configure a metric’s detector from its data — and, if you supply them, from labeled incidents. Searches detector type × hyperparameters × seasonality grouping × history window (× alert window, when supervised), cross-validates each candidate with walk-forward folds, and writes a new, annotated metric YAML. It is a separate pipeline from load → detect → alert: it never edits the original config and never sends alerts.

Terminal window
dtk autotune --select <selector> [OPTIONS]

Metric selector — same semantics as dtk run (metric name, path pattern, or tag:<name>). Tuning reads the metric’s already-loaded _dtk_datapoints; if it has none yet, load it first (optionally backfill more history, which tunes better):

Terminal window
dtk run --select api_error_rate --steps load --from "2026-01-01"

Path to a labels file of known incidents → supervised tuning. Without it (and without an autotune.labels_file in the metric config), an interactive terminal first prompts to enter incidents inline; declining — or running non-interactively (cron/CI/piped input) — falls back to an unsupervised objective (low false-positive rate + stable cross-fold separation). Supervised mode engages only if labeled timestamps land on loaded grid points. The file is YAML or JSON, all times UTC, each incident an interval ({start, end}) or a point ({at}):

metric: api_error_rate # optional; must match the metric being tuned
timezone: UTC # optional; interprets the naive times below
incidents:
- {start: "2026-05-02 14:00:00", end: "2026-05-02 16:30:00"}
- {at: "2026-05-11 09:05:00"}
Terminal window
dtk autotune --select api_error_rate --incidents incidents/api_error_rate.yml

Open the interactive labeler to mark incidents visually, then tune on them in the same command. By default it starts a local 127.0.0.1 browser labeler; Save & tune writes a versioned file into incidents/<metric>/ and the run continues into tuning. Mark incidents by click-drag, use Threshold capture to grab every span above/below a horizontal line at once, or Lasso capture to loop around a cloud of outliers (each grid-adjacent run, gaps bridged, becomes one incident span); remove one with its chart-side or the Delete key. It seeds from the metric’s newest saved set (or --incidents <file-or-dir>), so re-running --label keeps editing in place. --no-serve instead writes a static metrics/<metric>__labeler.html (Export downloads a labels file; Import file… loads one back); --no-open prints the URL instead of launching a browser. See the --label reference for the full walkthrough.

Terminal window
dtk autotune --select api_error_rate --label

The metric the search maximizes across folds: mcc (default), f1, f_beta, balanced_accuracy, roc_auc, pr_auc. MCC uses the whole confusion matrix and suits rare anomalies.

Terminal window
dtk autotune --select api_error_rate \
--incidents incidents/api_error_rate.yml \
--scoring f_beta

Lower bound of the training window (YYYY-MM-DD or YYYY-MM-DD HH:MM:SS, UTC).

Upper bound of the training window (YYYY-MM-DD or YYYY-MM-DD HH:MM:SS, UTC).

Override the default profile from the project config.

Ignore an existing task lock and run anyway (same lock semantics as dtk run --force).

Run the search but persist nothing — no config, no detections, no _dtk_autotune_runs row. Previews what autotune would choose.

Write the same self-contained HTML report as dtk run --report for the tuned winner — values, the chosen detector’s confidence band, the flagged anomalies, the alerts that would have fired, and a summary, with the client-side period selector. It charts the winner’s detections (persisted during the run), so run without --dry-run.

Terminal window
# Default path: reports/<metric>__tuned_<id>.html
dtk autotune --select cpu_usage --report
# A directory, or a specific file
dtk autotune --select cpu_usage --report reports/
dtk autotune --select cpu_usage --report cpu_tuned.html

Bare --reportreports/<metric>__tuned_<id>.html; a directory → <dir>/<metric>.html; a .html path → that file. The same Advanced note as dtk run --report applies: alerts in the report are reconstructed by replaying the decision logic over the stored detections.

On success (without --dry-run), one run:

  • writes metrics/<name>__tuned_<id>.yml — a normal, ready-to-run config led by a # comment header explaining every decision (training period, labels, seasonality rationale, detector votes, grid-search winner + CV score + per-fold scores, window choice). The <id> is a deterministic hash of the run.
  • records one row in the _dtk_autotune_runs audit table;
  • persists the winning detector’s detections to _dtk_detections;
  • prunes the superseded winners from prior autotune runs of the same metric.

The tuned config is an ordinary metric. Hand-editing its detector changes the detector_id, orphaning the old detections — recompute and prune:

Terminal window
dtk run --select <name>__tuned_<id> --steps detect --full-refresh
dtk clean --select <name>__tuned_<id> --execute

See the Auto-tuning guide and the Auto-tune reference for the labels schema, the autotune: config block, the scoring-metrics catalog, and the _dtk_autotune_runs columns.


Interactively tune a metric’s detector on its real data, then write the chosen config back into the metric YAML. The manual, human-in-the-loop sibling of dtk autotune: it opens a browser view of the metric’s persisted series, lets you turn the detector’s knobs and watch the confidence band + flagged anomalies + would-fire alerts recompute live, and — on a click — applies the config. Where autotune searches automatically and writes a new __tuned_<id>.yml, tune is manual and edits the metric in place.

Safe by construction: the new config is validated before anything is written, the previous metric YAML is archived under metrics/.history/<metric>/, and only then is the metric overwritten. It takes no pipeline lock (it only edits a config file); re-run dtk run afterwards to recompute detections under the new config.

Terminal window
dtk tune --select <selector> [OPTIONS]

Metric selector — same semantics as dtk run, but it must resolve to a single metric (tuning is interactive and per-metric). Tuning reads the metric’s already-loaded _dtk_datapoints; if it has none yet, load it first:

Terminal window
dtk run --select api_error_rate --steps load --from "2026-01-01"

Restrict the window the tuner shows and recomputes over (YYYY-MM-DD or YYYY-MM-DD HH:MM:SS, UTC). Defaults to the recent persisted window.

Write a static, read-only tuner HTML file (metrics/<metric>__tuner.html) and exit instead of starting the local server. The sliders still recompute the band live and you can still mark incidents, but there is no Apply / write-backSave incidents downloads the labels file instead of writing it.

Don’t auto-open the browser — just print the local 127.0.0.1 URL.

Profile override (default: from the project config).

Detector type (MAD / Z-Score / IQR / Manual bounds), threshold, window size, recency weighting + half-life, detrend, smoothing, seasonality conditioning (per available seasonality column, optionally conjoined into one group), direction (both/up/down) and the alert consecutive_anomalies window. The “effective config” readout shows exactly what will be written. A y = 0 line toggle shows the metric relative to zero.

Chart-first cockpit: modes, alert review & metrics

Section titled “Chart-first cockpit: modes, alert review & metrics”

The whole screen is one chart (the windshield) with the live metrics pinned in a HUD over it (the speedometer) and every control in an always-visible side rail that is mode-aware — it shows only the current mode’s panel (detector knobs + effective-config readout + Apply in Tune, verdict actions in Review, capture tools + Save in Label) and collapses to give the chart the whole width. The controls that aren’t detector-specific — the Points shown data window, the alert rule (direction + consecutive anomalies) and the y = 0 toggle — stay visible in every mode. A mode switch picks the job and dims the layers that don’t matter to it:

  • Tune — steer the band (corridor leads; incidents are read-only context; hover a point for its window).
  • Review — confirm the fired alerts: click an alert marker to cycle its verdict un-reviewed (red) → valid (green) → false alarm (slate); Confirm all unreviewed valid does the lot. Confirming an alert valid IS marking an incident — the confirmed streak becomes a first-class incident that shows in the Marked incidents list (a ”✓ confirmed alert” row; remove it to un-confirm), counts toward recall + correct (so a clean metric is validated in a few clicks without drawing spans), and is written as an incident on Save. The list, the metrics and Save share one ground-truth set (marked spans + confirmed alerts).
  • Label — mark real incidents: drag a span (edges/middle to adjust, ✕/Delete to remove), Lasso anomalies (loop a cloud of anomaly dots — each consecutive run, gaps bridged up to consecutive_anomalies, becomes one span sized to the run), or Threshold capture (grab every span past a horizontal line; set it by click or value, above/below, optional gap-bridge, optional painted time window saved as capture_windows; each span widened to a full interval so the alert lands inside).

As you tune, a metrics bar shows incident catch rate (recall) — the share of ground-truth incidents (marked + confirmed-valid alerts) caught by an alert (caught when an alert’s anomaly streak overlaps it, not just the fire instant) — false-alert rate — the share of fired alerts outside every incident and not confirmed valid (“≈1 in N false”) — and reviewed N/M; only incidents within the loaded window are scored. An optional false-alert budget (false_alert_budget, a fraction in (0, 1] on the metric then project, default 0.5) gently flags the false-alert chip when the rate exceeds it — tuning-only, labeling stays optional. Save incidents writes a versioned incidents/<metric>/<…>.yml, the same store dtk autotune reads (it seeds incidents and capture windows from the newest such file on open, anchoring the budget-sized loaded window on the seeded incidents — ending just past the latest one rather than at the last datapoint — so they render and count without loading the whole history; older incidents stay list-only, use --from/--to to tune against them; per-alert verdicts persist as an alert_reviews metadata block and re-seed on reopen), so a labeling round here also feeds the next supervised tune. Saving incidents does not end the session; only Apply does.

On Apply to metric the server validates the chosen detector (through the same DetectorFactory + MetricConfig the pipeline uses) — a broken or untunable config is rejected and nothing is written — then archives the current YAML verbatim to metrics/.history/<metric>/<metric>-<timestamp>.yml and re-emits the metric in place with the tuned detector (the detectors list becomes the single tuned detector; the first alerting block’s consecutive_anomalies is updated if present). The archive keeps a trackable history of chosen parameters and the original is always recoverable.

Terminal window
# Tune interactively and apply on click
dtk tune --select api_error_rate
# Tune over a specific window
dtk tune --select api_error_rate --from 2026-05-01 --to 2026-06-01
# Static, read-only preview file (no write-back)
dtk tune --select api_error_rate --no-serve

See the Tuning guide for the full walkthrough and how it relates to dtk autotune.


Send test alert for a metric.

Terminal window
dtk test-alert <metric_name> [OPTIONS]

metric_name (required) Name of the metric to test alerts for.

--profile (optional) Profile to use (overrides project default).

Test alert for single metric:

Terminal window
dtk test-alert cpu_usage

Test with specific profile:

Terminal window
dtk test-alert cpu_usage --profile production

Sends a mock alert through all configured channels with fake data:

  • Current timestamp
  • Mock anomaly value: 0.8532
  • Mock confidence interval: [0.4521, 0.6234]
  • Mock severity: 4.52
  • Rule preview: the mock mirrors the alert config’s own min_detectors, direction, and consecutive_anomalies (defaults 1 / same / 3), so the message shows the alert-centric layout a real firing would produce
  • Project label: the preview carries the project-name [name] prefix (from detectkit_project.yml), exactly as a real dtk run stamps it — so a preview on a shared multi-project channel reads identically to the real alert

Use cases:

  • Verify webhook URLs work
  • Check alert formatting
  • Test custom templates
  • Validate channel permissions
📨 Sending test alert for metric: cpu_usage
Timezone: UTC
Channels: mattermost_ops
→ Sending to mattermost_ops... ✓ SUCCESS
✓ Sent test alert to 1/1 channels
💡 Check your configured channels to verify message formatting
Mock data used: value=0.8532, confidence=[0.4521, 0.6234], severity=4.52

When the metric defines multiple enabled alerting blocks (the list form), each block is tested independently: its Timezone/Channels are printed under a [config i/N] header, followed by a combined Total: x/y channels across N alert configs line.


Clear a stuck pipeline lock for the selected metric(s).

Terminal window
dtk unlock --select <selector> [OPTIONS]

--select, -s (required) Metric selector — same semantics as dtk run (metric name, path pattern, or tag:<name>).

--profile (optional) Profile to use (overrides project default).

Terminal window
# Unlock a single metric
dtk unlock --select cpu_usage
# Unlock everything matching a tag
dtk unlock --select "tag:critical"

Every dtk run records a running lock in _dtk_tasks while it works and clears it on exit. If a run is killed without releasing its lock — most commonly when the database restarts mid-run — the running row is left behind. Until it’s cleared, every subsequent non---force run fails with:

RuntimeError: Failed to acquire lock for metric '<name>'. Another task is
running. Use --force to override.

Stuck locks auto-expire after their timeout (1 hour) — the next normal run treats the stale running row as released and overrides it, so the error clears itself. dtk unlock simply does this immediately instead of waiting for the timeout. It marks the task completed, so the next scheduled (cron) run proceeds normally without needing --force.

  • Reports, per metric, whether a lock was cleared (lock cleared) or none was held (• <name>: no active lock)
  • Clears even a not-yet-expired lock (use with the same care as --force)
  • Does not run the pipeline — only releases the lock
Project root: /path/to/project
Found 1 metric(s) to unlock
┌─ cpu_usage
└─ lock cleared
Done. Cleared 1 lock(s) of 1 metric(s).

Remove internal data that no longer matches the project’s YAML configs.

Editing metrics over time leaves stale rows behind in the internal tables. dtk clean finds and removes that drift. Both modes default to a dry-run that only reports what would be deleted; pass --execute to actually delete.

Terminal window
dtk clean --select <selector> [--execute] [OPTIONS] # drift mode
dtk clean --orphaned-metrics [--execute] [OPTIONS] # GC mode

Metric selector — same semantics as dtk run. For each selected (still-existing) metric, removes:

  • _dtk_detections rows whose detector_id is no longer produced by the config — i.e. you changed a detector parameter or seasonality_components (which changes the detector’s hash), or removed a detector;
  • _dtk_alert_states rows whose alert_config_id is no longer produced — i.e. you changed an alerting block’s functional params (channels, min_detectors, consecutive_anomalies, cooldown) or removed the block.

Datapoints are not touched — they are keyed only by (metric, timestamp) and are never orphaned by a parameter edit. Use dtk run --full-refresh to reload those.

Deletes all rows, across every internal table, for metric names present in the database but no longer defined by any YAML in the project (a renamed or deleted metric). Operates over the whole project (ignores --select).

Actually delete. Without it, the command only reports (dry-run).

Skip the confirmation prompt for --orphaned-metrics --execute.

Profile to use (overrides project default).

Terminal window
# See what stale detector/alert data a metric has accumulated (dry-run)
dtk clean --select cpu_usage
# ...then actually delete it
dtk clean --select cpu_usage --execute
# Clean drift across everything matching a tag
dtk clean --select "tag:critical" --execute
# List metrics in the DB that no longer exist in the project
dtk clean --orphaned-metrics
# Purge them (asks for confirmation unless -y)
dtk clean --orphaned-metrics --execute
  • Dry-run by default; nothing is deleted without --execute.
  • --orphaned-metrics --execute asks for confirmation (skip with --yes), and refuses to run if the project defines no metrics or its configs fail to parse — so a wrong directory or a duplicate-name error can’t wipe valid data.
  • In drift mode, if a metric’s config defines no detectors/alerting at all (so every stored row counts as orphaned), the command prints a loud warning before deleting.
  • Deletes are synchronous ClickHouse mutations and idempotent — safe to re-run.
Project root: /path/to/project
DRY-RUN — nothing will be deleted. Use --execute to apply.
Found 1 metric(s) to inspect
┌─ cpu_usage
│ detector a1b2c3d4e5f6a7b8: would delete 4,320 detection row(s)
└─ alert_config 9f8e7d6c5b4a3210: would delete stale alert state
Done. Would remove 1 detector group(s) and 1 alert-state row(s).
Re-run with --execute to apply.

CodeMeaning
0Normal completion — including most user-facing errors (bad project dir, missing profiles.yml, config/DB connection failures), which print an error message and return
2Click argument error (e.g. a missing required option or an invalid --steps/--from value)

Note: detectkit does not currently exit non-zero on configuration or database errors — it reports them and returns 0. Don’t gate a scheduler on the exit code alone; check the logged output.

The CLI itself defines no special environment variables, but configuration files support environment-variable interpolation so secrets stay out of YAML. Both ${VAR} and {{ env_var('VAR') }} syntaxes are supported:

profiles.yml
profiles:
prod:
type: clickhouse
host: "{{ env_var('CLICKHOUSE_HOST') }}"
port: 9000
password: "${CLICKHOUSE_PASSWORD}"
alert_channels:
mattermost_ops:
type: mattermost
webhook_url: "{{ env_var('MATTERMOST_WEBHOOK_URL') }}"

Unresolved placeholders (variable not set) are kept as-is, so missing variables surface as configuration errors instead of empty strings.

Terminal window
# 1. Initialize project
dtk init my_monitoring
cd my_monitoring
# 2. Edit profiles.yml (add database connection)
# 3. Create metric config in metrics/
# 4. Run metric
dtk run --select my_metric
Terminal window
# Run all metrics (typically in cron/scheduler)
dtk run --select "*"
# Run critical metrics only
dtk run --select "tag:critical"
# Run specific metric manually
dtk run --select cpu_usage
Terminal window
# Load last 30 days
dtk run --select cpu_usage --from "2024-02-01"
# Load specific range
dtk run --select cpu_usage \
--from "2024-01-01" \
--to "2024-02-01"
Terminal window
# Detector config changed → rerun detection
dtk run --select cpu_usage --steps detect --full-refresh
# Query changed → reload data
dtk run --select cpu_usage --full-refresh
# Detector/alert params changed → prune the now-orphaned old results
dtk clean --select cpu_usage # preview
dtk clean --select cpu_usage --execute
Terminal window
# Test alert channels
dtk test-alert cpu_usage
# Load data only (verify query works)
dtk run --select cpu_usage --steps load
# Detect only (verify detector works)
dtk run --select cpu_usage --steps detect
Terminal window
# Clear a stuck lock left by a crashed run (e.g. DB restarted mid-run)
dtk unlock --select cpu_usage
# Force run if previous run crashed (also clears the stuck lock on exit)
dtk run --select cpu_usage --force
# Full refresh if data is corrupted
dtk run --select cpu_usage --full-refresh
Terminal window
# Run all metrics every 10 minutes
*/10 * * * * cd /path/to/project && dtk run --select "*" >> /var/log/detectkit.log 2>&1
# Run critical metrics every 5 minutes
*/5 * * * * cd /path/to/project && dtk run --select "tag:critical" >> /var/log/detectkit.log 2>&1

Create /etc/systemd/system/detectkit.service:

[Unit]
Description=detectkit metric monitoring
[Service]
Type=oneshot
WorkingDirectory=/path/to/project
ExecStart=/usr/local/bin/dtk run --select "*"
User=detectkit

Create /etc/systemd/system/detectkit.timer:

[Unit]
Description=Run detectkit every 10 minutes
[Timer]
OnBootSec=1min
OnUnitActiveSec=10min
[Install]
WantedBy=timers.target

Enable:

Terminal window
systemctl enable detectkit.timer
systemctl start detectkit.timer
Terminal window
# Create scheduled task to run every 10 minutes
$action = New-ScheduledTaskAction -Execute "dtk" -Argument "run --select *" -WorkingDirectory "C:\projects\my_monitoring"
$trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 10)
Register-ScheduledTask -TaskName "detectkit" -Action $action -Trigger $trigger
FROM python:3.11-slim
# Install detectkit
RUN pip install detectkit[clickhouse]
# Install cron
RUN apt-get update && apt-get install -y cron
# Copy project files
COPY . /app
WORKDIR /app
# Add cron job
RUN echo "*/10 * * * * cd /app && dtk run --select '*' >> /var/log/cron.log 2>&1" | crontab -
# Start cron
CMD ["cron", "-f"]
Terminal window
# Good: Specific selector
dtk run --select "metrics/critical/*.yml"
# Avoid: Selecting all when not needed
dtk run --select "*"
Terminal window
# Always test manually before adding to cron
dtk run --select my_metric
dtk test-alert my_metric
Terminal window
# Redirect to log file for troubleshooting
dtk run --select "*" >> /var/log/detectkit.log 2>&1
Terminal window
# Test query without detection
dtk run --select my_metric --steps load
# Test detector without alerting
dtk run --select my_metric --steps load,detect
Terminal window
# Only use --force if you're sure no other process is running
# Check processes first:
ps aux | grep dtk

To recover from a crashed run (no live process), prefer dtk unlock — it clears the stale lock without running the pipeline concurrently. A stuck lock also auto-expires after 1 hour, so often no manual action is needed at all.

Cause: Selector doesn’t match any metrics.

Solution: Check metric name and file path:

Terminal window
# List metric files
ls metrics/
# Try exact match
dtk run --select cpu_usage # Not metrics/cpu_usage.yml

“Task is locked” / “Failed to acquire lock”

Section titled ““Task is locked” / “Failed to acquire lock””

Cause: Previous run is still in progress, or it crashed/was killed with the running lock held. The most common crash cause is the database restarting mid-run, which leaves a stale running row in _dtk_tasks.

Solution:

Terminal window
# Check if a process is actually still running
ps aux | grep dtk
# If no process is running, clear the stuck lock immediately:
dtk unlock --select cpu_usage
# (Or just wait — a stale lock auto-expires after 1 hour and the next
# normal run overrides it. --force also clears it on exit.)

Cause: Can’t connect to database.

Solution: Check profiles.yml and database connectivity:

Terminal window
# Test ClickHouse connection
clickhouse-client --host=<host> --port=<port>

Cause: Query returns empty result.

Solution: Test query manually in database client with sample dates.