open source · python

Catch the spike
before it pages you.

Time-series anomaly detection and alerting with a dbt-like project layout. A metric is a SQL query plus a detector in YAML — run it with one command.

api_error_rate last 24h · 5min
metric expected range anomaly recovery
Anomaly · api_error_rate
value 4.2 · expected ≤ 1.1 → Slack
Recovered · api_error_rate
value 1.0 · back within range → Slack
Works with
ClickHouse PostgreSQL MySQL
dtk init-claude

AI-native: build metrics with an assistant, out of the box.

One command sets up Claude Code for your project folder — a CLAUDE.md, a .claude/rules/detectkit/ reference, and four skills: dtk-setup-project (configure your database), dtk-new-metric (scaffold a metric), dtk-autotune (auto-tune a detector against labeled incidents), and dtk-feedback (file a redacted bug report or feature request upstream). Now an assistant writes metrics, tunes detectors, wires up alerts, and reports issues with full knowledge of detectkit. Re-run it after an upgrade to refresh the context.

dtk init-claude
Target: ~/monitoring
┌─ CLAUDE.md
└─ detectkit section created
┌─ .claude/rules/detectkit/
│ alerting.md (created)
│ autotune.md (created)
│ cli.md (created)
│ detectors.md (created)
│ metrics.md (created)
│ overview.md (created)
└─ project.md (created)
┌─ .claude/skills/
│ dtk-autotune/SKILL.md (created)
│ dtk-feedback/SKILL.md (created)
│ dtk-new-metric/SKILL.md (created)
└─ dtk-setup-project/SKILL.md (created)
Done. Claude context ready (12 created).
configuration

From a SQL query to a caught anomaly.

A metric is just a query plus a detector in YAML. dtk run handles the corridor, the quorum and the alert — nothing else to wire up.

metrics/api_errors.yml YAML
# metrics/api_errors.yml
name: api_error_rate
interval: "5min"
query: |
SELECT
toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS timestamp,
countIf(status_code >= 500) / count() * 100 AS value
FROM http_requests
WHERE timestamp >= '{{ dtk_start_time }}'
AND timestamp < '{{ dtk_end_time }}'
GROUP BY timestamp ORDER BY timestamp
detectors:
- type: mad
params:
threshold: 3.0
window_size: 2016 # 7d of 5-min points
window_weights: exponential
half_life: "1d"
alerting:
enabled: true
channels: [mattermost_ops]
consecutive_anomalies: 3
direction: "up"
mentions: [oncall_engineer, here]
$ dtk run --select api_error_rate
LOAD 12 points · resumed
DETECT 1 anomaly · detector mad
ALERT ✓ sent to mattermost_ops
api_error_rate anomaly
value 4.2 · expected ≤ 1.1 · severity 3.40
✓ pipeline completed 1 metric · 1 detector · 0.4s idempotent · resumable
SQL on your warehouse detector in YAML one command version-controlled
dtk run

A load → detect → alert run, in one tree.

The real output of dtk run — a load → detect → alert tree with cyan step headers and colored status lines. Idempotent: it resumes from the last saved point.

LOAD

Run the SQL on your warehouse, in batches, from the last checkpoint.

DETECT

Each detector scores points against its learned corridor of normal.

ALERT

Quorum met → post to chat with the rule up top, recovery on the way back.

dtk run --select api_error_rate
Project root: ~/monitoring
Found 1 metric(s) to process
Processing metric: api_error_rate
Config file: metrics/api_errors.yml
Steps: load, detect, alert
┌─ LOAD
│ Resuming from last saved: 2026-06-19 11:55:00
│ Loading from 2026-06-19 12:00:00 to 2026-06-19 12:05:00
│ Total points: ~1 | Batch size: 10,000
│ Loading in single batch...
└─ Loaded 1 datapoints
┌─ DETECT
│ Running 1 detector(s)...
│ [1/1] Detector: mad
│ Detecting from 2026-06-19 12:00:00 to 2026-06-19 12:05:00
│ Total points: ~1 | Batch size: 1,000
│ └─ Detected 1 anomalies
└─ Total anomalies: 1
┌─ ALERT
│ Checking alert conditions...
⚠ Alert triggered! Sending to 1 channel(s)...
mattermost_ops
└─ Sent 1/1 alerts
✓ Pipeline completed successfully
detectors

Robust statistics, not magic.

Every detector learns a corridor of normal from recent history, then flags the moment a metric steps outside it. Switch the detector to see the kind of metric it's built for.

metric expected range anomaly
anomaly · 3.40 σ
7 days ago now
mad
Median absolute deviation

Measures the typical distance from the median. A handful of wild spikes barely move it — the most robust default.

Corridor
median ± 3 × MAD
Best for
Spiky, noisy metrics with outliers in their history.
metric expected range anomaly
anomaly · z = 4.1
7 days ago now
zscore
Z-score

Classic mean ± k standard deviations. Fast and simple, but one big outlier inflates the band — keep it for clean data.

Corridor
mean ± 3 × σ
Best for
Clean, roughly bell-shaped metrics.
metric expected range anomaly
anomaly · beyond fence
7 days ago now
iqr
Interquartile range

Builds the corridor from the middle 50% of values, then extends fences 1.5×IQR out. Comfortable with skewed, long-tailed data.

Corridor
[ Q1 − 1.5·IQR , Q3 + 1.5·IQR ]
Best for
Skewed distributions and one-sided outliers.
metric expected range anomaly
anomaly · > max
max min
7 days ago now
manual_bounds
Manual bounds

No statistics at all — you set hard floor and ceiling values. Alerts the instant a metric crosses a known SLA line.

Corridor
value < min or value > max
Best for
Known SLAs and hard business limits.

// the corridor is recomputed per window with seasonality grouping & recency weighting — newer points count more
// each detector is shown on the metric shape it handles best — robust, bell-shaped, skewed or hard-bounded

try it · live

Shape a metric, then watch detection happen.

An interactive sandbox running the actual detectkit detector in your browser. Dial in a series that looks like one of yours, turn the detector's real knobs, and watch the corridor of normal, what gets flagged, and whether an alert would fire — nothing is sent anywhere.

Open the interactive playground → mad / zscore / iqr · live corridor · alert preview
autotune · labeling

Teach it your incidents — point, drag, done.

Autotune already tunes well with zero labels. To optimise against your real incidents, run dtk autotune --select <metric> --label — it opens a chart where you drag across each incident, add a note, and Export one self-contained file: offline, nothing leaves your browser, every round versioned.

Open the labeler demo → drag to label · scroll to zoom · export & re-tune
alerting

Alerts that lead with the rule that fired.

Direction-aware multi-detector quorum, cooldown, recovery and no-data alerts — posted to chat with the alert and its rule up top, anomaly evidence below.

The same alert, posted by detectkit to each channel — rendered as that channel formats it: a fields attachment on Slack/Mattermost, escaped HTML on Telegram, a branded card in email. Each leads with the project name ([payments]) so several projects can share one channel while keeping the brand bot identity. The dashboard_url below becomes a first-class link on every channel.

detectkitAPP12:04
@oncall_engineer @here
🔴 [payments] Alert: api_error_rate
Anomalous for 1h — 6 consecutive 10min intervals.
Rule min_detectors=1 · direction=same · consecutive=3
Value
4.2
Expected
<= 1.1
Quorum
1/1 · above
Severity
3.40
Started
2026-06-19 11:14:00 (Europe/Moscow)
Latest
2026-06-19 12:04:00 (Europe/Moscow)
Detectors
mad
Parameters
{"threshold": 3.0, "window_size": 2016, "half_life": "1d"}
detectkitAPP12:04
@oncall_engineer @here
🟢 [payments] Alert cleared: api_error_rate
The alert condition no longer holds — the metric is back within expected bounds. Incident lasted 1h (6 consecutive 10min intervals).
Rule min_detectors=1 · direction=same · consecutive=3
Value
1.0
Expected
<= 1.1
Started
2026-06-19 11:36:00 (Europe/Moscow)
Cleared
2026-06-19 12:36:00 (Europe/Moscow)
Detectors
mad
detectkitBOT12:04
@oncall_engineer @here
🔴 [payments] Alert: api_error_rate
Anomalous for 1h — 6 consecutive 10min intervals.
Rule min_detectors=1 · direction=same · consecutive=3
Value
4.2
Expected
<= 1.1
Quorum
1/1 · above
Severity
3.40
Started
2026-06-19 11:14:00 (Europe/Moscow)
Latest
2026-06-19 12:04:00 (Europe/Moscow)
Detectors
mad
Parameters
{"threshold": 3.0, "window_size": 2016, "half_life": "1d"}
detectkitBOT12:04
@oncall_engineer @here
🟢 [payments] Alert cleared: api_error_rate
The alert condition no longer holds — the metric is back within expected bounds. Incident lasted 1h (6 consecutive 10min intervals).
Rule min_detectors=1 · direction=same · consecutive=3
Value
1.0
Expected
<= 1.1
Started
2026-06-19 11:36:00 (Europe/Moscow)
Cleared
2026-06-19 12:36:00 (Europe/Moscow)
Detectors
mad
detectkit
🔴 [payments] Anomaly · api_error_rate Anomalous for 1h — 6 consecutive 10min intervals. Rule min_detectors=1 · direction=same · consecutive=3 • Value: 4.2 · Expected: <= 1.1 • Quorum: 1/1 · above • Severity: 3.40 • Anomaly began: 2026-06-19 11:14:00 (Europe/Moscow) · Latest reading: 2026-06-19 12:04:00 (Europe/Moscow) • Detector: mad • Parameters: {"threshold": 3.0, "window_size": 2016, "half_life": "1d"} Open dashboard · How to read this alert @oncall_engineer
12:04
detectkit
🟢 [payments] Recovered · api_error_rate The alert condition no longer holds — the metric is back within expected bounds. Incident lasted 1h (6 consecutive 10min intervals). Rule min_detectors=1 · direction=same · consecutive=3 • Value: 1.0 · Expected: <= 1.1 • Anomaly began: 2026-06-19 11:36:00 (Europe/Moscow) · Alert fired: 2026-06-19 11:56:00 (Europe/Moscow) · Recovered: 2026-06-19 12:36:00 (Europe/Moscow) • Detector: mad Open dashboard · How to read this alert @oncall_engineer
12:04
alerting:
  channels: [mattermost_ops]
  dashboard_url: https://grafana.ops/d/api-errors   # one line → a link on every channel

Ship your first detector in five minutes.

SQL + YAML, one command. No agents, no dashboards to babysit.