Patterns & troubleshooting
Common Patterns
Section titled “Common Patterns”Pattern 1: Immediate Alerts for Critical Metrics
Section titled “Pattern 1: Immediate Alerts for Critical Metrics”name: api_errorsdetectors: - type: manual_bounds params: upper_bound: 0 # Zero tolerance
alerting: channels: - slack_critical consecutive_anomalies: 1 # Alert immediately direction: "up" # Only alert on increasesPattern 2: Conservative Alerts for Noisy Metrics
Section titled “Pattern 2: Conservative Alerts for Noisy Metrics”name: network_latencydetectors: - type: mad params: threshold: 4.0 # Higher threshold
alerting: channels: - mattermost_ops consecutive_anomalies: 5 # Require 5 consecutive points direction: "up" # Only alert on increasesPattern 3: Multi-Channel Escalation
Section titled “Pattern 3: Multi-Channel Escalation”name: service_uptimedetectors: - type: manual_bounds params: lower_bound: 99.9
alerting: channels: - mattermost_ops # Team notification - slack_oncall # On-call engineer - email_management # Management notification consecutive_anomalies: 1Pattern 4: Business Hours Only (via Filtering)
Section titled “Pattern 4: Business Hours Only (via Filtering)”# Metric runs 24/7, but only alert during business hoursname: office_occupancy
seasonality_columns: - hour
detectors: - type: mad params: threshold: 3.0 # Per-hour statistics make 9-18h anomalies meaningful seasonality_components: - "hour"
alerting: channels: - mattermost_ops consecutive_anomalies: 2Note: detectkit doesn’t have built-in time-of-day filtering. Use external tools (cron, schedulers) to control when dtk run executes, or filter alerts in receiving system.
Troubleshooting
Section titled “Troubleshooting”No Alerts Received
Section titled “No Alerts Received”Checklist:
alerting.enabled: truein metric config- Channels exist in
profiles.yml - Recent anomalies detected (check
_dtk_detectionstable) - Consecutive anomaly threshold met
- Direction filter not blocking alerts
Debug:
# Check recent detectionsdtk run --select my_metric --steps detect
# Test alert channeldtk test-alert my_metricAlerts Not Reaching Channel
Section titled “Alerts Not Reaching Channel”Mattermost/Slack:
- Verify webhook URL is correct
- Check webhook permissions
- Test with
curl:Terminal window curl -X POST -H 'Content-Type: application/json' \-d '{"text":"Test message"}' \https://mattermost.example.com/hooks/xxx
Telegram:
- Verify bot token is valid
- Check bot is member of target chat
- Test with API:
Terminal window curl "https://api.telegram.org/bot<TOKEN>/getMe"
Email:
- Check SMTP credentials
- Verify firewall allows outbound SMTP
- Test with manual SMTP connection
Too Many Alerts
Section titled “Too Many Alerts”Solutions:
- Increase
consecutive_anomaliesthreshold - Increase detector
thresholdparameter - Use
min_detectors: 2(require multiple detectors) - Add seasonality to detector (if metric is seasonal)
- Use
directionfilter (only alert on “up” or “down”)
Alerts for Wrong Direction
Section titled “Alerts for Wrong Direction”Example: Alerting when CPU drops (which is good)
Solution: Add direction filter
alerting: direction: "up" # Only alert on high CPUMissing Important Anomalies
Section titled “Missing Important Anomalies”Causes:
consecutive_anomaliestoo highmin_detectorstoo high- Detector
thresholdtoo high
Solutions:
- Lower
consecutive_anomalies(e.g., from 5 to 3) - Lower
min_detectors(e.g., from 2 to 1) - Lower detector
threshold(e.g., from 4.0 to 3.0)
Best Practices
Section titled “Best Practices”1. Start Conservative, Then Tune
Section titled “1. Start Conservative, Then Tune”# Initial setupalerting: consecutive_anomalies: 5 # Conservative min_detectors: 2 # Require agreement
# After observing false positive rate, tune downalerting: consecutive_anomalies: 3 # Balanced min_detectors: 1 # Any detector2. Use Different Channels for Different Severities
Section titled “2. Use Different Channels for Different Severities”# Critical metricsalerting: channels: - slack_oncall
# Informational metricsalerting: channels: - mattermost_monitoring3. Document Alert Rationale
Section titled “3. Document Alert Rationale”alerting: channels: - slack_ops consecutive_anomalies: 1 # Critical: errors should never occur direction: "up" # Only alert on error increases4. Test Alerts Before Production
Section titled “4. Test Alerts Before Production”# Always test before deployingdtk test-alert new_metric5. Monitor Alert Volume
Section titled “5. Monitor Alert Volume”If receiving too many alerts:
- Team becomes desensitized
- Real issues get missed
- Alert fatigue sets in
Aim for: < 5 alerts per day per team
See Also
Section titled “See Also”- Configuration Guide - Alert configuration options
- Detectors Guide - Reducing false positives
- CLI Reference -
dtk test-alertcommand