Observability Designer
SLO design, alert optimization, and dashboard generation.
$skillfish add ~skills/observability-designer Real Conversations
See how the Observability Designer skill responds to actual requests. These are representative examples of what your AI agent will produce.
We need to define SLOs for our payment processing API. It handles charge creation, refunds, and webhook delivery. Our engineering team has never set SLOs before. Where do we start, and what should our targets be?
Start with the four golden signals, then translate them into user-facing SLIs that measure what actually matters to customers.
Step 1 — Define SLIs per endpoint type:
| Endpoint | SLI Type | Measurement |
|---|---|---|
POST /charges |
Availability + Latency | % of requests returning 2xx, P99 latency |
POST /refunds |
Availability + Latency | Same — refunds are user-initiated |
| Webhook delivery | Delivery success rate | % delivered within 30s of event |
Step 2 — Recommended SLO targets (starting point):
# payments-api SLOs
charge_availability:
target: 99.9% # 43.8 min downtime/month
window: 30-day rolling
measurement: (successful_charges / total_charge_attempts) * 100
charge_latency_p99:
target: 99% # of requests
threshold: 2000ms # charges involve Stripe network calls
window: 30-day rolling
webhook_delivery:
target: 99.5% # delivered within 30 seconds
window: 30-day rolling
Step 3 — Error budget policies:
- Error budget >50% remaining: Normal deployment velocity
- Error budget 25-50% remaining: Code freeze on payment critical path, SRE review required
- Error budget <25% remaining: Full code freeze, incident review mandatory
Step 4 — Burn rate alerts (multi-window):
# Page immediately: burning budget 14x fast over 1h
alert: PaymentSLOCritical
expr: |
(1 - rate(payment_requests_successful_total[1h])
/ rate(payment_requests_total[1h])) > 14 * 0.001
severity: page
# Ticket: burning 6x fast over 6h
alert: PaymentSLOWarning
expr: |
(1 - rate(payment_requests_successful_total[6h])
/ rate(payment_requests_total[6h])) > 6 * 0.001
severity: ticket
Start conservative — 99.9% is achievable. Review after 90 days of data and tighten.
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Larger models produce more detailed, production-ready outputs.
Context Window
This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.
All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Good to Know
Advanced guide and reference material for Observability Designer. Background, edge cases, and patterns worth understanding.
Contents
SLI vs SLO vs SLA
These three terms are often used interchangeably in the wild, which causes real operational confusion.
- SLI (Service Level Indicator): A specific, quantitative measurement of service behavior. Must be a ratio or percentage:
(good events / total events) * 100. Examples: request success rate, P99 latency below threshold, queue drain rate. - SLO (Service Level Objective): A target value for an SLI, with a time window. Example: "99.9% of requests succeed over a 30-day rolling window." The SLO is internal — it's the engineering commitment.
- SLA (Service Level Agreement): A contractual promise to customers, usually with financial penalties. SLAs are typically weaker than your internal SLO (e.g., SLO: 99.9%, SLA: 99.5%) to give you a buffer before breaching the contract.
Why availability SLIs alone are insufficient: A service that returns 200 OK in 30 seconds is "available" but unusable. Production systems need at minimum two SLIs per critical endpoint: availability (success ratio) and latency (P99 below a threshold). Webhook or async systems also need a delivery SLI measuring time-to-delivery, not just success rate.
Multi-Window Burn Rate Alerting
Single-window burn rate alerts generate too many false positives. A 1-hour spike that resolves itself still pages you — and over time, on-call engineers learn to ignore it.
The solution is requiring confirmation from two windows simultaneously:
| Window | Burn Rate Multiplier | What it catches |
|---|---|---|
| 1 hour | 14x | Fast burns — major outages |
| 6 hours | 6x | Sustained burns — slow degradation |
The formula: For a 99.9% SLO (0.1% error budget) and a 30-day window:
Burn rate multiplier = (alert_window_hours / budget_exhaustion_hours)
Budget at 1x burn = 30 days = 720 hours
Alert fires when projected exhaustion < threshold
A 14x burn rate on a 1h window means: if this keeps up, you'll exhaust your entire monthly error budget in 720 / 14 ≈ 51 hours. That warrants a page.
A 6x burn rate on a 6h window means: exhaustion in 720 / 6 = 120 hours (~5 days). That warrants a ticket, not a page.
Why dual-window works: Both windows must exceed their respective thresholds simultaneously. This eliminates brief spikes (pass the 1h check but not the 6h) and slow chronic issues that wouldn't trigger the 1h check until they've already sustained for hours.
Error Budget Arithmetic
From SLO to monthly budget:
SLO: 99.9%
Error budget: 100% - 99.9% = 0.1%
Monthly request volume (example): 5,000,000 requests
Allowable failures: 5,000,000 * 0.001 = 5,000 failed requests
In time: 0.001 * 30 days * 24 hours = 43.2 minutes of total downtime
What "burning 5% of budget in 1 hour" means operationally:
5% of 5,000 = 250 failed requests in 1 hour
That's about 4.2 failures/minute sustained
Remaining budget: 4,750 failed requests for the rest of the month
If current request rate is 1,000 req/min, that 1-hour incident consumed 25% of the monthly budget in 3.6 minutes of actual downtime equivalent. Error budget meetings become much more concrete when the team sees these numbers, not just percentages.
Platform Comparison
| Prometheus + Grafana | Datadog | New Relic | OpenTelemetry | |
|---|---|---|---|---|
| Data retention | Configurable (default 15d), storage cost is yours | 15 months (metrics), 30d (logs, default) | 8d (default), up to 395d paid | Vendor-dependent (it's a standard, not a backend) |
| Cardinality limits | No hard limit — your hardware is the limit | Hard limit per account; high-cardinality labels cause ingestion drops | ~100k unique time series per account before throttling | No limit (backend determines this) |
| Cost model | Infrastructure cost only; free at small scale | Per-host + per-custom-metric; costs escalate fast with many services | Data ingest (GB) + query compute; unpredictable at scale | Depends on backend chosen (often Grafana Cloud, Honeycomb, etc.) |
| Self-hosted | Yes — full control | No | No | Yes (collector) + SaaS backends |
| Best for | Teams with SRE capacity to operate infra; cost-sensitive orgs | Teams wanting full-stack APM with minimal setup; budget > $5k/mo | Application-centric monitoring; strong auto-instrumentation | Teams building vendor-portable pipelines; greenfield |
Cardinality note: Prometheus cardinality issues surface as high memory usage on the server and slow query times. A label like user_id on a high-traffic metric will OOM a Prometheus server fast. Use recording rules to pre-aggregate before storing.
Ready to try Observability Designer?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
$skillfish add ~skills/observability-designer