Engineering Advanced ObservabilitySLOMonitoring

Observability Designer

SLO design, alert optimization, and dashboard generation.

$skillfish add ~skills/observability-designer

Real Conversations

See how the Observability Designer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We need to define SLOs for our payment processing API. It handles charge creation, refunds, and webhook delivery. Our engineering team has never set SLOs before. Where do we start, and what should our targets be?

Agent

Start with the four golden signals, then translate them into user-facing SLIs that measure what actually matters to customers.

Step 1 — Define SLIs per endpoint type:

Endpoint SLI Type Measurement
POST /charges Availability + Latency % of requests returning 2xx, P99 latency
POST /refunds Availability + Latency Same — refunds are user-initiated
Webhook delivery Delivery success rate % delivered within 30s of event

Step 2 — Recommended SLO targets (starting point):

# payments-api SLOs
charge_availability:
  target: 99.9%   # 43.8 min downtime/month
  window: 30-day rolling
  measurement: (successful_charges / total_charge_attempts) * 100

charge_latency_p99:
  target: 99%  # of requests
  threshold: 2000ms  # charges involve Stripe network calls
  window: 30-day rolling

webhook_delivery:
  target: 99.5%   # delivered within 30 seconds
  window: 30-day rolling

Step 3 — Error budget policies:

  • Error budget >50% remaining: Normal deployment velocity
  • Error budget 25-50% remaining: Code freeze on payment critical path, SRE review required
  • Error budget <25% remaining: Full code freeze, incident review mandatory

Step 4 — Burn rate alerts (multi-window):

# Page immediately: burning budget 14x fast over 1h
alert: PaymentSLOCritical
expr: |
  (1 - rate(payment_requests_successful_total[1h])
       / rate(payment_requests_total[1h])) > 14 * 0.001
severity: page

# Ticket: burning 6x fast over 6h  
alert: PaymentSLOWarning
expr: |
  (1 - rate(payment_requests_successful_total[6h])
       / rate(payment_requests_total[6h])) > 6 * 0.001
severity: ticket

Start conservative — 99.9% is achievable. Review after 90 days of data and tighten.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best
Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2
Good
Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only
32K+ Skill + conversation
100K+ Skill + references + codebase

All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Good to Know

Advanced guide and reference material for Observability Designer. Background, edge cases, and patterns worth understanding.

Contents

SLI vs SLO vs SLA

These three terms are often used interchangeably in the wild, which causes real operational confusion.

  • SLI (Service Level Indicator): A specific, quantitative measurement of service behavior. Must be a ratio or percentage: (good events / total events) * 100. Examples: request success rate, P99 latency below threshold, queue drain rate.
  • SLO (Service Level Objective): A target value for an SLI, with a time window. Example: "99.9% of requests succeed over a 30-day rolling window." The SLO is internal — it's the engineering commitment.
  • SLA (Service Level Agreement): A contractual promise to customers, usually with financial penalties. SLAs are typically weaker than your internal SLO (e.g., SLO: 99.9%, SLA: 99.5%) to give you a buffer before breaching the contract.

Why availability SLIs alone are insufficient: A service that returns 200 OK in 30 seconds is "available" but unusable. Production systems need at minimum two SLIs per critical endpoint: availability (success ratio) and latency (P99 below a threshold). Webhook or async systems also need a delivery SLI measuring time-to-delivery, not just success rate.


Multi-Window Burn Rate Alerting

Single-window burn rate alerts generate too many false positives. A 1-hour spike that resolves itself still pages you — and over time, on-call engineers learn to ignore it.

The solution is requiring confirmation from two windows simultaneously:

Window Burn Rate Multiplier What it catches
1 hour 14x Fast burns — major outages
6 hours 6x Sustained burns — slow degradation

The formula: For a 99.9% SLO (0.1% error budget) and a 30-day window:

Burn rate multiplier = (alert_window_hours / budget_exhaustion_hours)
Budget at 1x burn = 30 days = 720 hours
Alert fires when projected exhaustion < threshold

A 14x burn rate on a 1h window means: if this keeps up, you'll exhaust your entire monthly error budget in 720 / 14 ≈ 51 hours. That warrants a page.

A 6x burn rate on a 6h window means: exhaustion in 720 / 6 = 120 hours (~5 days). That warrants a ticket, not a page.

Why dual-window works: Both windows must exceed their respective thresholds simultaneously. This eliminates brief spikes (pass the 1h check but not the 6h) and slow chronic issues that wouldn't trigger the 1h check until they've already sustained for hours.


Error Budget Arithmetic

From SLO to monthly budget:

SLO: 99.9%
Error budget: 100% - 99.9% = 0.1%
Monthly request volume (example): 5,000,000 requests

Allowable failures: 5,000,000 * 0.001 = 5,000 failed requests
In time: 0.001 * 30 days * 24 hours = 43.2 minutes of total downtime

What "burning 5% of budget in 1 hour" means operationally:

5% of 5,000 = 250 failed requests in 1 hour
That's about 4.2 failures/minute sustained
Remaining budget: 4,750 failed requests for the rest of the month

If current request rate is 1,000 req/min, that 1-hour incident consumed 25% of the monthly budget in 3.6 minutes of actual downtime equivalent. Error budget meetings become much more concrete when the team sees these numbers, not just percentages.


Platform Comparison

Prometheus + Grafana Datadog New Relic OpenTelemetry
Data retention Configurable (default 15d), storage cost is yours 15 months (metrics), 30d (logs, default) 8d (default), up to 395d paid Vendor-dependent (it's a standard, not a backend)
Cardinality limits No hard limit — your hardware is the limit Hard limit per account; high-cardinality labels cause ingestion drops ~100k unique time series per account before throttling No limit (backend determines this)
Cost model Infrastructure cost only; free at small scale Per-host + per-custom-metric; costs escalate fast with many services Data ingest (GB) + query compute; unpredictable at scale Depends on backend chosen (often Grafana Cloud, Honeycomb, etc.)
Self-hosted Yes — full control No No Yes (collector) + SaaS backends
Best for Teams with SRE capacity to operate infra; cost-sensitive orgs Teams wanting full-stack APM with minimal setup; budget > $5k/mo Application-centric monitoring; strong auto-instrumentation Teams building vendor-portable pipelines; greenfield

Cardinality note: Prometheus cardinality issues surface as high memory usage on the server and slow query times. A label like user_id on a high-traffic metric will OOM a Prometheus server fast. Use recording rules to pre-aggregate before storing.

Ready to try Observability Designer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$skillfish add ~skills/observability-designer
← Browse all 169 skills