Engineering Advanced ObservabilitySLOMonitoring

Observability Designer

SLO design, alert optimization, and dashboard generation.

Install for

Real Conversations

See how the Observability Designer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We need to define SLOs for our payment processing API. It handles charge creation, refunds, and webhook delivery. Our engineering team has never set SLOs before. Where do we start, and what should our targets be?

Agent

Start with the four golden signals, then translate them into user-facing SLIs that measure what actually matters to customers.

Step 1 — Define SLIs per endpoint type:

Endpoint	SLI Type	Measurement
`POST /charges`	Availability + Latency	% of requests returning 2xx, P99 latency
`POST /refunds`	Availability + Latency	Same — refunds are user-initiated
Webhook delivery	Delivery success rate	% delivered within 30s of event

Step 2 — Recommended SLO targets (starting point):

# payments-api SLOs
charge_availability:
  target: 99.9%   # 43.8 min downtime/month
  window: 30-day rolling
  measurement: (successful_charges / total_charge_attempts) * 100

charge_latency_p99:
  target: 99%  # of requests
  threshold: 2000ms  # charges involve Stripe network calls
  window: 30-day rolling

webhook_delivery:
  target: 99.5%   # delivered within 30 seconds
  window: 30-day rolling

Step 3 — Error budget policies:

Error budget >50% remaining: Normal deployment velocity
Error budget 25-50% remaining: Code freeze on payment critical path, SRE review required
Error budget <25% remaining: Full code freeze, incident review mandatory

Step 4 — Burn rate alerts (multi-window):

# Page immediately: burning budget 14x fast over 1h
alert: PaymentSLOCritical
expr: |
  (1 - rate(payment_requests_successful_total[1h])
       / rate(payment_requests_total[1h])) > 14 * 0.001
severity: page

# Ticket: burning 6x fast over 6h  
alert: PaymentSLOWarning
expr: |
  (1 - rate(payment_requests_successful_total[6h])
       / rate(payment_requests_total[6h])) > 6 * 0.001
severity: ticket

Start conservative — 99.9% is achievable. Review after 90 days of data and tighten.

You

Our Prometheus setup is generating ~400 alerts per week. PagerDuty is a mess — on-call engineers are ignoring alerts because most are noise. We have 12 services. How do I rationalize the alerting?

Agent

400 alerts/week is a classic alert fatigue problem. The goal is high-precision alerting: every page is actionable.

Audit your current alerts into four buckets:

Symptom-based (keep these) — user-facing impact: 5xx rate high, latency SLO burning
Cause-based (convert or delete) — CPU >80%, disk >70% — these fire constantly and rarely need immediate action
Informational (route to Slack, not PagerDuty) — deployment notifications, scale events
Redundant (delete) — multiple alerts firing for the same root cause

The rationalization process:

Run this query on your PagerDuty history:

# Which alerts fired >10x last month?
gh api /orgs/{org}/... # or use PD API
# Sort by: frequency DESC, resolution_time DESC
# Alerts that fire often AND take long to resolve = noisy and hard to fix

Quick wins (implement this week):

Remove CPU/memory alerts — replace with saturation SLO: container_cpu_throttling_ratio > 0.5 for 15m
Add for: duration — for: 5m eliminates flapping on transient spikes
Add hysteresis — alert fires at 95% saturation, resolves at 80%
Group related alerts — Alertmanager group_by: [service, severity] collapses 12 alerts into 1 per service

Target state: <20 pages/week

12 services × 3 SLO alerts each (availability, latency P95, error budget burn) = 36 alert rules
36 well-tuned rules will generate far fewer pages than 400 poorly-tuned ones

Generate a Grafana dashboard showing alert frequency by rule name — that prioritization view alone will show you where to cut first.

You

We have 8 Node.js microservices communicating via HTTP and a shared RabbitMQ bus. Debugging is painful — a request touches 4-5 services and we can't tell where latency comes from. We're using Datadog. How do we add distributed tracing?

Agent

With Datadog and Node.js, you can add traces with minimal code changes using auto-instrumentation.

Step 1 — Add the Datadog tracer (one change per service):

npm install dd-trace

// At the very top of your entry file, before any other imports
const tracer = require('dd-trace').init({
  service: 'orders-service',  // unique per service
  env: process.env.NODE_ENV,
  version: process.env.APP_VERSION,
  logInjection: true,  // auto-correlates logs with trace IDs
})

This auto-instruments Express, HTTP, and popular DB drivers — no other code changes needed for basic traces.

Step 2 — Propagate trace context through RabbitMQ (manual):

// Publisher — inject trace context into message headers
const { propagation, context } = require('@opentelemetry/api')
const headers = {}
tracer.inject(tracer.scope().active(), 'text_map', headers)
channel.publish(exchange, routingKey, Buffer.from(body), { headers })

// Consumer — extract trace context
const parentContext = tracer.extract('text_map', msg.properties.headers)
const span = tracer.startSpan('process_order', { childOf: parentContext })
// ... process message ...
span.finish()

Step 3 — Add spans for business operations that matter:

// Wrap slow business logic with named spans
const span = tracer.startSpan('compute_pricing')
try {
  const price = await computePricing(items)
  span.setTag('item_count', items.length)
  return price
} finally {
  span.finish()
}

What you'll see in Datadog APM:

Full flame graph across all 8 services for a single request
P50/P95/P99 per service and per operation
Automatic anomaly detection on span duration
Logs linked to traces via trace_id in structured log output

Sampling config for cost control:

tracer.init({ sampleRate: 0.1 })  // trace 10% of requests
// Or tail-based: always trace errors + slow requests

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Good to Know

Advanced guide and reference material for Observability Designer. Background, edge cases, and patterns worth understanding.

Contents

SLI vs SLO vs SLA

These three terms are often used interchangeably in the wild, which causes real operational confusion.

SLI (Service Level Indicator): A specific, quantitative measurement of service behavior. Must be a ratio or percentage: (good events / total events) * 100. Examples: request success rate, P99 latency below threshold, queue drain rate.
SLO (Service Level Objective): A target value for an SLI, with a time window. Example: "99.9% of requests succeed over a 30-day rolling window." The SLO is internal — it's the engineering commitment.
SLA (Service Level Agreement): A contractual promise to customers, usually with financial penalties. SLAs are typically weaker than your internal SLO (e.g., SLO: 99.9%, SLA: 99.5%) to give you a buffer before breaching the contract.

Why availability SLIs alone are insufficient: A service that returns 200 OK in 30 seconds is "available" but unusable. Production systems need at minimum two SLIs per critical endpoint: availability (success ratio) and latency (P99 below a threshold). Webhook or async systems also need a delivery SLI measuring time-to-delivery, not just success rate.

Multi-Window Burn Rate Alerting

Single-window burn rate alerts generate too many false positives. A 1-hour spike that resolves itself still pages you — and over time, on-call engineers learn to ignore it.

The solution is requiring confirmation from two windows simultaneously:

Window	Burn Rate Multiplier	What it catches
1 hour	14x	Fast burns — major outages
6 hours	6x	Sustained burns — slow degradation

The formula: For a 99.9% SLO (0.1% error budget) and a 30-day window:

Burn rate multiplier = (alert_window_hours / budget_exhaustion_hours)
Budget at 1x burn = 30 days = 720 hours
Alert fires when projected exhaustion < threshold

A 14x burn rate on a 1h window means: if this keeps up, you'll exhaust your entire monthly error budget in 720 / 14 ≈ 51 hours. That warrants a page.

A 6x burn rate on a 6h window means: exhaustion in 720 / 6 = 120 hours (~5 days). That warrants a ticket, not a page.

Why dual-window works: Both windows must exceed their respective thresholds simultaneously. This eliminates brief spikes (pass the 1h check but not the 6h) and slow chronic issues that wouldn't trigger the 1h check until they've already sustained for hours.

Error Budget Arithmetic

From SLO to monthly budget:

SLO: 99.9%
Error budget: 100% - 99.9% = 0.1%
Monthly request volume (example): 5,000,000 requests

Allowable failures: 5,000,000 * 0.001 = 5,000 failed requests
In time: 0.001 * 30 days * 24 hours = 43.2 minutes of total downtime

What "burning 5% of budget in 1 hour" means operationally:

5% of 5,000 = 250 failed requests in 1 hour
That's about 4.2 failures/minute sustained
Remaining budget: 4,750 failed requests for the rest of the month

If current request rate is 1,000 req/min, that 1-hour incident consumed 25% of the monthly budget in 3.6 minutes of actual downtime equivalent. Error budget meetings become much more concrete when the team sees these numbers, not just percentages.

Platform Comparison

	Prometheus + Grafana	Datadog	New Relic	OpenTelemetry
Data retention	Configurable (default 15d), storage cost is yours	15 months (metrics), 30d (logs, default)	8d (default), up to 395d paid	Vendor-dependent (it's a standard, not a backend)
Cardinality limits	No hard limit — your hardware is the limit	Hard limit per account; high-cardinality labels cause ingestion drops	~100k unique time series per account before throttling	No limit (backend determines this)
Cost model	Infrastructure cost only; free at small scale	Per-host + per-custom-metric; costs escalate fast with many services	Data ingest (GB) + query compute; unpredictable at scale	Depends on backend chosen (often Grafana Cloud, Honeycomb, etc.)
Self-hosted	Yes — full control	No	No	Yes (collector) + SaaS backends
Best for	Teams with SRE capacity to operate infra; cost-sensitive orgs	Teams wanting full-stack APM with minimal setup; budget > $5k/mo	Application-centric monitoring; strong auto-instrumentation	Teams building vendor-portable pipelines; greenfield

Cardinality note: Prometheus cardinality issues surface as high memory usage on the server and slow query times. A label like user_id on a high-traffic metric will OOM a Prometheus server fast. Use recording rules to pre-aggregate before storing.

Ready to try Observability Designer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills

Observability Designer

Real Conversations

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Good to Know

SLI vs SLO vs SLA

Multi-Window Burn Rate Alerting

Error Budget Arithmetic

Platform Comparison

Ready to try Observability Designer?