Engineering Advanced SRESLOsReliability

SRE Engineer

Site Reliability Engineering: SLOs/SLIs/error budgets, toil reduction, capacity planning, incident management, postmortem culture, and reliability automation.

A Google SRE-principles skill for teams that need to balance velocity and reliability systematically. It covers the full SRE practice: defining meaningful SLOs and SLIs, error budget policies, on-call rotation design, incident runbooks, blameless postmortem facilitation, and automating toil away. Opinionated on what to measure and what not to page on.

Added Mar 20, 2026

Install for

What This Skill Can Do

Concrete capabilities you get when you install this skill.

Define SLIs, SLOs, and error budget policies aligned to user journeys

Design on-call rotations, escalation policies, and runbooks for common failure modes

Facilitate blameless postmortems and extract actionable follow-up items

Identify and eliminate toil with automation, self-healing scripts, and operator patterns

Plan capacity using load testing results, growth projections, and headroom targets

Build reliability dashboards with burn rate alerts and error budget burn-down tracking

Real Conversations

See how the SRE Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We run a payment processing API. How do we define SLOs? I've heard about SLIs and error budgets but not sure how to connect them to something actionable.

Agent

Here's the full chain for a payment API:

Step 1 — Choose SLIs (what to measure) For a payment API, measure what users care about:

Availability: successful_requests / total_requests
Latency: % of requests completing under 500ms (p99)
Correctness: % of transactions with correct outcome (requires audit log)

Step 2 — Set SLOs (the target)

SLOs:
  availability:  99.9%   # allows 43.8 min downtime/month
  latency_p99:   99.0%   # 99% of requests under 500ms

Start conservative. 99.9% is usually right for internal payment APIs; 99.95% for customer-facing.

Step 3 — Error budget (the operational lever)

Monthly error budget = (1 - 0.999) * 30 days * 24h * 60min
                     = 43.8 minutes of allowed downtime

Step 4 — Error budget policy (what happens when budget burns)

> 50% consumed in first 2 weeks → freeze non-critical deploys
> 100% consumed              → freeze all feature work, incident review required

Step 5 — Burn rate alerts (don't wait until 100%)

# Alert if burning budget 14x faster than sustainable (uses 100% in 2 hours)
sum(rate(http_requests_total{status=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h])) > 0.014

The key insight: SLOs make reliability conversations concrete — instead of "the API feels slow," you have "we've burned 60% of our latency budget this week."

You

Our on-call engineer is getting paged 15+ times per week, half of which are at night. Most pages are for high CPU that auto-resolves. The team is burning out. How do we fix this?

Agent

This is a toil and alert quality problem. 15+ pages/week with 50% auto-resolving means your alerting is generating noise, not signal. Fix in three stages:

Stage 1 — Audit alerts this week (2 hours) Categories for each page in the past 4 weeks:

Auto-resolved without action → delete the alert or raise threshold
Required action but documented → build a runbook or automate the fix
Required novel investigation → keep, these are real

Target: eliminate all auto-resolving pages. An alert that fires and resolves on its own is a monitoring misconfiguration.

Stage 2 — Fix the high CPU alert specifically

# Bad: fires immediately on any CPU spike
- alert: HighCPU
  expr: cpu_usage > 80
  for: 1m

# Good: fires only if sustained AND correlated with user impact
- alert: HighCPU
  expr: cpu_usage > 80 AND http_error_rate > 0.01
  for: 10m
  labels:
    severity: warning  # not page-worthy unless impacting users

Stage 3 — Automate the runbook steps If the runbook says "restart the service," automate it:

# Auto-remediation script triggered by alert
if [ $(kubectl top pod $POD | awk '{print $2}' | tr -d 'm') -gt 900 ]; then
  kubectl rollout restart deployment/$APP
fi

Goal: <5 actionable pages/week, zero at night for known issues. Measure page-to-action rate weekly.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Works well with Claude Sonnet or GPT-4o. SLO math and PromQL generation is reliable across frontier models.

Context Window

SKILL.md is ~9KB. Load current alert rules and SLO definitions in context for effective tuning sessions.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Works Great With

These skills complement SRE Engineer for end-to-end coverage. Install them together for better results.

Senior DevOps

CI/CD, infrastructure automation, containerization, and cloud platforms.

DevOpsCI/CD

Observability Designer

SLO design, alert optimization, and dashboard generation.

ObservabilitySLO

Incident Commander

Incident response framework from detection through resolution and post-incident review.

SREIncidents

Kubernetes Specialist

Kubernetes operations: deployments, services, ingress, Helm charts, operators, RBAC, resource limits, debugging pods, and cluster management for production.

KubernetesHelm

$ skillfish add johnefemer/skillfish --all # install all skills at once

Ready to try SRE Engineer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.