Engineering Advanced SRESLOsReliability

SRE Engineer

Site Reliability Engineering: SLOs/SLIs/error budgets, toil reduction, capacity planning, incident management, postmortem culture, and reliability automation.

A Google SRE-principles skill for teams that need to balance velocity and reliability systematically. It covers the full SRE practice: defining meaningful SLOs and SLIs, error budget policies, on-call rotation design, incident runbooks, blameless postmortem facilitation, and automating toil away. Opinionated on what to measure and what not to page on.

Added Mar 20, 2026

$npx skills add johnefemer/skillfish --skill sre-engineer

What This Skill Can Do

Concrete capabilities you get when you install this skill.

Define SLIs, SLOs, and error budget policies aligned to user journeys

Design on-call rotations, escalation policies, and runbooks for common failure modes

Facilitate blameless postmortems and extract actionable follow-up items

Identify and eliminate toil with automation, self-healing scripts, and operator patterns

Plan capacity using load testing results, growth projections, and headroom targets

Build reliability dashboards with burn rate alerts and error budget burn-down tracking

Real Conversations

See how the SRE Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We run a payment processing API. How do we define SLOs? I've heard about SLIs and error budgets but not sure how to connect them to something actionable.

Agent

Here's the full chain for a payment API:

Step 1 — Choose SLIs (what to measure) For a payment API, measure what users care about:

  • Availability: successful_requests / total_requests
  • Latency: % of requests completing under 500ms (p99)
  • Correctness: % of transactions with correct outcome (requires audit log)

Step 2 — Set SLOs (the target)

SLOs:
  availability:  99.9%   # allows 43.8 min downtime/month
  latency_p99:   99.0%   # 99% of requests under 500ms

Start conservative. 99.9% is usually right for internal payment APIs; 99.95% for customer-facing.

Step 3 — Error budget (the operational lever)

Monthly error budget = (1 - 0.999) * 30 days * 24h * 60min
                     = 43.8 minutes of allowed downtime

Step 4 — Error budget policy (what happens when budget burns)

> 50% consumed in first 2 weeks → freeze non-critical deploys
> 100% consumed              → freeze all feature work, incident review required

Step 5 — Burn rate alerts (don't wait until 100%)

# Alert if burning budget 14x faster than sustainable (uses 100% in 2 hours)
sum(rate(http_requests_total{status=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h])) > 0.014

The key insight: SLOs make reliability conversations concrete — instead of "the API feels slow," you have "we've burned 60% of our latency budget this week."

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Works well with Claude Sonnet or GPT-4o. SLO math and PromQL generation is reliable across frontier models.

Context Window

SKILL.md is ~9KB. Load current alert rules and SLO definitions in context for effective tuning sessions.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Works Great With

These skills complement SRE Engineer for end-to-end coverage. Install them together for better results.

$ skillfish add johnefemer/skillfish --all # install all skills at once

Ready to try SRE Engineer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$npx skills add johnefemer/skillfish --skill sre-engineer
← Browse all skills