SRE Engineer
Site Reliability Engineering: SLOs/SLIs/error budgets, toil reduction, capacity planning, incident management, postmortem culture, and reliability automation.
A Google SRE-principles skill for teams that need to balance velocity and reliability systematically. It covers the full SRE practice: defining meaningful SLOs and SLIs, error budget policies, on-call rotation design, incident runbooks, blameless postmortem facilitation, and automating toil away. Opinionated on what to measure and what not to page on.
Added Mar 20, 2026
$npx skills add johnefemer/skillfish --skill sre-engineer What This Skill Can Do
Concrete capabilities you get when you install this skill.
Define SLIs, SLOs, and error budget policies aligned to user journeys
Design on-call rotations, escalation policies, and runbooks for common failure modes
Facilitate blameless postmortems and extract actionable follow-up items
Identify and eliminate toil with automation, self-healing scripts, and operator patterns
Plan capacity using load testing results, growth projections, and headroom targets
Build reliability dashboards with burn rate alerts and error budget burn-down tracking
Real Conversations
See how the SRE Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.
We run a payment processing API. How do we define SLOs? I've heard about SLIs and error budgets but not sure how to connect them to something actionable.
Here's the full chain for a payment API:
Step 1 — Choose SLIs (what to measure) For a payment API, measure what users care about:
- Availability:
successful_requests / total_requests - Latency: % of requests completing under 500ms (p99)
- Correctness: % of transactions with correct outcome (requires audit log)
Step 2 — Set SLOs (the target)
SLOs:
availability: 99.9% # allows 43.8 min downtime/month
latency_p99: 99.0% # 99% of requests under 500ms
Start conservative. 99.9% is usually right for internal payment APIs; 99.95% for customer-facing.
Step 3 — Error budget (the operational lever)
Monthly error budget = (1 - 0.999) * 30 days * 24h * 60min
= 43.8 minutes of allowed downtime
Step 4 — Error budget policy (what happens when budget burns)
> 50% consumed in first 2 weeks → freeze non-critical deploys
> 100% consumed → freeze all feature work, incident review required
Step 5 — Burn rate alerts (don't wait until 100%)
# Alert if burning budget 14x faster than sustainable (uses 100% in 2 hours)
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])) > 0.014
The key insight: SLOs make reliability conversations concrete — instead of "the API feels slow," you have "we've burned 60% of our latency budget this week."
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Works well with Claude Sonnet or GPT-4o. SLO math and PromQL generation is reliable across frontier models.
Context Window
SKILL.md is ~9KB. Load current alert rules and SLO definitions in context for effective tuning sessions.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Works Great With
These skills complement SRE Engineer for end-to-end coverage. Install them together for better results.
Ready to try SRE Engineer?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
$npx skills add johnefemer/skillfish --skill sre-engineer