Engineering SREIncidentsDevOps

Incident Commander

Incident response framework from detection through resolution and post-incident review.

$skillfish add ~skills/incident-commander

Real Conversations

See how the Incident Commander skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

P1 — Postgres primary is down, replica is lagging 45s, app throwing 500s. 3,000 users affected. What do I do right now?

Agent

INCIDENT ACTIVE — Execute in order:

T+0 (next 2 min):

  1. Page on-call DBA + backend lead via PagerDuty
  2. Open #inc-YYYY-MM-DD-db-outage in Slack — post current status immediately
  3. Assign roles: Incident Commander (you), Tech Lead, Communications Lead, Scribe

T+2 — Failover decision:

# Check replica lag before promoting
psql -h replica-host -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"
# 45s lag = ~45s of potential data loss — confirm with engineering lead before proceeding

# Promote (RDS)
aws rds promote-read-replica --db-instance-identifier myapp-replica

T+5 — Restore traffic:

# Update app connection string to promoted replica, then rolling restart
kubectl set env deployment/api DATABASE_URL="postgres://replica-host/myapp"
kubectl rollout restart deployment/api
kubectl rollout status deployment/api  # watch recovery

T+10 — Customer communication:

"We identified a database outage at [time]. We have failed over to our standby and services are recovering. We will provide an update in 20 minutes."

Now: Set a 48h reminder for the post-incident review. Collect timeline data in this Slack channel while it's fresh — don't wait.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best
Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2
Good
Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only
32K+ Skill + conversation
100K+ Skill + references + codebase

All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Incident Commander?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$skillfish add ~skills/incident-commander
← Browse all 169 skills