Engineering SREIncidentsDevOps

Incident Commander

Incident response framework from detection through resolution and post-incident review.

Install for

Real Conversations

See how the Incident Commander skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

P1 — Postgres primary is down, replica is lagging 45s, app throwing 500s. 3,000 users affected. What do I do right now?

Agent

INCIDENT ACTIVE — Execute in order:

T+0 (next 2 min):

Page on-call DBA + backend lead via PagerDuty
Open #inc-YYYY-MM-DD-db-outage in Slack — post current status immediately
Assign roles: Incident Commander (you), Tech Lead, Communications Lead, Scribe

T+2 — Failover decision:

# Check replica lag before promoting
psql -h replica-host -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"
# 45s lag = ~45s of potential data loss — confirm with engineering lead before proceeding

# Promote (RDS)
aws rds promote-read-replica --db-instance-identifier myapp-replica

T+5 — Restore traffic:

# Update app connection string to promoted replica, then rolling restart
kubectl set env deployment/api DATABASE_URL="postgres://replica-host/myapp"
kubectl rollout restart deployment/api
kubectl rollout status deployment/api  # watch recovery

T+10 — Customer communication:

"We identified a database outage at [time]. We have failed over to our standby and services are recovering. We will provide an update in 20 minutes."

Now: Set a 48h reminder for the post-incident review. Collect timeline data in this Slack channel while it's fresh — don't wait.

You

Cloudflare is serving stale JS bundles after a deploy — users getting blank screens. Half the team is EU, half US. How do I structure this incident?

Agent

Incident Structure for Distributed Team:

T+0 — Activate:

Page EU lead (currently on shift) as Tech Commander; US lead on standby
Status page: "Investigating reports of blank screens for some users"
Open #inc-cdn-stale-cache, pin the incident doc

T+2 — Diagnose:

# Check what Cloudflare is serving vs what's deployed
curl -I https://yourapp.com/static/main.abc123.js | grep cf-cache-status
# HIT = stale content being served | MISS/BYPASS = fresh

# Compare deployed bundle hash vs CDN-served filename
curl -s https://yourapp.com/ | grep -o 'main\.[a-f0-9]\{8\}\.js'

T+5 — Mitigation (fastest to slowest):

Purge by URL (~2 min): Cloudflare Dashboard → Caching → Custom Purge → paste JS/CSS asset URLs
Purge everything (~5 min): curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" -d '{"purge_everything":true}'
Fix deploy hook (next deploy): add Cloudflare cache purge API call as last deploy step

EU/US handoff:

EU runs mitigation, US monitors rollout 30 min later during their peak
Decision authority stays with the IC — timezone never splits command
Zoom bridge stays open 1h post-resolution for cross-tz debrief

Root cause question for PIR: Why didn't the deploy hook call the purge API? Check your CI pipeline's post-deploy steps.

You

We're 3 hours into an incident. Payment processing is down, 40% of customers affected, we have a fix in progress but need another 2 hours. CEO is asking for updates, sales is getting customer calls, and social media is starting to pick up. How do I manage communications?

Agent

Three hours in with 2 more to go — you're past the "acute response" phase and into the sustained-incident communication problem. The technical work is in progress; your job right now is information hygiene across five different audiences who each need something different.

Communication Hierarchy

Execute in this order — internal first, then outward:

Internal Slack (#inc-channel) — continuous, every 30 min minimum
Status page — update immediately, then every 30 min
CEO/exec brief — concise, no technical jargon, focus on customer impact and timeline
Sales/Support — give them talking points, not raw status
Public (social/email) — only if social media traction requires it

Never reverse this order. Don't post on Twitter before your own support team has context.

The 30-Minute Cadence Rule

Set a recurring 30-minute timer. On each tick:

Post an update to #inc-channel even if nothing has changed — "No change, still on track for the T+2h fix" is a valid update
Refresh the status page timestamp
Decide whether the exec brief needs an update (usually every 60 min is sufficient unless something changes)

Silence during an active incident creates rumor. A "nothing new to report" update is actively valuable.

Template Language

Internal Slack update (every 30 min):

🔴 INCIDENT UPDATE — 14:30
Status: Fix in progress, deployment expected ~16:00
Impact: Payment processing degraded, ~40% of customers affected
Customer count: ~3,200 transactions affected since 11:00
Root cause: [one sentence — e.g. "Queue worker OOM after config deploy"]
Current work: [eng name] deploying patched queue worker to staging now
Next update: 15:00
IC: @you | Tech lead: @name | Scribe: @name

Status page update:

Investigating — Payment Processing Degradation
We are experiencing degraded payment processing affecting approximately 40% of customers.
Our engineering team has identified the root cause and a fix is in progress.
We expect to restore full service by approximately 4:00 PM PT.
We will provide updates every 30 minutes.
Last updated: 2:30 PM PT

CEO summary (60 min cadence, or on material change):

Subject: Incident Update — Payment Outage [14:30]

Situation: Payment processing has been degraded since 11:00 AM (~3.5 hours).
Customer impact: ~40% of customers affected; ~3,200 transactions impacted.
Fix status: Engineering has a fix deployed to staging. Production deployment is targeting 4:00 PM.
Customer communication: Status page is live. We have not sent a proactive customer email yet.
Recommend: If this extends past 4:30 PM, we should send a customer email. Prepared draft ready.

Next update from me: 3:30 PM or sooner if anything changes.

For Sales/Support (give them a script, not raw status):

Talking points for customer calls:
- "We are aware of a payment processing issue affecting some customers."
- "Our engineering team is actively working on a fix and we expect resolution by approximately 4:00 PM PT."
- "Your payment data is safe — this is a processing delay, not a data issue."
- "We will send a follow-up email once service is fully restored."
Do NOT quote exact customer counts. Do NOT promise a specific fix time unless confirmed by IC.

On Estimating Resolution Time

Only publish a resolution estimate if your tech lead has high confidence in it. "Fix in progress" without a time is better than a missed estimate. Missed estimates destroy trust faster than silence does.

If you do publish a time, add 30% buffer and add "approximately" — then tell your tech lead that the public estimate is set.

Incident Severity and Notification Tiers

Severity	Trigger	Who Gets Notified
P1	>10% customers, payment/auth, data loss risk	On-call SRE, eng lead, VP Eng, CEO, Legal/Compliance if data involved
P2	Degraded feature, <10% customers	On-call SRE, eng lead, VP Eng
P3	Minor degradation, <1% customers	On-call SRE, eng lead

Payment processing down at 40% is unambiguously P1. CEO notification is correct.

All-Clear and Follow-Up Communication

When the fix is deployed and verified:

Status page (immediate):

Resolved — Payment Processing Fully Restored
Payment processing has been fully restored as of 4:05 PM PT.
All queued transactions have been processed. We will publish a full post-incident review within 5 business days.
We apologize for the disruption.

Customer email (send within 1 hour of resolution, for P1 incidents affecting payment):

Subject: Payment Processing Issue — Resolved

Earlier today, from approximately 11:00 AM to 4:05 PM PT, payment processing was degraded for a portion of our customers.

If you attempted a payment during this window and experienced an error, your payment was not charged. You can safely retry your transaction.

We have fully resolved the issue and all systems are operating normally. We are conducting a full post-incident review and will implement safeguards to prevent recurrence.

We're sorry for the disruption. If you have any questions, contact support at [support link].

Set a reminder for the post-incident review — 48 hours is the right window while the incident is still fresh.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Incident Commander?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills

Incident Commander

Real Conversations

Communication Hierarchy

The 30-Minute Cadence Rule

Template Language

On Estimating Resolution Time

Incident Severity and Notification Tiers

All-Clear and Follow-Up Communication

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Ready to try Incident Commander?