Performance Profiler
Node/Python/Go profiling, bundle analysis, and load testing.
$skillfish add ~skills/performance-profiler Real Conversations
See how the Performance Profiler skill responds to actual requests. These are representative examples of what your AI agent will produce.
Our Express API has P50 at 120ms but P99 at 4,200ms. It's a task management app. The endpoint is GET /api/tasks?projectId=xxx. About 200 tasks per project, fetched with Prisma. I have no idea what's happening at the 99th percentile.
This skill includes executable scripts — I'll need your permission before running them.
P99 4x higher than P50 is a classic sign of an intermittent bottleneck — most requests are fast, but some hit something slow. Most likely culprits: N+1 queries, missing index, or connection pool starvation.
Step 1 — Establish baseline with autocannon:
npx autocannon -c 50 -d 30 'http://localhost:3000/api/tasks?projectId=proj_123'
# Record: P50, P95, P99, RPS
Step 2 — Count DB queries per request (fastest diagnosis):
// Add to Prisma client config in dev
const prisma = new PrismaClient({
log: [{ emit: 'event', level: 'query' }]
})
let queryCount = 0
prisma.$on('query', () => queryCount++)
// In your route handler, log it:
app.get('/api/tasks', async (req, res) => {
queryCount = 0
const tasks = await getTasksWithDetails(req.query.projectId)
console.log(`Queries for this request: ${queryCount}`) // Is this 1 or 201?
res.json(tasks)
})
If this prints 201, you have an N+1 — one query for tasks + one per task for assignee/labels/etc.
Step 3 — Fix N+1 with Prisma include:
// Before: N+1 (separate query per task for assignee)
const tasks = await prisma.task.findMany({ where: { projectId } })
for (const task of tasks) {
task.assignee = await prisma.user.findUnique({ where: { id: task.assigneeId } })
}
// After: 1 query with JOIN
const tasks = await prisma.task.findMany({
where: { projectId },
include: { assignee: true, labels: true },
})
Step 4 — Check for missing index:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM tasks WHERE project_id = 'proj_123' ORDER BY created_at DESC;
-- Look for: "Seq Scan on tasks" → needs index
If you see Seq Scan: CREATE INDEX CONCURRENTLY idx_tasks_project_created ON tasks(project_id, created_at DESC);
Expected result: P99 should drop to <300ms.
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Larger models produce more detailed, production-ready outputs.
Context Window
This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.
All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Good to Know
Advanced guide and reference material for Performance Profiler. Background, edge cases, and patterns worth understanding.
Contents
Profiler Types: Sampling vs Instrumentation
| Sampling profiler | Instrumentation profiler | |
|---|---|---|
| How it works | Interrupts the process at regular intervals and records the current call stack | Inserts measurement code at every function entry/exit |
| Overhead | Low (1–5% typical) | High (10–100% depending on depth) |
| Accuracy | Statistical — misses short-duration calls | Exact — captures every call with precise timing |
| Production use | Yes — py-spy, async-profiler, pprof in continuous mode | No — reserve for dev/staging |
| Blind spots | Functions faster than the sample interval may not appear | None — but the act of measuring changes behavior (observer effect) |
Rule of thumb: Use sampling profilers in production and during load tests. Use instrumentation profilers in development when you need precise call counts or want to find functions that are called millions of times with negligible individual duration (which sampling misses).
Reading Flame Graphs
A flame graph plots call stacks as horizontal bars. Width = time spent. The x-axis is not time — it's sorted alphabetically within each stack level. Left-right position means nothing.
What to look for:
- Wide bars at the top: The hot path — functions consuming the most CPU time. Start your optimization here.
- Tall, narrow stacks: Deep call chains with little time at each level — usually framework dispatch overhead, not your code.
- Wide plateau: A function that is wide at its level but has no children taking significant time — this is where time is actually being spent. Fix this function.
- "Staircases": Recursive calls or deep framework stacks — often not actionable, but worth checking if the recursion is bounded.
On-CPU vs Off-CPU flame graphs:
- On-CPU: Shows where the CPU is busy — pure computation, tight loops, CPU-bound serialization.
- Off-CPU: Shows where threads are waiting — I/O, lock contention, sleep calls, network. A process that's "slow" but shows low CPU in on-CPU profiles is spending time off-CPU. You need off-CPU profiling (async-profiler's
--event wall, py-spy's--idleflag) to diagnose this.
Production Profiling Safety
Running a profiler in production is acceptable if you understand the overhead budget:
| Profiler | Overhead | Safe sample interval | Notes |
|---|---|---|---|
| py-spy | ~1–2% | 100ms (default) | Attaches externally via ptrace; no code changes |
| async-profiler (JVM) | ~2–5% | 10ms | Safe at 99th percentile; avoid 1ms in high-traffic |
| pprof (Go) | ~1–3% | Default 10ms | Built into the runtime; enable via HTTP endpoint |
Node.js --prof |
~10–15% | N/A (continuous) | Not production-safe; use clinic.js or 0x instead |
| Pyroscope (continuous) | ~1–3% | Configurable | Designed for always-on production profiling |
How to profile without impacting p99 latency: Run the profiler against one instance in the load balancer pool, not all of them. Monitor p99 on the profiled instance via your APM before-and-after. If p99 increases >5%, reduce the sample rate or pull the profiler.
Never attach an instrumentation profiler to a production instance serving real traffic.
P50/P95/P99 vs Average
The average hides everything that matters at scale.
The "average of averages" fallacy: If you have 10 servers each reporting average latency, averaging those averages is statistically wrong unless each server handles the same number of requests. Always aggregate raw request latency, not server-level averages.
Why tail latency matters:
P50: 120ms — half of users see this or better
P95: 800ms — 1 in 20 requests is this slow
P99: 4,200ms — 1 in 100 requests; at 1,000 req/min = 10 users/minute hitting this
Average: 180ms — looks fine on a dashboard
Spotting a bimodal distribution: If your latency histogram shows two humps — e.g., a cluster at 80–120ms and another at 3,000–5,000ms — you have two distinct code paths. Common causes: cache hit vs miss, connection pool available vs waiting, GC pause coinciding with request. A bimodal distribution always has a specific root cause. Find which requests land in the slow cluster and trace the difference.
Common False Positives in Profiler Output
GC pauses appearing as application code: When GC runs, the profiler may record the last application frame as "running" during the pause. In Python, gc.collect() can appear as time in the function that triggered it, not in the GC frame. Use memory-specific tooling (tracemalloc, memray) to separate GC cost from application logic.
Lock contention showing up as CPU time: A thread spinning on a lock shows as CPU-bound in on-CPU profiles. The actual bottleneck is the thread holding the lock. Examine which code holds locks for the longest duration, not which code is waiting.
JIT warm-up skewing early samples: JVM and V8 JIT-compiled code runs slower for the first N executions (interpreter → compiled). Profiles taken during application startup or immediately after deployment will show methods as hot that are actually fast at steady state. Always profile after the JVM or runtime has warmed up — typically 1–2 minutes of real traffic or a synthetic warm-up phase.
Sample bias from sleep/idle: If a worker thread is mostly idle and your profiler samples it, the idle wait shows up in profiles. This is noise, not optimization opportunity. Filter out idle frames — in async-profiler, use --event cpu rather than wall if you want to see only actual CPU activity.
Ready to try Performance Profiler?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
$skillfish add ~skills/performance-profiler