Engineering Advanced PerformanceProfilingOptimization

Performance Profiler

Node/Python/Go profiling, bundle analysis, and load testing.

Install for

Real Conversations

See how the Performance Profiler skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

Our Express API has P50 at 120ms but P99 at 4,200ms. It's a task management app. The endpoint is GET /api/tasks?projectId=xxx. About 200 tasks per project, fetched with Prisma. I have no idea what's happening at the 99th percentile.

Agent

This skill includes executable scripts — I'll need your permission before running them.

P99 4x higher than P50 is a classic sign of an intermittent bottleneck — most requests are fast, but some hit something slow. Most likely culprits: N+1 queries, missing index, or connection pool starvation.

Step 1 — Establish baseline with autocannon:

npx autocannon -c 50 -d 30 'http://localhost:3000/api/tasks?projectId=proj_123'
# Record: P50, P95, P99, RPS

Step 2 — Count DB queries per request (fastest diagnosis):

// Add to Prisma client config in dev
const prisma = new PrismaClient({
  log: [{ emit: 'event', level: 'query' }]
})
let queryCount = 0
prisma.$on('query', () => queryCount++)

// In your route handler, log it:
app.get('/api/tasks', async (req, res) => {
  queryCount = 0
  const tasks = await getTasksWithDetails(req.query.projectId)
  console.log(`Queries for this request: ${queryCount}`) // Is this 1 or 201?
  res.json(tasks)
})

If this prints 201, you have an N+1 — one query for tasks + one per task for assignee/labels/etc.

Step 3 — Fix N+1 with Prisma include:

// Before: N+1 (separate query per task for assignee)
const tasks = await prisma.task.findMany({ where: { projectId } })
for (const task of tasks) {
  task.assignee = await prisma.user.findUnique({ where: { id: task.assigneeId } })
}

// After: 1 query with JOIN
const tasks = await prisma.task.findMany({
  where: { projectId },
  include: { assignee: true, labels: true },
})

Step 4 — Check for missing index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM tasks WHERE project_id = 'proj_123' ORDER BY created_at DESC;
-- Look for: "Seq Scan on tasks" → needs index

If you see Seq Scan: CREATE INDEX CONCURRENTLY idx_tasks_project_created ON tasks(project_id, created_at DESC);

Expected result: P99 should drop to <300ms.

You

Our FastAPI service starts at 180MB RSS and grows to 800MB+ after 6 hours of traffic, then gets OOM-killed. We restart it every 4 hours as a workaround. It processes webhook events from Stripe — about 50 req/min. Python 3.11, uvicorn with 4 workers.

Agent

Memory that grows over time and doesn't return after GC is a leak. With 4 uvicorn workers, each one leaks independently — that's why it compounds.

Step 1 — Profile a running worker with py-spy:

# Find uvicorn worker PIDs
ps aux | grep uvicorn

# Attach to one worker and generate flamegraph (no code changes needed)
py-spy record -o flamegraph.svg --pid <worker_pid> --duration 60
# Open flamegraph.svg — look for wide bars that accumulate over time

Step 2 — Add memory checkpoints to isolate the leak:

import tracemalloc
\[email protected]('/webhooks/stripe', methods=['POST'])
async def handle_webhook(request: Request):
    tracemalloc.start()
    
    # ... your webhook processing ...
    process_stripe_event(await request.json())
    
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')[:5]
    for stat in top_stats:
        logger.info(f"Memory: {stat}")
    tracemalloc.stop()

Run this for 100 requests and look for a file/line that keeps growing.

Common leaks in webhook handlers to check immediately:

Accumulating list/dict at module level:

# BUG: module-level dict that never gets cleared
processed_events = {}  # grows forever
processed_events[event['id']] = event  # add but never remove

Unclosed Stripe client connections:

# BUG: creates a new client per request, connections not pooled
def handle_event(event):
    stripe_client = stripe.StripeClient(api_key)  # new client each time

Fix: instantiate once at module level.

SQLAlchemy session not closed:

# Ensure sessions are closed after each request
with SessionLocal() as db:
    db.add(webhook_log)
    db.commit()
# NOT: db = SessionLocal() with no explicit close

Memory baseline after each fix: restart worker, let it run 30 min, sample RSS with ps -o rss= -p <pid> every 5 min. Flat line = leak fixed.

You

Our Next.js bundle jumped from 420KB to 600KB after a PR that added date formatting. The PR only added one format(date, 'MMM dd, yyyy') call. I think someone imported moment.js. How do I diagnose and fix this?

Agent

180KB for a single date format call is almost certainly a full moment.js import (67KB gzipped, ~200KB uncompressed). Here's the diagnosis and fix:

Step 1 — Confirm with bundle analyzer:

ANALYZE=true pnpm build
# Opens browser treemap — look for 'moment' block in the visualization

Or faster diagnosis:

# Check if moment is in the output
pnpm build 2>&1
npx source-map-explorer .next/static/chunks/*.js --json | \
  jq '.files | to_entries[] | select(.key | contains("moment")) | .value'

Step 2 — Find the import:

grep -r "import.*moment\|require.*moment" src/ --include='*.ts' --include='*.tsx'

Step 3 — Replace moment with date-fns (13.8KB gzipped — tree-shakeable):

// Before (imports entire 67KB moment library)
import moment from 'moment'
const formatted = moment(date).format('MMM DD, YYYY')

// After (imports only the format function — ~2KB)
import { format } from 'date-fns'
const formatted = format(new Date(date), 'MMM dd, yyyy')

Or use dayjs (2.4KB gzipped) if you need a moment-compatible API:

import dayjs from 'dayjs'
const formatted = dayjs(date).format('MMM DD, YYYY')

Step 4 — Add bundle size budgets to CI:

// next.config.js
module.exports = {
  experimental: {
    bundlePagesExternals: true,
  },
}

# Add to CI — fail if bundle grows >10KB
npx bundlesize

Expected result: Bundle drops from 600KB back to ~425KB (you'll gain ~5KB for date-fns vs the baseline).

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Good to Know

Advanced guide and reference material for Performance Profiler. Background, edge cases, and patterns worth understanding.

Contents

Profiler Types: Sampling vs Instrumentation

	Sampling profiler	Instrumentation profiler
How it works	Interrupts the process at regular intervals and records the current call stack	Inserts measurement code at every function entry/exit
Overhead	Low (1–5% typical)	High (10–100% depending on depth)
Accuracy	Statistical — misses short-duration calls	Exact — captures every call with precise timing
Production use	Yes — py-spy, async-profiler, pprof in continuous mode	No — reserve for dev/staging
Blind spots	Functions faster than the sample interval may not appear	None — but the act of measuring changes behavior (observer effect)

Rule of thumb: Use sampling profilers in production and during load tests. Use instrumentation profilers in development when you need precise call counts or want to find functions that are called millions of times with negligible individual duration (which sampling misses).

Reading Flame Graphs

A flame graph plots call stacks as horizontal bars. Width = time spent. The x-axis is not time — it's sorted alphabetically within each stack level. Left-right position means nothing.

What to look for:

Wide bars at the top: The hot path — functions consuming the most CPU time. Start your optimization here.
Tall, narrow stacks: Deep call chains with little time at each level — usually framework dispatch overhead, not your code.
Wide plateau: A function that is wide at its level but has no children taking significant time — this is where time is actually being spent. Fix this function.
"Staircases": Recursive calls or deep framework stacks — often not actionable, but worth checking if the recursion is bounded.

On-CPU vs Off-CPU flame graphs:

On-CPU: Shows where the CPU is busy — pure computation, tight loops, CPU-bound serialization.
Off-CPU: Shows where threads are waiting — I/O, lock contention, sleep calls, network. A process that's "slow" but shows low CPU in on-CPU profiles is spending time off-CPU. You need off-CPU profiling (async-profiler's --event wall, py-spy's --idle flag) to diagnose this.

Production Profiling Safety

Running a profiler in production is acceptable if you understand the overhead budget:

Profiler	Overhead	Safe sample interval	Notes
py-spy	~1–2%	100ms (default)	Attaches externally via ptrace; no code changes
async-profiler (JVM)	~2–5%	10ms	Safe at 99th percentile; avoid 1ms in high-traffic
pprof (Go)	~1–3%	Default 10ms	Built into the runtime; enable via HTTP endpoint
Node.js `--prof`	~10–15%	N/A (continuous)	Not production-safe; use clinic.js or 0x instead
Pyroscope (continuous)	~1–3%	Configurable	Designed for always-on production profiling

How to profile without impacting p99 latency: Run the profiler against one instance in the load balancer pool, not all of them. Monitor p99 on the profiled instance via your APM before-and-after. If p99 increases >5%, reduce the sample rate or pull the profiler.

Never attach an instrumentation profiler to a production instance serving real traffic.

P50/P95/P99 vs Average

The average hides everything that matters at scale.

The "average of averages" fallacy: If you have 10 servers each reporting average latency, averaging those averages is statistically wrong unless each server handles the same number of requests. Always aggregate raw request latency, not server-level averages.

Why tail latency matters:

P50: 120ms  — half of users see this or better
P95: 800ms  — 1 in 20 requests is this slow
P99: 4,200ms — 1 in 100 requests; at 1,000 req/min = 10 users/minute hitting this

Average: 180ms — looks fine on a dashboard

Spotting a bimodal distribution: If your latency histogram shows two humps — e.g., a cluster at 80–120ms and another at 3,000–5,000ms — you have two distinct code paths. Common causes: cache hit vs miss, connection pool available vs waiting, GC pause coinciding with request. A bimodal distribution always has a specific root cause. Find which requests land in the slow cluster and trace the difference.

Common False Positives in Profiler Output

GC pauses appearing as application code: When GC runs, the profiler may record the last application frame as "running" during the pause. In Python, gc.collect() can appear as time in the function that triggered it, not in the GC frame. Use memory-specific tooling (tracemalloc, memray) to separate GC cost from application logic.

Lock contention showing up as CPU time: A thread spinning on a lock shows as CPU-bound in on-CPU profiles. The actual bottleneck is the thread holding the lock. Examine which code holds locks for the longest duration, not which code is waiting.

JIT warm-up skewing early samples: JVM and V8 JIT-compiled code runs slower for the first N executions (interpreter → compiled). Profiles taken during application startup or immediately after deployment will show methods as hot that are actually fast at steady state. Always profile after the JVM or runtime has warmed up — typically 1–2 minutes of real traffic or a synthetic warm-up phase.

Sample bias from sleep/idle: If a worker thread is mostly idle and your profiler samples it, the idle wait shows up in profiles. This is noise, not optimization opportunity. Filter out idle frames — in async-profiler, use --event cpu rather than wall if you want to see only actual CPU activity.

Ready to try Performance Profiler?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills

Performance Profiler

Real Conversations

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Good to Know

Profiler Types: Sampling vs Instrumentation

Reading Flame Graphs

Production Profiling Safety

P50/P95/P99 vs Average

Common False Positives in Profiler Output

Ready to try Performance Profiler?