Engineering TDDTestingQuality

TDD Guide

Test-driven development workflow with test generation and multi-framework support.

$npx skills add johnefemer/skillfish --skill tdd-guide

Real Examples Get Started Models & Context Good to Know

Real Conversations

See how the TDD Guide skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

Starting a new Node.js shopping cart service. Team is skeptical of TDD. Can you show me the red-green-refactor cycle on a real example?

Agent

TDD Red-Green-Refactor — Shopping Cart:

RED — Write a failing test first:

// cart.test.ts
import { Cart } from './cart';

describe('Cart', () => {
  it('adds an item and returns correct total', () => {
    const cart = new Cart();
    cart.addItem({ id: 'sku-1', name: 'Widget', price: 999, quantity: 2 });
    expect(cart.total()).toBe(1998);
  });
});
// Run: npx jest -- FAILS: Cannot find module './cart'

GREEN — Write the minimum code to pass:

// cart.ts
export class Cart {
  private items: CartItem[] = [];

  addItem(item: CartItem) {
    this.items.push(item);
  }

  total(): number {
    return this.items.reduce((sum, i) => sum + i.price * i.quantity, 0);
  }
}
// Run: npx jest -- PASSES

REFACTOR — Improve without breaking:

// Add quantity merging for duplicate SKUs
addItem(item: CartItem) {
  const existing = this.items.find(i => i.id === item.id);
  if (existing) { existing.quantity += item.quantity; }
  else { this.items.push({ ...item }); }
}
// Run: npx jest -- still PASSES

Next test (repeat cycle):

it('applies 10% discount coupon', () => {
  cart.addItem({ id: 'sku-1', price: 1000, quantity: 1 });
  cart.applyCoupon({ code: 'SAVE10', type: 'percent', value: 10 });
  expect(cart.total()).toBe(900);
});

The pitch to skeptics: TDD doesn't slow you down — it shifts debugging time from after deployment to during writing. The test suite you build becomes the spec.

You

Have a legacy Python function calculate_invoice_total(items, tax_rate, discount_code) with no tests and complex branching. Need to add tests safely before refactoring.

Agent

Characterization Testing — Safe Testing of Legacy Code:

Step 1 — Discover what it actually does (not what you think it does):

# characterization_test.py — capture existing behavior as a baseline
import pytest
from invoice import calculate_invoice_total

# Run with real production-like inputs and record outputs
class TestCharacterizationInvoiceTotal:
    def test_basic_items_no_discount(self):
        items = [{'sku': 'A1', 'price': 10.00, 'qty': 3}]
        result = calculate_invoice_total(items, tax_rate=0.10, discount_code=None)
        # Run it once and hardcode the actual output:
        assert result == 33.00  # 30 + 10% tax

    def test_with_discount_code_SAVE20(self):
        items = [{'sku': 'A1', 'price': 100.00, 'qty': 1}]
        result = calculate_invoice_total(items, tax_rate=0.08, discount_code='SAVE20')
        assert result == 86.40  # (100 - 20%) * 1.08 — verify this is right

Step 2 — Map the branching paths:

    # Test each code path independently
    def test_empty_items_returns_zero(self):
    def test_invalid_discount_code_ignored(self):
    def test_negative_price_raises(self):  # does it? find out
    def test_tax_applied_after_discount(self):  # order matters
    def test_free_shipping_threshold(self):  # if this branch exists

Step 3 — Run with coverage to find untested branches:

pytest --cov=invoice --cov-report=term-missing characterization_test.py
# Missing lines reveal branches you haven't characterized yet

Rule: Only refactor after the characterization test suite hits 100% branch coverage. Then any refactor that breaks a test is a regression you can see immediately.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Good to Know

Advanced guide and reference material for TDD Guide. Background, edge cases, and patterns worth understanding.

Contents

When TDD Helps vs Hurts

TDD is a design tool, not a religious obligation. Apply it where the feedback loop pays off.

Scenario	TDD appropriate?	Notes
New feature with clear spec	Yes	Red-green-refactor shines; tests become the living spec
Exploratory / spike work	No	Write a spike, extract understanding, throw it away, then TDD the real implementation
Legacy code without tests	Characterization tests first	Write tests describing current behavior (even if buggy) before modifying anything
Performance-critical hot path	Profile first	TDD the correctness; benchmark separately. Don't optimize before measuring.
UI components (visual)	Maybe	TDD business logic and state; skip TDD for visual assertions — use snapshot or visual regression testing
Third-party integration wrappers	Yes	Mock the external call; TDD the wrapper's behavior and error handling

The Test Pyramid in Practice

The canonical ratio: 70% unit / 20% integration / 10% E2E. This keeps the test suite fast and failures localized.

How teams end up with an inverted pyramid:

E2E tests are written first because they "feel like real testing"
Integration tests catch the bugs unit tests miss, so more get added
Unit tests require design discipline (DI, pure functions) that teams skip under deadline pressure

How to detect an inverted pyramid from CI metrics:

If: test suite runtime > 10 minutes
And: disabling E2E tests cuts runtime by >60%
Then: your pyramid is inverted

The other signal: flaky tests. E2E and integration tests are the primary source of flakiness (network, timing, shared state). If you have >2% flake rate on your suite, you probably have too many high-level tests.

Recovery: Don't delete E2E tests — add unit tests until the ratio normalizes. Deleting tests leaves undetected regressions.

Coverage Is Not Quality

100% line coverage is achievable with tests that assert nothing:

it('runs without throwing', () => {
  expect(() => processPayment(validOrder)).not.toThrow();
});
// Covers every line. Detects zero behavioral regressions.

What coverage misses:

Missing test cases (the code path you didn't think to write)
Wrong assertions (testing the wrong output)
Incorrect edge case handling that still returns a value

Mutation score is the better signal. A mutation testing tool (Stryker for JS/TS, mutmut for Python, pitest for Java) makes small code changes — flipping > to >=, changing a return value — and runs your tests. If your tests don't catch the mutation, the test is weak.

Target: mutation score >70% on critical business logic. Don't apply it to all code — it's expensive to run and the ROI is highest on payment, auth, and calculation logic.

AAA vs BDD Test Style

Both are structuring conventions. The difference is readability for different audiences.

AAA (Arrange/Act/Assert) — clear for developers reading code:

it('applies percentage discount to cart total', () => {
  // Arrange
  const cart = new Cart([{ price: 1000, qty: 1 }]);
  const coupon = { type: 'percent', value: 10 };
  // Act
  cart.applyCoupon(coupon);
  // Assert
  expect(cart.total()).toBe(900);
});

BDD (Given/When/Then) — reads like a product spec:

it('applies percentage discount to cart total', () => {
  given('a cart containing one item at $10.00', () => {
    const cart = new Cart([{ price: 1000, qty: 1 }]);
    when('a 10% discount coupon is applied', () => {
      cart.applyCoupon({ type: 'percent', value: 10 });
      then('the total should be $9.00', () => {
        expect(cart.total()).toBe(900);
      });
    });
  });
});

When to use each: AAA for team-internal tests where all readers are engineers. BDD when tests double as acceptance criteria reviewed by product or QA — the Given/When/Then maps directly to user story format. Avoid mixing styles in the same file.

"Obvious Implementation" vs "Fake It Till You Make It"

Kent Beck described two modes in Test-Driven Development by Example:

Obvious implementation: When you know exactly how to write the code correctly, just write it. Don't introduce artificial fakery for its own sake. Skip the stub if the real implementation is three lines.

Fake it till you make it: When the correct implementation isn't clear, start with a hardcoded return value that makes the test pass. Let subsequent tests force you to generalize.

// Test 1: passes with hardcoded return
total() { return 1998; }

// Test 2: new items — hardcoded breaks
// Forces you to write: return this.items.reduce(...)

The value of "fake it" isn't the fake code — it's the discipline of writing tests that force you toward the real implementation one constraint at a time. Use it when you're unsure of the design, not as a default.

Ready to try TDD Guide?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.