Engineering Advanced TestingQAAI Skills

Skill Tester

Quality assurance testing for AI agent skills.

$npx skills add johnefemer/skillfish --skill skill-tester

Real Examples Get Started Models & Context

Real Conversations

See how the Skill Tester skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

I just finished building a new skill called log-analyzer — it has SKILL.md, README.md, two Python scripts in scripts/, and a references/ directory. I want to make sure it meets POWERFUL tier requirements before I open a PR. How do I run the full quality check?

Agent

This skill includes executable scripts — I'll need your permission before running them.

Here's the full three-tool validation sequence for POWERFUL tier:

# Step 1 — Structural validation
python3 scripts/skill_validator.py skills/log-analyzer --tier POWERFUL

# Step 2 — Script testing (syntax, imports, runtime, output format)
python3 scripts/script_tester.py skills/log-analyzer --timeout 30 --sample-data

# Step 3 — Quality scoring
python3 scripts/quality_scorer.py skills/log-analyzer --detailed --recommendations

What POWERFUL tier requires (checklist before running):

SKILL.md: minimum 300 lines
2-3 Python scripts, each 500-800 lines
--json and human-readable output modes on all scripts
Comprehensive argparse with multiple modes/subcommands
references/ directory with documentation
expected_outputs/ directory for regression testing

Interpreting results:

=== SKILL VALIDATION REPORT ===
Skill: skills/log-analyzer
Tier: POWERFUL
Overall Score: 82/100 (B)

Structure Validation: ✓ PASS
├─ SKILL.md: ✓ EXISTS (312 lines)
├─ scripts/: ✓ EXISTS (2 files)
├─ references/: ✓ EXISTS
└─ expected_outputs/: ⚠ MISSING  ← add this

Script Testing:
├─ log_scanner.py: ✓ PASS (syntax, imports, --help, --json)
└─ log_summarizer.py: ⚠ WARN — no --json output mode

Recommendations:
• Add expected_outputs/ directory with sample runs
• Add --json flag to log_summarizer.py
• Add --format text/json to argument parser

Fix the warnings before the PR — a B score (80+) is acceptable for POWERFUL tier. Anything below 75 will be flagged in CI.

# Run with --json for CI-parseable output
python3 scripts/quality_scorer.py skills/log-analyzer --json | jq '.overall_score'
# Must be >= 75

You

We have 4 contractors submitting skills to our repository via PRs. Quality varies a lot — some have no expected outputs, some scripts crash on --help. I want to block merges automatically when a skill doesn't meet the minimum bar. How do I set this up?

Agent

Here's a complete GitHub Actions quality gate that runs on every PR touching the skills/ directory:

# .github/workflows/skill-quality-gate.yml
name: Skill Quality Gate

on:
  pull_request:
    paths:
      - 'skills/**'

jobs:
  validate-skills:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Find changed skills
        id: changed
        run: |
          changed=$(git diff --name-only origin/main...HEAD | \
            grep '^skills/' | cut -d'/' -f1-2 | sort -u)
          echo "skills=$changed" >> $GITHUB_OUTPUT

      - name: Validate structure
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            echo "--- Validating $skill ---"
            python3 engineering/skill-tester/scripts/skill_validator.py \
              "$skill" --json | tee /tmp/validation.json
            python3 -c "
import json, sys
result = json.load(open('/tmp/validation.json'))
if not result['structure_compliance']['checks']['skill_md_exists']:
    sys.exit(1)
"
          done

      - name: Test scripts
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            python3 engineering/skill-tester/scripts/script_tester.py \
              "$skill" --timeout 30
          done

      - name: Score quality (minimum 75)
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            score=$(python3 engineering/skill-tester/scripts/quality_scorer.py \
              "$skill" --json | jq '.overall_score')
            echo "$skill score: $score"
            if (( $(echo "$score < 75" | bc) )); then
              echo "FAIL: $skill scored $score (minimum: 75)"
              exit 1
            fi
          done

What this gates on:

Missing SKILL.md → immediate block
Script syntax errors or import failures → block
--help crashes or no --json support → block
Quality score < 75 → block with score in the error message

Feedback to contractors in the PR:

Action failed: log-analyzer scored 62/100 (minimum: 75)
Failing checks:
  - scripts/log_summarizer.py: no --json output mode
  - expected_outputs/ directory missing
Run quality_scorer.py --detailed --recommendations to see all suggestions.

This eliminates back-and-forth review cycles on structural issues — contractors get immediate automated feedback.

You

We migrated 167 skills from a legacy repo last month. I suspect many are below quality standards — wrong tier classifications, missing expected outputs, scripts that don't support --json. How do I get a quality report across all 167?

Agent

This skill includes executable scripts — I'll need your permission before running them.

# Batch quality report across all 167 skills
python3 scripts/quality_scorer.py skills/ \
  --batch \
  --json > reports/quality_report_$(date +%Y%m%d).json

Then extract the actionable summary:

# Distribution by grade
jq -r '[.[] | .letter_grade] | group_by(.) | 
  map({grade: .[0], count: length}) | sort_by(.grade)[]' \
  reports/quality_report_*.json

# Output:
# { "grade": "A", "count": 12 }
# { "grade": "B", "count": 43 }
# { "grade": "C", "count": 67 }   ← most of them
# { "grade": "D", "count": 38 }
# { "grade": "F", "count": 7 }

# Skills scoring < 60 (need urgent attention)
jq -r '.[] | select(.overall_score < 60) | 
  "\(.skill_path): \(.overall_score)/100 — \(.improvement_suggestions[:2] | join(", "))"' \
  reports/quality_report_*.json

Most common gaps in legacy migrations:

Issue	Prevalence	Fix
No `--json` output mode	~60% of scripts	Add `--format {json,text}` argparse arg
Missing `expected_outputs/`	~70% of skills	Run script with sample data, save output
SKILL.md < tier minimum lines	~30%	Either expand docs or reclassify tier
External imports (not stdlib)	~15%	Replace with stdlib equivalents
Missing `references/` directory	~45%	Create dir, add relevant docs

Prioritization approach:

Fix the 7 failing skills first (structural issues, scripts crash)
Batch-fix the --json issue (same pattern in all scripts — one change per script)
Add expected_outputs/ by running each script once with sample data

Track progress over time:

# Run weekly, compare to baseline
jq '[.[] | .overall_score] | add/length' reports/quality_report_*.json
# Week 1: 61.2 average → Week 4: 74.8 average

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Skill Tester?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$npx skills add johnefemer/skillfish --skill skill-tester

← Browse all skills

Skill Tester

Real Conversations

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Ready to try Skill Tester?