Engineering Advanced Data AnalysisPandasVisualization

Data Analyst

Data analysis workflows: pandas/SQL patterns, statistical analysis, visualization (matplotlib, plotly), dashboard design, A/B test analysis, and business metrics.

A practical data analysis skill that turns raw data into defensible insights. It handles the full workflow: data cleaning, exploratory analysis, statistical testing, visualization, and stakeholder-ready summaries. Opinionated on common pitfalls — Simpson's paradox, p-hacking, misleading visualizations, and metric misalignment — so analysis holds up to scrutiny.

Added Mar 20, 2026

Install for

What This Skill Can Do

Concrete capabilities you get when you install this skill.

Clean and reshape datasets using pandas, SQL window functions, and dbt transforms

Run statistical tests (t-test, chi-square, Mann-Whitney) and interpret results correctly

Design and analyze A/B experiments with proper power calculations and guardrail metrics

Build interactive dashboards with Plotly, Streamlit, or Observable

Define business metrics that align with company objectives and resist gaming

Identify and flag misleading analysis patterns: confounding, survivorship bias, and p-hacking

Real Conversations

See how the Data Analyst skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We want to run an A/B test on our checkout button. Current conversion rate is 3.2%. We want to detect a 0.5 percentage point lift. How many users do we need?

Agent

Here's the power calculation:

import numpy as np
from scipy import stats

def sample_size_two_proportion(
    baseline: float,
    mde: float,  # minimum detectable effect
    alpha: float = 0.05,
    power: float = 0.80,
) -> int:
    p1 = baseline
    p2 = baseline + mde
    p_bar = (p1 + p2) / 2

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # two-tailed
    z_beta = stats.norm.ppf(power)

    n = (
        (z_alpha * np.sqrt(2 * p_bar * (1 - p_bar))
         + z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
        / (p2 - p1) ** 2
    )
    return int(np.ceil(n))

n = sample_size_two_proportion(baseline=0.032, mde=0.005)
print(f"Per variant: {n:,} users")  # → ~14,700 per variant

Result: ~14,700 users per variant (29,400 total).

Critical decisions to make now:

Runtime: At your traffic volume, how long to hit 29k users? If >4 weeks, consider raising MDE to 1pp.
Primary metric: conversion rate. Guardrail metrics: revenue per user, cart abandonment — log these too.
Segment pre-registration: decide mobile vs desktop splits BEFORE launch, not after peeking.

Do NOT peek at results early and stop — that inflates false positive rate to 20%+. Use sequential testing (e.g., scipy.stats with O'Brien-Fleming boundaries) if you need to monitor continuously.

You

Our revenue dropped 15% last week vs the week before. I have a transactions table with user_id, amount, created_at, product_id, and country. Where do I start?

Agent

Structured decomposition — start broad, narrow to the signal:

Step 1 — Is it volume, price, or mix?

SELECT
  DATE_TRUNC('week', created_at) AS week,
  COUNT(*) AS transactions,
  COUNT(DISTINCT user_id) AS unique_buyers,
  ROUND(AVG(amount), 2) AS avg_order_value,
  SUM(amount) AS revenue
FROM transactions
WHERE created_at >= NOW() - INTERVAL '4 weeks'
GROUP BY 1 ORDER BY 1;

This tells you if the drop is fewer orders, smaller orders, or fewer buyers.

Step 2 — Segment by product and country

SELECT
  product_id,
  country,
  SUM(CASE WHEN created_at >= '2026-03-24' THEN amount ELSE 0 END) AS last_week,
  SUM(CASE WHEN created_at >= '2026-03-17' AND created_at < '2026-03-24' THEN amount ELSE 0 END) AS prev_week,
  ROUND(100.0 * (SUM(CASE WHEN created_at >= '2026-03-24' THEN amount ELSE 0 END)
    - SUM(CASE WHEN created_at >= '2026-03-17' AND created_at < '2026-03-24' THEN amount ELSE 0 END))
    / NULLIF(SUM(CASE WHEN created_at >= '2026-03-17' AND created_at < '2026-03-24' THEN amount ELSE 0 END), 0), 1) AS pct_change
FROM transactions
WHERE created_at >= NOW() - INTERVAL '2 weeks'
GROUP BY 1, 2
ORDER BY pct_change ASC
LIMIT 20;

Most revenue drops are concentrated in 1-2 segments. Share the results and we'll drill into the root cause (payment failures, product removed, geo-specific issue, etc.).

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Works well with any frontier model. Complex statistical reasoning benefits from GPT-4o or Claude Sonnet; pandas/SQL generation is reliable on most models.

Context Window

SKILL.md is ~8KB. Fits in 32K context. Load dataset schema and sample rows in context for best results.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Works Great With

These skills complement Data Analyst for end-to-end coverage. Install them together for better results.

Senior Data Scientist

Statistical modeling, A/B testing, causal inference, and advanced analytics.

Data ScienceStatistics

Financial Analyst

Financial ratio analysis, DCF valuation, budgeting, and forecasting.

FinanceDCF

Senior Data Engineer

Data pipelines, ETL/ELT systems, Spark, Airflow, dbt, and Kafka.

DataPipelines

Database Designer

Schema analysis, ERD generation, index optimization, and migrations.

DatabaseSchema

$ skillfish add johnefemer/skillfish --all # install all skills at once

Ready to try Data Analyst?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.