Engineering Data ScienceStatisticsML

Senior Data Scientist

Statistical modeling, A/B testing, causal inference, and advanced analytics.

Install for

Real Conversations

See how the Senior Data Scientist skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We want to A/B test removing the coupon code field from checkout. Baseline conversion is 3.2%. We need 95% confidence. How do I size the experiment?

Agent

Sample Size Calculation:

from scipy.stats import norm
import math

def sample_size_per_variant(p_baseline, mde, alpha=0.05, power=0.80):
    """Calculate required n per variant for a two-proportion z-test."""
    p_treatment = p_baseline * (1 + mde)
    p_avg = (p_baseline + p_treatment) / 2
    
    z_alpha = norm.ppf(1 - alpha / 2)  # two-tailed
    z_beta = norm.ppf(power)
    
    n = (z_alpha + z_beta)**2 * (p_baseline*(1-p_baseline) + p_treatment*(1-p_treatment)) / (p_treatment - p_baseline)**2
    return math.ceil(n)

# Your numbers:
n = sample_size_per_variant(p_baseline=0.032, mde=0.10)  # detect 10% relative lift
# → ~8,300 users per variant = 16,600 total

At what traffic volume does this complete?

1,000 checkouts/day → 17 days
5,000 checkouts/day → 4 days

Key decisions before you launch:

MDE choice matters more than alpha: A 10% relative lift on 3.2% = detecting 3.52% vs 3.2% — is that a meaningful business outcome? If not, set MDE to 5% (doubles sample size).
Randomization unit: Randomize by user_id, not session — a user seeing both variants inflates false positives
Guardrail metrics: Track add_to_cart rate and revenue_per_visitor alongside conversion — removing the coupon field could hurt AOV even if it lifts conversion

Analysis (post-experiment):

from scipy.stats import chi2_contingency
contingency = [[control_converted, control_not], [treat_converted, treat_not]]
chi2, p_value, _, _ = chi2_contingency(contingency)

You

Our churn model gets AUC 0.97 in training but 0.61 in production. XGBoost on 6 months of customer data. Something is very wrong.

Agent

0.97 train → 0.61 prod is almost certainly data leakage. Diagnosis steps:

Step 1 — Check feature timestamps:

# The most common leakage: features computed AFTER the churn event
# e.g., 'last_login_date' populated at row-creation time, not prediction time
print(df[['customer_id', 'churn_date', 'last_activity_date', 'support_tickets_total']].head(20))
# If last_activity_date > churn_date for churned customers → leakage

Step 2 — Check for target-derived features:

# Features like 'cancellation_reason', 'offboarding_completed', 'final_invoice_amount'
# are only known AFTER churn — classic leakage
for col in df.columns:
    corr = df[col].corr(df['churned'])
    if abs(corr) > 0.6:
        print(f'Suspicious: {col} has {corr:.3f} correlation with target')

Step 3 — Enforce temporal train/test split:

# WRONG: random split leaks future into past
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# CORRECT: split by time
train = df[df['observation_date'] < '2024-06-01']
test = df[df['observation_date'] >= '2024-06-01']

Step 4 — Build features with a strict point-in-time cutoff: Every feature must be computable as of the observation_date — nothing from after that date.

With proper temporal split and leakage-free features, expect train AUC 0.72-0.78 and production AUC 0.68-0.74 — much closer together and trustworthy.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Senior Data Scientist?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills