Engineering DataPipelinesETL

Senior Data Engineer

Data pipelines, ETL/ELT systems, Spark, Airflow, dbt, and Kafka.

$npx skills add johnefemer/skillfish --skill senior-data-engineer

Real Examples Get Started Models & Context

Real Conversations

See how the Senior Data Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We run dbt on 200+ models in BigQuery. Full refresh takes 4 hours. I want to set up Airflow to run only models changed since the last run plus their downstream dependencies.

Agent

Incremental dbt + Airflow Pattern:

Step 1 — Get changed models from git diff:

# dags/dbt_incremental_dag.py
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
import subprocess

@task
def get_changed_models() -> str:
    result = subprocess.run(
        ['git', 'diff', '--name-only', 'HEAD~1', 'HEAD', '--', 'models/'],
        capture_output=True, text=True, cwd='/opt/dbt/project'
    )
    changed_files = result.stdout.strip().split('\n')
    # Convert file paths to dbt model selectors
    models = [f.replace('models/', '').replace('.sql', '').replace('/', '.') for f in changed_files if f.endswith('.sql')]
    return ' '.join(f'+{m}' for m in models)  # + prefix includes downstream

@dag(schedule='@hourly', catchup=False)
def dbt_incremental():
    changed = get_changed_models()
    BashOperator(
        task_id='dbt_run',
        bash_command=f'dbt run --select {{ changed }} --profiles-dir /opt/dbt',
    )

The +model_name selector runs the model AND all downstream dependents — this is what you want for correctness.

For a more robust setup, use the dbt Cloud API or dbt-airflow provider:

pip install astronomer-cosmos  # NASA's dbt-Airflow integration

Cosmos parses the dbt manifest and creates individual Airflow tasks per model — giving you granular retry, skip, and observability.

Expected result: 4h full refresh → 20-40min for a typical PR touching 10-20 models.

You

Spark job joining two DataFrames — one 500GB events table, one 2GB lookup table — takes 45 min and spills 120GB to disk. Running on EMR with 20 r5.4xlarge workers.

Agent

Root cause: you're doing a sort-merge join when you should be doing a broadcast join.

A 2GB lookup table fits easily in executor memory — broadcast it and eliminate the shuffle entirely.

Fix:

from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast

# Force broadcast on the small table
result = events_df.join(broadcast(lookup_df), on='event_type', how='left')

If the lookup table is actually larger than Spark's autoBroadcastJoinThreshold (default 10MB), override it:

spark.conf.set('spark.sql.autoBroadcastJoinThreshold', '3g')  # 3GB threshold

Why the spill is happening without broadcast: Spark shuffles both DataFrames by join key, writing 500GB across the network and to disk when executor memory fills up. With broadcast, only the 2GB table crosses the network — once.

Additional tuning for r5.4xlarge (16 vCPU, 128GB RAM):

spark.conf.set('spark.executor.memory', '100g')
spark.conf.set('spark.executor.cores', '5')       # 5 cores per executor
spark.conf.set('spark.executor.instances', '60')  # 20 nodes × 3 executors
spark.conf.set('spark.sql.shuffle.partitions', '400')  # 2×parallelism

Expected result: 45min → 8-12min, 120GB spill → 0GB spill.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Senior Data Engineer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$npx skills add johnefemer/skillfish --skill senior-data-engineer

← Browse all skills