Senior ML Engineer
ML model deployment, MLOps pipelines, LLM integration, and cost optimization.
$skillfish add ~skills/senior-ml-engineer Real Conversations
See how the Senior ML Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.
Fine-tuned LLaMA 3.1 8B with QLoRA on our dataset. Need to serve it with <200ms p95 latency on a single A10G GPU. What's the best serving stack?
Serving Stack: vLLM on A10G with quantization
vLLM's PagedAttention gives 2-4x higher throughput vs naive HuggingFace inference, and easily fits 8B in 24GB VRAM with 4-bit quantization.
Merge LoRA weights first:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
model = PeftModel.from_pretrained(base, './lora-checkpoint')
merged = model.merge_and_unload()
merged.save_pretrained('./llama3-finetuned-merged')
vLLM server:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./llama3-finetuned-merged \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 \
--port 8000
AWQ quantization (do this once before deploying):
pip install autoawq
autoawq quantize ./llama3-finetuned-merged --quant_config '{"zero_point": true, "q_group_size": 128, "w_bit": 4}'
Expected performance on A10G (24GB):
| Config | VRAM | p50 latency | p95 latency |
|---|---|---|---|
| FP16, no batching | 18GB | 80ms | 180ms |
| AWQ 4-bit + vLLM | 10GB | 45ms | 120ms |
Monitoring: Track vllm:num_requests_running and vllm:gpu_cache_usage_perc in Prometheus — evict when cache usage >85%.
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Larger models produce more detailed, production-ready outputs.
Context Window
This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.
All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Ready to try Senior ML Engineer?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
$skillfish add ~skills/senior-ml-engineer