Engineering MLMLOpsLLM

Senior ML Engineer

ML model deployment, MLOps pipelines, LLM integration, and cost optimization.

$skillfish add ~skills/senior-ml-engineer

Real Conversations

See how the Senior ML Engineer skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

Fine-tuned LLaMA 3.1 8B with QLoRA on our dataset. Need to serve it with <200ms p95 latency on a single A10G GPU. What's the best serving stack?

Agent

Serving Stack: vLLM on A10G with quantization

vLLM's PagedAttention gives 2-4x higher throughput vs naive HuggingFace inference, and easily fits 8B in 24GB VRAM with 4-bit quantization.

Merge LoRA weights first:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
model = PeftModel.from_pretrained(base, './lora-checkpoint')
merged = model.merge_and_unload()
merged.save_pretrained('./llama3-finetuned-merged')

vLLM server:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model ./llama3-finetuned-merged \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 1 \
  --port 8000

AWQ quantization (do this once before deploying):

pip install autoawq
autoawq quantize ./llama3-finetuned-merged --quant_config '{"zero_point": true, "q_group_size": 128, "w_bit": 4}'

Expected performance on A10G (24GB):

Config VRAM p50 latency p95 latency
FP16, no batching 18GB 80ms 180ms
AWQ 4-bit + vLLM 10GB 45ms 120ms

Monitoring: Track vllm:num_requests_running and vllm:gpu_cache_usage_perc in Prometheus — evict when cache usage >85%.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase. Web-based AI can use the knowledge and frameworks, but won't have tool access.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best
Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2
Good
Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only
32K+ Skill + conversation
100K+ Skill + references + codebase

All current frontier models (Claude, GPT, Gemini) support 100K+ context. Use the full window for complex multi-service work.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Senior ML Engineer?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$skillfish add ~skills/senior-ml-engineer
← Browse all 169 skills