Engineering Advanced LLMFine-tuningModel Selection

LLM Architect

LLM system design: model selection, fine-tuning strategies, inference optimization, context window management, multi-model routing, and cost analysis.

A deep systems skill for teams building serious LLM infrastructure. It covers model selection trade-offs across GPT-4o, Claude, Gemini, Llama, and Mistral; fine-tuning strategies (LoRA, QLoRA, RLHF, DPO); inference optimization with vLLM and TGI; context window management with sliding windows and KV cache tuning; and multi-model routing for cost-performance optimization.

Added Mar 20, 2026

$npx skills add johnefemer/skillfish --skill llm-architect

Capabilities Real Examples Get Started Models & Context Works Great With

What This Skill Can Do

Concrete capabilities you get when you install this skill.

Select the right model for latency, cost, and capability requirements

Design fine-tuning pipelines using LoRA, QLoRA, or DPO on custom datasets

Optimize inference throughput with vLLM, TGI, or llama.cpp configurations

Architect multi-model routing with fallback, load balancing, and A/B testing

Manage context windows: chunking strategies, KV cache, and prompt compression

Analyze cost-performance trade-offs across model families and hosting options

Real Conversations

See how the LLM Architect skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We have 50,000 internal support docs and want our LLM to answer questions accurately. Should we fine-tune or use RAG?

Agent

For document Q&A, RAG wins almost every time over fine-tuning. Here's why:

Fine-tuning limitations for your case:

Fine-tuning teaches style and behavior, not facts. The model will hallucinate confidently after training on 50k docs.
Your docs change. Retraining costs $500-2000/run and takes days.
You lose attribution — users can't see which document answered their question.

RAG advantages:

Ground answers in retrieved chunks — hallucination drops dramatically
Update your knowledge base without retraining: add/remove documents instantly
Native citations: return the source doc URL with every answer
Cost: embeddings for 50k docs ≈ $2 one-time with text-embedding-3-small

When fine-tuning DOES make sense:

You need a specific output format or tone the base model doesn't naturally produce
You have <10k high-quality instruction pairs with clear input→output mapping
Latency is critical and you want to bake domain vocabulary into weights

Recommendation: Build RAG first with pgvector or Pinecone. If retrieval quality plateaus below 85% on your eval set, then consider fine-tuning the retriever or reranker — not the generator.

You

We deployed Llama 3.1 70B on vLLM with 2x A100 80GB. Throughput is 12 req/s but we need 50 req/s. What are the levers?

Agent

12 req/s on 2x A100 for 70B is below expected. Let's diagnose and fix:

Diagnostic first — check these metrics:

# Check GPU utilization and KV cache hit rate
curl http://localhost:8000/metrics | grep -E 'gpu_cache|num_running|num_waiting'

Lever 1 — Enable tensor parallelism correctly

vllm serve meta-llama/Llama-3.1-70B \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 8192  # reduce from 128k if not needed

Reducing max-model-len from 128k to 8192 cuts KV cache memory ~16x — more space for concurrent requests.

Lever 2 — Tune batch size and block size

--max-num-batched-tokens 32768 \
--max-num-seqs 256 \
--block-size 16

Lever 3 — Enable speculative decoding (if requests have similar lengths)

--speculative-model "meta-llama/Llama-3.2-1B" \
--num-speculative-tokens 5

Speculative decoding with a 1B draft model typically yields 2-3x speedup on latency-bound workloads.

Lever 4 — Add prefix caching for shared system prompts

--enable-prefix-caching

With all levers applied and max-model-len 8192, expect 40-60 req/s. If still bottlenecked, profile with --disable-log-stats false and share the num_waiting metric.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Requires a strong reasoning model — Claude Opus, GPT-4o, or Gemini 2.5 Pro. Architecture decisions need multi-step trade-off analysis that smaller models handle poorly.

Context Window

SKILL.md is ~12KB. Use 100K+ context for sessions covering fine-tuning datasets, inference configs, and eval results simultaneously.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Works Great With

These skills complement LLM Architect for end-to-end coverage. Install them together for better results.

AI Engineer

Build production AI applications: LLM integration, function calling, streaming, embedding pipelines, guardrails, cost optimization, and evaluation.

AILLM

RAG Architect

RAG pipeline builder, chunking optimizer, and retrieval evaluator.

RAGLLM

Agent Designer

Multi-agent orchestration, tool schemas, and performance evaluation.

AgentsAI

Senior ML Engineer

ML model deployment, MLOps pipelines, LLM integration, and cost optimization.

MLMLOps

AWS Solution Architect

Design AWS architectures using serverless patterns and IaC templates.

AWSServerless

$ skillfish add johnefemer/skillfish --all # install all skills at once

Ready to try LLM Architect?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.