LLM Architect
LLM system design: model selection, fine-tuning strategies, inference optimization, context window management, multi-model routing, and cost analysis.
A deep systems skill for teams building serious LLM infrastructure. It covers model selection trade-offs across GPT-4o, Claude, Gemini, Llama, and Mistral; fine-tuning strategies (LoRA, QLoRA, RLHF, DPO); inference optimization with vLLM and TGI; context window management with sliding windows and KV cache tuning; and multi-model routing for cost-performance optimization.
Added Mar 20, 2026
$npx skills add johnefemer/skillfish --skill llm-architect What This Skill Can Do
Concrete capabilities you get when you install this skill.
Select the right model for latency, cost, and capability requirements
Design fine-tuning pipelines using LoRA, QLoRA, or DPO on custom datasets
Optimize inference throughput with vLLM, TGI, or llama.cpp configurations
Architect multi-model routing with fallback, load balancing, and A/B testing
Manage context windows: chunking strategies, KV cache, and prompt compression
Analyze cost-performance trade-offs across model families and hosting options
Real Conversations
See how the LLM Architect skill responds to actual requests. These are representative examples of what your AI agent will produce.
We have 50,000 internal support docs and want our LLM to answer questions accurately. Should we fine-tune or use RAG?
For document Q&A, RAG wins almost every time over fine-tuning. Here's why:
Fine-tuning limitations for your case:
- Fine-tuning teaches style and behavior, not facts. The model will hallucinate confidently after training on 50k docs.
- Your docs change. Retraining costs $500-2000/run and takes days.
- You lose attribution — users can't see which document answered their question.
RAG advantages:
- Ground answers in retrieved chunks — hallucination drops dramatically
- Update your knowledge base without retraining: add/remove documents instantly
- Native citations: return the source doc URL with every answer
- Cost: embeddings for 50k docs ≈ $2 one-time with
text-embedding-3-small
When fine-tuning DOES make sense:
- You need a specific output format or tone the base model doesn't naturally produce
- You have <10k high-quality instruction pairs with clear input→output mapping
- Latency is critical and you want to bake domain vocabulary into weights
Recommendation: Build RAG first with pgvector or Pinecone. If retrieval quality plateaus below 85% on your eval set, then consider fine-tuning the retriever or reranker — not the generator.
Get Started
How to install and use this skill in your preferred environment.
Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.
Models & Context
Which AI models and context windows work best with this skill.
Recommended Models
Requires a strong reasoning model — Claude Opus, GPT-4o, or Gemini 2.5 Pro. Architecture decisions need multi-step trade-off analysis that smaller models handle poorly.
Context Window
SKILL.md is ~12KB. Use 100K+ context for sessions covering fine-tuning datasets, inference configs, and eval results simultaneously.
Pro tips for best results
Be specific
Include numbers — users, budget, RPS — so the skill can size the architecture.
Share constraints
Compliance needs, team size, and existing stack all improve the output.
Iterate
Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.
Combine skills
Pair with companion skills below for end-to-end coverage.
Works Great With
These skills complement LLM Architect for end-to-end coverage. Install them together for better results.
Ready to try LLM Architect?
Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.
$npx skills add johnefemer/skillfish --skill llm-architect