Engineering Advanced RAGLLMRetrieval

RAG Architect

RAG pipeline builder, chunking optimizer, and retrieval evaluator.

$npx skills add johnefemer/skillfish --skill rag-architect

Real Examples Get Started Models & Context

Real Conversations

See how the RAG Architect skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

We want to build an internal knowledge base chatbot over our Confluence wiki — about 50,000 pages. Engineers should be able to ask questions about our architecture, runbooks, and processes. We're using Python. What's the best architecture for this?

Agent

At 50k pages, the architecture decisions that matter most are chunking strategy, vector store selection, and hybrid retrieval.

Recommended stack:

Chunking: Hierarchical (page title + section heading + content block)
Embedding model: text-embedding-3-small (1536 dim, $0.02/1M tokens — ~$0.50 to embed all pages once)
Vector store: pgvector on PostgreSQL (you likely already have Postgres; no new infra)
Retrieval: Hybrid (dense + BM25 sparse) with cross-encoder reranking
LLM: Claude 3.5 Sonnet or GPT-4o for generation

Chunking for Confluence pages:

def chunk_confluence_page(page: dict) -> list[dict]:
    chunks = []
    # Always include page title + space in every chunk for context
    page_context = f"Page: {page['title']} | Space: {page['space']}"
    
    # Split on H2/H3 headings (Confluence structure)
    sections = split_by_headings(page['body_storage'], max_tokens=400)
    
    for section in sections:
        chunks.append({
            'text': f"{page_context}\n{section['heading']}\n{section['content']}",
            'metadata': {
                'page_id': page['id'],
                'title': page['title'],
                'space': page['space'],
                'url': page['_links']['webui'],
                'last_modified': page['version']['when'],
            }
        })
    return chunks

Hybrid retrieval with RRF fusion:

def hybrid_search(query: str, k: int = 10):
    # Dense: semantic similarity
    dense_results = pgvector_search(query_embedding, k=20)
    # Sparse: keyword matching
    sparse_results = bm25_search(query, k=20)
    # Fuse with Reciprocal Rank Fusion
    return reciprocal_rank_fusion(dense_results, sparse_results, k=k)

Why hybrid matters here: Engineers search for exact command names (kubectl rollout restart) and semantic concepts ('how to scale a service') — pure dense retrieval misses exact terms, pure BM25 misses semantic intent.

Estimated cost: ~$0.50 for initial embedding. $0.002/query at 10 chunks retrieved. At 1k queries/day: ~$60/month.

You

Our RAG system is hallucinating. When engineers ask 'what port does the auth service use?' it sometimes says 8080, sometimes 3000, sometimes 443. The answer is in our docs. Faithfulness score from RAGAS is 0.61. How do I improve it?

Agent

A faithfulness score of 0.61 means 39% of your generated answers aren't grounded in retrieved context — the model is filling gaps with plausible-sounding information. Here's the systematic fix.

Step 1 — Diagnose whether it's a retrieval problem or a generation problem:

# Test: does the right chunk come back at all?
results = retriever.search("what port does the auth service use")
for r in results:
    print(r['text'][:200])
# If 'auth service' + port number appears in top-3 results: generation problem
# If it doesn't appear: retrieval problem

If it's a retrieval problem — your chunks are too large:

Port numbers in a 400-token chunk get diluted. The embedding represents the full chunk, not the specific fact.

Fix: smaller chunks for configuration docs (100-150 tokens), with parent document retrieval:

# Small chunks for retrieval precision
chunk = "Auth service configuration: HOST=0.0.0.0, PORT=8443, TLS=true"
# Large parent document for context in generation
parent = full_service_config_page

If it's a generation problem — your system prompt is too permissive:

# Current (too loose):
system = "Answer questions about our infrastructure using the context provided."

# Better (grounding constraint):
system = """Answer ONLY based on the provided context. 
If the context does not contain the answer, say "I don't have that information in the docs" — 
do NOT guess or use your training knowledge.
Always cite the specific document and section your answer comes from."""

Step 2 — Add source attribution to expose hallucinations:

# Force the model to cite sources — citation failures reveal retrieval gaps
prompt = f"""Context:\n{context}\n\nQuestion: {query}\n
Answer the question and end with: Source: [exact document title and section]"""

Step 3 — Evaluate per-query with RAGAS:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
# Target: faithfulness > 0.90, context_recall > 0.85

Address retrieval first — getting faithfulness to 0.90+ usually requires both smaller chunks and a stricter system prompt.

You

We're building a RAG system for a B2B SaaS product. We'll have about 500k documents per customer tenant, and 200 customers at launch growing to 2,000. Each tenant should only see their own documents. We're already running PostgreSQL on RDS. What vector store should I use?

Agent

Your constraints — multi-tenancy with data isolation, existing Postgres on RDS, and 500k docs × 200 customers (100M vectors at max scale) — point strongly toward a specific decision.

Recommendation: Start with pgvector, plan for Pinecone at scale

Phase 1 (0–200 customers): pgvector on RDS

You already have Postgres. Add the extension:

CREATE EXTENSION vector;

-- Per-tenant isolation via tenant_id column (NOT separate tables)
CREATE TABLE embeddings (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL REFERENCES tenants(id),
  document_id UUID NOT NULL,
  chunk_text TEXT,
  embedding vector(1536),
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW index per-tenant for fast ANN search
CREATE INDEX CONCURRENTLY embeddings_tenant_embedding_hnsw
  ON embeddings USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Row-level security enforces tenant isolation at DB layer
ALTER TABLE embeddings ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON embeddings
  USING (tenant_id = current_setting('app.current_tenant_id')::UUID);

At 500k docs × 1536 dims × 4 bytes = ~3GB per tenant. 200 tenants = ~600GB. RDS db.r7g.2xlarge (64GB RAM) handles this if you set shared_buffers = 16GB and use HNSW (not ivfflat).

Phase 2 (>500 customers): Evaluate Pinecone namespaces

Pinecone's namespace feature maps 1:1 to tenants, scales to billions of vectors, and handles the operational burden of indexing. At 2,000 customers: $70/namespace × 2,000 = $140k/month — compare to RDS scaling costs.

Avoid Weaviate here: Its multi-tenancy model requires per-tenant collections, which becomes operationally complex at 2,000 tenants.

The pgvector start is low-risk: your schema, queries, and application code transfer directly to pgvector-compatible hosted services (Neon, Supabase, or Aurora) if you need to migrate later.

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try RAG Architect?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

$npx skills add johnefemer/skillfish --skill rag-architect

← Browse all skills

RAG Architect

Real Conversations

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Ready to try RAG Architect?