GENERATIVE AI RAG REFERENCE ARCHITECTURE
Quick Answer: A production-grade retrieval-augmented generation (RAG) system has eight components: document ingestion, chunking, embedding, vector store, retrieval, re-ranking, generation with a foundation model, and evaluation. The difference between a prototype and a system is evaluation discipline and governance. This reference architecture covers each layer with opinionated defaults, an evaluation framework, and the governance controls enterprises need before shipping.
WHY RAG, AND WHERE IT BREAKS
RAG lets a foundation model answer using an organization's private knowledge without retraining the model. The original paper is Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020).
RAG works. It also breaks in predictable ways: stale indexes, poor chunking, off-target retrieval, hallucinations over thin retrieval, and evaluation theatre. This architecture is opinionated about where to spend effort.
THE EIGHT COMPONENTS
1. Document ingestion. Connectors to source systems (SharePoint, Confluence, Salesforce, Google Drive, databases). Track provenance: source, author, timestamp, access-control list. Never lose provenance metadata; it drives filtering, citation, and governance.
2. Chunking. Split documents into retrievable units. Default: semantic chunking at paragraph boundaries, 300–800 token chunks, 50-token overlap. Avoid fixed-token chunking across sentence boundaries — it destroys retrieval quality. Document type matters: code chunks by function; tables chunk by row group; transcripts chunk by speaker turn.
3. Embedding. Dense vector representation of each chunk. Default: domain-appropriate production embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or equivalent). Re-embed on model version change. Store raw text alongside the vector — you will need it.
4. Vector store. Default: managed (Pinecone, Weaviate, Qdrant, pgvector). Key selection criteria: metadata filtering, hybrid search (dense + BM25 sparse), access-control enforcement, and multi-region replication if latency matters.
5. Retrieval. Hybrid search by default: dense embedding + keyword (BM25). Filter by metadata before retrieval (tenant, ACL, document type, freshness). Retrieve 20–50 candidates; do not try to feed only the top-3 to the model directly.
6. Re-ranking. Feed retrieved candidates to a cross-encoder re-ranker (Cohere Rerank, or equivalent). Re-ranking is the highest-leverage quality lever in most RAG systems. Skipping it is the single most common production mistake we see.
7. Generation. Send top 5–10 re-ranked chunks to the foundation model with a system prompt instructing grounded citation. Default foundation model: Anthropic Claude for long-context and citation discipline; evaluate alternatives per use case. Enforce "cite source or refuse" in the system prompt.
8. Evaluation. Offline: curated eval set scored on retrieval hit rate, groundedness, answer correctness, and citation precision. Online: A/B or shadow deployment with user-feedback capture. Evaluation is not optional; it is the system.
THE ARCHITECTURE AT A GLANCE
| Layer | Component | Default Choice | Key Risk If Skipped | |---|---|---|---| | Source | Document ingestion | Managed connectors | Lost provenance | | Prep | Chunking | Semantic, 300–800 tokens | Retrieval noise | | Prep | Embedding | Production-grade model | Poor recall | | Store | Vector store | Managed with hybrid search | Metadata filtering fails | | Serve | Retrieval | Hybrid (dense + BM25) | Missed relevant docs | | Serve | Re-ranker | Cross-encoder | Garbage-in-garbage-out | | Serve | Generation | Claude or equivalent | Hallucination | | Measure | Evaluation | Offline + online | You ship blind |
RAG VS FINE-TUNING — WHEN TO USE WHICH
Use RAG when:
- Knowledge is proprietary and changes often
- You need source citations
- Access control matters at query time
- Latency allows 1–3 seconds of retrieval
Use fine-tuning when:
- You need consistent tone, style, or format
- The task is narrow and stable
- Latency budget is tight (no retrieval hop)
- The base model keeps misinterpreting domain language
Use both when: You need domain-tuned style and fresh private knowledge. Fine-tune on style and reasoning, RAG for facts.
Most enterprise use cases are RAG-first, fine-tune-maybe. Building custom foundation models is rarely justified.
EVALUATION — THE CORE DISCIPLINE
The evaluation framework separates production systems from demos.
Offline evaluation set. 100–500 curated question-answer pairs with expected citations. Re-run on every change to prompt, retrieval logic, chunking, embedding, or model.
Four metrics per question:
- Retrieval hit rate: did the correct chunk get into the top-N retrieved?
- Re-rank precision: did the correct chunk get into the top-K re-ranked?
- Groundedness: does the answer cite sources from the retrieval context?
- Answer correctness: is the answer factually right per a human rater?
Online evaluation. Shadow traffic or A/B test every material change. Capture user feedback (thumbs, corrections, escalations). Log every retrieval and generation with trace IDs.
Cadence. Run offline evaluation weekly at minimum; on every prompt or model change; on every index refresh for regulated-content use cases.
GOVERNANCE CONTROLS THAT ENTERPRISE NEEDS
- Model inventory. Every RAG system in a central registry with risk classification, business owner, and evaluation cadence. Required under NIST AI RMF and ISO/IEC 42001 alignment.
- Access control enforcement at retrieval time. The user's access permissions filter retrievable chunks before embedding enters the model. Never rely on the model to respect ACLs.
- PII and sensitive-data redaction. Automated redaction at ingestion plus logging controls at retrieval and generation.
- Prompt injection defences. Separate system prompt from user content. Sanitize retrieved content. Monitor for anomalous instruction-following.
- Citation enforcement. The system prompt must require citation; the application layer should display citations to the user and block ungrounded answers for high-stakes use cases.
- Incident response. Documented playbook for hallucination incidents, data leaks, and model failure. Named owner. Monthly tabletop exercise.
COMMON FAILURE MODES AND FIXES
Failure: answers are often plausible but wrong. Likely cause: no re-ranker, or chunking too coarse. Fix: add cross-encoder re-ranker; tighten chunks.
Failure: system cites the wrong source. Likely cause: weak prompt, or retrieval noise. Fix: enforce citation-or-refuse in prompt; filter retrieval by metadata.
Failure: system leaks data across tenants. Likely cause: ACL enforced at generation, not retrieval. Fix: enforce ACL in the retrieval filter before embedding similarity is computed.
Failure: latency unacceptable in production. Likely cause: retrieving too many candidates or using a large re-ranker. Fix: tune top-N retrieval count; consider a smaller re-ranker; cache frequent queries.
Failure: evaluation scores great offline, badly online. Likely cause: eval set does not represent user queries. Fix: build eval set from sampled production queries (with PII stripped).
HOW WE DEPLOY THIS AT NUUN
NUUN AI builds RAG systems to this architecture for clients in financial services, health, retail, and government. A 90-day production deployment typically runs: weeks 1–3 source-connector and governance design; weeks 4–7 chunking/embedding/retrieval build and offline eval set; weeks 8–10 re-ranker and prompt hardening; weeks 11–13 shadow deployment and online eval; week 14+ phased rollout with named business owner.
Every system we ship has a documented model card, eval report, and governance artefact. We publish the eval set structure for each client so they can re-run it themselves.
FAQ
Q: Do we need a vector database? Postgres seems fine.
A: pgvector works well up to about 10M vectors and for teams already running Postgres. Above that, or when you need rich metadata filtering and hybrid search at scale, a managed vector database (Pinecone, Weaviate, Qdrant) pays back quickly.
Q: Which foundation model is best for RAG?
A: Per use case. Claude is strong for long-context, citation discipline, and instruction-following. GPT-4-class models are strong for general reasoning. Gemini is strong for multimodal. The architecture is model-agnostic; plan for swappability.
Q: How do we handle documents with tables or images?
A: Tables: chunk by row group, preserve headers in each chunk, use structured metadata. Images: use a vision-capable model at ingestion to produce a text description that is embedded alongside the image, and serve both at generation time.
Q: What's the right chunk size?
A: 300–800 tokens for most English text, 50-token overlap, semantic boundaries. Technical content and code need smaller chunks; long-form narrative can go larger. Always test against your eval set.
Q: How often should we re-index?
A: Source change frequency drives this. Policy documents: weekly. Product documentation: daily. Operational data: streaming or hourly. Always track freshness as a metric.
Q: What's a reasonable offline eval budget?
A: 200 curated Q-A pairs with expected citations covers most enterprise use cases. Plan 20–40 hours of subject-matter-expert time to build it. Treat it as a first-class code artefact, version-controlled.
Q: Can we use RAG for customer-facing applications?
A: Yes, with stricter governance. Require citation-or-refuse, log every generation, implement rate limiting and prompt-injection defences, and run continuous online evaluation. Many enterprise-grade customer support deployments are RAG systems.
Q: What's the biggest mistake teams make?
A: Skipping the re-ranker. The second biggest is not building an eval set. The third is treating governance as paperwork instead of architecture.
RELATED READING
- NUUN AI Index 2026
- AI adoption benchmarking — Canada
- Top AI consultancies — mid-market retail
- Retrieval-augmented generation (glossary)