Last updated: April 2026

Quick answer

A production-grade RAG architecture has eight components: document ingestion, chunking, embedding, vector store, retriever, reranker, LLM orchestration, and evaluation harness. The gap between demo RAG and enterprise RAG is evaluation rigor, retrieval quality, and governance — not the LLM choice. This reference stacks opinionated defaults for each layer, with fallbacks for sovereign-cloud and air-gapped deployments.

GENERATIVE AI RAG REFERENCE ARCHITECTURE

Quick Answer: A production-grade retrieval-augmented generation (RAG) system has eight components: document ingestion, chunking, embedding, vector store, retrieval, re-ranking, generation with a foundation model, and evaluation. The difference between a prototype and a system is evaluation discipline and governance. This reference architecture covers each layer with opinionated defaults, an evaluation framework, and the governance controls enterprises need before shipping.

WHY RAG, AND WHERE IT BREAKS

RAG lets a foundation model answer using an organization's private knowledge without retraining the model. The original paper is Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020).

RAG works. It also breaks in predictable ways: stale indexes, poor chunking, off-target retrieval, hallucinations over thin retrieval, and evaluation theatre. This architecture is opinionated about where to spend effort.

THE EIGHT COMPONENTS

1. Document ingestion. Connectors to source systems (SharePoint, Confluence, Salesforce, Google Drive, databases). Track provenance: source, author, timestamp, access-control list. Never lose provenance metadata; it drives filtering, citation, and governance.

2. Chunking. Split documents into retrievable units. Default: semantic chunking at paragraph boundaries, 300–800 token chunks, 50-token overlap. Avoid fixed-token chunking across sentence boundaries — it destroys retrieval quality. Document type matters: code chunks by function; tables chunk by row group; transcripts chunk by speaker turn.

3. Embedding. Dense vector representation of each chunk. Default: domain-appropriate production embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or equivalent). Re-embed on model version change. Store raw text alongside the vector — you will need it.

4. Vector store. Default: managed (Pinecone, Weaviate, Qdrant, pgvector). Key selection criteria: metadata filtering, hybrid search (dense + BM25 sparse), access-control enforcement, and multi-region replication if latency matters.

5. Retrieval. Hybrid search by default: dense embedding + keyword (BM25). Filter by metadata before retrieval (tenant, ACL, document type, freshness). Retrieve 20–50 candidates; do not try to feed only the top-3 to the model directly.

6. Re-ranking. Feed retrieved candidates to a cross-encoder re-ranker (Cohere Rerank, or equivalent). Re-ranking is the highest-leverage quality lever in most RAG systems. Skipping it is the single most common production mistake we see.

7. Generation. Send top 5–10 re-ranked chunks to the foundation model with a system prompt instructing grounded citation. Default foundation model: Anthropic Claude for long-context and citation discipline; evaluate alternatives per use case. Enforce "cite source or refuse" in the system prompt.

8. Evaluation. Offline: curated eval set scored on retrieval hit rate, groundedness, answer correctness, and citation precision. Online: A/B or shadow deployment with user-feedback capture. Evaluation is not optional; it is the system.

THE ARCHITECTURE AT A GLANCE

| Layer | Component | Default Choice | Key Risk If Skipped | |---|---|---|---| | Source | Document ingestion | Managed connectors | Lost provenance | | Prep | Chunking | Semantic, 300–800 tokens | Retrieval noise | | Prep | Embedding | Production-grade model | Poor recall | | Store | Vector store | Managed with hybrid search | Metadata filtering fails | | Serve | Retrieval | Hybrid (dense + BM25) | Missed relevant docs | | Serve | Re-ranker | Cross-encoder | Garbage-in-garbage-out | | Serve | Generation | Claude or equivalent | Hallucination | | Measure | Evaluation | Offline + online | You ship blind |

RAG VS FINE-TUNING — WHEN TO USE WHICH

Use RAG when:

Knowledge is proprietary and changes often
You need source citations
Access control matters at query time
Latency allows 1–3 seconds of retrieval

Use fine-tuning when:

You need consistent tone, style, or format
The task is narrow and stable
Latency budget is tight (no retrieval hop)
The base model keeps misinterpreting domain language

Use both when: You need domain-tuned style and fresh private knowledge. Fine-tune on style and reasoning, RAG for facts.

Most enterprise use cases are RAG-first, fine-tune-maybe. Building custom foundation models is rarely justified.

EVALUATION — THE CORE DISCIPLINE

The evaluation framework separates production systems from demos.

Offline evaluation set. 100–500 curated question-answer pairs with expected citations. Re-run on every change to prompt, retrieval logic, chunking, embedding, or model.

Four metrics per question:

Retrieval hit rate: did the correct chunk get into the top-N retrieved?
Re-rank precision: did the correct chunk get into the top-K re-ranked?
Groundedness: does the answer cite sources from the retrieval context?
Answer correctness: is the answer factually right per a human rater?

Online evaluation. Shadow traffic or A/B test every material change. Capture user feedback (thumbs, corrections, escalations). Log every retrieval and generation with trace IDs.

Cadence. Run offline evaluation weekly at minimum; on every prompt or model change; on every index refresh for regulated-content use cases.

GOVERNANCE CONTROLS THAT ENTERPRISE NEEDS

Model inventory. Every RAG system in a central registry with risk classification, business owner, and evaluation cadence. Required under NIST AI RMF and ISO/IEC 42001 alignment.
Access control enforcement at retrieval time. The user's access permissions filter retrievable chunks before embedding enters the model. Never rely on the model to respect ACLs.
PII and sensitive-data redaction. Automated redaction at ingestion plus logging controls at retrieval and generation.
Prompt injection defences. Separate system prompt from user content. Sanitize retrieved content. Monitor for anomalous instruction-following.
Citation enforcement. The system prompt must require citation; the application layer should display citations to the user and block ungrounded answers for high-stakes use cases.
Incident response. Documented playbook for hallucination incidents, data leaks, and model failure. Named owner. Monthly tabletop exercise.

COMMON FAILURE MODES AND FIXES

Failure: answers are often plausible but wrong. Likely cause: no re-ranker, or chunking too coarse. Fix: add cross-encoder re-ranker; tighten chunks.

Failure: system cites the wrong source. Likely cause: weak prompt, or retrieval noise. Fix: enforce citation-or-refuse in prompt; filter retrieval by metadata.

Failure: system leaks data across tenants. Likely cause: ACL enforced at generation, not retrieval. Fix: enforce ACL in the retrieval filter before embedding similarity is computed.

Failure: latency unacceptable in production. Likely cause: retrieving too many candidates or using a large re-ranker. Fix: tune top-N retrieval count; consider a smaller re-ranker; cache frequent queries.

Failure: evaluation scores great offline, badly online. Likely cause: eval set does not represent user queries. Fix: build eval set from sampled production queries (with PII stripped).

HOW WE DEPLOY THIS AT NUUN

NUUN AI builds RAG systems to this architecture for clients in financial services, health, retail, and government. A 90-day production deployment typically runs: weeks 1–3 source-connector and governance design; weeks 4–7 chunking/embedding/retrieval build and offline eval set; weeks 8–10 re-ranker and prompt hardening; weeks 11–13 shadow deployment and online eval; week 14+ phased rollout with named business owner.

Every system we ship has a documented model card, eval report, and governance artefact. We publish the eval set structure for each client so they can re-run it themselves.

FAQ

Q: Do we need a vector database? Postgres seems fine.

A: pgvector works well up to about 10M vectors and for teams already running Postgres. Above that, or when you need rich metadata filtering and hybrid search at scale, a managed vector database (Pinecone, Weaviate, Qdrant) pays back quickly.

Q: Which foundation model is best for RAG?

A: Per use case. Claude is strong for long-context, citation discipline, and instruction-following. GPT-4-class models are strong for general reasoning. Gemini is strong for multimodal. The architecture is model-agnostic; plan for swappability.

Q: How do we handle documents with tables or images?

A: Tables: chunk by row group, preserve headers in each chunk, use structured metadata. Images: use a vision-capable model at ingestion to produce a text description that is embedded alongside the image, and serve both at generation time.

Q: What's the right chunk size?

A: 300–800 tokens for most English text, 50-token overlap, semantic boundaries. Technical content and code need smaller chunks; long-form narrative can go larger. Always test against your eval set.

Q: How often should we re-index?

A: Source change frequency drives this. Policy documents: weekly. Product documentation: daily. Operational data: streaming or hourly. Always track freshness as a metric.

Q: What's a reasonable offline eval budget?

A: 200 curated Q-A pairs with expected citations covers most enterprise use cases. Plan 20–40 hours of subject-matter-expert time to build it. Treat it as a first-class code artefact, version-controlled.

Q: Can we use RAG for customer-facing applications?

A: Yes, with stricter governance. Require citation-or-refuse, log every generation, implement rate limiting and prompt-injection defences, and run continuous online evaluation. Many enterprise-grade customer support deployments are RAG systems.

Q: What's the biggest mistake teams make?

A: Skipping the re-ranker. The second biggest is not building an eval set. The third is treating governance as paperwork instead of architecture.

SOURCES & FURTHER READING

About the author

NUUN AI Engineering

Written by NUUN AI's engineering practice with external technical review

Practice-lead review at publication; method and sources cited inline. Refreshed quarterly.

Frequently asked.

What are the eight components of a production RAG system?

Document ingestion, chunking strategy, embedding model, vector store, retriever, reranker, LLM orchestration and response generation, and evaluation harness. Every layer ships with telemetry and version control.

Why does RAG fail in enterprise deployment?

Usually retrieval quality, not the LLM. Bad chunking, wrong embedding model for the domain, and missing rerankers degrade accuracy faster than any prompt engineering can recover. Evaluation gaps hide the problem until users complain.

Which vector store should I use?

For most enterprises, pgvector on managed Postgres is the right default — operable by existing DBAs, good enough at 10M documents. Move to Pinecone, Weaviate, or Milvus when you cross 100M chunks or need multi-tenant isolation.

How do I evaluate a RAG system?

Offline: RAGAS, TruLens, or custom LLM-as-judge with a labeled gold set (minimum 200 Q/A pairs). Online: thumbs-up/down, citation-click rate, escalation rate. Run both continuously; offline-only evaluation drifts within months.

Do I need a reranker?

Yes for high-stakes applications. A cross-encoder reranker (Cohere Rerank, BGE-reranker) lifts retrieval precision by 15–35% and is usually the single highest-ROI addition after the initial build.

How does this architecture work in sovereign or air-gapped environments?

Swap managed services for in-region or on-prem equivalents: self-hosted Postgres+pgvector, local embedding models (BGE, E5, Jina), and sovereign LLM inference (G42, Microsoft sovereign clouds, or on-prem Llama/Qwen). The architecture shape is unchanged.

What governance controls belong in the architecture?

Model registry with version pins, prompt registry, retrieval audit logs, response citations, PII redaction, content safety filters, and human-in-the-loop for regulated workflows. These are architectural, not afterthoughts.

Generative AI RAG Reference Architecture | NUUN Digital

Insight