WHAT IT IS
A RAG pipeline has five stages: document ingestion (chunking, cleaning), embedding (converting chunks to vectors with OpenAI, Cohere, Voyage, or open models), storage in a vector database (Pinecone, Weaviate, Qdrant, pgvector, Elastic), retrieval (semantic search, hybrid with BM25, reranking with Cohere Rerank or BGE), and generation (LLM answers conditioned on retrieved context with explicit citations).
HOW IT WORKS
Production-grade RAG layers in metadata filtering, access control (user-level permissions on retrieved docs), evaluation (RAGAS, hit rate, faithfulness), and prompt design that forces citation. The pattern was formalized by Lewis et al. at Meta AI in 2020 and has become the default architecture for enterprise knowledge assistants.
WHEN TO USE
Use RAG when answers must reflect proprietary, recent, or verifiable content — internal knowledge assistants, customer support, policy Q&A, research synthesis. Don't use it when the model's training data is already sufficient and current.