RAG Architecture: The Practical Guide for UK Service Businesses

Most AI agents built for UK service businesses fail at the same point: they get the answer wrong. Not because the underlying model is incapable, but because the model is guessing. It has no access to your client records, your service agreements, your pricing, your internal policies, or anything else specific to your business. RAG — Retrieval-Augmented Generation — is the engineering pattern that fixes this. It grounds your AI agents in your actual business knowledge, so when a client asks about their engagement terms or your agent drafts a deliverable, it is drawing from your documents rather than making something up. This is the practical guide to building it right.

Why AI Agents Hallucinate Without RAG

An AI language model producing confident but factually incorrect output — illustrating the hallucination problem that RAG architecture is designed to solve for UK service businesses

A large language model is trained on a fixed dataset with a cutoff date. Everything it knows comes from that dataset. It has no access to your company's files, your client history, your pricing structure, or the specifics of any engagement you have ever run. When you ask it a question that requires any of that knowledge, it either declines or — more dangerously — produces a confident-sounding answer that is fabricated.

This is called hallucination, and it is not a bug that future model versions will fix. It is an architectural property of how language models work. A model with no access to your documents cannot tell you what your standard contract says. It can make something up that sounds plausible. In a customer support context, a proposal drafting context, or a compliance context, that is unacceptable.

The solution is not a more capable model. The solution is a better architecture. RAG solves this by giving the model access to a curated knowledge base at inference time. Instead of relying on training data, the system retrieves the relevant parts of your actual documents and passes them to the model as context. The model generates its response based on what it can see in that context window, not what it remembers from training. The answer is grounded in your knowledge, not fabricated from statistical patterns.

In every AI agent build we have done for UK service businesses — from the Yorkshire accountancy firm to the Oxford management consultancy — the knowledge base layer is not optional. It is what separates an agent that produces reliable outputs from one that occasionally makes things up and erodes user trust over time.

RAG does not make the model smarter. It makes the model better-informed — with your specific business knowledge, updated in real time, rather than a frozen snapshot of the internet.

The Four Components of a RAG System

A data pipeline diagram representing the four stages of a RAG system — document ingestion, chunking and embedding, vector storage, and retrieval-augmented generation — for UK business AI agents

A RAG system has four distinct components. Understanding each one is necessary before you can make sensible decisions about how to build yours.

1. Document Ingestion

Your knowledge base is only as good as what goes into it. For a UK service business, the relevant documents typically include: standard operating procedures, service agreements and contract templates, past client deliverables (appropriately anonymised), pricing schedules, product and service descriptions, internal policy documents, and frequently asked questions. The ingestion process loads these documents into the system, parses them into plain text, and prepares them for the next stage. This sounds simple. Getting it right — handling PDFs, Word documents, email threads, and spreadsheets without losing structure — requires deliberate choices about your document parsing approach.

2. Chunking and Embedding

Raw documents are too large to pass whole into a language model's context window. They need to be split into chunks — smaller pieces of text that can be retrieved individually. Each chunk is then converted into a numerical representation called an embedding: a vector of floating-point numbers that captures the semantic meaning of the text. Two chunks that mean similar things will have embeddings that are numerically close to each other. This is what enables semantic search — the ability to find relevant content based on meaning rather than keyword matching.

3. Vector Storage

Embeddings are stored in a vector database — a purpose-built store optimised for similarity search over high-dimensional numerical vectors. When a user query arrives, it is also converted into an embedding, and the vector database returns the chunks whose embeddings are most similar to the query embedding. This is the retrieval step: finding the relevant pieces of your knowledge base for any given question.

4. Generation

The retrieved chunks are assembled into a context block and prepended to the prompt sent to the language model. The model's instruction is essentially: "Answer the user's question using only the information in the context below." Because the model can see the relevant parts of your documents, it produces responses that are grounded in your actual knowledge rather than its training data. This four-stage pipeline is what makes the semantic memory layer described in our agent memory architecture post work in practice — and why a well-constructed knowledge base is the single highest-leverage component in any AI operating system.

Chunking Strategy — Where 80% of RAG Failures Begin

A document being divided into optimal semantic chunks for a RAG knowledge base — illustrating the chunking strategy decisions that determine retrieval accuracy for UK business AI agents

Production deployment data consistently points to the same finding: roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the model, not the vector database. Getting chunking right is the highest-leverage engineering decision in the entire system.

The core tension in chunking is this: chunks that are too large dilute the retrieval signal — the relevant sentence gets buried in a 2,000-word block of mixed content, and the embedding represents all of it rather than the specific point you need. Chunks that are too small lose context — a single sentence about pricing retrieved without its surrounding paragraph may be ambiguous or misleading.

The practical starting point for most business document types is recursive character splitting at 400–512 tokens with a 10–20% overlap. The overlap — repeating the last 50–100 tokens of one chunk at the start of the next — ensures that sentences spanning chunk boundaries are not split in a way that loses their meaning. For most service business documents (contracts, SOPs, deliverable templates, email threads), this approach works well and is worth testing before moving to more sophisticated strategies.

Where it breaks down is with structured documents: PDFs with tables, Excel sheets with column dependencies, and documents where the structure itself carries meaning. For these, semantic chunking — which uses embedding similarity between adjacent sentences to detect natural topic boundaries — produces significantly better results. The cost is higher complexity: semantic chunking requires running embeddings at ingestion time to determine where to split, rather than applying a simple character count.

A practical rule: start with recursive character splitting. Evaluate retrieval hit rate on a test set of 20–30 representative queries against your actual documents. If hit rate is below 75%, investigate your chunking strategy before anything else. In most cases, either the chunk size is wrong or your documents have structure that the splitter is not respecting.

Choosing Your Vector Database

In 2026, the vector database market has matured enough that there are clear choices for different scales. For UK service businesses operating with knowledge bases of up to 50,000 documents, the right choice is almost always pgvector — the open-source PostgreSQL extension that adds vector similarity search to a standard relational database.

pgvector's advantages at this scale are significant. If you are already running PostgreSQL (and most business applications are), pgvector adds vector search without an additional database to operate, back up, or pay for. Query performance is entirely adequate for knowledge bases up to several hundred thousand chunks. And as we described in the post on self-hosting AI agents on Google Cloud, it runs comfortably on the same Cloud SQL instance you are already paying for — no additional managed service required.

The specialist vector databases — Pinecone, Weaviate, Qdrant — become relevant at larger scales: hundreds of thousands to millions of documents, sub-10ms retrieval requirements, or multi-tenant architectures where hard isolation between different clients' knowledge bases is required. For a professional service firm with 5,000 documents, pgvector is the right answer. The specialist databases are engineering overhead you do not yet need.

Making Retrieval Accurate — Hybrid Search and Re-ranking

Pure semantic search has a known weakness: it is excellent at finding conceptually similar content but poor at exact keyword matching. If a user asks "what is the day rate for senior consultant work?", a well-constructed vector embedding will find related pricing content. But a keyword-based search (BM25) will find exact term matches that semantic search might rank lower.

The solution in production RAG systems is hybrid search: run both a semantic search and a BM25 keyword search simultaneously, then merge the results using a weighted ranking function called Reciprocal Rank Fusion. The hybrid approach consistently outperforms either method alone — particularly for queries that mix conceptual questions with specific terms, which describes most business knowledge queries.

The second accuracy technique is re-ranking. After the vector database returns its top 20 candidate chunks, a lightweight cross-encoder model re-scores each candidate against the specific query. Cross-encoders see the query and the candidate chunk together, making them significantly more accurate than the bi-encoder embeddings used in retrieval — but also too slow to run against the full knowledge base. Running re-ranking as a second stage over 20 candidates adds 100–300ms to your pipeline and improves precision materially. For business applications where a wrong answer has real consequences, this trade-off is almost always worth it.

This is the same pipeline we described when covering AI agent observability — the retrieval quality metrics (hit rate, mean reciprocal rank, relevance score) are exactly the signals that tell you whether your hybrid search and re-ranking layer is working as intended in production.

Building Your Business Knowledge Base: The Practical Checklist

A UK service business professional reviewing an AI knowledge base configuration — representing the practical setup of a RAG system grounded in real business documents for a professional services firm

If you are building a RAG system for a UK service business today, these are the practical decisions to make before you write a line of code.

Decide what goes in. Start with the 10–15 documents your team consults most often — the ones that currently exist as PDFs or Word files that people search for manually. Standard service agreements, your most common deliverable templates, internal rate cards, key policy documents. This is your MVP knowledge base. Resist the urge to ingest everything at once.
Choose your chunking approach. Recursive character splitting at 400–512 tokens with 15% overlap for prose documents. Page-level chunking for PDFs with tables. Test these against a realistic query set before moving forward.
Pick your embedding model. OpenAI's text-embedding-3-small is cost-effective and performant for most business document types. For deployments with UK data sovereignty requirements, Cohere's embedding models can be accessed via UK-based infrastructure.
Start with pgvector. Unless you have a specific reason for a specialist database, pgvector on Cloud SQL is the right first choice for UK SME scale.
Add hybrid search from day one. BM25 plus semantic is not significantly harder to implement than semantic alone, and the accuracy improvement in business contexts is consistent enough to be worth the additional setup.
Plan your update cadence. Knowledge bases degrade when documents become outdated. Assign one person responsibility for reviewing and updating the knowledge base monthly — adding new documents, retiring superseded ones. Treat it like maintaining any other core operational tool.

The total cost to run a RAG-powered knowledge base for a UK service business at this scale: approximately £15–30 per month in cloud infrastructure (Cloud SQL with pgvector, embedding API calls, and language model inference), depending on query volume. This is not a meaningful operational cost relative to the admin hours it replaces — the same pattern we found at the Manchester recruitment agency and the firms using MCP to extend their agents' tool access.

The knowledge base is the foundation of your AI operating system. Build it with care, test retrieval quality from the start, and maintain it as you would any core operational asset — and it will compound in value every month.

If you are building AI agents for your UK service business and want a RAG architecture that works in production — with hybrid search, re-ranking, and an update process that does not require a data engineer — get in touch. We build knowledge-grounded AI operating systems for professional service firms, and we can scope what is right for your practice in a short call.

RAG Architecture: The Practical Guide for UK Service Businesses

Why AI Agents Hallucinate Without RAG

The Four Components of a RAG System

1. Document Ingestion

2. Chunking and Embedding

3. Vector Storage

4. Generation

Chunking Strategy — Where 80% of RAG Failures Begin

Choosing Your Vector Database

Making Retrieval Accurate — Hybrid Search and Re-ranking

Building Your Business Knowledge Base: The Practical Checklist

More field notes.

Three AI Agents, One Management Consultancy, 61% More Projects Delivered

UK AI Agents in the Workplace: What the Industry Data Means for Service Businesses

AI Agent Observability: How to Monitor Your AI Operating System

More field notes, in your inbox.