Retrieval Sets the Ceiling for RAG Quality
In 2026, it's rare for a RAG project to see a dramatic improvement by switching the generation model from GPT-4 to Claude Opus 4.7. The vast majority of problems come down to Retrieval failing to surface the needed information, or retrieving noise that confuses the LLM. Looking back at RAG projects KGA has supported over the past 18 months, Retrieval improvements alone drove 30–60% gains in user-facing quality scores, while LLM model switches produced an average improvement of around 8%.
This article organizes 2026's established RAG Retrieval practices across five layers: (1) hybrid search, (2) rerankers, (3) query rewriting, (4) chunking strategy, and (5) evaluation metrics.
Layer 1: Hybrid Search (BM25 + Dense)
Pure dense retrieval frequently fails on model numbers ("ZX-450B"), abbreviations ("TLS1.3"), proper names, and specific numeric conditions. BM25, conversely, is weak on semantic paraphrasing. Fusing both using Reciprocal Rank Fusion (RRF) has become the industry standard.
RRF is elegantly simple: convert each search result's rank to `1 / (k + rank)` and sum (default k=60). It handles differences in score scale across models without issue, making it universally effective for combining different retrieval systems.
```python from collections import defaultdict
def rrf_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]: scores = defaultdict(float) for results in ranked_lists: for rank, doc_id in enumerate(results, start=1): scores[doc_id] += 1.0 / (k + rank) return sorted(scores.items(), key=lambda x: -x[1])
bm25_top = bm25_index.search(query, k=50) dense_top = vector_db.search(query_embed, k=50) fused = rrf_fusion([bm25_top, dense_top])[:20] ```
On a legal search project at KGA: BM25 alone achieved nDCG@10=0.52, dense alone 0.61, and RRF hybrid reached 0.73. Simple — but remarkably effective.
Layer 2: Rerankers (Cohere Rerank-3, Voyage Rerank-2)
After hybrid search returns top 50–200 candidates, running a cross-encoder reranker to precisely re-score them is the 2026 de facto standard. Dense retrieval uses a bi-encoder structure that vectorizes query and document independently, losing query-document interaction. A reranker passes both through the same Transformer simultaneously to produce a true relevance score.
- Cohere Rerank-3 (released September 2025): 100 language support, 128k context, $2/1k queries
- Voyage Rerank-2: 32k context, top-tier on MTEB Reranking benchmark, $0.5/1k queries
- Jina Reranker v2: OSS-friendly option; Japanese performance is one step behind Cohere
- BGE-Reranker-v2-m3: MIT-licensed OSS, the first choice for self-hosted deployments
```python import cohere
co = cohere.Client() results = co.rerank( model="rerank-3", query="Requirements for cross-border transfers under data protection law", documents=[d.text for d in top50], top_n=10, ) reranked = [top50[r.index] for r in results.results] ```
The effect of reranking is dramatic — KGA's average improvement is from nDCG@10 around 0.65 to around 0.82. However, latency increases by +200–400ms for top-50 sets, so asynchronous processing or a design that serves fast first-stage results while reranking runs in the background is necessary for search UX.
Layer 3: Query Rewriting (HyDE, Query2Doc, Multi-Query)
User queries are often short and ambiguous ("contract termination," "memory utilization"). Expanding them with an LLM before searching has become a standard part of 2026 Retrieval design.
- HyDE (Hypothetical Document Embeddings): Have the LLM write a hypothetical answer document and use its embedding as the search key
- Query2Doc: Concatenate the query with the pseudo-document and embed the result (a HyDE variant that tends to be more stable in practice)
- Multi-Query: Have the LLM generate 3–5 alternative phrasings of the same intent, run parallel searches, and fuse with RRF
```python def hyde_search(query: str, vector_db, llm): hypothetical = llm.generate( f"Write a 3-sentence hypothetical answer to the following question: {query}" ) combined = query + " " + hypothetical # Query2Doc variant embed = embedding_model.encode(combined) return vector_db.search(embed, k=50) ```
Multi-Query offers the best cost-to-performance ratio — on an internal FAQ RAG at KGA, it improved Recall@20 by 12% over single-query search. The cost is one additional LLM call per query, but running it on Claude Haiku or GPT-4.1-mini keeps the cost well under ¥0.1 per query.
Layer 4: Chunking Strategy
Any RAG still operating on "split every 1,000 characters" is out of date by 2026 standards. The chunking approach directly determines Retrieval quality.
- Fixed-size: The classic N-character split with 10–20% overlap
- Recursive Character Splitting: Splits recursively in order — paragraph → sentence → word. The default in LangChain's `RecursiveCharacterTextSplitter`
- Semantic Chunking: Embeds consecutive sentences and splits at points where cosine similarity drops sharply, producing semantically cohesive chunks
- Late Chunking (Jina 2024/2025): Passes the full long document through the embedding model first, extracts per-token embeddings, then pools them to form chunk embeddings. Because the chunks retain surrounding context, pronouns and co-references are preserved
- Hierarchical / Parent-Child: Search on small chunks; pass the parent chunk to generation. Balances precision and context length
Late Chunking implementation example:
```python from transformers import AutoModel, AutoTokenizer import torch
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
def late_chunking(long_text: str, chunk_boundaries: list[tuple[int, int]]): inputs = tokenizer(long_text, return_tensors="pt", truncation=True, max_length=8192) with torch.no_grad(): token_embeds = model(**inputs).last_hidden_state[0] chunk_embeds = [] for start, end in chunk_boundaries: chunk_embeds.append(token_embeds[start:end].mean(dim=0)) return torch.stack(chunk_embeds) ```
After deploying Late Chunking in a contract RAG at KGA, Recall@10 improved from 71% to 88% on documents with heavy cross-referencing (e.g., "as defined in Article 3").
Layer 5: Evaluation Metrics and Offline Evaluation
Retrieval cannot be evaluated as "it works." Quantitative metrics and regression tests are mandatory.
- Recall@k: The proportion of ground-truth documents appearing in the top k. The most important metric — it sets the RAG ceiling
- nDCG@10 (Normalized Discounted Cumulative Gain): Accounts for ranking quality, not just presence. The de facto standard in commercial search
- MRR (Mean Reciprocal Rank): Mean of the reciprocal of the rank of the first correct result. Closely approximates user-perceived quality
- Hit Rate@k: Binary — did the correct document appear in the top k?
Evaluation sets should have a minimum of 100 queries, ideally 500. Maintain query and ground-truth document ID pairs in CSV format and run as regression tests in CI.
```python import numpy as np
def ndcg_at_k(ranked_ids: list[str], relevance: dict[str, int], k: int = 10) -> float: gains = [relevance.get(doc_id, 0) for doc_id in ranked_ids[:k]] dcg = sum((2 g - 1) / np.log2(idx + 2) for idx, g in enumerate(gains)) ideal_gains = sorted(relevance.values(), reverse=True)[:k] idcg = sum((2 g - 1) / np.log2(idx + 2) for idx, g in enumerate(ideal_gains)) return dcg / idcg if idcg > 0 else 0.0 ```
Frameworks like LlamaIndex's `RetrieverEvaluator`, RAGAS, and Trulens standardize these calculations — in 2026, start by installing RAGAS rather than building from scratch.
Reference Production Configuration
The standard production RAG configuration KGA proposes in 2026:
- Ingest: Document → Late Chunking (Jina v3) → store parent/child
- Index: Store both dense + sparse in pgvector or Qdrant
- Query: Multi-Query (generate 3 variations with LLM)
- First stage: Hybrid BM25 + Dense → RRF fusion → top 100
- Second stage: Cohere Rerank-3 → compress to top 10
- Generation: Pass top 10 to Claude Opus 4.7 for final answer
- Eval: Run regression tests across 500 queries weekly in CI; block on nDCG@10 / MRR / Recall@20 threshold violations
Build this configuration rigorously, and "the RAG gets smarter without changing the LLM" becomes a real, felt experience. Retrieval is the core of RAG, and in 2026 it is the highest-priority layer for project investment.