Part 3: RAG Without the LLM: Building Context-First Retrieval Pipelines
When I set out to build a dataset discovery system, everyone said “just throw GPT-4 at it.” But I built something different, a retrieval system that returns structured context, formatted and ready for any LLM (or no LLM at all).
This is the architecture that emerged, and why I think it’s the right approach for most RAG applications.
The Problem with LLM-First RAG
Most RAG tutorials look like this:
def answer_question(question: str) -> str:
# 1. Embed the question
query_embedding = embed(question)
# 2. Search vector database
documents = vector_db.search(query_embedding, k=5)
# 3. Build prompt and call LLM
prompt = f"""Answer based on these documents:
{documents}
Question: {question}
"""
return openai.chat(prompt)
Simple. And problematic:
-
Tight coupling. Your retrieval logic is married to OpenAI’s API. Want to switch to Claude? Rewrite everything.
-
No observability. What documents were retrieved? What was the actual context? You can’t see it.
-
No graceful degradation. LLM rate limited? API down? Your entire system fails.
-
Testing nightmare. How do you test retrieval quality when it’s bundled with generation?
The Solution: Retrieval as a First-Class Concern
Separate retrieval from generation. Build a pipeline that returns context, not answers:
class RAGResponse(BaseModel):
query: str
answer: str = "" # Empty! Filled by caller, if at all
context: RetrievedContext
sources: List[Dict[str, Any]]
confidence: float
Notice answer: str = "". The pipeline doesn’t generate answers. It retrieves context and leaves generation to the caller, who might use GPT-4, Claude, Llama, or nothing at all.
The Architecture
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
▼
┌─────────────────┐
│ RAG Pipeline │
│ ┌───────────┐ │
│ │ Retrieve │ │
│ └─────┬─────┘ │
│ ▼ │
│ ┌───────────┐ │
│ │ Format │ │
│ └─────┬─────┘ │
│ ▼ │
│ ┌───────────┐ │
│ │ Compute │ │
│ │Confidence │ │
│ └───────────┘ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ RAGResponse │
│ - context │
│ - sources │
│ - confidence │
│ - answer="" │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ LLM (opt) │ OR │ Direct Display │
└─────────────────┘ └─────────────────┘
The pipeline does three things:
- Retrieve relevant documents
- Format context for consumption
- Compute confidence from retrieval quality
What it doesn’t do: generate answers. That’s the caller’s job.
Building the Retrieval Layer
Start with a clean retrieval interface:
class RetrievedContext(BaseModel):
"""Context retrieved for RAG generation."""
documents: List[SearchResult] = Field(default_factory=list)
query: str = ""
formatted_context: str = ""
sources: List[Dict[str, Any]] = Field(default_factory=list)
retrieval_time_ms: float = 0.0
@property
def has_results(self) -> bool:
return len(self.documents) > 0
@property
def num_documents(self) -> int:
return len(self.documents)
The key fields:
documents— The raw search results with full metadataformatted_context— A string ready for LLM consumptionsources— Simplified citations for displayretrieval_time_ms— Performance tracking
Now the retrieval method:
class RAGPipeline:
def __init__(
self,
vector_store: VectorStore,
embedding_service: EmbeddingService,
default_collection: str = "datasets",
top_k: int = 5,
context_max_length: int = 4000,
):
self.vector_store = vector_store
self.embedding_service = embedding_service
self.default_collection = default_collection
self.top_k = top_k
self.context_max_length = context_max_length
def retrieve(
self,
query: str,
collection: Optional[str] = None,
top_k: Optional[int] = None,
filters: Optional[Dict[str, Any]] = None,
min_score: float = 0.0,
) -> RetrievedContext:
"""Retrieve relevant documents for a query."""
start_time = time.time()
collection = collection or self.default_collection
top_k = top_k or self.top_k
# Generate query embedding
query_embedding = self.embedding_service.embed_query(query)
# Search vector store
search_results = self.vector_store.search(
collection=collection,
query_embedding=query_embedding,
n_results=top_k,
where=filters,
)
# Filter by minimum score
filtered_docs = [
doc for doc in search_results.results
if doc.score >= min_score
]
# Build sources list
sources = [
{
"id": doc.id,
"title": doc.metadata.get("title", "Unknown"),
"dataset_id": doc.metadata.get("dataset_id", doc.id),
"score": doc.score,
}
for doc in filtered_docs
]
# Format context
formatted_context = ContextFormatter.format_structured(
filtered_docs,
max_length=self.context_max_length,
)
retrieval_time = (time.time() - start_time) * 1000
return RetrievedContext(
documents=filtered_docs,
query=query,
formatted_context=formatted_context,
sources=sources,
retrieval_time_ms=retrieval_time,
)
Clean separation. Retrieve documents, format context, return everything. No LLM in sight.
Context Formatting: The Underrated Art
How you format context matters as much as what you retrieve. Different use cases need different formats.
Simple format for basic retrieval:
@staticmethod
def format_simple(documents: List[SearchResult], max_length: int = 4000) -> str:
"""Basic concatenation with numbering."""
context_parts = []
current_length = 0
for i, doc in enumerate(documents, 1):
title = doc.metadata.get("title", "Unknown Dataset")
content = doc.content
part = f"[{i}] {title}\n{content}\n"
if current_length + len(part) > max_length:
break
context_parts.append(part)
current_length += len(part)
return "\n".join(context_parts)
Output:
[1] Land Cover Map 2017
This dataset represents land cover classification...
[2] UK Soil Moisture Dataset
Soil moisture measurements from...
Structured format with metadata:
@staticmethod
def format_structured(documents: List[SearchResult], max_length: int = 4000) -> str:
"""Rich formatting with metadata."""
context_parts = []
current_length = 0
for i, doc in enumerate(documents, 1):
meta = doc.metadata
title = meta.get("title", "Unknown Dataset")
dataset_id = meta.get("dataset_id", doc.id)
entry = f"""
---
Source [{i}]: {title}
ID: {dataset_id}
Relevance: {doc.score:.2%}
---
{doc.content}
"""
if current_length + len(entry) > max_length:
break
context_parts.append(entry.strip())
current_length += len(entry)
return "\n\n".join(context_parts)
Output:
---
Source [1]: Land Cover Map 2017
ID: abc-123-def
Relevance: 94.23%
---
This dataset represents land cover classification...
---
Source [2]: UK Soil Moisture Dataset
ID: xyz-789-uvw
Relevance: 87.15%
---
Soil moisture measurements from...
Q&A format with instructions for the LLM:
@staticmethod
def format_for_qa(
documents: List[SearchResult],
query: str,
max_length: int = 4000,
) -> str:
"""Format specifically for Q&A tasks."""
context = ContextFormatter.format_structured(documents, max_length - 200)
return f"""Based on the following dataset information, answer the user's question.
RELEVANT DATASETS:
{context}
USER QUESTION: {query}
Provide a helpful answer based on the datasets above. If the information is not available in the provided context, say so clearly."""
The caller chooses the format based on their needs:
response = pipeline.query(query, format_style="qa") # For LLM
response = pipeline.query(query, format_style="simple") # For display
response = pipeline.query(query, format_style="structured") # For debugging
Confidence: Quantifying Retrieval Quality
LLM outputs are notoriously hard to trust. But retrieval quality? That’s measurable.
def _compute_confidence(self, context: RetrievedContext) -> float:
"""Compute confidence score based on retrieval quality."""
if not context.documents:
return 0.0
# Average of top document scores
scores = [doc.score for doc in context.documents[:3]]
return sum(scores) / len(scores)
This gives you a number you can act on:
response = pipeline.query("obscure topic nobody wrote about")
if response.confidence < 0.3:
return "I don't have enough relevant information to answer this question."
elif response.confidence < 0.6:
return f"Based on limited information: {generate_answer(response.context)}"
else:
return generate_answer(response.context)
No hallucination-prone LLM confidence scores. Just retrieval metrics you can verify by looking at the actual documents.
The Query Method: Bringing It Together
def query(
self,
query: str,
collection: Optional[str] = None,
top_k: Optional[int] = None,
filters: Optional[Dict[str, Any]] = None,
format_style: str = "structured",
) -> RAGResponse:
"""
Execute a full RAG query.
Retrieves context and prepares response (without LLM generation).
The formatted context can be passed to any LLM.
"""
# Retrieve relevant documents
context = self.retrieve(
query=query,
collection=collection,
top_k=top_k,
filters=filters,
)
# Format context based on style
if format_style == "simple":
formatted = ContextFormatter.format_simple(
context.documents,
self.context_max_length,
)
elif format_style == "qa":
formatted = ContextFormatter.format_for_qa(
context.documents,
query,
self.context_max_length,
)
else:
formatted = context.formatted_context
context.formatted_context = formatted
# Build response (answer would be filled by LLM)
return RAGResponse(
query=query,
answer="", # Deliberately empty
context=context,
sources=context.sources,
confidence=self._compute_confidence(context),
)
The answer="" is intentional. This is a feature, not a bug. The pipeline’s job is retrieval and formatting. Generation is someone else’s problem.
Using the Pipeline
With an LLM:
response = pipeline.query("What soil data is available?", format_style="qa")
if response.confidence > 0.5:
llm_answer = openai.chat(response.context.formatted_context)
response.answer = llm_answer
else:
response.answer = "I couldn't find relevant information."
return response
Without an LLM:
response = pipeline.query("soil moisture datasets")
# Just return the sources
return {
"query": response.query,
"results": [
{
"title": src["title"],
"id": src["dataset_id"],
"relevance": f"{src['score']:.0%}",
}
for src in response.sources
],
"confidence": response.confidence,
}
For testing:
def test_retrieval_quality():
response = pipeline.query("land cover classification")
# Assert on retrieval, not generation
assert response.confidence > 0.7
assert len(response.sources) >= 3
assert "Land Cover" in response.sources[0]["title"]
You can test retrieval independently of generation. No mocking LLM APIs. No flaky tests based on generation randomness.
Finding Similar Items
The same architecture enables “similar items” features:
def get_similar_datasets(self, dataset_id: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""Find datasets similar to a given dataset."""
# Get the source dataset
source = self.vector_store.get_by_id(
collection=self.default_collection,
doc_id=dataset_id,
)
if not source:
return []
# Search using source content as query
context = self.retrieve(
query=source.content, # Use document text as query
top_k=top_k + 1, # +1 to exclude self
)
# Filter out the source dataset
results = [
{
"id": doc.id,
"title": doc.metadata.get("title", "Unknown"),
"score": doc.score,
"metadata": doc.metadata,
}
for doc in context.documents
if doc.id != dataset_id
]
return results[:top_k]
Same retrieval logic. Different use case. No LLM needed.
The API Layer
The FastAPI endpoints expose the pipeline cleanly:
@router.get("/search")
async def search_datasets(
query: str,
top_k: int = 10,
pipeline: RAGPipeline = Depends(get_rag_pipeline),
) -> SearchResponse:
"""Search for datasets using semantic similarity."""
context = pipeline.retrieve(query=query, top_k=top_k)
return SearchResponse(
query=query,
results=[
SearchResultItem(
id=doc.id,
title=doc.metadata.get("title", "Unknown"),
description=doc.content[:500],
score=doc.score,
)
for doc in context.documents
],
total=len(context.documents),
search_time_ms=context.retrieval_time_ms,
)
@router.post("/ask")
async def ask_question(
request: AskRequest,
pipeline: RAGPipeline = Depends(get_rag_pipeline),
) -> AskResponse:
"""Ask a question about available datasets."""
response = pipeline.query(
query=request.question,
top_k=request.top_k,
format_style="qa",
)
# Build answer from sources (no LLM)
if response.context.has_results:
titles = [src["title"] for src in response.sources[:3]]
answer = f"Found {len(response.sources)} relevant datasets: {', '.join(titles)}."
else:
answer = "No relevant datasets found for your question."
return AskResponse(
question=request.question,
answer=answer,
sources=response.sources,
context_used=response.context.num_documents,
confidence=response.confidence,
)
The /ask endpoint builds an answer from sources without an LLM. If you want LLM generation, add it at this layer, the pipeline doesn’t need to change.
Why This Architecture Works
1. Testable retrieval. You can measure and improve retrieval quality independently.
2. LLM-agnostic. Switch from GPT-4 to Claude to Llama by changing one call site. The pipeline doesn’t care.
3. Graceful degradation. LLM API down? Return the sources anyway. Users still get value.
4. Observable. Every response includes the exact context used. Debug why the LLM said something weird by looking at what it was given.
5. Cost control. Not every query needs an LLM. Simple searches can skip generation entirely.
6. Future-proof. When better models come out, swap them in. When you want to fine-tune, you have training data (queries + retrieved contexts).
The Uncomfortable Truth
Here’s what I’ve learned: retrieval quality matters more than generation quality.
A perfect LLM with bad retrieval will hallucinate. A mediocre LLM with excellent retrieval will give you the right information, maybe awkwardly phrased.
Build your retrieval pipeline first. Make it observable. Make it testable. Make it good.
Then, maybe, add an LLM.
Enjoyed this post? Check out more articles on my blog.
View all posts
Comments