Part 3: RAG Without the LLM: Building Context-First Retrieval Pipelines

When I set out to build a dataset discovery system, everyone said “just throw GPT-4 at it.” But I built something different, a retrieval system that returns structured context, formatted and ready for any LLM (or no LLM at all).

This is the architecture that emerged, and why I think it’s the right approach for most RAG applications.

The Problem with LLM-First RAG

Most RAG tutorials look like this:

def answer_question(question: str) -> str:
    # 1. Embed the question
    query_embedding = embed(question)

    # 2. Search vector database
    documents = vector_db.search(query_embedding, k=5)

    # 3. Build prompt and call LLM
    prompt = f"""Answer based on these documents:
    {documents}

    Question: {question}
    """

    return openai.chat(prompt)

Simple. And problematic:

Tight coupling. Your retrieval logic is married to OpenAI’s API. Want to switch to Claude? Rewrite everything.
No observability. What documents were retrieved? What was the actual context? You can’t see it.
No graceful degradation. LLM rate limited? API down? Your entire system fails.
Testing nightmare. How do you test retrieval quality when it’s bundled with generation?

The Solution: Retrieval as a First-Class Concern

Separate retrieval from generation. Build a pipeline that returns context, not answers:

class RAGResponse(BaseModel):
    query: str
    answer: str = ""  # Empty! Filled by caller, if at all
    context: RetrievedContext
    sources: List[Dict[str, Any]]
    confidence: float

Notice answer: str = "". The pipeline doesn’t generate answers. It retrieves context and leaves generation to the caller, who might use GPT-4, Claude, Llama, or nothing at all.

The Architecture

┌─────────────────┐
│   User Query    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  RAG Pipeline   │
│  ┌───────────┐  │
│  │ Retrieve  │  │
│  └─────┬─────┘  │
│        ▼        │
│  ┌───────────┐  │
│  │  Format   │  │
│  └─────┬─────┘  │
│        ▼        │
│  ┌───────────┐  │
│  │ Compute   │  │
│  │Confidence │  │
│  └───────────┘  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  RAGResponse    │
│  - context      │
│  - sources      │
│  - confidence   │
│  - answer=""    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────┐
│   LLM (opt)     │ OR  │  Direct Display │
└─────────────────┘     └─────────────────┘

The pipeline does three things:

Retrieve relevant documents
Format context for consumption
Compute confidence from retrieval quality

What it doesn’t do: generate answers. That’s the caller’s job.

Building the Retrieval Layer

Start with a clean retrieval interface:

class RetrievedContext(BaseModel):
    """Context retrieved for RAG generation."""

    documents: List[SearchResult] = Field(default_factory=list)
    query: str = ""
    formatted_context: str = ""
    sources: List[Dict[str, Any]] = Field(default_factory=list)
    retrieval_time_ms: float = 0.0

    @property
    def has_results(self) -> bool:
        return len(self.documents) > 0

    @property
    def num_documents(self) -> int:
        return len(self.documents)

The key fields:

documents — The raw search results with full metadata
formatted_context — A string ready for LLM consumption
sources — Simplified citations for display
retrieval_time_ms — Performance tracking

Now the retrieval method:

class RAGPipeline:
    def __init__(
        self,
        vector_store: VectorStore,
        embedding_service: EmbeddingService,
        default_collection: str = "datasets",
        top_k: int = 5,
        context_max_length: int = 4000,
    ):
        self.vector_store = vector_store
        self.embedding_service = embedding_service
        self.default_collection = default_collection
        self.top_k = top_k
        self.context_max_length = context_max_length

    def retrieve(
        self,
        query: str,
        collection: Optional[str] = None,
        top_k: Optional[int] = None,
        filters: Optional[Dict[str, Any]] = None,
        min_score: float = 0.0,
    ) -> RetrievedContext:
        """Retrieve relevant documents for a query."""

        start_time = time.time()

        collection = collection or self.default_collection
        top_k = top_k or self.top_k

        # Generate query embedding
        query_embedding = self.embedding_service.embed_query(query)

        # Search vector store
        search_results = self.vector_store.search(
            collection=collection,
            query_embedding=query_embedding,
            n_results=top_k,
            where=filters,
        )

        # Filter by minimum score
        filtered_docs = [
            doc for doc in search_results.results
            if doc.score >= min_score
        ]

        # Build sources list
        sources = [
            {
                "id": doc.id,
                "title": doc.metadata.get("title", "Unknown"),
                "dataset_id": doc.metadata.get("dataset_id", doc.id),
                "score": doc.score,
            }
            for doc in filtered_docs
        ]

        # Format context
        formatted_context = ContextFormatter.format_structured(
            filtered_docs,
            max_length=self.context_max_length,
        )

        retrieval_time = (time.time() - start_time) * 1000

        return RetrievedContext(
            documents=filtered_docs,
            query=query,
            formatted_context=formatted_context,
            sources=sources,
            retrieval_time_ms=retrieval_time,
        )

Clean separation. Retrieve documents, format context, return everything. No LLM in sight.

Context Formatting: The Underrated Art

How you format context matters as much as what you retrieve. Different use cases need different formats.

Simple format for basic retrieval:

@staticmethod
def format_simple(documents: List[SearchResult], max_length: int = 4000) -> str:
    """Basic concatenation with numbering."""
    context_parts = []
    current_length = 0

    for i, doc in enumerate(documents, 1):
        title = doc.metadata.get("title", "Unknown Dataset")
        content = doc.content

        part = f"[{i}] {title}\n{content}\n"

        if current_length + len(part) > max_length:
            break

        context_parts.append(part)
        current_length += len(part)

    return "\n".join(context_parts)

Output:

[1] Land Cover Map 2017
This dataset represents land cover classification...

[2] UK Soil Moisture Dataset
Soil moisture measurements from...

Structured format with metadata:

@staticmethod
def format_structured(documents: List[SearchResult], max_length: int = 4000) -> str:
    """Rich formatting with metadata."""
    context_parts = []
    current_length = 0

    for i, doc in enumerate(documents, 1):
        meta = doc.metadata
        title = meta.get("title", "Unknown Dataset")
        dataset_id = meta.get("dataset_id", doc.id)

        entry = f"""
---
Source [{i}]: {title}
ID: {dataset_id}
Relevance: {doc.score:.2%}
---
{doc.content}
"""

        if current_length + len(entry) > max_length:
            break

        context_parts.append(entry.strip())
        current_length += len(entry)

    return "\n\n".join(context_parts)

Output:

---
Source [1]: Land Cover Map 2017
ID: abc-123-def
Relevance: 94.23%
---
This dataset represents land cover classification...

---
Source [2]: UK Soil Moisture Dataset
ID: xyz-789-uvw
Relevance: 87.15%
---
Soil moisture measurements from...

Q&A format with instructions for the LLM:

@staticmethod
def format_for_qa(
    documents: List[SearchResult],
    query: str,
    max_length: int = 4000,
) -> str:
    """Format specifically for Q&A tasks."""
    context = ContextFormatter.format_structured(documents, max_length - 200)

    return f"""Based on the following dataset information, answer the user's question.

RELEVANT DATASETS:
{context}

USER QUESTION: {query}

Provide a helpful answer based on the datasets above. If the information is not available in the provided context, say so clearly."""

The caller chooses the format based on their needs:

response = pipeline.query(query, format_style="qa")      # For LLM
response = pipeline.query(query, format_style="simple")  # For display
response = pipeline.query(query, format_style="structured")  # For debugging

Confidence: Quantifying Retrieval Quality

LLM outputs are notoriously hard to trust. But retrieval quality? That’s measurable.

def _compute_confidence(self, context: RetrievedContext) -> float:
    """Compute confidence score based on retrieval quality."""
    if not context.documents:
        return 0.0

    # Average of top document scores
    scores = [doc.score for doc in context.documents[:3]]
    return sum(scores) / len(scores)

This gives you a number you can act on:

response = pipeline.query("obscure topic nobody wrote about")

if response.confidence < 0.3:
    return "I don't have enough relevant information to answer this question."
elif response.confidence < 0.6:
    return f"Based on limited information: {generate_answer(response.context)}"
else:
    return generate_answer(response.context)

No hallucination-prone LLM confidence scores. Just retrieval metrics you can verify by looking at the actual documents.

The Query Method: Bringing It Together

def query(
    self,
    query: str,
    collection: Optional[str] = None,
    top_k: Optional[int] = None,
    filters: Optional[Dict[str, Any]] = None,
    format_style: str = "structured",
) -> RAGResponse:
    """
    Execute a full RAG query.

    Retrieves context and prepares response (without LLM generation).
    The formatted context can be passed to any LLM.
    """
    # Retrieve relevant documents
    context = self.retrieve(
        query=query,
        collection=collection,
        top_k=top_k,
        filters=filters,
    )

    # Format context based on style
    if format_style == "simple":
        formatted = ContextFormatter.format_simple(
            context.documents,
            self.context_max_length,
        )
    elif format_style == "qa":
        formatted = ContextFormatter.format_for_qa(
            context.documents,
            query,
            self.context_max_length,
        )
    else:
        formatted = context.formatted_context

    context.formatted_context = formatted

    # Build response (answer would be filled by LLM)
    return RAGResponse(
        query=query,
        answer="",  # Deliberately empty
        context=context,
        sources=context.sources,
        confidence=self._compute_confidence(context),
    )

The answer="" is intentional. This is a feature, not a bug. The pipeline’s job is retrieval and formatting. Generation is someone else’s problem.

Using the Pipeline

With an LLM:

response = pipeline.query("What soil data is available?", format_style="qa")

if response.confidence > 0.5:
    llm_answer = openai.chat(response.context.formatted_context)
    response.answer = llm_answer
else:
    response.answer = "I couldn't find relevant information."

return response

Without an LLM:

response = pipeline.query("soil moisture datasets")

# Just return the sources
return {
    "query": response.query,
    "results": [
        {
            "title": src["title"],
            "id": src["dataset_id"],
            "relevance": f"{src['score']:.0%}",
        }
        for src in response.sources
    ],
    "confidence": response.confidence,
}

For testing:

def test_retrieval_quality():
    response = pipeline.query("land cover classification")

    # Assert on retrieval, not generation
    assert response.confidence > 0.7
    assert len(response.sources) >= 3
    assert "Land Cover" in response.sources[0]["title"]

You can test retrieval independently of generation. No mocking LLM APIs. No flaky tests based on generation randomness.

Finding Similar Items

The same architecture enables “similar items” features:

def get_similar_datasets(self, dataset_id: str, top_k: int = 5) -> List[Dict[str, Any]]:
    """Find datasets similar to a given dataset."""

    # Get the source dataset
    source = self.vector_store.get_by_id(
        collection=self.default_collection,
        doc_id=dataset_id,
    )

    if not source:
        return []

    # Search using source content as query
    context = self.retrieve(
        query=source.content,  # Use document text as query
        top_k=top_k + 1,  # +1 to exclude self
    )

    # Filter out the source dataset
    results = [
        {
            "id": doc.id,
            "title": doc.metadata.get("title", "Unknown"),
            "score": doc.score,
            "metadata": doc.metadata,
        }
        for doc in context.documents
        if doc.id != dataset_id
    ]

    return results[:top_k]

Same retrieval logic. Different use case. No LLM needed.

The API Layer

The FastAPI endpoints expose the pipeline cleanly:

@router.get("/search")
async def search_datasets(
    query: str,
    top_k: int = 10,
    pipeline: RAGPipeline = Depends(get_rag_pipeline),
) -> SearchResponse:
    """Search for datasets using semantic similarity."""

    context = pipeline.retrieve(query=query, top_k=top_k)

    return SearchResponse(
        query=query,
        results=[
            SearchResultItem(
                id=doc.id,
                title=doc.metadata.get("title", "Unknown"),
                description=doc.content[:500],
                score=doc.score,
            )
            for doc in context.documents
        ],
        total=len(context.documents),
        search_time_ms=context.retrieval_time_ms,
    )


@router.post("/ask")
async def ask_question(
    request: AskRequest,
    pipeline: RAGPipeline = Depends(get_rag_pipeline),
) -> AskResponse:
    """Ask a question about available datasets."""

    response = pipeline.query(
        query=request.question,
        top_k=request.top_k,
        format_style="qa",
    )

    # Build answer from sources (no LLM)
    if response.context.has_results:
        titles = [src["title"] for src in response.sources[:3]]
        answer = f"Found {len(response.sources)} relevant datasets: {', '.join(titles)}."
    else:
        answer = "No relevant datasets found for your question."

    return AskResponse(
        question=request.question,
        answer=answer,
        sources=response.sources,
        context_used=response.context.num_documents,
        confidence=response.confidence,
    )

The /ask endpoint builds an answer from sources without an LLM. If you want LLM generation, add it at this layer, the pipeline doesn’t need to change.

Why This Architecture Works

1. Testable retrieval. You can measure and improve retrieval quality independently.

2. LLM-agnostic. Switch from GPT-4 to Claude to Llama by changing one call site. The pipeline doesn’t care.

3. Graceful degradation. LLM API down? Return the sources anyway. Users still get value.

4. Observable. Every response includes the exact context used. Debug why the LLM said something weird by looking at what it was given.

5. Cost control. Not every query needs an LLM. Simple searches can skip generation entirely.

6. Future-proof. When better models come out, swap them in. When you want to fine-tune, you have training data (queries + retrieved contexts).

The Uncomfortable Truth

Here’s what I’ve learned: retrieval quality matters more than generation quality.

A perfect LLM with bad retrieval will hallucinate. A mediocre LLM with excellent retrieval will give you the right information, maybe awkwardly phrased.

Build your retrieval pipeline first. Make it observable. Make it testable. Make it good.

Then, maybe, add an LLM.