Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix

RAG Embeddings Information Retrieval LLM Context Engineering Vector Search

Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix

Your RAG system scores well on general queries. Users ask "how does our billing work?" and it returns relevant docs. Then someone asks "what's the retry behavior when a webhook delivery returns a 429 vs a 503?" — and the system confidently surfaces the wrong document. No error. No warning. Just a plausible-sounding wrong answer.

This is the silent failure mode of pure embedding-based retrieval, and it's architectural, not a tuning problem.

How Embeddings Compress Meaning — and What Gets Lost

Embedding models convert text into dense vectors by compressing semantic meaning into a fixed-dimensional space. This works remarkably well for natural language similarity: "how do I cancel my subscription" and "steps to end my membership" land close together in vector space because they mean the same thing.

But technical queries don't work on semantic closeness. They work on structural precision.

Consider the query: "what's the difference between OAuth 2.0 and OIDC?" To an embedding model, both terms are related to authentication, so documents about OAuth, OIDC, JWT, SSO, and session management all cluster nearby in vector space. The distinction you're asking about — that OIDC is an identity layer built on top of OAuth 2.0, not an alternative — is a structural relationship, not a semantic one. The embedding model has no mechanism to privilege that structural distinction in retrieval.

The same failure pattern appears across domain-specific queries:

BM25 + Embeddings: Why Hybrid Retrieval Is the Standard in Production

BM25 is a keyword-based ranking algorithm (the backbone of Elasticsearch and most traditional search). It scores documents based on term frequency and inverse document frequency — meaning it rewards documents that contain your exact query terms and penalizes documents that use those terms too commonly.

For technical queries, BM25 is often more reliable than embeddings because exact terminology matters. A document that mentions "429" and "retry" and "webhook" will rank higher than one that's semantically adjacent but uses different vocabulary.

The production pattern is to run both retrievers in parallel, then merge and rerank:

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_retrieve(query: str, documents: list[str], embedding_model, top_k: int = 10):
    # BM25 retrieval
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(query.split())
    bm25_top_k = np.argsort(bm25_scores)[::-1][:top_k]

    # Embedding retrieval
    query_embedding = embedding_model.encode(query)
    doc_embeddings = embedding_model.encode(documents)
    cosine_scores = np.dot(doc_embeddings, query_embedding)
    embedding_top_k = np.argsort(cosine_scores)[::-1][:top_k]

    # Merge candidate sets (Reciprocal Rank Fusion)
    candidate_indices = set(bm25_top_k) | set(embedding_top_k)
    rrf_scores = {}
    for rank, idx in enumerate(bm25_top_k):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)
    for rank, idx in enumerate(embedding_top_k):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)

    ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return [documents[idx] for idx, _ in ranked[:top_k]]

Reciprocal Rank Fusion (RRF) is the standard merging strategy here — it combines ranked lists without requiring score normalization, which matters because BM25 and cosine similarity scores live on completely different scales.

Tools and services that implement hybrid retrieval out of the box:

The Reranker Layer: Precision After Recall

Hybrid retrieval improves your candidate set. A reranker improves which candidates you actually send to the LLM.

Cross-encoder rerankers (like Cohere Rerank or open-source models like cross-encoder/ms-marco-MiniLM-L-6-v2) take a query and a document together as input and output a relevance score. Unlike bi-encoders (standard embedding models), cross-encoders see both texts simultaneously, which lets them reason about the relationship between query and document — not just their independent representations.

Rerankers are computationally expensive — you don't run them over your full corpus. The pattern is: broad retrieval (100+ candidates via hybrid search) → reranker narrows to top 5–10 → those go into context.

Reranker options worth knowing:

Metadata Filtering Comes Before Embedding Search

For many domain-specific queries, the right answer isn't a better retrieval algorithm — it's filtering before you hit the embedding index.

If a user asks about "the v3 API rate limits," you should filter your document index by version=v3 before running any similarity search. Sending the query into a corpus that includes v1, v2, and v3 documentation and hoping the embedding model sorts it out is asking the retriever to do structural reasoning it wasn't designed for.

Vector stores with strong metadata filtering support:

Use Embeddings When

  • Queries are natural language and conceptual
  • Paraphrasing and synonyms matter
  • Semantic closeness predicts relevance

Use BM25/Filters When

  • Exact terms, codes, or identifiers matter
  • Structural distinctions define the answer
  • Metadata (version, category, date) can narrow scope
How to detect when your embedding retrieval is silently failing

The failure is silent by design — the system still returns something. To catch it:

  1. Build an evaluation set of domain-specific queries with known correct source documents. Run retrieval and measure Recall@K — what percentage of the time is the correct document in your top-K results?
  2. Log retrieval scores alongside answers. If cosine similarity for your top result is below ~0.75 on a normalized scale, the retriever is guessing.
  3. Test BM25 alone on your technical query set. If BM25 outperforms embeddings, that's diagnostic — your corpus has precision requirements that semantic search can't meet alone.

Evaluation tools that make this measurable:

  • Ragas — open-source RAG evaluation framework; measures context precision, context recall, and answer faithfulness against a ground-truth set
  • LangSmith — LangChain's observability and evaluation platform; lets you trace retrieval steps, log scores, and run eval datasets against your pipeline
  • TruLens — framework-agnostic RAG evaluation library with built-in retrieval quality metrics and a local dashboard for inspecting results

Takeaway

Embedding similarity is a recall mechanism, not a precision mechanism — and for domain-specific RAG, precision is exactly what you need, which means layering metadata filters, BM25 keyword search, and a cross-encoder reranker into your pipeline before you ever blame the LLM for bad answers.