Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix

March 26, 2026

RAG Embeddings Information Retrieval LLM Context Engineering Vector Search

Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix

Your RAG system scores well on general queries. Users ask "how does our billing work?" and it returns relevant docs. Then someone asks "what's the retry behavior when a webhook delivery returns a 429 vs a 503?" — and the system confidently surfaces the wrong document. No error. No warning. Just a plausible-sounding wrong answer.

This is the silent failure mode of pure embedding-based retrieval, and it's architectural, not a tuning problem.

How Embeddings Compress Meaning — and What Gets Lost

Embedding models convert text into dense vectors by compressing semantic meaning into a fixed-dimensional space. This works remarkably well for natural language similarity: "how do I cancel my subscription" and "steps to end my membership" land close together in vector space because they mean the same thing.

But technical queries don't work on semantic closeness. They work on structural precision.

Consider the query: "what's the difference between OAuth 2.0 and OIDC?" To an embedding model, both terms are related to authentication, so documents about OAuth, OIDC, JWT, SSO, and session management all cluster nearby in vector space. The distinction you're asking about — that OIDC is an identity layer built on top of OAuth 2.0, not an alternative — is a structural relationship, not a semantic one. The embedding model has no mechanism to privilege that structural distinction in retrieval.

The same failure pattern appears across domain-specific queries:

API version differences ("v2 vs v3 rate limiting behavior")
Exact error codes or status conditions ("ETIMEDOUT vs ECONNREFUSED")
Specific configuration flags or parameter names
Numerical thresholds and SLA definitions

BM25 + Embeddings: Why Hybrid Retrieval Is the Standard in Production

BM25 is a keyword-based ranking algorithm (the backbone of Elasticsearch and most traditional search). It scores documents based on term frequency and inverse document frequency — meaning it rewards documents that contain your exact query terms and penalizes documents that use those terms too commonly.

For technical queries, BM25 is often more reliable than embeddings because exact terminology matters. A document that mentions "429" and "retry" and "webhook" will rank higher than one that's semantically adjacent but uses different vocabulary.

The production pattern is to run both retrievers in parallel, then merge and rerank:

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_retrieve(query: str, documents: list[str], embedding_model, top_k: int = 10):
    # BM25 retrieval
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(query.split())
    bm25_top_k = np.argsort(bm25_scores)[::-1][:top_k]

    # Embedding retrieval
    query_embedding = embedding_model.encode(query)
    doc_embeddings = embedding_model.encode(documents)
    cosine_scores = np.dot(doc_embeddings, query_embedding)
    embedding_top_k = np.argsort(cosine_scores)[::-1][:top_k]

    # Merge candidate sets (Reciprocal Rank Fusion)
    candidate_indices = set(bm25_top_k) | set(embedding_top_k)
    rrf_scores = {}
    for rank, idx in enumerate(bm25_top_k):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)
    for rank, idx in enumerate(embedding_top_k):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)

    ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return [documents[idx] for idx, _ in ranked[:top_k]]

Reciprocal Rank Fusion (RRF) is the standard merging strategy here — it combines ranked lists without requiring score normalization, which matters because BM25 and cosine similarity scores live on completely different scales.

Tools and services that implement hybrid retrieval out of the box:

Elasticsearch — supports hybrid search combining BM25 and kNN vector search natively since v8.x
OpenSearch — AWS's Elasticsearch fork with a dedicated hybrid search pipeline
Weaviate — vector database with built-in hybrid search and configurable alpha weighting between BM25 and embedding scores
Qdrant — supports sparse + dense vector fusion, including SPLADE-based sparse retrieval alongside embeddings
LlamaIndex — provides a QueryFusionRetriever that wires BM25 and embedding retrieval together with RRF in a few lines

The Reranker Layer: Precision After Recall

Hybrid retrieval improves your candidate set. A reranker improves which candidates you actually send to the LLM.

Cross-encoder rerankers (like Cohere Rerank or open-source models like cross-encoder/ms-marco-MiniLM-L-6-v2) take a query and a document together as input and output a relevance score. Unlike bi-encoders (standard embedding models), cross-encoders see both texts simultaneously, which lets them reason about the relationship between query and document — not just their independent representations.

Rerankers are computationally expensive — you don't run them over your full corpus. The pattern is: broad retrieval (100+ candidates via hybrid search) → reranker narrows to top 5–10 → those go into context.

Reranker options worth knowing:

Cohere Rerank — managed API, easiest to integrate, strong out-of-the-box performance; charges per 1K searches
Jina Reranker — managed API alternative with competitive benchmarks and a generous free tier
Voyage AI Rerank — another managed option, particularly strong on technical and code-heavy content
cross-encoder/ms-marco-MiniLM-L-6-v2 — open-source, runs locally via HuggingFace, good baseline for self-hosted setups
FlashRank — lightweight open-source reranker library optimized for low-latency inference, easy drop-in for Python pipelines

Metadata Filtering Comes Before Embedding Search

For many domain-specific queries, the right answer isn't a better retrieval algorithm — it's filtering before you hit the embedding index.

If a user asks about "the v3 API rate limits," you should filter your document index by version=v3 before running any similarity search. Sending the query into a corpus that includes v1, v2, and v3 documentation and hoping the embedding model sorts it out is asking the retriever to do structural reasoning it wasn't designed for.

Vector stores with strong metadata filtering support:

Pinecone — filter by metadata fields at query time using a MongoDB-style filter syntax
Weaviate — supports pre-filtering (applied before ANN search) and post-filtering, with a GraphQL-based filter API
Chroma — lightweight, runs locally, good for development; supports where and where_document filter clauses
pgvector — if you're already on Postgres, pgvector lets you combine standard SQL WHERE clauses with vector similarity search in a single query, which is a natural fit for metadata scoping

Use Embeddings When

Queries are natural language and conceptual
Paraphrasing and synonyms matter
Semantic closeness predicts relevance

Use BM25/Filters When

Exact terms, codes, or identifiers matter
Structural distinctions define the answer
Metadata (version, category, date) can narrow scope

How to detect when your embedding retrieval is silently failing

The failure is silent by design — the system still returns something. To catch it:

Build an evaluation set of domain-specific queries with known correct source documents. Run retrieval and measure Recall@K — what percentage of the time is the correct document in your top-K results?
Log retrieval scores alongside answers. If cosine similarity for your top result is below ~0.75 on a normalized scale, the retriever is guessing.
Test BM25 alone on your technical query set. If BM25 outperforms embeddings, that's diagnostic — your corpus has precision requirements that semantic search can't meet alone.

Evaluation tools that make this measurable:

Ragas — open-source RAG evaluation framework; measures context precision, context recall, and answer faithfulness against a ground-truth set
LangSmith — LangChain's observability and evaluation platform; lets you trace retrieval steps, log scores, and run eval datasets against your pipeline
TruLens — framework-agnostic RAG evaluation library with built-in retrieval quality metrics and a local dashboard for inspecting results

Takeaway

Embedding similarity is a recall mechanism, not a precision mechanism — and for domain-specific RAG, precision is exactly what you need, which means layering metadata filters, BM25 keyword search, and a cross-encoder reranker into your pipeline before you ever blame the LLM for bad answers.