Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix
Embedding Similarity Search Fails on Domain-Specific Queries — Here's the Architecture Fix
Your RAG system scores well on general queries. Users ask "how does our billing work?" and it returns relevant docs. Then someone asks "what's the retry behavior when a webhook delivery returns a 429 vs a 503?" — and the system confidently surfaces the wrong document. No error. No warning. Just a plausible-sounding wrong answer.
This is the silent failure mode of pure embedding-based retrieval, and it's architectural, not a tuning problem.
How Embeddings Compress Meaning — and What Gets Lost
Embedding models convert text into dense vectors by compressing semantic meaning into a fixed-dimensional space. This works remarkably well for natural language similarity: "how do I cancel my subscription" and "steps to end my membership" land close together in vector space because they mean the same thing.
But technical queries don't work on semantic closeness. They work on structural precision.
Consider the query: "what's the difference between OAuth 2.0 and OIDC?" To an embedding model, both terms are related to authentication, so documents about OAuth, OIDC, JWT, SSO, and session management all cluster nearby in vector space. The distinction you're asking about — that OIDC is an identity layer built on top of OAuth 2.0, not an alternative — is a structural relationship, not a semantic one. The embedding model has no mechanism to privilege that structural distinction in retrieval.
The same failure pattern appears across domain-specific queries:
- API version differences ("v2 vs v3 rate limiting behavior")
- Exact error codes or status conditions ("ETIMEDOUT vs ECONNREFUSED")
- Specific configuration flags or parameter names
- Numerical thresholds and SLA definitions
BM25 + Embeddings: Why Hybrid Retrieval Is the Standard in Production
BM25 is a keyword-based ranking algorithm (the backbone of Elasticsearch and most traditional search). It scores documents based on term frequency and inverse document frequency — meaning it rewards documents that contain your exact query terms and penalizes documents that use those terms too commonly.
For technical queries, BM25 is often more reliable than embeddings because exact terminology matters. A document that mentions "429" and "retry" and "webhook" will rank higher than one that's semantically adjacent but uses different vocabulary.
The production pattern is to run both retrievers in parallel, then merge and rerank:
from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_retrieve(query: str, documents: list[str], embedding_model, top_k: int = 10):
# BM25 retrieval
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
bm25_scores = bm25.get_scores(query.split())
bm25_top_k = np.argsort(bm25_scores)[::-1][:top_k]
# Embedding retrieval
query_embedding = embedding_model.encode(query)
doc_embeddings = embedding_model.encode(documents)
cosine_scores = np.dot(doc_embeddings, query_embedding)
embedding_top_k = np.argsort(cosine_scores)[::-1][:top_k]
# Merge candidate sets (Reciprocal Rank Fusion)
candidate_indices = set(bm25_top_k) | set(embedding_top_k)
rrf_scores = {}
for rank, idx in enumerate(bm25_top_k):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)
for rank, idx in enumerate(embedding_top_k):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (60 + rank)
ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return [documents[idx] for idx, _ in ranked[:top_k]]
Reciprocal Rank Fusion (RRF) is the standard merging strategy here — it combines ranked lists without requiring score normalization, which matters because BM25 and cosine similarity scores live on completely different scales.
Tools and services that implement hybrid retrieval out of the box:
- Elasticsearch — supports hybrid search combining BM25 and kNN vector search natively since v8.x
- OpenSearch — AWS's Elasticsearch fork with a dedicated hybrid search pipeline
- Weaviate — vector database with built-in hybrid search and configurable alpha weighting between BM25 and embedding scores
- Qdrant — supports sparse + dense vector fusion, including SPLADE-based sparse retrieval alongside embeddings
- LlamaIndex — provides a
QueryFusionRetrieverthat wires BM25 and embedding retrieval together with RRF in a few lines
The Reranker Layer: Precision After Recall
Hybrid retrieval improves your candidate set. A reranker improves which candidates you actually send to the LLM.
Cross-encoder rerankers (like Cohere Rerank or open-source models like cross-encoder/ms-marco-MiniLM-L-6-v2) take a query and a document together as input and output a relevance score. Unlike bi-encoders (standard embedding models), cross-encoders see both texts simultaneously, which lets them reason about the relationship between query and document — not just their independent representations.
Reranker options worth knowing:
- Cohere Rerank — managed API, easiest to integrate, strong out-of-the-box performance; charges per 1K searches
- Jina Reranker — managed API alternative with competitive benchmarks and a generous free tier
- Voyage AI Rerank — another managed option, particularly strong on technical and code-heavy content
cross-encoder/ms-marco-MiniLM-L-6-v2— open-source, runs locally via HuggingFace, good baseline for self-hosted setups- FlashRank — lightweight open-source reranker library optimized for low-latency inference, easy drop-in for Python pipelines
Metadata Filtering Comes Before Embedding Search
For many domain-specific queries, the right answer isn't a better retrieval algorithm — it's filtering before you hit the embedding index.
If a user asks about "the v3 API rate limits," you should filter your document index by version=v3 before running any similarity search. Sending the query into a corpus that includes v1, v2, and v3 documentation and hoping the embedding model sorts it out is asking the retriever to do structural reasoning it wasn't designed for.
Vector stores with strong metadata filtering support:
- Pinecone — filter by metadata fields at query time using a MongoDB-style filter syntax
- Weaviate — supports pre-filtering (applied before ANN search) and post-filtering, with a GraphQL-based filter API
- Chroma — lightweight, runs locally, good for development; supports
whereandwhere_documentfilter clauses - pgvector — if you're already on Postgres, pgvector lets you combine standard SQL
WHEREclauses with vector similarity search in a single query, which is a natural fit for metadata scoping
Use Embeddings When
- Queries are natural language and conceptual
- Paraphrasing and synonyms matter
- Semantic closeness predicts relevance
Use BM25/Filters When
- Exact terms, codes, or identifiers matter
- Structural distinctions define the answer
- Metadata (version, category, date) can narrow scope
How to detect when your embedding retrieval is silently failing
The failure is silent by design — the system still returns something. To catch it:
- Build an evaluation set of domain-specific queries with known correct source documents. Run retrieval and measure Recall@K — what percentage of the time is the correct document in your top-K results?
- Log retrieval scores alongside answers. If cosine similarity for your top result is below ~0.75 on a normalized scale, the retriever is guessing.
- Test BM25 alone on your technical query set. If BM25 outperforms embeddings, that's diagnostic — your corpus has precision requirements that semantic search can't meet alone.
Evaluation tools that make this measurable:
- Ragas — open-source RAG evaluation framework; measures context precision, context recall, and answer faithfulness against a ground-truth set
- LangSmith — LangChain's observability and evaluation platform; lets you trace retrieval steps, log scores, and run eval datasets against your pipeline
- TruLens — framework-agnostic RAG evaluation library with built-in retrieval quality metrics and a local dashboard for inspecting results
Takeaway
Embedding similarity is a recall mechanism, not a precision mechanism — and for domain-specific RAG, precision is exactly what you need, which means layering metadata filters, BM25 keyword search, and a cross-encoder reranker into your pipeline before you ever blame the LLM for bad answers.