Vector Databases Demystified: What Backend Engineers Actually Need to Know Before Picking One

March 21, 2026

RAG Vector Databases pgvector Embeddings LLM Production AI

You're building your first RAG system. You've got embeddings working, retrieval makes sense conceptually, and then you hit the tooling wall: Pinecone? Weaviate? Milvus? Qdrant? pgvector? The options multiply fast, and every vendor's homepage makes it sound like you'll be lost without them.

Here's the honest take: most teams reach for a dedicated vector database too early. Let's fix that by understanding what these tools actually do — and when they earn their place in your stack.

It's Not a Database. It's a Search Engine for Numbers.

If you come from a relational database background, your mental model for queries is exact matching: WHERE user_id = 42. Vector similarity search is fundamentally different. You're asking: "What items in this dataset are closest to this query vector?"

That sounds simple until you realize "closest" in 1,536-dimensional space (OpenAI's embedding size) is computationally expensive if done naively. Comparing a query vector against 1 million stored vectors with brute-force dot products is O(n) — it gets slow fast.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. The key word is approximate. They trade a small amount of recall accuracy for massive speed gains by building index structures (like HNSW — Hierarchical Navigable Small World graphs) that let you skip most of the comparison work.

A clean technical diagram showing a 2D scatter plot of vector points, with a query point highlighted and nearest neighbors connected by lines, surrounded by an HNSW graph structure — warm blue and orange gradient, minimal style

The tradeoff in plain terms:

Exact search: 100% recall, O(n) cost, slow at scale
ANN search: ~95–99% recall, O(log n) cost, fast at scale

For RAG, that ~1–5% miss rate almost never matters. You're retrieving context for an LLM, not running a financial audit.

You Probably Don't Need a Vector DB Yet

Before you provision a Pinecone index, ask yourself: how many vectors are you actually storing?

Under 10K vectors: A NumPy array in memory is fine. Seriously.
10K–500K vectors: PostgreSQL with pgvector handles this comfortably on any decent instance.
500K–10M vectors: pgvector with HNSW indexing still works, or consider FAISS if you need pure speed.
10M+ vectors, low-latency requirements, multiple tenants: Now a dedicated vector DB earns its operational cost.

Here's a minimal working retrieval setup that gets you surprisingly far:

import numpy as np
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> np.ndarray:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return np.array(response.data[0].embedding)

# Your stored chunks: list of (text, embedding) tuples
def retrieve(query: str, corpus: list[tuple[str, np.ndarray]], top_k: int = 5):
    query_vec = embed(query)
    scores = [
        (text, np.dot(query_vec, vec) / (np.linalg.norm(query_vec) * np.linalg.norm(vec)))
        for text, vec in corpus
    ]
    return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]

This runs in milliseconds for thousands of chunks. Don't over-engineer before you have a real scaling problem.

A side-by-side comparison illustration: on the left, a simple in-memory array with a magnifying glass (labeled "small scale"), on the right, a distributed database cluster with interconnected nodes (labeled "production scale") — clean flat design, muted green and blue tones

When You Do Need a Vector DB: How to Choose

Once you've outgrown in-memory or pgvector, here's a practical breakdown:

Tool	Best For	Hosted?	Operational Cost
pgvector	Teams already on Postgres, <1M vectors	Self-hosted or RDS	Low — you know Postgres
Pinecone	Fast start, no infra team, SaaS budget	Fully managed	Medium–High (per-vector pricing)
Weaviate	Rich metadata filtering + vector search	Both	Medium (more complex config)
Milvus	High throughput, self-hosted, 10M+ vectors	Self-hosted (or Zilliz cloud)	High (Kubernetes-native)

The decision tree that actually matters:

Do you already run Postgres? Start with pgvector. Seriously, just add the extension.
Do you have < 2 engineers and no infra time? Pinecone's managed service is worth the cost.
Do you need complex metadata filters alongside vector search? Weaviate's hybrid search is purpose-built for this.
Are you at 10M+ vectors with strict SLA requirements? Milvus or Qdrant, and budget for the ops complexity.

What is HNSW and why does every vector DB use it?

HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm that builds a multi-layer graph where each node connects to its nearest neighbors. At query time, you start at the top layer (sparse, long-range connections) and greedily navigate toward the query vector, dropping into lower layers for finer resolution.

The result: O(log n) search complexity with recall rates typically above 95%. It requires more memory than flat indexes but is faster at query time than tree-based alternatives like KD-trees in high dimensions.

Both pgvector and Pinecone support HNSW. Milvus and Weaviate also support IVF (Inverted File Index) variants for different memory/speed tradeoffs.

A flowchart decision tree diagram for choosing a vector database: starting from "How many vectors?" branching into NumPy, pgvector, Pinecone, and Milvus paths — clean sans-serif typography, soft purple and gray palette

The Gotcha

Don't conflate vector search with your entire retrieval strategy. Vector similarity finds semantically similar chunks — but it can miss exact keyword matches (product codes, error messages, proper nouns). Production RAG systems often need hybrid search: vector similarity plus BM25 keyword search, with results merged via reciprocal rank fusion. pgvector + PostgreSQL full-text search gives you this for free. Pinecone and Weaviate also support hybrid modes. If you build on pure vector search and wonder why your chatbot can't find "error code ERR_4291", this is why.

The Takeaway

Start with pgvector or in-memory NumPy, measure your actual scale, and only graduate to a dedicated vector database when your latency or volume numbers demand it — not because the vendor blog post told you to.