Embeddings Are Just Coordinates: The Mental Model Every RAG Engineer Needs
Embeddings are the part of RAG that engineers nod at in architecture diagrams but quietly skip over. You know they go somewhere between your text chunks and your vector database — but if you're fuzzy on what they actually are, your retrieval will be fuzzy too.
Let's fix that in five minutes.
Text Doesn't Have Coordinates. Embeddings Give It Some.
Your database knows how to find rows. Your search index knows how to match keywords. But neither knows that "the server crashed" and "the backend went down" mean the same thing. That's the gap embeddings close.
An embedding model is a deterministic function that maps a string of text to a fixed-length array of floating-point numbers — a vector. That vector is a point in high-dimensional space (typically 768 to 3072 dimensions, depending on the model). The key insight:
Semantic similarity in language ≈ geometric proximity in vector space.
Think of it like a city map. "pizza" and "pasta" land in the same neighborhood. "pizza" and "mortgage rate" are across town. The embedding model learned this geography by training on massive amounts of text — it encodes meaning as location.
This is what makes semantic search work in RAG. You embed the user's query, embed your document chunks at index time, and then find the chunks whose vectors are closest to the query vector. Closest, by convention, means highest cosine similarity — a measure of the angle between two vectors that ranges from -1 to 1.
The Code You'll Actually Write
import openai
import numpy as np
client = openai.OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
query = embed("Why did the server go down?")
doc = embed("The backend service crashed due to an OOM error.")
unrelated = embed("Best pasta recipes for beginners.")
print(cosine_similarity(query, doc)) # ~0.87 — highly similar
print(cosine_similarity(query, unrelated)) # ~0.21 — semantically distant
Same query. Two chunks. The score tells your retrieval layer which one to surface. This is the core loop of every RAG system.
Closed vs. Open Embedding Models: Pick Your Tradeoff
You have two categories of embedding models to choose from:
Closed models (OpenAI text-embedding-3-small/large, Cohere embed-v3):
- ✅ One API call, no infrastructure
- ✅ Consistently strong benchmark performance
- ✅ Versioned and stable — your index stays valid
- ❌ Per-token cost at scale
- ❌ Data leaves your environment
Open models (Sentence Transformers, nomic-embed-text, bge-large-en):
- ✅ Run locally or on your own infra — zero marginal cost at scale
- ✅ Fine-tunable on your domain data
- ✅ Data never leaves your stack
- ❌ You own the deployment, versioning, and hardware
- ❌ Generally trail closed models on general-purpose benchmarks
Pragmatic default: Start with text-embedding-3-small. It's cheap (~$0.02 per million tokens), fast, and strong enough for most production RAG workloads. Reach for open models when you have privacy constraints, massive embedding volume, or a narrow enough domain that fine-tuning pays off.
The Gotcha: Embedding Models Are Not Interchangeable
Here's where engineers get burned: you cannot mix embedding models in the same index.
If you embed your documents with text-embedding-3-small (1536 dimensions) and then switch to nomic-embed-text (768 dimensions), your similarity scores become nonsense. The coordinate systems are completely different — like trying to use GPS coordinates as latitude/longitude on Mars.
Beyond dimensions, models differ in training data, tokenization, and what they optimize for. A model trained on code will outperform a general-purpose model on a codebase RAG system. A model fine-tuned on legal documents will retrieve clauses more accurately than one that's never seen a contract.
Weak embedding model = broken RAG pipeline — regardless of how powerful your LLM is. The retrieval step is the ceiling. If the wrong chunks go in, no amount of GPT-4 intelligence fixes the output.
Takeaway
Embeddings are a coordinate system for meaning — and choosing the right model for that coordinate system is the first architectural decision your RAG pipeline depends on.