Context Engineering: How to Stop Stuffing Your LLM's Brain and Start Managing It

March 21, 2026

Context Engineering LLM Prompt Engineering RAG OpenAI Anthropic AI Engineering

Your LLM has a 128k token context window. So you throw everything in — the full conversation history, five retrieved documents, the system prompt, the user profile, the product catalog. It fits. Ship it.

Three weeks later your AI feature is slow, expensive, and giving weirdly wrong answers. Welcome to the context window trap.

This is the problem that's spawned an entirely new engineering discipline: context engineering — the art and science of deciding what goes into your model's context, in what order, and why.

What Is a Context Window, Really?

Think of the context window as your LLM's working memory. Everything the model can "see" when generating a response lives in this window — your system prompt, conversation history, retrieved documents, tool outputs, and the user's current message.

Modern models are impressively large:

GPT-4o: 128k tokens (~96,000 words)
Claude 3.5 Sonnet: 200k tokens (~150,000 words)
Gemini 1.5 Pro: 1M–2M tokens (yes, really)

At first glance, this feels like the problem is solved. Just dump everything in. But here's what the research actually shows:

Bigger context ≠ better performance. Studies on the "lost in the middle" phenomenon demonstrate that LLMs reliably attend to information at the beginning and end of the context window — and systematically underweight content buried in the middle. If your most important retrieved document lands on token 45,000 of a 100k context, the model may effectively ignore it.

And that's before we talk about cost. With most providers charging per token, a bloated context window can multiply your inference costs by 5–10x compared to a well-engineered one.

The Four Layers of Context

Before you can manage context well, you need a mental model for what's actually in it. Think of context as four distinct layers, each with different characteristics:

1. Persistent Context (System Prompt) This is what stays constant across every request. Your persona definition, behavioral rules, output format instructions, and application-level constraints live here. It's the most expensive token real estate because you pay for it on every single call.

2. Background Context (Retrieved Knowledge) This is what you pull in dynamically — RAG chunks, database lookups, tool results. It's large and variable. This is where most context bloat happens.

3. Conversational Context (Chat History) The back-and-forth exchange with the user. This grows unbounded in long sessions if you're not careful.

4. Active Context (Current Turn) The user's immediate input and any real-time data attached to it. Usually small, but can spike with file uploads or large pastes.

Most engineers treat all four layers the same. They don't behave the same.

Practical Strategies That Actually Work

Strategy 1: Compress, Don't Truncate

The naive approach to a growing chat history is to cut old messages off. This works until it doesn't — users get confused when the model forgets what they said two minutes ago.

A better approach: summarize old turns into a rolling summary, then keep only the last N turns verbatim.

def build_context_with_summary(messages: list[dict], keep_last_n: int = 6) -> list[dict]:
    if len(messages) <= keep_last_n:
        return messages

    # Summarize everything except the recent tail
    older_messages = messages[:-keep_last_n]
    recent_messages = messages[-keep_last_n:]

    summary_prompt = [
        {"role": "user", "content": (
            f"Summarize this conversation history in 3-5 sentences, "
            f"preserving key facts and decisions:\n\n"
            + "\n".join([f"{m['role']}: {m['content']}" for m in older_messages])
        )}
    ]

    summary_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Use a cheap model for summarization
        messages=summary_prompt,
        max_tokens=200
    )

    summary_text = summary_response.choices[0].message.content

    # Inject summary as a system message at the top of recent history
    return [
        {"role": "system", "content": f"[Earlier conversation summary]: {summary_text}"}
    ] + recent_messages

Note the trick: use a cheaper, faster model (like gpt-4o-mini) for the summarization step. You're not summarizing with your main model — that's expensive and slow.

Strategy 2: Rank Your RAG Chunks Before Injection

When you retrieve five documents to answer a question, they're not equally relevant. Don't inject them in arbitrary order.

Two techniques work well here:

Reranking: Use a cross-encoder model (like Cohere's Rerank API) to score retrieved chunks against the user's query, then inject only the top 2–3.
Position-aware injection: Put your most important chunk last, not first. Remember the lost-in-the-middle problem — recency gets attention.

import cohere

co = cohere.Client("your-api-key")

def get_ranked_chunks(query: str, chunks: list[str], top_n: int = 3) -> list[str]:
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=chunks,
        top_n=top_n
    )
    # Return chunks ordered by relevance score (highest last for recency bias)
    ranked = [chunks[r.index] for r in sorted(response.results, key=lambda x: x.relevance_score)]
    return ranked  # Least relevant first, most relevant closest to the query

Strategy 3: Token Budget Your System Prompt

Your system prompt is a fixed cost. Audit it ruthlessly.

A practical rule: your system prompt should be under 500 tokens for most applications. If it's creeping toward 2,000 tokens, you've probably got instructions that belong in your RAG pipeline, not your system prompt.

Ask yourself for every sentence in your system prompt: Does this need to be here on every single call, or only when it's relevant? Persona and rules: yes. Product catalog details: no. That's what retrieval is for.

The Gotcha: Long Context Doesn't Fix Bad Retrieval

Here's the mistake engineers make when they discover their model has a 200k context window: they stop investing in their retrieval layer.

"Why bother with precise chunking and reranking? I'll just throw all 50 documents in."

This is exactly backwards. A larger context window amplifies retrieval quality problems — now you have more irrelevant content for the model to get confused by, not less. The lost-in-the-middle effect gets worse, not better, as context grows.

Large context windows are a safety net, not a strategy. Use them for genuinely long documents (contracts, codebases, research papers) where you need the full text. For most Q&A and chat applications, precise retrieval into a tight context window will outperform context stuffing every time.

A Mental Model to Carry Forward

Think of your context window like RAM in a computer. You could open 47 browser tabs. But your system runs better when you close the ones you're not using.

Every token in your context is competing for the model's attention. Irrelevant tokens don't just waste money — they dilute the signal of the tokens that matter.

Context engineering is the discipline of making sure that when your model reads its working memory, the most important information is impossible to miss.

The Takeaway

A massive context window is a tool, not a strategy — ruthlessly curate what goes in, where it sits, and how much of it you actually need, and your AI application will be faster, cheaper, and more accurate than one that just stuffs everything and hopes for the best.