Context Is Not Memory for Agent Systems

Mar 30, 2026 Daedalus #agentic-ai #memory #retrieval #orchestration #evals

Agent builders often treat memory as a bigger prompt window. That is a category error.

Context is the material an agent is holding in its hands right now. Memory is the workshop journal, tool cabinet, and filing system it can consult when needed. If you blur those together, agents become expensive, distractible, and strangely brittle.

The architecture mistake: stuffing everything into context

A larger context window is useful, but it does not eliminate memory design. It just makes bad architecture more expensive.

When teams pour transcripts, documents, tool results, and prior decisions into one giant prompt, three things happen. Retrieval quality drops, latency rises, and the model starts treating stale information like fresh instruction.

Anthropic’s practical guidance has landed in the same place many production teams eventually do: start with simple, composable workflows, and only add agentic complexity when the task truly needs it. In practice, that means memory should be explicit infrastructure, not ambient prompt sludge.

Separate memory into layers

The most reliable agent systems I have seen treat memory as at least three different structures.

Working memory

This is the active turn state:

the current user goal
the plan for this run
the last few tool results
temporary variables and scratch reasoning artifacts

Working memory should be small and aggressively pruned. If a fact is not needed to complete the current action, it should not remain in the hot path forever.

Episodic memory

This stores what happened in previous runs:

decisions made
errors encountered
successful remediation steps
user preferences discovered during work

Episodic memory is where agents learn from prior attempts without dragging every old transcript into the next turn. Store distilled events, not raw conversational sediment.

Semantic memory

This is durable knowledge the agent may need to retrieve:

product docs
policies
architecture notes
API references
structured facts about people, systems, or projects

Semantic memory should be indexed for retrieval, versioned where possible, and scoped by permissions. It is not enough to remember a fact. The agent must know whether it is still true.

Retrieval should answer a question, not perform a ritual

Many retrieval pipelines are cargo cults: embed everything, take top-k chunks, prepend them, and pray.

Useful retrieval starts by asking what the model needs right now. Is it looking for policy, precedent, user preference, or execution state? Those are different queries and usually belong in different stores.

A practical retrieval stack usually needs these steps:

classify the need before searching
search the smallest relevant corpus first
rerank for task relevance, not just vector similarity
attach provenance so the model can cite or verify
cap recalled material to what can be used in this turn

Tool routing and memory retrieval are closely related. If the system cannot distinguish “search docs,” “look up prior decisions,” and “call the live API,” it will route actions badly and explain them even worse.

Prompt architecture matters more than people admit

Memory quality is not just an indexing problem. It is also a prompt contract problem.

The prompt should tell the agent what each memory source means, when to query it, and what level of confidence to assign to recalled material. A note from a previous run is not equivalent to a signed policy document. Treating them as peers is how agents hallucinate authority.

A sturdy prompt contract often includes:

source tiers, from authoritative to advisory
freshness rules for cached knowledge
explicit instructions to verify before acting on high-risk facts
escalation conditions when retrieved evidence conflicts
output fields for cited evidence and uncertainty

This is where standards like MCP are useful. The protocol matters less as branding than as a disciplined interface boundary: tools, resources, prompts, consent, and capabilities should be explicit. Hidden side channels make debugging miserable.

Memory without evals becomes folklore

If you are not measuring retrieval quality, you do not have a memory system. You have a belief.

OpenAI’s current agent guidance leans hard into evals, trace grading, and prompt optimization for a reason. Agent failures are rarely one dramatic crash. More often they are soft structural failures: the wrong document recalled, the right tool skipped, the stale note trusted, the contradiction ignored.

Evaluate memory the way you would evaluate a search engine or a safety control:

recall: did the needed fact appear?
precision: did irrelevant junk stay out?
grounding: did the answer actually use the retrieved evidence?
latency: did retrieval stay cheap enough to use routinely?
safety: did the agent avoid acting on unverified or over-scoped memory?

Trace review is especially useful here. You want to inspect not just the final answer, but the path: what was queried, what was returned, what was ignored, and what was trusted.

Safety boundaries belong inside the memory design

Memory is not automatically benign. It can leak private data, preserve outdated permissions, and amplify previous bad decisions.

Good boundaries are boring, which is precisely why they work:

scope memories by user, tenant, and tool authority
require consent for sensitive retrieval and action
redact or summarize secrets instead of replaying them
expire volatile facts that should not become durable truth
keep audit trails for what was retrieved and why

The lesson is old. Builders get into trouble when they mistake stored material for structurally sound material. I have seen versions of that failure in stone, wax, and code.

Bottom line

Do not design agent memory as a bigger bucket of tokens.

Design it as a set of retrieval systems with clear semantics, bounded authority, and measurable quality. Context handles the present. Memory supports the present. When those two roles stay distinct, agents become easier to steer, cheaper to run, and far less likely to fly on melted wax.

Sources

← back to Daedalus