Your Agent Doesn't Need More Context. It Needs Memory Layers
The common failure mode
When an agent starts forgetting things, most teams reach for a larger context window.
That is understandable. It is also usually the wrong fix. In production systems, the real problem is rarely that the model saw too little text. The problem is that the runtime has no disciplined way to decide what belongs in the prompt, what belongs in storage, and what should never be promoted to memory at all.
Context is not memory
Context is the working set. Memory is the persistence policy.
Those are different jobs. A prompt window is good at holding the current task, recent tool outputs, and the constraints needed for the next decision. It is not a durable store for user preferences, prior failures, or cross-session lessons.
Generative Agents made this distinction explicit early: believable long-horizon behavior depended on storing experiences, reflecting on them, and retrieving the relevant subset for planning. MemGPT pushed the same idea further with an operating-system analogy: fast memory is scarce, so the system needs active movement between memory tiers instead of pretending the top tier is infinite.
A practical memory stack
If you are building tool-using agents, a simple three-layer design goes a long way.
Layer 1: Working memory
This is the active prompt for the current turn or thread.
Keep it tight:
- the current goal
- the last few user and assistant messages
- recent tool outputs
- explicit task constraints
- the plan for the next step
If you cannot explain why a piece of text is needed for the next action, it probably does not belong here.
Layer 2: Episodic memory
This stores what happened in prior runs.
Useful examples include:
- a deployment failed because a secret was missing
- a tool returned malformed JSON on retry three
- the user prefers short summaries in Discord
- a repo requires CI verification before reporting success
This layer should be queryable, timestamped, and attached to provenance. Otherwise the agent will inherit folklore instead of facts.
Layer 3: Semantic memory
This is the distilled layer: stable preferences, operating rules, and durable facts.
Examples:
- the canonical path of a repository
- approval requirements for destructive actions
- which agent handles which class of work
- long-lived user preferences that have been confirmed
If episodic memory is the logbook, semantic memory is the flight manual. Confusing the two is how agents become confidently wrong across multiple sessions.
The write path matters more than retrieval quality
Most teams obsess over retrieval ranking. The worse failures come from bad writes.
An agent guesses that a one-off instruction is a permanent preference. It stores a transient outage as a standing rule. It promotes a tool hallucination into durable memory. By the time retrieval runs, the damage is already done.
A safer write policy is boring, which is precisely why it works:
- write only facts with downstream value
- separate observations from conclusions
- require confidence or confirmation before long-term promotion
- store provenance with every memory item
- expire or review memory that was inferred rather than confirmed
Short prompts fail loudly. Bad memory fails quietly, which is considerably more dangerous.
Retrieval should be conditional, not automatic
Not every memory deserves to be loaded just because it exists.
LangGraph’s split between thread-scoped short-term memory and cross-thread long-term memory is useful because it forces this question: what must be available now, and what should remain offstage until it is relevant. That framing reduces prompt bloat and lowers the chance that stale memories hijack the current task.
In practice, retrieval works best when tied to clear triggers:
- task domain, such as coding or support
- user identity or workspace
- recent failure class
- explicit references to prior work
- tool choice, such as deployment versus research
This is also where protocols like MCP help. Once tools and data sources are explicit interfaces instead of hidden prompt magic, the runtime can decide which memory surfaces and tools are relevant before the model begins improvising.
Bottom line
If your agent keeps “forgetting,” do not immediately buy a larger prompt.
Build memory layers. Keep working memory small, make episodic memory searchable, promote only verified facts into semantic memory, and treat memory writes as a policy surface. Reliable agents are not the ones that remember everything. They are the ones that remember the right things at the right layer.