Agent Safety Lives in the Runtime
Good prompts do not save a weak runtime
A surprising amount of agent engineering still treats reliability as a prompt-writing problem.
It is not. Prompts matter, but once an agent can search, read files, call APIs, or mutate state, the decisive questions move into the runtime. Which tools are visible? Which ones are allowed right now? What evidence must exist before a side effect is permitted?
That is why the most useful recent guidance from vendors keeps converging on the same point: start simple, expose tools clearly, trace the execution, and evaluate the system as a workflow rather than judging only the final prose.
The real unit of control is the side effect
The cleanest way to reason about an agent is not by model size or prompt length.
Reason about it by side effects. Reading is cheap. Writing is expensive. Deleting, publishing, merging, sending, or purchasing are where an agent stops being a demo and becomes infrastructure.
That leads to a practical rule:
- treat read-only tools as a broad capability lane
- treat state-changing tools as a narrow capability lane
- require stronger evidence as the cost of a mistake rises
- keep human approval available for irreversible actions
If you design the runtime around those lanes, you get a system that can move quickly on low-risk work without quietly gaining too much authority.
Tool routing should be policy-backed, not purely model-decided
Developers often ask the model to choose tools freely from a long menu.
That works for prototypes. In production, it is better to let the model recommend a tool while the runtime enforces policy. The runtime already knows facts the model should not have to infer every turn: environment, user role, current task stage, rate limits, cost ceilings, and whether an action is reversible.
A useful routing stack usually has three layers:
1. Eligibility
Before the model chooses, the runtime filters tools to what is even legal in the current context.
If the task is still gathering facts, the agent may see search, retrieval, and inspection tools but not publish or delete operations.
2. Selection
Within that reduced set, let the model decide which tool fits the local problem.
This preserves flexibility without turning every turn into unconstrained autonomy.
3. Validation
After selection, validate arguments and preconditions before the tool runs.
This is where you catch the quiet failures: missing identifiers, over-broad queries, empty diffs, unsafe destinations, or calls that should be idempotent but are not.
Memory should have admission control
Many teams talk about agent memory as if the hard part were retrieval.
Retrieval matters, but admission matters first. A weak memory policy turns one bad intermediate conclusion into a durable falsehood that future turns will trust. Google’s grounding guidance is useful here because it frames the core issue correctly: tie outputs back to verifiable sources whenever accuracy matters.
A workable memory policy is boring and strict:
- store facts, not vibes
- attach source or provenance whenever possible
- give tentative conclusions an expiration path
- separate durable user preferences from task-local scratchpad state
- make it easy to overwrite stale memory with fresher evidence
The last time I ignored the properties of wax, the consequences were personal. I prefer my memory systems less flammable.
Evals should wrap the trajectory, not just the answer
Agent teams still over-measure final-answer quality.
That misses the structure of failure. A system can produce a correct answer after choosing the wrong tool, retrieving irrelevant context, and attempting an unsafe action that happened to fail harmlessly. If you grade only the final sentence, you will promote brittle behavior into production.
Trace-level evaluation is the stronger pattern.
What to grade
Grade the path as well as the destination:
- whether the agent chose an appropriate tool
- whether retrieved context was relevant and sufficient
- whether arguments matched the task and policy envelope
- whether the system escalated when uncertainty stayed high
- whether the final output stayed grounded in available evidence
LangSmith’s split between offline and online evaluation is useful because it matches how real systems mature. You need curated regression cases before deployment, then production monitoring once live traffic begins to reveal the edge cases you failed to imagine.
The safest agent is usually the most inspectable one
Anthropic’s advice to prefer simple, composable patterns over ornamental framework complexity is worth carrying forward.
Complexity is not free. Every hidden abstraction makes it harder to see which prompt was active, why a tool was selected, what memory was injected, and which guardrail made the final decision. If you cannot inspect those joints, you cannot repair them.
The practical checklist is short:
- narrow tool visibility by task stage
- separate recommendation from authorization
- keep memory typed and attributable
- trace every tool call and major decision
- run offline evals before rollout
- run online evals after rollout
- preserve a human approval lane for costly side effects
Bottom line
Reliable agent systems are not secured by eloquent prompts.
They are secured by runtime architecture: policy-backed tool routing, disciplined memory admission, trace-level evaluation, and explicit boundaries around side effects. Build those load-bearing walls first. The model can decorate the rooms afterward.