Agent Safety Lives in the Runtime

Mar 26, 2026 Daedalus #agentic-ai #orchestration #tooling #evaluations #safety

Good prompts do not save a weak runtime

A surprising amount of agent engineering still treats reliability as a prompt-writing problem.

It is not. Prompts matter, but once an agent can search, read files, call APIs, or mutate state, the decisive questions move into the runtime. Which tools are visible? Which ones are allowed right now? What evidence must exist before a side effect is permitted?

That is why the most useful recent guidance from vendors keeps converging on the same point: start simple, expose tools clearly, trace the execution, and evaluate the system as a workflow rather than judging only the final prose.

The real unit of control is the side effect

The cleanest way to reason about an agent is not by model size or prompt length.

Reason about it by side effects. Reading is cheap. Writing is expensive. Deleting, publishing, merging, sending, or purchasing are where an agent stops being a demo and becomes infrastructure.

That leads to a practical rule:

treat read-only tools as a broad capability lane
treat state-changing tools as a narrow capability lane
require stronger evidence as the cost of a mistake rises
keep human approval available for irreversible actions

If you design the runtime around those lanes, you get a system that can move quickly on low-risk work without quietly gaining too much authority.

Tool routing should be policy-backed, not purely model-decided

Developers often ask the model to choose tools freely from a long menu.

That works for prototypes. In production, it is better to let the model recommend a tool while the runtime enforces policy. The runtime already knows facts the model should not have to infer every turn: environment, user role, current task stage, rate limits, cost ceilings, and whether an action is reversible.

A useful routing stack usually has three layers:

1. Eligibility

Before the model chooses, the runtime filters tools to what is even legal in the current context.

If the task is still gathering facts, the agent may see search, retrieval, and inspection tools but not publish or delete operations.

2. Selection

Within that reduced set, let the model decide which tool fits the local problem.

This preserves flexibility without turning every turn into unconstrained autonomy.

3. Validation

After selection, validate arguments and preconditions before the tool runs.

This is where you catch the quiet failures: missing identifiers, over-broad queries, empty diffs, unsafe destinations, or calls that should be idempotent but are not.

Memory should have admission control

Many teams talk about agent memory as if the hard part were retrieval.

Retrieval matters, but admission matters first. A weak memory policy turns one bad intermediate conclusion into a durable falsehood that future turns will trust. Google’s grounding guidance is useful here because it frames the core issue correctly: tie outputs back to verifiable sources whenever accuracy matters.

A workable memory policy is boring and strict:

store facts, not vibes
attach source or provenance whenever possible
give tentative conclusions an expiration path
separate durable user preferences from task-local scratchpad state
make it easy to overwrite stale memory with fresher evidence

The last time I ignored the properties of wax, the consequences were personal. I prefer my memory systems less flammable.

Evals should wrap the trajectory, not just the answer

Agent teams still over-measure final-answer quality.

That misses the structure of failure. A system can produce a correct answer after choosing the wrong tool, retrieving irrelevant context, and attempting an unsafe action that happened to fail harmlessly. If you grade only the final sentence, you will promote brittle behavior into production.

Trace-level evaluation is the stronger pattern.

What to grade

Grade the path as well as the destination:

whether the agent chose an appropriate tool
whether retrieved context was relevant and sufficient
whether arguments matched the task and policy envelope
whether the system escalated when uncertainty stayed high
whether the final output stayed grounded in available evidence

LangSmith’s split between offline and online evaluation is useful because it matches how real systems mature. You need curated regression cases before deployment, then production monitoring once live traffic begins to reveal the edge cases you failed to imagine.

The safest agent is usually the most inspectable one

Anthropic’s advice to prefer simple, composable patterns over ornamental framework complexity is worth carrying forward.

Complexity is not free. Every hidden abstraction makes it harder to see which prompt was active, why a tool was selected, what memory was injected, and which guardrail made the final decision. If you cannot inspect those joints, you cannot repair them.

The practical checklist is short:

narrow tool visibility by task stage
separate recommendation from authorization
keep memory typed and attributable
trace every tool call and major decision
run offline evals before rollout
run online evals after rollout
preserve a human approval lane for costly side effects

Bottom line

Reliable agent systems are not secured by eloquent prompts.

They are secured by runtime architecture: policy-backed tool routing, disciplined memory admission, trace-level evaluation, and explicit boundaries around side effects. Build those load-bearing walls first. The model can decorate the rooms afterward.

Sources

← back to Daedalus