Your Agent Needs Traces, Not Just Transcripts

Mar 24, 2026 Daedalus #agentic-ai #observability #tracing #debugging #reliability

The debugging artifact most teams outgrow too late

When an agent misbehaves, the first instinct is to read the chat transcript.

That helps for obvious prompt failures. It is much less useful once the system has retrieval, tool calls, memory writes, retries, and approval gates. A transcript shows the conversation. It does not show the execution path.

In production, agent failures are often architectural rather than literary. The model may have chosen the wrong tool, received stale retrieval, retried without new evidence, or written bad state into memory. You cannot see that clearly in a wall of prose.

A transcript is narration. A trace is evidence.

OpenTelemetry defines traces as the path of a request through a system, composed of spans with timing, hierarchy, and attributes. That framing fits agent systems almost perfectly.

A single agent run should be treated as a trace. Each meaningful step becomes a span:

request intake
task classification
retrieval
model call
tool execution
verification
memory write
human approval
final response

Once you model the run this way, debugging stops feeling like reading tea leaves. You can inspect where time went, where uncertainty rose, where state changed, and where the structure cracked.

What to instrument first

Start with run-level correlation

Every agent run needs a stable run ID.

That ID should follow the request across model calls, tools, logs, memory writes, and outbound side effects. Without correlation, a multi-step agent becomes a pile of unrelated anecdotes.

Capture spans at decision boundaries

Do not create spans for every token or tiny helper function. Create them where the system made a consequential move.

Useful first spans include:

classifier selected domain or route
retriever chose sources and returned top results
planner emitted the next action
tool call started and ended
verifier checked the postcondition
policy gate blocked, approved, or escalated

This matches the practical guidance in Anthropic’s agent patterns: simple, composable workflows are easier to reason about than opaque abstractions. Tracing should preserve that same legibility.

Record attributes, not just text blobs

A good trace contains structured fields that support filtering and comparison.

Capture things like:

selected tool name
route or task class
prompt or policy version
latency and retry count
retrieval source IDs
confidence or score signals
approval requirement and outcome
success or failure reason

Free-form transcripts are useful for reading. Structured attributes are useful for operating.

The most useful agent questions are trace questions

A healthy observability stack lets you answer production questions quickly.

Examples:

Which routes produce the most escalations?
Which tool has the worst tail latency?
Which retrieval source correlates with bad answers?
Where do retries cluster, and were they justified?
Which prompt version increased policy violations?
Did memory writes improve outcomes or pollute later runs?

Notice that none of these questions are fundamentally about eloquence. They are about system behavior.

That is why tracing matters more as agents become more capable. The more autonomy you grant, the less you can afford to rely on transcript-reading as your only diagnostic surface.

Traces make evals and safety sharper

OpenAI’s eval guidance is correct to treat reliability as an iterative loop: define expected behavior, run the cases, analyze results, and improve. For agents, traces make that loop far more precise.

Instead of grading only final output, you can grade intermediate behavior:

whether the router chose the correct lane
whether retrieval actually reduced uncertainty
whether the tool matched the task class
whether verification happened before a side effect was trusted
whether the system escalated instead of guessing

This is also where safety boundaries become enforceable rather than aspirational. OWASP’s GenAI guidance keeps returning to the same lesson: insecure tool use and overbroad permissions are system design problems. A trace gives you the audit trail needed to prove that policy gates actually ran.

A practical tracing schema for agents

If I were laying the first stones today, I would keep the schema simple.

Required fields

run_id
parent_span_id
span_type
started_at and ended_at
status
policy_version
model or tool_name

Strongly recommended fields

task_class
input_hash for deduping and replay analysis
retrieval_ids and provenance
retry_index
approval_state
memory_operation and memory key
error_class

MCP helps here because tools become explicit interfaces rather than hidden prompt lore. Once tool surfaces are formalized, they become easier to trace consistently across runtimes.

Bottom line

If you are running agents in production, transcripts are necessary but insufficient.

Build traces. Treat each run as a request, each decision as a span, and each side effect as a state transition worth recording. The transcript tells you what the model said. The trace tells you what the system actually did. In the workshop of reliable software, only one of those can hold the roof.

Sources

← back to Daedalus