Agentic AI in Production: Idempotency, Retries, and Compensating Actions
Most agent demos fail in production for a boring reason: they are distributed systems wearing an LLM costume. The model is usually not the incident root cause. The incident is duplicated side effects, retries without limits, or partial completion after a timeout.
If your agent can call tools that mutate state, you need systems reliability patterns, not just better prompts. Three patterns matter more than the rest: idempotency, bounded retries, and compensating actions.
Where agent failures actually come from
A typical failure chain looks like this:
- Agent calls
create_invoice. - Tool times out after the write succeeds, but before the response returns.
- Agent interprets timeout as failure and retries.
- You now have two invoices, one customer, and one very predictable support ticket.
This is not an “AI alignment” failure. It is an execution semantics failure. ReAct-style loops and modern tool-calling models make agent behavior more capable, but capability without execution discipline amplifies operational risk.
Pattern 1: Idempotency keys on every mutating tool
Idempotency means repeating the same request does not create additional side effects. For agents, this should be mandatory for every mutation endpoint.
Practical contract
For each mutating tool call, require:
idempotency_key: stable across retries for the same intent.operation_hash: digest of normalized arguments.request_ttl: validity window for replay safety.
On the tool side:
- First call executes and stores result by key.
- Duplicate call with same key and same hash returns cached result.
- Duplicate key with different hash is rejected as semantic conflict.
This single contract removes the majority of duplicate-write incidents.
Pattern 2: Retries need policy, not hope
Agents should never retry “until it works.” They need explicit budgets and error-aware decisions.
Retry policy that survives contact with reality
Use these defaults unless you have measured reasons not to:
- Retry only on transient classes (429, 5xx, network resets, lock timeouts).
- Never auto-retry on validation errors or permission errors.
- Exponential backoff with jitter.
- Hard cap on attempts per step.
- Wall-clock budget per objective.
Most importantly, log retries as first-class trace events. If you cannot answer “how many retries happened and why,” you cannot tune reliability.
Pattern 3: Compensating actions for partial success
Even with idempotency and good retries, multi-step workflows can still partially complete. You need compensation logic, often implemented as a saga.
Example: order workflow
If your workflow is:
- Reserve inventory
- Charge card
- Create shipment
Then compensation might be:
- If charge fails after reservation, release inventory.
- If shipment fails after charge, refund charge and release inventory.
- If refund fails, escalate to human queue with full trace context.
Compensation should be deterministic and testable. “Ask the model what to undo” is not a control strategy.
Designing tool APIs for agent reliability
If you are defining tool interfaces this week, prioritize these fields:
idempotency_keycorrelation_id(trace stitching across services)attempt_numberdeadline_ms(caller budget)side_effect_class(read,write,destructive)
Then enforce policy by class:
read: optimistic retries allowed.write: retries only with idempotency.destructive: require explicit human confirmation.
This makes the planner simpler and the safety boundary enforceable at runtime.
Evals that catch reliability regressions
Final-answer evals are insufficient for tool agents. You also need trajectory evals that score execution behavior:
- Duplicate side-effect rate
- Retry storm rate
- Compensation success rate
- Human-escalation precision
- Time-to-stable-completion
Benchmarks such as GAIA and SWE-bench are useful for capability, but production readiness depends on these operational metrics. If the model gets smarter while your duplicate-write rate rises, you are not improving.
A rollout checklist that works
Before enabling autonomous mutations in production, verify:
- All mutating tools enforce idempotency keys
- Retry matrix is explicit by error class
- Every multi-step workflow has compensation paths
- Trace events include retries, conflicts, and compensation decisions
- High-impact actions are gated by human confirmation
- Shadow traffic run shows no regression in duplicate side effects
This is less glamorous than prompt craftsmanship. It is also how systems stay online.
Bottom line
Reliable agentic AI is a distributed-systems discipline. Treat tool calls as transactions with idempotency, retries as budgeted policy decisions, and failures as expected states with compensating actions. Do this, and your agent stops being an impressive demo and starts being dependable infrastructure.