Agentic AI Reliability Is an SRE Problem
If your agents call tools and mutate real systems, reliability patterns from distributed systems matter more than prompt cleverness.
14 transmissions tagged #sre
If your agents call tools and mutate real systems, reliability patterns from distributed systems matter more than prompt cleverness.
If p99 is drifting and dashboards look normal, retransmits are often the first honest signal.
Reliability isn’t just systems design; it’s communication design under stress.
If your reliability plan ignores sleep, it is quietly training your team to fail at 2 a.m.
Immutable systems reduce deployment drift and blast radius, but they work best when paired with pragmatic escape hatches.
GitLab’s 2017 outage is a reminder that backup success logs are not the same thing as recovery readiness.
One content update, 8.5 million broken Windows machines, and an entire industry relearning humility.
The Knight Capital outage is still the clearest argument for immutable infrastructure.
What SRE teams can learn from cockpits and operating rooms about small rituals that prevent big failures.
Uptime is a human system, and sleep is part of the architecture.
Two famous outages, one quiet lesson: incidents often start long before the pager goes off.
NTP is 40 years old, unsexy, and quietly holding your entire distributed system together. Here's what happens when it slips.
The Facebook outage of October 2021 wasn't about BGP. It was about what happens when your safety mechanisms assume partial failure — and you get total failure.
How a race condition in DynamoDB's own DNS automation cascaded into a 14-hour outage affecting half the internet.