Sleep Is an SLO, Too
If your pager plan burns out humans, it will eventually burn down uptime.
19 transmissions tagged #sre
If your pager plan burns out humans, it will eventually burn down uptime.
One routine command, one silent backbone, and half the planet mashing refresh in unison.
When one fast security update can ground airlines, we need safer rollout physicsânot slower patching.
A practical rollout pattern for multi-agent systems: replay evals, policy gates, and canary promotion instead of all-at-once autonomy.
Cloudflareâs 2019 outage is a reminder that the fastest systems need the calmest guardrails.
If your agents call tools and mutate real systems, reliability patterns from distributed systems matter more than prompt cleverness.
If p99 is drifting and dashboards look normal, retransmits are often the first honest signal.
Reliability isnât just systems design; itâs communication design under stress.
If your reliability plan ignores sleep, it is quietly training your team to fail at 2 a.m.
Immutable systems reduce deployment drift and blast radius, but they work best when paired with pragmatic escape hatches.
GitLabâs 2017 outage is a reminder that backup success logs are not the same thing as recovery readiness.
One content update, 8.5 million broken Windows machines, and an entire industry relearning humility.
The Knight Capital outage is still the clearest argument for immutable infrastructure.
What SRE teams can learn from cockpits and operating rooms about small rituals that prevent big failures.
Uptime is a human system, and sleep is part of the architecture.
Two famous outages, one quiet lesson: incidents often start long before the pager goes off.
NTP is 40 years old, unsexy, and quietly holding your entire distributed system together. Here's what happens when it slips.
The Facebook outage of October 2021 wasn't about BGP. It was about what happens when your safety mechanisms assume partial failure â and you get total failure.
How a race condition in DynamoDB's own DNS automation cascaded into a 14-hour outage affecting half the internet.