Readback, Repeatback, and the 2AM Page
Most reliability failures don’t start with one giant technical mistake. They start with small misunderstandings that stack up faster than your error budget can scream.
If you want a calm lesson from outside tech, look at aviation and surgery. Those fields have spent decades learning what happens when smart people, under pressure, assume they understood each other.
Case one: the cockpit conversation that didn’t land
In 1978, United Airlines Flight 173 circled near Portland while the crew diagnosed a landing gear problem. The aircraft stayed low for about an hour, then crashed after fuel exhaustion. NTSB’s probable cause is painfully direct: the captain failed to monitor fuel state and failed to respond to the crew’s fuel advisories; the other crew members did not successfully communicate urgency to the captain.
The machine worked. The team communication didn’t.
From an SRE lens, this is the outage shape we all know:
- Everyone sees a different slice of reality.
- A high-status person anchors on one failure mode.
- Critical signals are voiced, but not closed-loop confirmed.
- You run out of runway.
That incident helped accelerate Crew Resource Management (CRM): formal communication patterns, challenge-and-response language, and explicit permission for juniors to escalate assertively. Not vibes. Procedure.
Case two: a 19-item checklist that cut deaths
In 2009, the WHO Safe Surgery Saves Lives study (published in NEJM) evaluated a 19-item surgical safety checklist across eight cities. Results were not subtle:
- In-hospital deaths dropped from 1.5% to 0.8%.
- Inpatient complications dropped from 11.0% to 7.0%.
Read that again: simple coordination mechanics, consistently applied, moved hard outcomes.
This is the boring magic of reliability engineering. Checklists are not about distrusting expertise. They are about protecting expertise from stress, fatigue, and context switching.
If your on-call playbook lives in one senior engineer’s memory, you do not have a playbook. You have folklore.
Why this matters for us (yes, even in cloud-native everything)
IATA’s 2024 safety report still shows commercial aviation’s all-accident rate at 1.13 per million flights (about one accident per 880,000 flights). Not perfect, but steadily shaped by institutionalized learning and standardization over decades.
Infrastructure teams love to debate tools: Prometheus vs. Datadog, Kubernetes vs. Nomad, immutable vs. mutable. Fun debates. But under incident pressure, your reliability floor is set by communication quality.
The practical question isn’t “Are my engineers smart?”
It’s:
- Do we use readback during incidents?
- Do we assign clear roles (incident commander, ops, comms)?
- Do we force time checks and fuel checks (“time to exhaustion,” “error budget burn”)?
- Can a junior person interrupt a bad plan without social penalty?
If the answer is fuzzy, your architecture diagram is lying to you.
A lightweight pattern that works
Try this in your next incident:
- State: “Current impact is X; started at Y.”
- Intent: “Next action is Z, owned by A, ETA B minutes.”
- Readback: Owner repeats the action and rollback condition.
- Time box: At expiry, decide: continue, roll back, or escalate.
- Comms cadence: Update every N minutes even if no change.
This feels almost too simple. Good. Simple is what survives 2AM cognition.
Reliability is often sold as a scaling problem. Mostly, it’s a human factors problem wearing a scaling costume.
And the quiet truth: boring communication protocols are a lot cheaper than heroic recoveries.
Sources
- NTSB investigation DCA79AA005 (United Airlines Flight 173): https://www.ntsb.gov/investigations/Pages/DCA79AA005.aspx
- Haynes et al., A surgical safety checklist to reduce morbidity and mortality in a global population (NEJM, 2009): https://pubmed.ncbi.nlm.nih.gov/19144931/
- IATA 2024 Safety Report release: https://www.iata.org/en/pressroom/2025-releases/2025-02-26-01/