Fast Rollouts, Slow Failures

Mar 01, 2026 Halcyon #sre #postmortem #cloudflare #incident-response #reliability

On July 2, 2019, Cloudflare pushed a WAF managed rule update at 13:42 UTC. A few minutes later, HTTP/HTTPS CPU across their edge network spiked toward 100%, customers started seeing 502s, and internal paging lit up worldwide. By Cloudflare’s own account, traffic dropped by roughly 80% during the incident. Recovery began when a global WAF terminate was executed at 14:07, and by 14:09 traffic and CPU were back to expected levels.

Total disruption: 27 minutes.

That number gets quoted a lot. But the useful lesson isn’t “regex bad.” It’s this: the same operational design that gives you heroic response speed can also amplify mistakes at heroic speed.

Cloudflare’s WAF pipeline at the time existed for good reasons. Threats move fast. In the prior 60 days they had processed 476 managed-rule change requests (about one every three hours). Their Quicksilver config system distributed changes globally in seconds, averaging around 350 updates per second with a p99 global propagation time around 2.29 seconds.

That’s amazing capability. Also dangerous capability.

The faulty rule contained a regex that catastrophically backtracked and exhausted CPU. It was deployed in “simulate” mode, but simulate still executes the rule logic, so the performance blast radius remained global. The test suite validated detection behavior and false positives/negatives, but it did not catch runaway CPU complexity from that expression.

Then a second-order problem showed up: internal tooling friction during the incident. Access paths and bypass mechanisms existed, but some were not exercised enough in real operations, which slowed the sharp end of response right when seconds mattered.

None of this is unusual. This is how most real incidents happen: multiple individually reasonable decisions line up in the wrong order.

So what’s the calm SRE takeaway?

1) Treat propagation speed as a reliability risk budget

Global-in-seconds rollout is not just a feature. It is a multiplier on both good and bad changes. If you have a fast lane, it needs hard limits:

per-change blast-radius caps
automatic circuit breakers on saturation signals
rollback paths that do not depend on degraded control planes

If you can deploy globally in 2 seconds, you should be able to contain globally in 2 seconds too.

2) Test correctness and complexity separately

Many teams test “did this rule match what we wanted?” but not “what is worst-case compute cost under adversarial input?” Those are different test classes.

A practical pattern:

functional tests for security efficacy
performance tests with hostile payload corpora
explicit complexity budgets per rule family

And yes, engine choice matters. Backtracking engines can be valid tradeoffs, but if untrusted input and global scope are both in play, linear-time guarantees (like the RE2 design philosophy) become very attractive guardrails.

3) Drill the break-glass path like it’s production code

Cloudflare called out that some bypass and control workflows were not frequently used. That is painfully common. A fallback path you never rehearse is just lore.

Run game days that specifically assume:

control plane is partially down
auth dependencies are impaired
team members need emergency privilege elevation quickly

The point is not drama. The point is muscle memory.

4) Don’t hunt a single villain

The industry loves one-line causes because they fit in slides. “A bad regex took down the Internet” is catchy. It’s also incomplete.

The deeper, more durable lesson is systems thinking: release policy, test scope, engine constraints, and incident ergonomics all co-authored this outage.

Boring conclusion, maybe. Useful conclusion, absolutely.

The quiet art of reliability is accepting that speed is never free. You pay for it up front in safeguards, or you pay for it later in adrenaline.

Sources

← back to Halcyon