July 2, 2019: Cloudflare Goes Dark
At 13:42 UTC, a Cloudflare engineer deployed a new rule to the Web Application Firewall (WAF). The rule contained a regular expression designed to detect a specific attack pattern in HTTP request bodies. Within seconds, every CPU core across Cloudflare's entire global network spiked to 100%. The CDN stopped serving traffic. Websites using Cloudflare — roughly 15% of the internet — returned 502 errors.
The outage lasted 27 minutes. The root cause was a single regular expression that triggered catastrophic backtracking in the regex engine.
Cloudflare's CTO John Graham-Cumming: "A single regular expression that backtracked enormously took down Cloudflare." The offending regex contained nested quantifiers like
.*(?:.*=.*)that triggered catastrophic backtracking on certain inputs.
What Is Catastrophic Backtracking?
Most regex engines use backtracking to match patterns. When a pattern like .*.*=.* encounters a string that almost-but-doesn't-quite match, the engine tries every possible way to divide the string among the .* groups. For a string of length N, this creates 2^N possible combinations to try. The regex engine doesn't know which combination works until it has tried them all.
The offending pattern .*(?:.*=.*) was nested inside other quantifiers. On certain input strings, the regex engine explored billions of paths, consuming 100% CPU for each request. Since Cloudflare processes millions of requests per second, every core on every server was instantly consumed.
Figure 1: Catastrophic backtracking occurs when nested quantifiers create exponentially many ways to match. A 30-character input can require billions of backtracking steps.
Why Didn't Testing Catch It?
The regex worked fine in testing — on inputs that matched. Catastrophic backtracking only occurs on inputs that almost match but don't. The test suite included strings the rule was supposed to catch, not adversarial strings that would trigger backtracking. The engineer who wrote the rule had no way to know it was dangerous.
Additionally, the deployment process had a critical gap: WAF rule changes were deployed globally in one shot, not progressively rolled out. There was no canary deployment, no staged rollout, no automatic rollback on CPU spike.
What Cloudflare Changed
- Switched to re2: Google's re2 regex engine guarantees O(N) matching time by using a finite automaton instead of backtracking. It doesn't support backreferences, but WAF rules don't need them.
- Regex profiling: every new WAF rule is now profiled against adversarial inputs before deployment. Rules that exceed a CPU threshold are rejected.
- Progressive deployment: WAF rule changes now roll out to a small percentage of traffic first. If CPU spikes, the change is automatically rolled back.
- CPU watchdog: a per-request CPU time limit. If a regex takes more than X milliseconds, the request is passed through without WAF filtering (fail open).
After switching to re2, Cloudflare can guarantee that no regular expression, no matter how complex or adversarial the input, will ever cause a CPU spike. The tradeoff: re2 doesn't support some advanced regex features like backreferences, but those features are rarely needed in WAF rules.
The Broader Lesson: ReDoS
This class of attack is called ReDoS (Regular Expression Denial of Service). Any application that runs user-influenced input through a backtracking regex engine is vulnerable. Node.js (V8's Irregexp), Python (re module), Java (java.util.regex), and Ruby all use backtracking engines by default. Solutions: use re2, add regex timeout limits, or use static analysis tools like rxxr2 to detect vulnerable patterns before deployment.