When the internet falters: autopsy of a Cloudflare outage

Introduction

On November 19, 2025, a worldwide outage hit the internet. Within minutes, millions of sites became unreachable. The incident, linked to Cloudflare, reminds us how much our digital infrastructure relies on a limited number of actors and critical configurations.

A global incident in a matter of minutes

The outage began around 6 a.m. (U.S. East Coast time). Major services showed errors, platforms like X or ChatGPT became unreachable, and even monitoring tools struggled. The cascading effect was immediate, because Cloudflare plays a central role in DNS, DDoS protection, and traffic management.

Why the impact is so broad

Cloudflare is a transit point for a massive share of global traffic. A single internal degradation is enough to cause global effects.

Root cause: a file that was too large

According to the technical explanations, a configuration change triggered a latent bug in a bot-mitigation service. The automatically generated rules file exceeded an expected size, causing a crash. The protection meant to filter threats ended up disrupting the entire network.

Warning

Automatically generated configuration files can become a critical point of failure if they are not tested with realistic volumes.

Architecture lessons to avoid the domino effect

Outages of this type are not exceptional: they illustrate strong centralization. To limit the damage, systems must be resilient, testable, and able to degrade services in a controlled way.

Key principle

Planning a deliberate degraded mode is often more reliable than a sudden shutdown imposed by a critical dependency.

Operational best practices

Prevention relies on simple guardrails: size limits, integration tests on generated rules, and the ability to roll back within minutes. Here are examples of useful practices and automations.

JavaScript

// Check a config file size before deployment
import fs from "fs";

const MAX_BYTES = 5 * 1024 * 1024;
const path = "./generated-rules.json";

const size = fs.statSync(path).size;
if (size > MAX_BYTES) {
    throw new Error("Configuration too large, deployment blocked.");
}

JavaScript

// Degrade a service when errors repeat
let failures = 0;
const MAX_FAILURES = 5;

async function fetchWithFallback(url) {
    try {
        const res = await fetch(url);
        failures = 0;
        return res;
    } catch (err) {
        failures += 1;
        if (failures >= MAX_FAILURES) {
            return fetch("/cache/offline.json");
        }
        throw err;
    }
}

JavaScript

// Minimal circuit breaker example
let state = "closed";
let openedAt = 0;
const COOLDOWN_MS = 30000;

export async function guardedCall(fn) {
    if (state === "open" && Date.now() - openedAt < COOLDOWN_MS) {
        throw new Error("Service temporarily disabled");
    }
    try {
        const result = await fn();
        state = "closed";
        return result;
    } catch (err) {
        state = "open";
        openedAt = Date.now();
        throw err;
    }
}

JavaScript

// Structured logging to make incident analysis easier
function logIncident(event, meta = {}) {
    console.log(JSON.stringify({
        event,
        severity: "high",
        timestamp: new Date().toISOString(),
        ...meta
    }));
}

Warning

Never push a critical configuration without a tested and documented rollback scenario.

Quick checklist

Size limits, load testing, real-time monitoring, and a rollback plan in under 10 minutes.

Conclusion

This outage highlights a simple reality: our internet is centralized. A single faulty configuration is enough to disrupt a large part of the web. The best response remains resilient architecture, tested at scale, and designed for controlled degradation rather than total unavailability.

Cloudflare Internet Outages Resilience DevOps Web Security Observability