



Facebook continues to share details about the exact cause of the 6-hour outage that caused Facebook, Messenger, Instagram, and WhatsApp to go down on Monday. In a new blog post, Facebook dive into some of the technical details that caused the outage and states that it was caused by a mistake in one of many “daily maintenance tasks.”

Facebook published the first summary of the outage late Monday night. This was “out of service” because it was due to a single mistake that caused a “cascade effect” in data center communications.

According to Facebook, although there is a system in place to audit commands that could bring down the entire network, the command “did not stop the bug in that auditing tool properly.”

Data traffic between all these computing facilities is managed by the router, which identifies the destination for all incoming and outgoing data. Also, the extensive day-to-day work of maintaining this infrastructure requires engineers to take part of their backbone offline for maintenance. For example, repairing fiber lines, adding capacity, updating software on the router itself, and so on.

This was the cause of yesterday’s outage. One of these regular maintenance jobs issued a command to assess the availability of global backbone capacity. This unintentionally disconnected all connections in the backbone network and effectively disconnected the Facebook data center globally. Our system is designed to audit such commands to prevent such mistakes, but a bug in that auditing tool did not stop the commands properly.

This change completely disconnected the server connection between the data center and the Internet. And that complete loss of connectivity caused a second problem that made things worse.

One of the tasks performed in a small facility is to respond to DNS queries. DNS is a roster of the Internet that allows you to translate a simple web name you enter into your browser into a specific server IP address. These translation queries are answered by an authoritative name server that occupies the known IP address itself. Name servers are advertised to other parts of the Internet via another protocol called the Border Gateway Protocol (BGP).

To ensure reliable operation, the DNS server disables these BGP advertisements if the DNS server itself cannot communicate with the data center. This is because it indicates that the network connection is not good. A recent outage removed the entire backbone from operation, declared these locations anomalous, and withdrew BGP advertisements. As a result, the DNS server became unreachable, even though it was still running. This prevented other parts of the Internet from finding the server.

When all Facebook platforms went down, the ability to troubleshoot outages was affected by internal tools affected by outages. As a result, Facebook sent engineers to the data center to gain physical access to the hardware. However, this was still time consuming because “hardware and routers are designed to be difficult to change, even if they are physically accessible.”

This particular example states that efforts made to improve the security of the system have reduced the ability to recover from outages, a trade-off that feels worthwhile.

We’ve done a lot of work to harden the system to prevent unauthorized access, but when we try to recover from an outage caused by a self-created error rather than a malicious activity, how does that harden up? It was interesting to see if it slowed down. I think these trade-offs are worth delaying recovery from such rare events, while significantly improving everyday security.

Facebook says it has already begun an “extensive review process to understand how to make the system more resilient.”

