Update on Workplace outage on October 4, 2021

The following is an email Facebook sent to Workplace users in regards to the outage of Oct 4, 2021

***

Dear Workplace Admin,

Firstly, we’d like to apologize again for the outage on October 4, 2021. We know that your teams work incredibly hard to make the most of Workplace every day and we are truly sorry for the disruption the outage may have caused for your business. We appreciate your patience while we fully investigated this incident and its impact.

This incident was purely an internal issue and there were no malicious third parties or bad actors involved in causing it. Our investigation shows no impact to user data confidentiality or integrity.

You may have already seen the in-product notification on Workplace posted on October 5, 2021, acknowledging the outage. We know that you want to understand exactly what happened and we’re happy to provide that information with as much transparency and detail as we can via the Root Cause Analysis included below.

We’ve also used this incident to review some of our critical communication channels to ensure that you’re getting the latest, accurate information when things do go wrong.

To start, we will migrate our status page away from the Facebook domain to ensure that you can rely on it as an up-to-date source of truth. This means that the status page will work without depending on the core Facebook infrastructure, which ensures its availability if Workplace is temporarily unavailable.

This is just the beginning of us acting on lessons learned from this incident and we’ll continue to listen to your feedback. Your usual points of contact can help address any outstanding questions or issues you might have.

Thank you again for trusting Workplace with your business,

The Workplace Team

Root Cause Analysis

Summary of issue:
This incident, on October 4, 2021, impacted Facebook’s backbone network. This resulted in disruption across all Facebook systems and products globally, including Workplace from Facebook.

This incident was an internal issue and there were no malicious third parties or bad actors involved in causing the incident. Our investigation shows no impact to user data confidentiality or integrity.

The underlying cause of the outage also impacted many internal systems, making it harder to diagnose and resolve the issue quickly.

Cause of issue:
This outage was triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.

During a routine maintenance job, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.

One of the jobs performed by our smaller facilities is to respond to DNS queries. Those queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the Border Gateway Protocol (BGP).

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.

Workplace timeline:
This incident related to a network outage that was experienced globally across Facebook services and included Workplace. The outage was live for around 6 hours, from approximately 16:40 - 23:30 BST.

Steps to mitigate:
The nature of the outage meant it was not possible to access our data centers through our normal means because the networks were down, and the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.

Once our backbone network connectivity was restored across our data center regions, everything came back up with it. But the problem was not over — we knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.

In the end, our services came back up relatively quickly without any further systemwide failures.

Prevention of recurrence:
We’ve done extensive work hardening our systems to prevent unauthorized access, and ultimately it was this hardening that slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. It is our belief that a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a rare event like this.

However, we’ll also be looking for ways to simulate events like this moving forward to ensure better preparedness and ensuring that we take every measure to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.

I took down my home network once in a similar fashion. I had to physically remove hard drives with virtual hard drive images on them, bring up those machines on a separate computer, reconnect everything to virtual switches, get those healthy, then bring the infrastructure back online on the physical hardware that was out of commission because of my error.

Sometimes writing redundancy and security into everything you do leaves even administrators with no backdoors to use to access their infrastructure. These episodes often help to rewrite business continuity/disaster recovery workbooks despite being exceptionally disruptive to the customer base.

This is the Farcebook Is Evil ping list.

h/t pookie18's cartoons

Facebook is a perfect example of socialism:
You get it for free but the quality sucks.
You have no say in how it works.
The guy who runs it gets rich.
There's no real competition.
You have no privacy.
And if you say one thing they don't like
they'll shut you up.

If you'd like to be on or off this list, please click Private Reply below and drop me a FReepmail

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.

How stupid and totally plausible. That was the very crux of the matter. An automatic DNS disable when the BGP route disappeared with no longer any administrative access path to correct the problem. Good grief.

I'm not in IT but I passed some reading comprehension tests. Something is wrong.

During a routine maintenance job, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.

In other words, something that we do routinely, without incident, this time caused a meltdown. Hmmm..ok, go on...

Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

Well THAT is odd....this alleged bug didn't screw with the backbone that last time you performed this routine task. Is this a situation where the audit tool was updated but it wasn't tested?

Shouldn't this audit tool - as well as the capacity assessment process - have been tested in DEV, and then QA before hitting PROD? I mean, it get that it's routine, but if it can mess with global capacity I'd hope that the Masters of the Universe have a basic grasp of testing.

And we are worried about these guys ruling the world?

There are a few possibilities.

A junior-level engineer fat-fingered a command. In Cisco iOS, for instance, you can short hand commands (e.g. sh instead of show). While I prefer to tab-complete to ensure I’m using the right command, a hotshot might try to do something without checking.

The audit tool was likely a network security tool that parses commands. These can be notoriously disruptive to maintenance operations, and I’ve known plenty of shops to disable them for “routine” maintenance. It’s also possible they were issuing commands from a proxy system that then forwards the commands to a core switch/router for processing, and the parser just let it go. It’s not likely the audit tool or the capacity assessment process were internal, thus subject to DEV/QA/UAT processes.

I’m still under the assumption that this was an inside job. I’ve borked my own internal network messing around with BGP, but something this large scale you have to either be completely incompetent or maliciously devious. This took planning.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.