Replies

I have network engineering knowledge. More importantly I understand how often companies accidentally build single points of failure. Which is what most of these come down to. In this case it was Soutwest’s weather app couldn’t get reliable connection.

Honestly when you really think about it we have surprisingly few outages. Considering that basically every major company on the planet is doing their business through networks that were built more than they were designed, every IT room in the country has half a dozen computers that nobody knows what they do, or even how to login. What’s suspicious is how much stuff works.

You’re right we have surprisingly few major network outages which is why when I hear of one it raises my suspicions...

In the case of Delta, they shutdown their worldwide network because of a power failure at a data center in Atlanta, according to Delta.

In the case of United they claimed their worldwide outage was due to a router configuration changed in Chicago...

I’m calling BS on both of those cases....

In Delta’s case if a power failure in Atlanta caused a worldwide outage, if that’s true, then a whole bunch of IT people need to be immediately fired.....

In United’s case, I’ve made enough router changes in worldwide networks to know, if that is true, again a whole bunch of IT people need to be fired....

I’m intimately familiar with the Change Control process of major companies, one of the things required to make changes is a testing procedure after changes are made and a back out procedure if your changes fail.....

I’ve seen multiple times were changes went south on a worldwide network that caused outages, none shut down the entire network and none required hours to resolve.

I saw a guy put in a router change that took down the entire east coast ATM network of a major bank, that was corrected in less than 1 hour...

I’ve seen a guy take down over 20,000 voice mail accounts from a major account that took about 3-4 hours to correct....

when this type of thing happens usually a root cause analysis is done and when all the investigation is done, people get disciplined.....

I almost was fired from a job one time when I was accuse of operating outside a change control which I did, but my excuse was I did not know the change control was input incorrectly...

I produced my documentation that clearly described the change, the issue was the person inputting the change did not include all the router names and changes....

Even though I operated outside of the change control, nothing was taken down but I was still almost fired.

That’s the gist of it. System dependencies have dependencies and so on.

I’m on a site reliability engineering team at my company that ‘keeps the lights on’ for our web based services. Monitoring, preparation, system efficiency improvements, escalation chains, etc only work as well as they are designed and followed-through. There are still surprises and combinations that hadn’t been considered until they fail.

I hope they regroup and bulletproof their system soon.