Get answers from your peers along with millions of IT pros who visit Spiceworks.
Join Now

Hey Spiceheads! On October 4, Facebook was offline for about six hours due to human error. The company states that “configuration changes on our backbone routers” was the cause. Let's take a quick look at what happened and then look at a few ways you can limit the risk of similar problems.

Facebook suffered a self-inflicted wound when one of their network engineers sent a command that basically took the entire company’s server collection off the internet. Facebook engineers indicated that “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.”

In Facebook’s post-mortem post, they mention their “storm drills” in which they stress test their computing infrastructure to ensure these system-wide failures don’t happen or can be recovered quickly. However, they never simulated the loss of their entire network backbone. They also never put together a scenario where operator error would bring down their network.

What can you do to prevent a similar situation?

  • Review how you operate your DNS and make certain it is secure.
  • Examine your network servers, and take stock of the applications (besides your website) that run on them. Consider how building services might be affected by a loss of network servers.  Facebook, has a sophisticated physical access control on its data centers, but it was rendered useless because of the DNS issues.
  • Do some research, does you email have a single point of failure. While this might be more applicable to a natural disaster, it could also be caused by cyber criminals.
  • Identify weak spots and make certain you have backup plans. For instance if you phone system depends on your domain to be up, make certain you have alternate contact methods for necessary parties.

For more on this read the full article.

Oops, something's wrong below.