Many Facebook users were unable to access the social networking site for up to two and a half hours on Thursday, the worst outage the Website has had in more than four years, Facebook said in a posting.
The problems were traced back to a change made by Facebook in one of its systems.
The change was made to a piece of data that was called upon whenever an error-checking routine found invalid data in Facebook’s system. The piece of data was itself interpreted as invalid, which caused the system to try and replace it with the same piece of data and so a feedback loop began.
The loop resulted in hundreds of thousands of queries per second being sent to Facebook’s database cluster, overwhelming the system.
The result for users was a “DNS error” message and no access to the site.
“The way to stop the feedback cycle was quite painful—we had to stop all traffic to this database cluster, which meant turning off the site,” wrote Robert Johnson, director of software engineering at Facebook, in a post on the site. “Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
The problem hasn’t been entirely fixed. Johnson said Facebook had to turn off the automated system to get the Website back up and running. But that system does play an integral role in protecting the Website.
Facebook is now exploring new ways to handle the situation so it won’t lead to another feedback loop.
“We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously,” he wrote.
It’s the second day Facebook was brought down for some users. On Wednesday, Facebook blamed a third-party networking provider for making the site inaccessible to some.