Skype post-mortem explains service outage
Skype’s CIO Lars Rabbe on Wednesday offered a frank assessment of the recent 24-hour lapse in its Internet telephony service, in a blog post that also laid out what the company is now doing to make its network more robust.
Rabbe’s post also served as a corporate mea culpa, saying “we know that we fell short in both fulfilling your expectations and communicating with you during this incident.”
The failure of Skype’s service for many of its users started at about 4 p.m. GMT on December 22 and lasted through much of the December 23, Rabbe said. On that Wednesday, a cluster of servers became overloaded, and some Skype clients received delayed responses from them. In one particular version of the Skype for Windows client, the delayed responses from the servers caused a processing misfire that led the client software to crash.
The affected version of the Skype for Windows client was 5.0.0152—a version that Rabbe said about half of Skype’s users were running. Crashes caused about 40 percent of those clients to fail. And among the clients that failed were between a quarter and 30 percent of systems that provided important directory services in Skype’s peer-to-peer network.
While Skype worked quickly to bring these so-called supernodes back online, even when restarted those systems remained unavailable to the network for a time. And meanwhile, the pressure on remaining supernodes pushed other systems over the top and caused even more of them to shut down. “This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event,” Rabbe explained.
To fix the problem, Skype engineers introduced hundreds of instances of the Skype software into the peer-to-peer network to serve as dedicated supernodes, the CIO said. To do that, they drew on resources that are normally used in Group Video calling, thus taking that service offline temporarily. It was restored in time for Christmas, Rabbe wrote.
Skype is now focused on keeping its user base current on client software. (Rabbe noted that the company had previously offered users of the affected—and out-of-date—version of Skype for Windows an upgraded version to fix a bug in the software.) He added that Skype will review its “processes for providing automatic updates for our users so that we can help keep everyone on the latest Skype software.” And he also pledged to improve software quality with a review of testing processes.
As for how the Skype team responded to the outage, Rabbe said that the company will look for ways it can in future detect problems more quickly and head off a major disruption. It also aims to shorten the time it takes to bring the system back up after a failure, he said.
Rabbe promised that Skype “will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems.”