BlackBerry maker RIM still does not know why the core switch at its Slough data center crashed, but crash it did, causing what the company now admits is the biggest network outage in its history.
At a service update conference held on Thursday afternoon, RIM co-CEO Mike Lazaridis said, “We don’t know why the switch failed and why service never failed over to another alternative switch.” Asked why the company couldn’t avoid network traffic backlogs by re-routing traffic by bypassing Slough, RIM didn’t give a straight answer.
It said the company used a “complicated global system” and it didn’t want to “infect” other parts of the network, and that it “stood by that decision.”
The plain truth however, is that RIM did not have the capacity re-route traffic to the countries served by Slough, as it is one of only two main network operating centers in the world used by RIM. As a result, the traffic that stopped at Slough backed up to its other main NOC in Waterloo, Canada and other parts of the grid, and crashed the whole RIM network.
The simplicity of RIM’s network gives it an advantage of usually delivering users’ emails in a speedy fashion through the use of compression. It also allows the company to fully encrypt users’ data—something some governments, in the Middle East and India for instance, aren’t happy about.
But the third major BlackBerry outage since 2007 will put the company’s network under further scrutiny, particularly as RIM’s devices are facing even stiffer competition from the iPhone and Android devices. RIM’s latest outage, of course, coincided with Apple’s release of iOS 5 and the iPhone 4S.
A 2007 outage in North America was caused by a software upgrade, and a further North America outage in 2008 was the result of a router upgrade. While the focus on the latest outage has been on a failed core switch, the Guardian newspaper said it understood the problem was linked to a database update. As with all past RIM outages - small and large - we may have to wait some time to get more definitive answers.
After the two previous large outages in North America, RIM added two further data centres to its network in 2009 to increase capacity.
On the question of compensation for RIM users, Lazaridis said it “was something we are thinking about.” For the last 18 months, he said, RIM had enjoyed “99.97 percent uptime” to its network.
There was however an outage in Canada and Latin America last month, and at least one significant outage that affected the UK last year.
Earlier today, Ovum analyst Nick Dillon told ComputerworldUK.com that RIM may have to add regional NOCs to help re-route traffic, if RIM could not keep a lid on future outages.
This story, "'We do not know why system failed' says BlackBerry CEO" was originally published by Computerworld UK.