ZDNet has posted a summary of Microsoft’s response after last week’s outage of their cloud-based Office 365 and Dynamics Online services. Apparently the later was due to a configuration error that was pushed to all the data-centers, while the former was apparently caused by a “defective Cisco gear.”
I don’t think both reasons are acceptable. For the Dynamics issue, a proper staging test environment probably exists. Clearly they missed a key point to check and test before pushing their build to production – this can happen to anyone, nobody can think of everything. We can only hope for them they now took this into account and will add such a test-case in their pre-deployment scenarios. Spirent solution for this, beyond the actual test tools, is iTest which actually designed for that kind of issues.
The second reason, “defective Cisco gear”, is more… dubious. For one, I don’t think singling Cisco out is nice. Hardware defect does happen, and pointing fingers to a specific vendor is, well, rude. Secondly, a proper network should be designed around High-Availability ; taking into account any single – or multiple – point(s) of failure in the network. I guess it’s possible to have several routers die at the exact same time, but it’d be interesting to run the math and figure out what the odds are. Do I have to mention that testing HA/failure scenarios is of paramount importance specifically for this reason?
Microsoft stated they will provide a 25% credit to their impacted users. That probably cost a whole lot more than what it would have to properly test their network infrastructure.