Last week Windows Azure suffered an outage due to a leap year issue. It’s not the outage I want to highlight, as much as the transparency in sharing what happened.
Outages happen, all the popular cloud providers have had them, its how they are handled that matters. Often times little details are shared on what caused the problem or plans to ensure it doesn’t happen again. As we all try to get comfortable with trusting the cloud providers with our critical systems this often times leaves us in the dark.
Today Bill Laing a CVP at Microsoft posted this blog post explaining in great detail what happened, challenges they had during the outage and detail plans to improve the service. It’s fairly long but it’s a good read and will help you understand some of the inner workings of the Azure service better.
One of the commitments in the post that I really appreciate is the commitment for more details in progress updates during an outage. That’s one of the most important aspects that cloud providers don’t get is how they can replace the fact that if its hosted internally you can walk down the hallway and get a status. If Azure can figure out how to be transparent like this with what happened and be proactive on providing more details during an outage they are raising the bar for all cloud providers and I think that is a really good thing!