A Follow up To: Are Cloud Failures a Possibility?
In my previous blog article titled: Are Cloud Failures a Possibility?
I discussed the fact that simply deploying to the cloud does not abdicate your responsibilities as a business owner or IT professional. As the whole world knows, there was a Windows outage on February 28th
2012. Despite your inherent responsibilities, Microsoft as a provider of cloud services, must demonstrate a commitment to their product and accept responsibility for the outage. Wait! This outage does have a silver lining. How? You might ask. After all, this is mission critical stuff we are talking about here. The answer lies in Microsoft’s response to the Windows Azure outage. If you use Windows Azure or any of Microsoft’s Software as a Service (SaaS) products, these 4 results are outcomes you need to know about: Transparency, A Service Credit, A Fix, and A Commitment.
#1 - Transparency
In today’s world global companies often live under a microscope. Microsoft is no exception; its every move is under public scrutiny. Microsoft is a company that has brought transformational technology that has changed our world in the last century. Despite all of its contributions, Microsoft has a big target painted on its back. And I guess that is the price you pay for being a global company. Having grown up during the rise of the personal computer, I am a huge proponent of Microsoft. Microsoft has consistently demonstrated social, consumer, and business responsibility time and time again.
During the outage, information was scarce. Unfortunately, when money and careers are on the line, it can bring out the worst in people. If I had one criticism for Microsoft it is that they did a poor job of communicating during the outage. I suspect Microsoft did not communicate because they simply did not want misinformation floating around out in the public eye. But, absolute honesty and transparency will reap big rewards from customers, especially when tensions are high.
That being said, I want to point out that Microsoft, after having the opportunity to do a post mortem, has now provided full transparency on the outage. Bill Laing, Microsoft’s Corporate Vice President for Server & Cloud, reported in the Windows Azure public blog details about what caused the disruption on February 29th. He also outlined a full account of the corrective actions and lessons learned from the service disruption experienced by its customers. You can read the full account here: Windows Azure Cloud Outage Full Account
#2 - A Service Credit
In an unprecedented commitment to its customers and the Windows Azure Platform, Microsoft has proactively issued a service credit to all of their Windows Azure customers, whether they were impacted by the outage or not. A 33% credit is being issued for all customers worldwide for the entire billing month(s).
||“Microsoft recognizes that this outage had a significant impact on many of our customers. We stand behind the quality of our service and our Service Level Agreement (SLA), and we remain committed to our customers. Due to the extraordinary nature of this event, we have decided to provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted. These credits will be applied proactively and will be reflected on a billing period subsequent to the affected billing period. Customers who have additional questions can contact support for more information.”
I think this clearly demonstrates an extraordinary level of accountability by Microsoft. While I am sure there will be naysayers that state that this is a drop in the bucket, there are not many global companies that will stand by their product with the level of accountability and financial responsibility in the same manner that Microsoft has in this situation..
#3 - A Fix
At the core, Microsoft has discovered a software bug in the way that a certificate is created by a Guest Agent to communicate with the Host Agent in the substructure of the Windows Azure Fabric Controller. The Guest Agent creates a certificate that is valid for one year. On the day in question, there was an unusual scenario. It happened to be February 29th
, 2012: “Leap Year”. The code simply added one year to the date, but forgot to take into account the leap year. One year added to February 29th
, 2012 yields a date of February 29th
, 2013. Without taking this into account, the resulting math, February 29th
, 2013 yields an invalid date. As a result, the certificate creation failed. The way Windows Azure handled this failure by the Guest Agent to create a valid certificate, combined with the processes Windows Azure executes in the event of a failure, and the efforts of Microsoft Engineers to provide an expedited fix to the problem; all exacerbated the issue and resulted in a longer outage.
In the aftermath, Microsoft has done a full root cause analysis and analysis of all related issues. One statement by Bill Laing sums up the truth of cloud computing and I think, computing and IT in general:
||“The three truths of cloud computing are: hardware fails, software has bugs and people make mistakes. Our job is to mitigate all of these unpredictable issues to provide a robust service for our customers. By understanding and addressing these issues we will continue to improve the service we offer to our customers.”
A fix has been rendered to solve the original software bug, but at their core Microsoft has taken a number of steps to improve the communication with customers and to change their procedures should there ever be a future outage. George Santayana, a Spanish philosopher, essayist, poet, and novelist coined the famous saying:
“Those who cannot remember the past are condemned to repeat it.”
This famous saying has had a number of variants and paraphrases. It is clear that Microsoft is committed to learning from this outage and also to taking action to ensure that the missteps that caused it are not repeated.
#4 - A Commitment
Results numbers 1 through 3 represent something that you as a consumer of cloud services must have from a provider of cloud services: A Commitment to those services. When you chose a cloud provider you expect a Service Level Agreement
(SLA) for those services. You need assurances that when choosing Windows Azure that Microsoft will live up to that SLA. The fact is that every major cloud provider has an outage. This was not the first outage and it certainly will not be the last, but Microsoft has demonstrated a true commitment to their Cloud Platform and the customers that use it by providing honesty and transparency, giving a financial credit to all
cloud customers, and by implementing measures to ensure this problem never happens again.
The Bottom Line
Microsoft’s handling of this outage clearly demonstrates their commitment to their Customers and the Windows Azure Platform. By understanding Microsoft’s commitment and accountability it is safe to say that you can certainly trust your mission critical applications to Microsoft and Windows Azure. However, don’t forget your responsibilities when deploying to the Cloud. As Bill Laing said:
“Hardware fails, software has bugs and people make mistakes.”
Like Microsoft, you too have to plan for failure and mitigate the unpredictable issues that can occur whether you are in the Cloud or On-Premise. Happy Computing!
Related articles to this topic:
On the Recent Windows Azure Leap Day Outage
David Pallmann - Neudesic
Another Cloud Outage: Insight and Reactions
Kyle Hilgendorf - Gartner - Article #1
Azure Outage: Customer Insights a Week Later
Kyle Hilgendorf - Gartner - Article #2
Azure Root Cause Analysis: Transparency Wins
Kyle Hilgendorf - Gartner - Article #3