May 3, 2013 Leave a comment
I’d been pondering this a bit anyways, and yesterday actually had a customer ask. So I was forced to dig into this a bit further. And I need for forewarn you, this news may upset a few folks.
If you are using Windows Azure Virtual Machines (IaaS), you are taking two dependencies. Windows Azure Compute (to run the VM’s), and Windows Azure Storage (to persist the state of those VM’s). What this means is that you don’t have a single 99.95% sla, you actually have two SLA’s. And as such, they need to be aggregated since a failure in either, could render your service temporarily unavailable.
Calculating Aggregate SLA’s
Some background information before I get to far down this rabbit hole. When you have a solution that takes on multiple dependencies, the SLA you are providing is an aggregate of the underlying SLA’s. For example…
Two Instances of WA Compute = 99.95% uptime or approximately 263 minutes of downtime per year
Azure Storage = 99.9% uptime or 525 minutes of downtime per year.
This gives us a total possible downtime of 788 minutes or availability of approximately 99.85%.
Since we have multiple dependencies, we need to take the total amount of downtime we could experience when determining what our availability is.
But what about Windows Azure’s 99.95% SLA?
Now this is where things get a bit more… fuzzy. You actually need tor reference the details of the published Windows Azure SLA’s. When we look into the SLA for “Cloud Services, Virtual Machines, and Virtual Network”, we’re after three key terms that factor into the SLA.
“External Connectivity” is bi-directional network traffic over supported protocols such as UDP and TCP that can be sent and received from a public IP address.
“Maximum Connectivity Minutes” is the total accumulated minutes during a billing month for all Internet facing Virtual Machines that have two or more instances deployed in the same Availability Set. Maximum Connectivity Minutes is measured from when at least two Virtual Machines in the same Availability Set have both been started resultant from action initiated by Customer to the time Customer has initiated an action that would result in stopping or deleting the Virtual Machines.
“Connectivity Downtime” is the total accumulated minutes that are part of the Maximum Connectivity Minutes that have no External Connectivity.
So what these three items say is that IF you have two started virtual machines that are in an availability set, you will be able to connect to them 99.95% of the time or we owe you money back. Note I’ve highlighted the word ‘started’. Because if you take an external dependency that causes your virtual machine to stop/crash (aka the dependency on Azure Storage for the VM disks), when an instance of your VM stops, it’s no longer subject to the 99.95% SLA.
So where does this leave us
Ultimately, exactly where we’ve always been. In most cases, your solution is going to take multiple dependencies. And as a result, this simple example get compounded. It also doesn’t change our commitment to you as a customer of Windows Azure. This issue has always been here, even before cloud computing. And for the majority, that extra 0.1% isn’t going to make much of a difference.
Now, if you need really high uptime, you have the math above to help you really understand what level of risk your solution may be taking on. And hopefully can leverage that knowledge to help you design resilient solution architectures that are capable of adjusting to these outages and continuing (even if in a degraded state) to delivery functionality to the end users of those solutions.