Aggregate SLA’s & Windows Azure IaaS

I’d been pondering this a bit anyways, and yesterday actually had a customer ask. So I was forced to dig into this a bit further. And I need for forewarn you, this news may upset a few folks.

If you are using Windows Azure Virtual Machines (IaaS), you are taking two dependencies. Windows Azure Compute (to run the VM’s), and Windows Azure Storage (to persist the state of those VM’s). What this means is that you don’t have a single 99.95% sla, you actually have two SLA’s. And as such, they need to be aggregated since a failure in either, could render your service temporarily unavailable.

Calculating Aggregate SLA’s

Some background information before I get to far down this rabbit hole. When you have a solution that takes on multiple dependencies, the SLA you are providing is an aggregate of the underlying SLA’s. For example…

Two Instances of WA Compute = 99.95% uptime or approximately 263 minutes of downtime per year

Azure Storage = 99.9% uptime or 525 minutes of downtime per year.

This gives us a total possible downtime of 788 minutes or availability of approximately 99.85%.

Since we have multiple dependencies, we need to take the total amount of downtime we could experience when determining what our availability is.

But what about Windows Azure’s 99.95% SLA?

Now this is where things get a bit more… fuzzy. You actually need tor reference the details of the published Windows Azure SLA’s. When we look into the SLA for “Cloud Services, Virtual Machines, and Virtual Network”, we’re after three key terms that factor into the SLA.

“External Connectivity” is bi-directional network traffic over supported protocols such as UDP and TCP that can be sent and received from a public IP address.

“Maximum Connectivity Minutes” is the total accumulated minutes during a billing month for all Internet facing Virtual Machines that have two or more instances deployed in the same Availability Set. Maximum Connectivity Minutes is measured from when at least two Virtual Machines in the same Availability Set have both been started resultant from action initiated by Customer to the time Customer has initiated an action that would result in stopping or deleting the Virtual Machines.

“Connectivity Downtime” is the total accumulated minutes that are part of the Maximum Connectivity Minutes that have no External Connectivity.

So what these three items say is that IF you have two started virtual machines that are in an availability set, you will be able to connect to them 99.95% of the time or we owe you money back. Note I’ve highlighted the word ‘started’. Because if you take an external dependency that causes your virtual machine to stop/crash (aka the dependency on Azure Storage for the VM disks), when an instance of your VM stops, it’s no longer subject to the 99.95% SLA.

So where does this leave us

Ultimately, exactly where we’ve always been. In most cases, your solution is going to take multiple dependencies. And as a result, this simple example get compounded. It also doesn’t change our commitment to you as a customer of Windows Azure. This issue has always been here, even before cloud computing. And for the majority, that extra 0.1% isn’t going to make much of a difference.

Now, if you need really high uptime, you have the math above to help you really understand what level of risk your solution may be taking on. And hopefully can leverage that knowledge to help you design resilient solution architectures that are capable of adjusting to these outages and continuing (even if in a degraded state) to delivery functionality to the end users of those solutions.

 

 

Avoiding the Chaos Monkey

Yesterday I was pleased (and nervous) to be presenting at the Heartland Developers Conference in Omaha, NE. I’ve been hoping to present at this event for a couple years and was really pleased that one of my submissions was accepted. Especially given that the topic was more architect/concept then code. It was only my second time presenting this material and the first time for a non-captive audiance. And given that it was the 2pm slot, and only a handful of people fell asleep or left, I’m pretty pleased with how things went.

I’ve posted the deck for my Avoiding the Chaos Monkey presentation so please feel free to take and reuse. I just ask that you give proper credit and I’d love any feedback on it. I received some great feedback from HDC on the material and will be making some updates that show some real world scenarios and how applying the principles covered in this presentation can address them. I spoke to some of these during the presentation, but agreed with my colleague Eric that it would help to have more concrete and visual examples to drive the message home. I’ve already submitted the talk to two upcoming conferences and hopefully it will get accepted at one. Meanwhile, feel free to snag a copy and drop me a comment with any feedback you have!

You don’t really want an SLA!

I don’t often to editorials (and when I do, they tend to ramble), but I felt I’m due and this is a conversation I’ve been having a lot lately. I sit to talk with clients about cloud and one of the first questions I always get is “what is the SLA”? And I hate it.

The fact is that an SLA is an insurance policy. If your vendor doesn’t provide a basic level of service, you get a check. Not unlike my home owners insurance. If something happens, I get a check. The problem is that most of us NEVER want to have to get that check. If my house burns down, the insurance company will replace it. But all those personal mementos, the memories, the “feel” of the house are gone. So that’s a situation I’d rather avoid. What I REALLY want is safety. So install a fire-alarm, I make sure I have an extinguisher in the kitchen, I keep candles away from drapes. I take measures to help reduce the risk that I’ll need to cash my insurance policy.

When building solutions, we don’t want SLA’s. What we REALLY want is availability. So we as the solution owners need to take steps to help us achieve this. We have to weight the cost vs the benefit (do I need an extinguisher or a sprinkler system?) and determine how much we’re wiling to invest in actively working to achieve our own goals.

This is why when I get asked the question, I usually respond by giving them the answer and immediately jump into a discussion about resiliency. What is a service degradation vs an outage? How can we leverage redundancy? Can we decouple components and absorb service disruptions? These are the types of things we as architects need to start considering, not just for cloud solutions but for everything we build.

I continue to tell developers that the public cloud is a stepping stone. The patterns we’re using in the public cloud are lessons learned that will eventually get applied back on premises. As the private cloud becomes less vapor and more reality, the ability to think in these new patterns is what will make the next generation of apps truly useful. If a server goes down, how quickly does your load balancer see this and take that server out of rotation? How do the servers shift workloads?

When working towards availability, we need to take several things in mind.

Failures will happen – how we deal with them is our choice. We can have the world stop, or we can figure out how to “degrade” our solution to keep anything we can going.

How are we going to recover – when things return to normal, how does the solution “catch up” with what happened during the disruption

the outage is less important than how fast we react – we need to know something has gone wrong before our clients call to tell us

We (aka solution/application architects) really need to start changing the conversation here. We need to steer away from SLA’s entirely and when we can’t manage that at least get to more meaningful, scenario based SLA’s. This can mean instead of saying “the email server will be 99% of the time” we switch to “99% of emails will be transmitted within 5 minutes”. This is much more meaningful for the end users and also gives s more flexibility in how we achieve it. And depending on how traffic.

Anyway, enough rambling for now. I need to get a deck that discusses this ready for a presentation on Thursday that only about 20 minutes ago I realized I needed to do. Fortunately, I have an earlier draft of the session and definitely have the passion and knowhow to make this happen. So time to get cracking!

Until next time!

Session State with Windows Azure Caching Preview

I’m working on a project for a client and was asked to pull together a small demo using the new Windows Azure Caching preview.  This is the “dedicated” or better yet, “self hosted” solution that’s currently available as a preview in the Windows Azure SDK 1.7, not the Caching Service that was made available early last year. So starting with a simple MVC 3 application, I set out to enable the new memory cache for session state. This is only step 1 and the next step is to add a custom cache based on the LocalStorage feature of Windows Azure Cloud Services.

Enabling the self-hosted, in-memory cache

After creating my template project, I started by following the MSDN documentation for enabling the cache co-hosted in my MVC 3 web role. I opened up the properties tab for the role (right-clicking on the role in the cloud service via the Solution Explorer) and moved to the Caching tab. I checked “Enable Caching” and set my cache to Co-located (it’s the default) and the size to 20% of the available memory.

clip_image002

Now because I want to use this for session state, I’m also going to change the Expiration Type for the default cache from “Absolute” to “Sliding”. In the current preview, we only have one eviction type, Least Recently Used (LRU) which will work just fine for our session demo. We save these changes and take a look at what’s happened with the role.

There are three changes that I could find:

  • · A new module, Caching, is imported in the ServiceDefinition.csdef file
  • · A new local resource “Microsoft.WindowsAzure.Plugins.Caching.FileStore” is declared
  • · Four new configuration settings are added, all related to the cache: NamedCaches (a JSON list of named caches), LogLevel, CacheSizePercentage, and ConfigStoreConnectionString

Yeah PaaS! A few options clicked and the Windows Azure Fabric will handle spinning up the resources for me. I just have to make the changes to leverage this new resource. That’s right, now I need to setup my cache client.

Note: While you can rename the “default” cache by editing the cscfg file, the default will always exist. There’s currently no way I found to remove or rename it.

Client for Cache

I could configure the cache manually, but folks keep telling me to I need to learn this NuGet stuff. So lets do it with the NuGet packages instead. After a bit of fumbling to clean up a previously botched NuGet install fixed (Note: must be running VS at Admin to manage plug-ins), I right-clicked on my MVC 3 Webrole and selected “Manage NuGet Packages”, then following the great documentation at MSDN, searched for windowsazure.caching and installed the “Windows Azure Caching Preview” package.

This handles updating my project references for me, adding at least 5 of them that I saw at a quick glance, as well as updating the role’s configuration file (the web.config in my case) which I now need to update with the name of my role:

<dataCacheClientname=default>
<autoDiscoverisEnabled=trueidentifier=WebMVC />
<!–<localCache isEnabled=”true” sync=”TimeoutBased” objectCount=”100000″ ttlValue=”300″ />–>
</dataCacheClient>

Now if you’re familiar with using caching in .NET, this is all I really need to do to start caching. But I want to take another step and change my MVC application so that it will use this new cache for session state. This is simply a matter of replacing the default provider “DefaultSesionProvider” in my web.config with the AppFabricCacheSessionStoreProvider. Below are both for reference:

Before:

     <addname=DefaultSessionProvider
          type=System.Web.Providers.DefaultSessionStateProvider, System.Web.Providers, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
          connectionStringName=DefaultConnection
          applicationName=/ />

After:

<addname=AppFabricCacheSessionStoreProvider
type=Microsoft.Web.DistributedCache.DistributedCacheSessionStateStoreProvider, Microsoft.Web.DistributedCache
cacheName=default
useBlobMode=true
dataCacheClientName=default />

Its important to note that I’ve set the cacheName attribute to match the name of the named cached I set up previously, in this case “default”. If you set up a different named cache, set the value appropriately or expect issues.

But we can’t stop there, we also need to update the sessionState node’s attributes, namely mode and customProvider as follows:

<sessionStatemode=CustomcustomProvider=AppFabricCacheSessionStoreProvider>

Demo Time

Of course, all this does nothing unless we have some code that shows the functionality at work. So let’s increment a user specific page view counter. First, I’m going to go into the home controller and add in some (admittedly ugly) code in the Index method:

// create the session value if we’re starting a new session
if (Session.IsNewSession)
Session.Add(“viewcount”, 0);
// increment the viewcount
Session["viewcount"] = (int)Session["viewcount"] + 1;// set our values to display
ViewBag.Count = Session["viewcount"];
ViewBag.Instance = RoleEnvironment.CurrentRoleInstance.Id.ToString();

The first section just sets up the session value and handles incrementing them. The second block pulls the value back out to be displayed. And then alter the associated Index.cshtml page to render the values back out. So just insert the following wherever you’d like it to go.

Page view count: @ViewBag.Count<br />
Instance: @ViewBag.Instance

Now if we’ve done everything correctly, you’ll see the view count increment consistently regardless of which instance handles the request.

Session.Abandon

Now there’s some interesting stuff I’d love to dive into a bit more if I had time, but I don’t today. So instead, let’s just be happy with the fact that after more than 2 years, Windows Azure finally has “built in” session provider that is pretty darned good. I’m certain it still has its capacity limits (I haven’t tried testing to see how far it will go yet), but to have something this simple we can use for most projects is simply awesome. If you want my demo, you can snag it from here.

Oh, one last note. Since Windows Azure Caching does require Windows Azure Storage to maintain some information, don’t forget to update the connection string for it before you deploy to the cloud. If not, you’ll find instances may not start properly (not the best scenario admittedly). So be careful.

Until next time!

Follow

Get every new post delivered to your Inbox.

Join 934 other followers