You don’t really want an SLA!

I don’t often to editorials (and when I do, they tend to ramble), but I felt I’m due and this is a conversation I’ve been having a lot lately. I sit to talk with clients about cloud and one of the first questions I always get is “what is the SLA”? And I hate it.

The fact is that an SLA is an insurance policy. If your vendor doesn’t provide a basic level of service, you get a check. Not unlike my home owners insurance. If something happens, I get a check. The problem is that most of us NEVER want to have to get that check. If my house burns down, the insurance company will replace it. But all those personal mementos, the memories, the “feel” of the house are gone. So that’s a situation I’d rather avoid. What I REALLY want is safety. So install a fire-alarm, I make sure I have an extinguisher in the kitchen, I keep candles away from drapes. I take measures to help reduce the risk that I’ll need to cash my insurance policy.

When building solutions, we don’t want SLA’s. What we REALLY want is availability. So we as the solution owners need to take steps to help us achieve this. We have to weight the cost vs the benefit (do I need an extinguisher or a sprinkler system?) and determine how much we’re wiling to invest in actively working to achieve our own goals.

This is why when I get asked the question, I usually respond by giving them the answer and immediately jump into a discussion about resiliency. What is a service degradation vs an outage? How can we leverage redundancy? Can we decouple components and absorb service disruptions? These are the types of things we as architects need to start considering, not just for cloud solutions but for everything we build.

I continue to tell developers that the public cloud is a stepping stone. The patterns we’re using in the public cloud are lessons learned that will eventually get applied back on premises. As the private cloud becomes less vapor and more reality, the ability to think in these new patterns is what will make the next generation of apps truly useful. If a server goes down, how quickly does your load balancer see this and take that server out of rotation? How do the servers shift workloads?

When working towards availability, we need to take several things in mind.

Failures will happen – how we deal with them is our choice. We can have the world stop, or we can figure out how to “degrade” our solution to keep anything we can going.

How are we going to recover – when things return to normal, how does the solution “catch up” with what happened during the disruption

the outage is less important than how fast we react – we need to know something has gone wrong before our clients call to tell us

We (aka solution/application architects) really need to start changing the conversation here. We need to steer away from SLA’s entirely and when we can’t manage that at least get to more meaningful, scenario based SLA’s. This can mean instead of saying “the email server will be 99% of the time” we switch to “99% of emails will be transmitted within 5 minutes”. This is much more meaningful for the end users and also gives s more flexibility in how we achieve it. And depending on how traffic.

Anyway, enough rambling for now. I need to get a deck that discusses this ready for a presentation on Thursday that only about 20 minutes ago I realized I needed to do. Fortunately, I have an earlier draft of the session and definitely have the passion and knowhow to make this happen. So time to get cracking!

Until next time!

5 Responses to You don’t really want an SLA!

  1. smarx says:

    You can’t design a resilient solution if you don’t know what to expect from the platform you’re building on. For a simplified example, if you want 99.99% uptime, how many servers do you need? You can’t answer that without knowing what the uptime of an individual server will be.

    There are a few ways to figure out what to expect from a service, but the service level agreement (SLA) is a good way to get some confidence about a lower bound. The SLA is essentially the service provider saying, “I bet $X that we’ll have at least this level of service.” This economic trick forces the service provider to express a lower bound on their true confidence in a certain level of service.

    tl;dr The service level is something you actually need to know, and the SLA lets you know (a lower bound of) it.

    • Brent says:

      I think we agree Steve. The rambling point I was trying to make was what folks ask for is the SLA, when what they really want is availability. At least from a busines standpoint. As architects, its out job to figure out how to acheive that using the tools we have available.

      As for the math on how much redundancy we need to acheive our availability goals, this gets into probability math that was never really my strong suit. In fact, I’ve actively avoided it since graduating college. 🙂

      • smarx says:

        Cool. I suspected that underneath we agreed. It’s really only the title I don’t agree with.

        Don’t be scared of the probability! The odds of a bunch of (independent) bad things happening at once is just the product of the odds of each one.

        If the probability of a single server being down at any given time is p, then the probability of two servers being down at the same time is p^2. For three servers, it’s p^3, etc. (This assumes an overly simplistic model where the servers’ uptimes are completely independent. This ignores the possibility of both failing for a common reason, like a whole datacenter failing.)

        To work through an example: if a single server is up 90% of the time, that means it’s down 10% of the time. If you have two servers, they’ll both be down 10%^2 of the time. 10%*10% = 1%, so they’ll have a combined uptime of 99%. Add another server to get to 99.9%. Easy!

        Here’s the equation in terms of uptime:

        x = (1 – (1-p)^n)

        For our example: (1 – (1-.9)^3) = (1 – .1^3) = (1-.001) = 0.999

        To solve for n if you know p:

        n = log 1-p (1-x)

        (Not sure how to write subscripts, but I mean “log base (1-p) of (1-x)”.)

        Here’s Wolfram Alpha telling us we need three 90%-uptime servers to get 99.9% uptime.

      • Brent says:

        This is my eyes glazing over. 🙂

        The title was admittedly a bit intentionally provacative. Get folks thinking and all that. The key point I’m after is to truely get what you want, which IMHO is availability, not an insurance policy.

  2. Pingback: Windows Azure and Cloud Computing Posts for 9/4/2012+ - Windows Azure Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: