Exceeding the SLA–Its about resilience

Last week I was in Miami presenting at Sogeti’s Windows Azure Privilege Club summit. Had a great time, talked with some smart, brave, and generally great people about cloud computing and Windows Azure. But what really struck me was how little was out there about how to properly architect solutions so that they can take advantage of the promise of cloud computing.

So I figured I’d start putting some thoughts down in advance of maybe trying to write a whitepaper on the subject.

What is an SLA?

So when folks start thinking about uptime, the first thing that generally pops to mind is the vendor service level agreements, or SLA’s.

An SLA, for lack of a better definition is a contract or agreement that provides financial penalties if specific metrics are not met. For cloud, these metrics are generally expressed as a percentage of service availability/accessibility during a given period. What this isn’t, is a promise that things will be “up”, only that when they aren’t, the vendor/provider has some type of penalty they will pay. This penalty is usually a reimbursement of fees you paid.

Notice I wrote that as “when” things fail, not if. Failure is inevitable. And we need to start by recognizing this.

What are after?

With that out of the way, we need to look at what we’re after. We’re not after “the nines”. What we’re wanting is to protect ourselves from any potential losses that we could incur if our solutions are not available.

We are looking for protection from:

  • Hardware failures
  • Data corruption (malicious & accidental)
  • Failure of connectivity/networking
  • Loss of Facilities
  • <insert names of any of 10,000 faceless demons here>
    And since these types of issues are inevitable, we need to make sure our solution can handle them gracefully. In other words, we need to design our solutions to be resilient.

What is resilience?

To take a quote from the Bing dictionary:

image

Namely we need solutions that can self recovery from problems. This ability to flex and handle outages and easily return to full functionality when the underlying outages are resolved are what make your solution a success. Not the SLA your vendor gave you.

If you were Netflix, you test this with their appropriately named “chaos monkey”.

How do we create resilient systems?

Now that is an overloaded question and possibly a good topic for someone doctoral thesis. So I’m not going to answer that in today’s blog post. What I’d like to do instead of explore some concepts in future posts. Yes, I know I still need to finish my PHP series. But for now, I can at least set things up.

First off, assume everything can fail. Because at some point or another it will.

Next up, handle errors gracefully. “We’re having technical difficulties, please come back later” can be considered an approach to resilience. Its certainly better then a generic 404 or 500 http error.

Lastly, determine what resilience is worth for you. While creating a system that will NEVER go down is conceivably possible, it will likely be cost prohibitive. So you need to clear understand what you need and what you’re willing to pay for.

For now, that’s all I really wanted to get off my chest. I’ll publish some posts over the next few weeks that focus on some 10,000 foot high options for achieving resilience. Maybe after that, we’ can look at how these apply to Windows Azure specifically.

Until next time!

Detroit Day of Azure Keynote

Keynote is a fancy way of saying “gets to go first”. But when my buddy David Giard asked me if I would come Detroit to support his Day of Azure, I couldn’t say no. So we talked a bit, tossed around some ideas.. and I settled on a presetion idea I had been calling “When? Where? Why? Cloud?”. This presentation isn’t technical, its about helping educate both developers and decision makers on what cloud computing is, how you can use it, what opportunities, etc…. Its a way to start the conversation on cloud.

Session seemed to go pretty good, not much feedback but there were lots of noding heads, a few smiles (hopefully at my jokes), and only one person seemed to be falling asleep. Not bad for a foggy, drizzly 8am on a Saturday presentation. So as promised, I’ve uploaded the presentation here if you liked to take a look. And if you’re here because you were in the session, please leave a comment and let me know what you thought.

A Custom High-Availability Cache Solution

For a project I’m working on, we need a simple, easy to manage session state service. The solution needs to be highly available, low latency, but not persistent. Our session caches will also be fairly small in size (< 5mb per user). But given that our projected high end user load could be somewhere in the realm of 10,000-25,000 simultaneous users (not overly large by some standards), we have serious concerns about the quota limits that are present in todays version of the Windows Azure Caching Service.

Now we looked around, Memcached, ehCache, MonboDB, nCache to name some. And while they all did various things we needed, there were also various pros and cons. Memcached didn’t have the high availability we wanted (unless you jumped through some hoops). MongoDB has high availability, but raised issues about deployment and management. ehCache and nCache well…. more of the same. Add to them all that anything that had a open source license would have to be vetted by the client’s legal team before we could use it (a process that can be time consuming for any organization).

So I spent some time coming up with something I thought we could reasonably implement.

The approach

I started by looking at how I would handle the high availability. Taking a note from Azure Storage, I decided that when a session is started, we would assign that session to a partition. And that partitions would be assigned to nodes by a controller with a single node potentially handling multiple partitions (maybe primary for one and secondary for another, all depending on overall capacity levels).

The cache nodes would be Windows Azure worker roles, running on internal endpoints (to achieve low latency). Within the cache nodes will be three processes, a controller process, the session service process, and finally the replication process.

The important one here is the controller process. Since the controller process will attempt to run in all the cache nodes (aka role instances), we’re going to use a blob to control which one actually acts as the controller. The process will attempt to lock a blob via a lease, and if successful will write its name into that blob container. It will then load the current partition/node mappings from a simple Azure Storage table (I don’t predict us having more then a couple dozen nodes in a single cache) and verify that all the nodes are still alive.  It then begins a regular process of polling the nodes via their internal endpoints to check on their capacity.

The controller also then manages the nodes by tracking when they fall in and out of service, and determining which nodes handle which partitions. If a node in a partition fails, it will assign that a new node to that partition, and make sure that the node is in different fault and upgrade domains from the current node. Internally, the two nodes in a partition will then replicate data from the primary to the secondary.

Now there will also be a hook in the role instances so that the RoleEnvironment Changing ad Changed events will alert the controller process that it may need to rescan. This could be a response to the controller being torn down (in which case the other instances will determine a new controller) or some node being torn down so the controller needs to reassign their the partitions that were assigned to those nodes to new nodes.

This approach should allow us to remain online without interruptions for our end users even while we’re in the middle of a service upgrade. Which is exactly what we’re trying to achieve.

Walkthrough of a session lifetime

So here’s how we see this happening…

  1. The service starts up, and the cache role instances identify the controller.
  2. The controller attempts to load any saved partition data and validate it (scanning the service topology)
  3. The consuming tier, checks the blob container to get the instance ID of the controller, and asks if for a new session ID (and its resulting partition and node instance ID)
  4. The controller determines if there is room in an existing partition or creates a new partition.
  5. If a new partition needs to be created, it locates two new nodes (in separate domains) and notifies them of the assignment, then returns the primary node to the requestor.
  6. If a node falls out (crashes, is being rebooted), the session requestor would get a failure message, and goes back to the controller for a new node for that partition.
  7. The controller provides the name of the previously identified secondary node (which is of course now the primary), and also takes on the process of locating a new node.
  8. The new secondary node will contact the primary node to begin replicate its state. The new primary will start sending state event change messages to the secondary.
  9. If the controller drops (crash/recycle), the other nodes will attempt to become the controller by leasing the blob. Once established as a controller, it will start over at step 2.
  10. Limits

    So this approach does have some cons. We do have to write our own synchronization process, and session providers. We also have to have our own aging mechanism to get rid of old session data. However, its my believe that these shouldn’t be horrible to create so its something we can easily overcome.

    The biggest limitation here is that because we’re going to be managing the in-memory cache ourselves, we might have to get a bit tricky (multi-gigabyte collections in memory) and we’re going to need to pay close attention to maximum session size (which we believe we can do).
    Now admittedly, we’re hoping all this is temporary. There’s been mentions publically that there’s more coming to the Windows Azure Cache service. And we hope that we can at that time, swap out our custom session provider for one that’s built to leverage whatever the vNext of Azure Caching becomes.
    So while not ideal, I think this will meet our needs and do so in a way that’s not going to require months of development. But if you disagree, I’d encourage you to sound off via the site comments and let me know your thoughts. .

Cloud Computing as a Financial Institution

Ok, I’m sure the title of this post is throwing you a bit but please bear with.

I’ve been travelling the last few weeks. I’m driving as its only about 250 miles away. And driving unlike flying leaves you with a lot of time to think. And this train of thought dawned on me yesterday as I was cruising down I35 south somewhere between Clear Lake and Ames in northern Iowa.

The conventional thinking

So the common theme you’ll often hear when doing “intro to cloud” presentations is comparing it to a utility. I’ve done this myself countless times. This story goes that as a business owner, what you need is say light. Not electricity, no a generator, not a power grid.

Just like the you utility company manages the infrastructure to delivery power to your door so when you turn on the switch, you get the light you wanted. You don’t have to worry about how it gets there. Best yet, you only pay for what you use. No needs to spend hundreds of millions on building a power plant and the infrastructure to deliver that power to your office.

I really don’t have an issue with this comparison. Its easy to relate to and does a good job of illustrating the model. However, what I realized as I was driving is that this example is a one way example. I’m paying a fee and getting something in return. But there’s no real trust issue except in the ability for my provider to give me the service.

Why a financial institution?

Ok, push aside the “occupy” movement and the recent distrust of bankers. A bank is where you put assets for safe keeping. You have various services you get from the provider (atm, checking account) that allow you to leverage those assets. You also have various charges that you will pay for using some of those services while others are free. You have a some insurance in place (FDIC) to help protect your assets.

Lastly, and perhaps most importantly, you need to have a level of trust in the institution. You’re putting your valuables in their care. You either need to trust that they are doing what you have asked them to do, or that they have enough transparency that you can know exactly what’s being done.

What really strikes me about this example is you having some skin in the game and needing to have a certain level of trust in your provider. Just like you trust the airline to get you to your destination on time, you expect your financial provider to protect your assets and deliver the services they have promised.

It’s the same for a cloud provider. You put your data and intellectual property in their hands, or you keep in under your mattress. Their vault is likely more secure then your box spring, but its what you are familiar with and trust. Its up to you to find cloud provider you can trust. You need to ask the proper questions to get to that point. Ask for a tour of the vault, audit their books so to speak. Do your homework.

Do you trust me?

So the point of all this isn’t to get a group of hippies camped out on the doorstep of the nearest datacenter. Instead,the idea here is to make you think about what you’re afraid of, especially when you’re considering a public cloud provider. Cloud Computing is about trusting your provider but also having responsibility for make sure you did your homework. If you’re going to trust someone with your most precious possessions, be sure you know exactly how far you can trust them.

Azure Success Inhibitors

I was recently asked to provide MSFT with a list of our top 4 “Azure Success Inhibitors”. After talking with my colleagues and compiling a list I of course sent it in. It will get discussed I’m sure, but I figured why not toss this list out for folks to see publically and heaven forbid, use the comments area of this blog to provide some feedback on. Just keep in mind that this is really just a “top 5” and is by no means an exhaustive list.

I’d like to publically thanks Rajesh, Samidip, Leigh, and Brian for contributing to the list below.

Startups & Commercial ISVs

Pricing – Azure is competitively priced only on regular Windows OS images. If we move to “high CPU”, “high memory”, or Linux based images, Amazon is more competitive. Challenge is getting them to not focus on just hosting costs but also would like to see more info on plans for non-Windows OS hosting.

Perception/Culture – Microsoft is still viewed as “the man” and as such, many start-ups still subscribe to the open source gospel of avoiding the established corporate approaches whenever possible.

Cost Predictability – more controls to help protect from cost overruns as well as easier to find/calculate fixed pricing options.

Transition/Confusion – don’t understand PaaS model well enough to see feel comfortable making a commitment. Prefer to keep doing things the way they always have. Concerns over pricing, feature needs, industry pressure, etc… In some cases, it’s about not wanting to walk away from existing infrastructure investments.

Enterprise

Trust – SLA’s aren’t good enough. The continued outages, while  minor still create questions. This also impacts security concerns (SQL Azure encryption please) which are greatly exaggerated the moment you start asking for forensic evidence in cause you need to audit a breach. In some cases, it’s just “I don’t trust MSFT to run my applications”. This is most visible when looking at regulatory/industry compliance (HIPAA, PCI, SOX, etc…).

Feature Parity – The differences in offerings (Server vs. Azure AppFabric, SQL Azure vs. SQL Server) create confusion. Means loss of control as well as reeducation of IT staff. Deployment model differences create challenges for SCM and monitoring of cloud solutions creates nightmares for established Dev Ops organizations.

Release Cadence – We discussed this on the DL earlier. I get many client that want to be able to test things before they get deployed to their environments and also control when they get “upgraded”. This relates directly to the #1 trust issue in that they just don’t trust that things will always work when upgrades happen. As complexity in services and solutions grows, they see this only getting harder to guarantee.

Persistent VM’s – SharePoint, a dedicated SQL Server box, Active Directory, etc…. Solutions they work with now that they would like to port. But since they can’t run them all in Azure currently, they’re stuck doing hybrid models which drags down the value add for the Windows Azure platform by complicating development efforts.

Common/Shared

Value Added Services – additional SaaS offerings for email, virus scan, etc… Don’t force them to build it themselves or locate additional third party providers. Some of this could be met by a more viable app/service marketplace as long as it provides integrated billing with some assurances of provider/service quality from MSFT.

Seattle Interactive Conference–Cloud Track

In learning about Windows Azure and the cloud, I’ve racked up my fair share of technical debt. Folks that have helped me out and I always try to make good on my debts. So when I was asked to help raise awareness of the Cloud Experience track at the upcoming Seattle Interactive Conference by Wade Wegner, I jumped at the chance.

Some of the top Windows Azure experts from Microsoft will be presenting on November 2nd for this conference. People like Wade, Nick, Nathan, and Steve whom I’ve come to know and with the exception of Steve respect highly (hi Steve! *grin*). As well as get Rob Tiffany and Scott Densmore. With this lineup, I can guarantee you’ll walk away with your head so crammed full of knowledge you’ll have a hard time remembering where you parked your car.

Now registration for this event is usually $350, but if you use promo code “azure 200”, you’ll get in for only $150! So stop waiting and register!.

A rant about the future of Windows Azure

Hello, my name is Brent and I’m a Mirosoft fanboy. More pointedly, I’m a Windows Azure Fanboy. I even blogged my personal feelings about why Windows Azure represents the future of cloud computing.

Well this week, at the World Wide Patner conference, we finally got to see more info on MSFT’s private cloud solution. And unfortunately, I believe MSFT is missing the mark. While they are still talking about the Azure Appliance, their “private cloud” is really just virtalization on top of Hyper-V. Aka the Hyper-V cloud.

I won’t get into debating defintions of what is a cloud, instead I want to focus on what Windows Azure brings to the table. Namely a stateless, role based application architecture model, automated deployment, failover/upgrade management, and reduced infrastructure management.

The hyper-V cloud doesn’t fill any of these (at least well). And this IMHO is the opportunity MSFT is currently missing. Regardless of the underlaying implementations, there is an opportunity here for lacking a better term, a common application server model. A way for me to take my Azure roles and deploy them both on premises or in the cloud.

I realize I’d stil need to manage the hyper-V cloud’s infrasture. But it would seem to me that there has to be a happy middle ground where that cloud can automate the provisioning and configuration of VM’s and then automating the deployments of my roles to this. I don’t necessarially need WAD monitoring my apps (I should be able to use System Center).

Additionally, having this choice of deployment locales, with the benefits of the scale/failover would be a HUGE differentiator for Microsoft. Its something neither google or amazon have. Outside of a handful of smallish ISV startups, I think VMWare or Cisco are the only other outfits that would able to touch something like this.

I’m certain someone at MSFT has been thinking about this. So why I’m not seeing it on the radar yet just floors me. It is my firm believe that we need a solution for on-premises PaaS, not just another infrastructure management tool. And don’t get me wrong, I’m still an Azure fanboy. But I also believe that the benefits that Azure, as a PaaS solution, brings shouldn’t be limited to just the public cloud.

Cloud Computing Evolution

I’ve started my week off with a couple discussions about Cloud Computing. The most interesting was with one of my firm’s Account Executives. It put me to thinking about several emerging trends I’ve been seeing.

So I figured why not blog about these thoughts before they escape me.

Cloud on the Cloud

There have been a couple cloud platform providers recently that have been delivering PaaS built upon another cloud provider.

The majority of the existing cloud providers that currently dominate the market already had significant investments into datacenters. In some cases, these providers started out simply trying to optimize management of their own applications. But the real issue here is that the either already had a significant investment in datacenters or deep enough pockets to make a significant one.

The thought here is that as inflation continues in its upward direction, it’s going to become increasing more difficult for new providers to build their own datacenters and provide a new generation of cloud services. So we may very well see more and more providers that build their solutions on larger providers platforms.

Will this new generation of providers eventually move onto their own platforms? Or will they continue to find ways to add value to other providers’ offerings? Only time will tell.

Industry Specific Clouds

This is a message I’ve been preaching a lot lately. Stop targeting generic solutions… no more assessment and migration offerings. Instead, start looking at crafting industry specific solutions. Financial Services Healthcare, Public Sector, etc… They want a partner that understands their industry specific challenges.

How much longer can this go on before we start seeing cloud providers that are targeting specific verticals? Some of this has already started to happen with specific solutions. But it’s only a matter of time before we see more generalized cloud providers that have tailored their cloud solutions to specific industries.

I think this approach could have some legs. It would potentially elevate some of the trust issues organizations have. They’d have a trusted provider they could go to that knows their business challenges and had a cloud solution specific for those needs.

Note: Christian Reilly pointed out to me we already are seeing these.

Cloud Co-ops

The third option is something that I haven’t seen much of since leaving the rural community I grew up in. Namely the notion of a co-op. For those of you not familiar with this, imagine a business that’s owned by its customers and exists primarily to help those customers do things as a larger group that they could never do as efficiently on their own. I can see co-op clouds being created.

Last year, I’ve mentioned such an alliance a couple times over on my Cloud news blog, but other than that, I haven’t seen much mention of these. But combine the idea of industry specific clouds, and the next thing you know we’ll have industry specific co-op cloud providers.

What’s Next?

Honestly, who knows. I’m just rambling here. There’s folks far smarter than myself pondering this same question. I’m sure there are several folks that have already ideas they’re working on. But whatever is coming next should prove interesting. And I honestly can’t wait to see it.

Until next time!

Amazon’s Service Outtage

As some of you are likely aware, Amazon is currently experiencing a significant outage in several of its services includes EC2, EBS, and beanstalk. Is this a strike against cloud computing? Or a cautionary tale against making assumptions about a cloud provider’s SLA and your own lack of backup/recovery plans?

While cloud detractors are raising this as a prime example of why the cloud isn’t ready for the enterprise, the real truth here is that this outage is a great example of what happens when you put all your faith in a provider and don’t plan propely. While your provide may be giving you a financially backed SLA… Is being paid back $2,000 for SLA violations acceptable if you’re losing $10,000 an hour due to down services?

Many of the sites/services that are experiencing downtime today were hosted solely at Amazon’s Virginia data center and didn’t have disaster recovery or failover plans. One high profile client (I don’t have permissions to I won’t name names) isn’t having the same issue because they were prepared, they followed Amazon’s published guidance to have redundant copies of their services sitting ready in other facilities to take the load should something happen.

Does this mean you’re doubling your costs? Probably not, as you can (and likely should) have those secondary service sites set up at reduced capacity but prepared to quickly expand/scale to handle load. Its ultimately up to your organization to determine, much like purchasing insurance, how much coverage you need and how quickly they need to be able to adjust. And its precisely this ability to keep it small and scale it up when needed that is one of the cost benefits to the cloud. So you could potentially argue that this model is even supporting the whole argument for cloud computing.

As cloud evangelists and supporters, its counseling potential adopters on issues like these that will help us win their confidence when it comes to cloud computing. Even just raising the potential risk can get them to stop looking at us as someone with a sales pitch, and instead view us as an informed partner that wants to help them be successful in this new world. Regardless of which platform they are considering, they need to fully understand what SLA’s mean but more importantly know what the impact is to their business if the vendor violates the SLA, be it for a minute, an hour, or days on end.

Cloud Computing Barriers and Benefits

I stretched myself this week and gave a presenation to the local Cloud Computing User Group that wasn’t specifically about Windows Azure but instead a general take on cloud computing adoption. The idea for this rose from a session I conducted at the 2011 Microsoft MVP Summit with two other Windows Azure MVP’s that specifically targetted Windows Azure. And since much of my time lately has been thinking as much strategic about the cloud as technical, it made sense to put my thoughts out there via a presentation.

Feedback on the presentation was good and the discussion was as always much better. And I promised those in attendence that I’d post the presentation. So here you go!

I apologize for the deck not being annotated. I also realize that this subject is one folks have written entire books on so its just touches on complex topics. But there’s not much else you can do in a 60-90 minute presentation that covers such a board topic.  That said, I’d still appreciate any feedback anyone has on it or on my presentation (if you were present yesterday).