Attempting to define “IOT”

NOTE: the following represents my own opinions and should NOT be consider an official viewpoint for any person or organization. These opinions are also still in flux, so if you ask me again in a month, it’s likely to have changed.

Jason, a friend and colleague, recently posted an update on his blog titled “What is Internet of Things”. In his post, he calls out a couple great potential scenarios, but I called him out for not really defining IOT. He attempted to counter with what is in my opinion a marketing blurb. I like and respect Jason, so much of this back and forth was good natured colleague rib-poking. But realized I shouldn’t be poking at his attempt without making my own.

How do you define the undefinable?

To start with. I want to call out that attempting to define “IOT” is like attempting to define “the cloud”. Over time, “cloud” has settled on a definition that revolved around a collection of attributes: scalable, self-service, pay for what you use, and internet accessible. These still fluctuated greatly, and to some degree were dependent on the viewpoint of the consumer of “cloud” based services. But it gave us a starting point.

Using that as an example, I think we could define IOT solutions as also requiring a set of attributes:

Things – A ‘thing’ is a specialized, autonomous piece of technology capable of performing an action, but does not possess a traditional human user interface. A phone or computer is a device. A security hand scanner, motion sensor, or even the GPS sensor in a phone could be considered a “thing”..

Data – Our things, being highly specialized, will have limited capabilities. But one of these will be the ability to report on the information they gather so something else can process the data, turning it into information. They *may* also be able to receive commands to alter their function (i.e. changing your sampling rate from 1s to 5s). In many case, we can likely expect both.

Connectivity – How that data moves in and out of “the things”, requires some type of connection. This could be a cellular network, location specific wi-fi, etc… In part, this refers to how the “things” interact or the data is gathered. The connection can be persistent (always on), or transient (on/off as required), but if you have to do stuff like plug a device into it or move data via USB or SD, we’re lacking the “internet” part of IOT. A big discussion point here is if the “thing” is directly connected, or requires assistance to connect (like the GPS sensors mentioned above).

Management – This attribute is how we track the “things”. Are they curated and highly managed (likely in a factory scenario). Or are they anonymous and unmanaged (open sourced climate telemetry gathering). This also covers attributes like how do I identify the device and separate it from “rogue” devices.

So with a set of initial attributes identified, the next step is to use them to define some scenarios.

A Factory Scenario

So Jason’s post calls out a manufacturing scenario. I’m careful to “a” scenario, because our definition above allows for a nearly infinite set of scenarios. But here’s one possibility: a single factory that makes widgets. For this scenario, we can identify the our IOT attributes.

The things: each piece of machinery has 3 sensors on it: power monitoring, pieces made, and operating temperature and one actuator: power control

Data: The sensors each capture a specific measurement every 1 second and report it every 5 seconds while the machine is running to a local controller.

Connectivity: The sensors area wired to a “controller” that’s mounted on the machine. This in turn is connected into the factory’s private wi-fi network. While the controller can take action on the data, it mostly just displays it and hands it back up to a central service on the network.

Management: When a machine is installed in the factory, there’s a step where the machine (and its sensors) are connected to the network, and “registered” with the factory’s services. Additionally, since the sensors are hard-wired to the machine “controller”, it has information about the sensors it can in turn share up to the factory services.

So we have a basic scenario that has all four attributes, and meets our basic criteria. The factory is capable of determining when the data is trending downward (perhaps power is increasing while pieces being produced is being reduced). It can then take action like telling the machine to shut down because a technician is on the way. We can also collect and trend the data, mining it over time.

But IOT also creates some common challenges.

Ingestion of telemetry: If I only have 100 machines, this isn’t a big deal. But what I’m in a scenario where I have several thousand or hundreds of thousands? How do I scale my factory services to ingest that many connections and messages?

Device Management: Sensors fail and have to get replaced. As machines fail, they may be parted out. So a sensor that fails may be replaced with a sensor that used to be on a different machine. So I constantly need to be able to track the relationships.

Connectivity: What happens if the factory wi-fi goes down? Or worse yet a visitor to the factory taps into the network and starts sending rogue messages that alter my factory data? What if they send commands to the machines that cause them to overheat?

It’s these challenges that all the vendors are racing to solve with their various IOT Solutions!

What are the solutions?

I recently saw slides from a session given by Alessandro Bassi at the M2M+ Industry Summary in Milano, Italy. In this session, he calls out that innovation for its own sake will usually fail. It needs to be supported by a good business model. So the vendors in this space are all attempting to offer their business solutions.

In some cases, these solution are highly targeted. The tech startup NEST, is a good example of this. It has a thing, the thing gathers data, it’s connected and transmits the data, and its managed (you register your device). In this case, the vendor is trying to solve a specific problem, driving down home energy costs, and marketing this solution to consumers is their business model.

In other cases, the solutions are industry focused. With the vendor providing a collection of services that work together in a more flexible way to help drive larger, more strategic initiatives. In the link I just shared, it’s a collection of solutions targeting hardware and software to help with factory scenarios like I listed above. This solution is positioned to organizations that are in that industry vertical, to address the requirements of that industry.

And lastly, we have services like the ones I’ve been working with lately; ISS, Service Bus, Project Orleans, and HDInsight. These are more building block oriented. They aren’t specific industry or even set of scenarios. They are meant to allow higher level, more encompassing solutions to be created. These get marketed to either software vendors looking to build our commercial or consumer solutions, or to organizations that need to build out customized solutions for internal use.

Each approach has pros and cons, but then they are also targeting a different business models. So it’s about picking the approach that best addresses your needs. Meanwhile, the vendors are all looking to capitalize on the market and find a way to sell their solution to solve the “IOT problem”.

Summary – have we accomplished anything?

This was the first time I’ve really put any of these thoughts down. Looking back over the typos and grammar errors, I have to ask if I’ve accomplished anything. I did set down a rough definition for what I think “IOT” is. This also allows me to call out some of the common challenges to related scenarios and ultimately even call out the types of solutions the industry is offering. So I guess you could say I have defined IOT.

But I can’t help and think the practical reality is more difficult. The definition of IOT will continue to evolve and there will always be shades of gray. So it’s important to keep in mind that different things to different people. With that in mind, I think I’ll stick to my “remain calm and ask questions” approach, and when someone comes to me with “an IOT scenario”, my first response will always be “so tell me about it”. The scenario and its challenges will always trump an arbitrary terminology definition. And in the end, it’s the solution and the business value that is brings that really matters more than the industry buzzword. Isn’t it?

Until next time!

Automating ARR Configuration

In the world of cloud, we have to become familiar with the concept of DevOps, and this means that often times we need to code setups rather then write lengthy (nearly 5000 words) sets of instructions. In my last post, I walked you through manually setting up ARR. But what if we want to take this to the next level and start automating the setup?

Now there are several examples out of there automated configuration of ARR in Windows Azure. I don’t want to simply rehash those examples, but instead “teach you to fish” as it were. And while I don’t have all the answers (I am not and don’t intend to be an ARR “expert), I do want to share with some you some of the things I’ve learned on this subject lately.

Installing ARR “automatically”

In my last write-up, I talked about using the Web Platform installer to download and install ARR and its dependencies. Fortunately, we can take this step and turn it into a fairly simple powershell script. The samples below are from a start-up script I created for a Windows Azure PaaS Cloud Service worker role.

First, I’m going to set up a couple of variables so we have most of things we want to customize at the top of the script.

# temporary variables

$temppath = $env:roleroot + “\approot\startuptemp\”
$webpifile = “webpi.msi” 
$tempwebpi = $temppath + $webpifile 
$webplatformdownload = http://download.microsoft.com/download/7/0/4/704CEB4C-9F42-4962-A2B0-5C84B0682C7A/WebPlatformInstaller_amd64_en-US.msi”

These four variables are, in order:

  • temppath: where we’ll put the file when its downloaded
  • webpifile: the name I’m going to give to webPI install file after we download it
  • tempwebpi: the full path with name that it will be saved as (make sure this isn’t to long or we’ll have issues)
  • webplatformdownload: the URL we are going to download the WebPI installer from

Next up, we need the code to actually create the temporary location and download the webPI install package to that location.

# if it doesn’t exist create a temp location that we can place files in
Write-Host “Testing Temporary Path: “ + $temppath
if((Test-Path -PathType Container $temppath) -eq $false)
{
    Write-Host “Created WebPI directory: “ + $temppath
     New-Item -ItemType directory -Path $temppath
}

# if it doesn’t already exist, download Web Platform Installer 4.6 to the temp location
if((Test-Path $tempwebpi) -eq $false)
{
    Write-Host “Downloading WebPI installer”
    $wc = New-Object System.Net.WebClient
    $wc.DownloadFile($webplatformdownload, $tempwebpi)
}

Ideally, we may want to wrap this in some re-try logic so we can handle any transient issues related to the download, but this will get us by for the moment.

Now, we need to install the WebPI using “/quiet” or silent install mode.

#install Web Platform Installer
Write-Host “Install WebPI”
$tempMSIParameters =  “/package “ + $tempwebpi + ” /quiet”
(Start-Process -FilePath “msiexec.exe” -ArgumentList $tempMSIParameters -Wait -Passthru).ExitCode

Please note that I’m not testing to ensure that this installed properly. So again, for full diligence, we should likely wrap this in some error handling code.

With all that done, all that remains is to use WebPI to install ARR.

#use WebPI to install ARR v3
Write-Host “Using WebPI to install ARR v3″
$tempPICmd = $env:programfiles + “\microsoft\web platform installer\webpicmd”
$tempPIParameters = “/install /accepteula /Products:ARRv3_0″
Write-Host $tempPICmd
(Start-Process -FilePath $tempPICmd -ArgumentList $tempPIParameters -Wait -Passthru).ExitCode

Now this is where we run into our first challenge. Note in the fourth line of this sample that I specify a product name, “ARRv3_0″. This wasn’t just some random guess. I needed to discover what the correct product ID was. For those that aren’t familiar with the Web Platform Installer, it gets its list of products from an RSS feed. There are many feeds, but after some poking around, I found the 5.0 feed at http://www.microsoft.com/web/webpi/5.0/WebProductList.xml

I right clicked the page, viewed source and searched the result “ARR”, eventually finding the XML node for “Application Request Routing 3.0″ (the version I’m after). In this node, you’ll find the productID value that I needed for this step. Below is a picture of the RSS feed with this value highlighted.

Needless to say, tracking that down the first time took a bit of digging. J

When you put all the snippets above together, and run it, it should result in ARR being installed and running. Mind you, this assumes you installed the IIS server role, and that nothing goes wrong with the installs. But automating those two checks is a task for another day.

Getting Scripting for changing our IIS configuration

So the next step is scripting our IIS configuration. If you search around, you’ll find links on using appcmd, and maybe even a few on powershell. But the challenge is figuring out the right steps to take if you plan to only for your own unique situations and don’t have days (sometimes weeks) to dig through all the documentation. I started down this path, analyzing the options available and their parameters with the intent to then spend countless hours writing and debugging my own scripts. That is, until I found the IIS Configuration Editor.

When you load the IIS Manager UI, there’s an innocent looking icon in the management section labelled “Configuration Editor”. This will allow you to edit the IIS configuration, save/reject those changes, and even…. generate scripting!

Now there is a catch… this tool assumes you have an understanding of the monsterously complex schema that is the applicationHost.config. When you launch the Configuration Editor, the first thing you’ll need to do specify what section of the configuration you want to work with. And unless you’ve digested the docs and have a deep understanding of the schema, this can be a real “needle in a haystack” proposition.

Fortunately for us, there’s a workaround we can leverage, namely the applicationHost.config file itself. What I’ve taken to doing, is start by using the GUI to make the configuration changes I need, and making note of the unique names I give items. Once you’ve done that, you can go to the folder “%SYSTEMROOT%\System32\inetsrv\config\” and there you will find the applicationHost.config XML file. Open that file in your favorite XML editor, and have your search button ready.

In my previous article, I set up a web farm and gave it a unique name, a name I can now search on. So using text search, I located the <webFarms><webFarm… node that described “App1Farm” (my unique name). Furthermore, this helped me identify that for setting up the web farm, I select the “webFarms” section in the Configuration Editor that I’m going to work in “webFarms”.

Once there, I can open up the collection listed, and I’ll see any farms that have been configured. After a bit of trial and error I can even find out the specific settings needed to set up my server farm, separating my custom settings from the defaults. This where the fun starts.

If you look at the previous screen shot, on the far right are the actions we can take: Apply, Cancel, and Generate Script. When you use this editor to start making changes, these options will be enabled. So assume I go in and add a Web Farm like I described in my last post. When I close the dialog where I edited the settings, before I click on Apply or Cancel, I instead click on Generate Script and get the following dialog box!

This shows me the code needed to make the change I just made. And I can do this via C#, JavaScript, the AppCmd utility, or Powershell! Now the sample above just creates a farm with no details, but you can start to see where this goes. We can now use this utility to model the configuration changes we want to automate and generate template code that we can then incorporate into our solutions.

Note: after you’ve generated the code you want, be sure to click on apply or cancel as appropriate. Otherwise the Generate Script option continues to track the delta of the changes you are making and will continue to generate code for ALL the changes you are making.

Writing our first C# IIS Configuration Modifications

So with samples in hand, we’re ready to start writing some code. In my case, I’m going to do so with C#. So open up Visual Studio, and create a new project (a class library will do), and paste in your sample code.

The first thing you’ll find is that you’re missing a reference to the Microsoft.Web.Administration. Providing your development machine has IIS w/ the Administration tools installed, you can add a reference to %systemroot%/system32/inetsrv/Microsoft.Web.Administration.dll to your project and things should resolve nicely. If you can’t find the file, then likely you will need to add these roles/components to your dev machine first. I cover how to do this with Windows Server 2012 in my last post, but for Windows 8 (or 7 for that matter), it’s a matter of going to Programs and Features, and then turning Windows features on or off.

When you click on the highlighted option above, this will bring up the Windows Features dialog. Scroll down to “Internet Information Services” and make sure you have IIS Management Service installed, as well as any World Wide Web Services you think you may want.

The mundane out of the way, the next step is to get back to the code, we’ll start by looking at some code I generated to create a basic web farm like I used last time.

The first step, is to get a ConfigurationSection object that contains the “webFarms” section (which we selected when we were editing the configuration, remember).

ServerManager serverManager = new
ServerManager();

Configuration config = serverManager.GetApplicationHostConfiguration();

ConfigurationSection webFarmsSection = config.GetSection(“webFarms”);

ServerManager allows us to access the applicationHost.config file. We use that object to retrieve the configuration, and in turn pull the “webFarms” section into a ConfigurationSection object we can then manipulate.

Next up, we need to get the collection of web farms, and create a new element in that collection for our new farm.

ConfigurationElementCollection webFarmsCollection = webFarmsSection.GetCollection();

ConfigurationElement webFarmElement = webFarmsCollection.CreateElement(“webFarm”);

webFarmElement["name"] = @”sample”;

The collection of farms is stored in a ConfirationElementCollection object which is populated by doing a GetCollection on the section we retrieved previously. We then use the CreateElement method to create a new element of type “webFarm”. Finally, give that new element our name, in this case ‘sample’. (Original, aren’t I *grin*)

The next logical step, is to make sure we identify the affinity settings for new web farm. In my case, I change the default timeout from 30 to 10 minutes.

ConfigurationElement applicationRequestRoutingElement =

webFarmElement.GetChildElement(“applicationRequestRouting”);

ConfigurationElement affinityElement =

applicationRequestRoutingElement.GetChildElement(“affinity”);

affinityElement["timeout"] = TimeSpan.Parse(“00:10:00″);

Using the same ConfigurationElement we retrieve in the last snippet, we now go retrieve a child element that contains the settings for application request routing. And using that element, get the one that has details on how affinity is set. In this case, setting “timeout” to the timespan of 10 minutes.

I also want to change the load balancing behavior. The default is least request, but I prefer round robin. This is done in the same manner, but we use the “loadBalancing” element instead of the “affinity” element of the same “applicationRequestRouting” element we just used.

ConfigurationElement loadBalancingElement =

applicationRequestRoutingElement.GetChildElement(“loadBalancing”);

loadBalancingElement["algorithm"] = @”WeightedRoundRobin”;

Now that we’re all done, it’s time to add the new web farm element back to the farms collection, and commit our changes to the applicationHost.config file.

webFarmsCollection.Add(webFarmElement);

serverManager.CommitChanges();

And there we have it! We’ve customized the IIS configuration via code!

What next…

As you can likely guess, I’m working on a project that will pull these techniques together. Two actually. Admittedly there’s no solid sample here, but then my intent was to share some of the learning I’ve managed to wring out of IIS.NET, MSDN, and TechNet. And as always, bring them to you in a way that’s hopefully fairly easy to digest. While my focus has admittedly been on doing this with C#, you will hopefully be able to leverage the Configuration Editor to help you with any appcmd or Powershell automation you’re looking to pull together.

If all goes well over the next couple weeks, I’ll hope to share my projects with you. These will hopefully add some nice, fairly turnkey capabilities to your Windows Azure projects, but more importantly bring all these learnings into clear focus. So bear with me a bit longer as I go back into hiding to help get the remaining work completed.

Until next time!

ARR as a highly available reverse proxy in Windows Azure

With the general availability of Windows Azure’s IaaS solution last year, we’ve seen a significant uptake in migration of legacy solutions to the Windows Azure platform. And with the even more recent announcement of our agreement with Oracle for them to support their products on Microsoft’s hypervisor technology, Hyper-V, we have a whole new category of apps we are being asked to help move to Windows Azure. One common pattern that’s been emerging is for the need for Linux/Apache/Java solutions to run in Azure at the same level of “density” that is available via traditional hosting providers. If you were an ISV (Independent Software Vendor) hosting solutions for individual customers, you may choose to accomplish this by giving each customer a unique URI and binding that to a specific Apache module, sometimes based on a unique IP address that is associated with a customer specific URL and a unique SSL certificate. This results in a scenario that requires multiple IP’s per server.

As you may have heard, the internet starting to run a bit short on IP addresses. So supporting multiple public IP’s per server is a difficult proposition for a cloud, as well as some traditional hosting providers. To that end we’ve seen new technologies emerge such as SNI (Server Name Indication) and use of more and more proxy and request routing solutions like HaProxy, FreeBSD, Microsoft’s Application Request Routing (ARR). This is also complicated by the need for delivery highly available, fault tolerant solutions that can load balancing client traffic. This isn’t a always an easy problem to solve, especially using just application centric approaches. They require intelligent, configurable proxies and/or load balancers. Precisely the kind of low level management the cloud is supposed to help us get away from.

But today, I’m here to share one solution I created for a customer that I think addresses some of this need. Using Microsoft’s ARR modules for IIS, hosted in Windows Azure’s IaaS service, as a reverse proxy for a high-density application hosting solution.

Disclaimer: This article assumes you are familiar with creating/provisioning virtual machines in Windows Azure and then remoting into them to further alter their configurations. Additionally, you will need a basic understanding of IIS and how to make changes to it via the IIS Manager console. I’m also aware of there being a myriad of ways to accomplish what we’re trying to do with this solution. This is simply one possible solution.

Overview of the Scenario and proposed solution

Here’s the outline of a potential customer’s scenario:

  • We have two or more virtual machines hosted in Windows Azure that are configured for high availability. Each of these virtual machines is identical, and hosts several web applications.
  • The web applications consist of two types:
    • Stateful web sites, accessed by end users via a web browser
    • Stateless APIs accessed by a “rich client” running natively on a mobile device
  • The “state” of the web sites is stored in an in-memory user session store that is specific to the machine on which the session was started. So all subsequent requests made during that session must be routed to the same server. This is referred to as ‘session affinity’ or ‘sticky sessions’.
  • All client requests will be over SSL (on port 443), to a unique URL specific to a given application/customer.
  • Each site/URL has its own unique SSL certificate
  • SSL Offloading (decryption of HTTPS traffic prior to its receipt by the web application) should be enabled to reduce the load on the web servers.

As you can guess based on the title of this article my intent is to solve this problem using Application Request Routing (aka ARR), a free plug-in for Windows Server IIS. ARR is an incredibly powerful utility that can be used to do many things, including acting as a reverse proxy to help route requests in a way that is completely transparent to the requestor. Combined with other features of IIS 8.0, it is able to meet the needs of the scenario we just outlined.

For my POC, I use four virtual machines within a single Windows Azure cloud service (a cloud service is simply a container that virtual machines can be placed into that provides a level of network isolation). On-premises we had the availability provided by the “titanium eggshell” that is robust hardware, but in the cloud we need to protect ourselves from potential outages by running multiple instances configured to help minimize downtimes. To be covered by Windows Azure’s 99.95% uptime SLA, I am required to run multiple virtual machine instances placed into an availability set. But since the Windows Azure Load Balancer doesn’t support sticky sessions, I need something in the mix to deliver this functionality.

The POC will consist of two layers, the ARR based Reverse Proxy layer, and the web servers. To get the Windows Azure SLA, each layer will have two virtual machines: two running ARR with public endpoints for SSL traffic (port 443) and two set up as our web servers, but since these will sit behind our reverse proxy, they will not have any public endpoints (outside of remote desktop to help with initial setup). Requests will come in from various clients (web browsers or devices) and arrive at the Windows Azure Load Balancer. The load balancer will then distribute the traffic equally across our two reserve proxy virtual machines where the requests are processed by IIS and ARR and routed based on the rules we will configure to the proper applications on the web servers, each running on a unique port. Optionally, ARR will also handle the routing of requests to a specific web server, ensuring that “session affinity” is maintained. The following diagram illustrates the solution.

The focus on this article in on how we can leverage ARR to fulfill the scenario in a way that’s “cloud friendly”. So while the original customer scenario called for Linux/Apache servers, I’m going to use Windows Server/IIS for this POC. This is purely a decision of convenience since it has been a LONG time since I set up a Linux/Apache web server. Additionally, while the original scenario called for multiple customers, each with their own web applications/modules (as shown in the diagram), I just need to demonstrate the URI to specific application routing. So as you’ll see in later in the article, I’m just going to set up a couple of web applications.

Note: While we can have more than two web servers, I’ve limited the POC to two for the sake of simplicity. If you want to run, 3, 10, or 25, it’s just a matter of creating the additional servers and adding them to the ARR web farms as we’ll be doing later in this article.

Setting up the Servers in Windows Azure

If you’re used to setting up Virtual Machines in Windows Azure, this is fairly straight forward. We start by creating a cloud service and two storage accounts. The reason for the two is that I really want to try and maximize the uptime of the solution. And if all the VM’s had their hard-drives in a single storage account and that account experienced a sustained service interruption, my entire solution could be taken-offline.

NOTE: The approach to use multiple storage accounts does not guarantee availability. This is a personal preference to help, even if in some small part, mitigate potential risk.

You can also go so far as to define a virtual network for the machines with separate subnets for the front and back end. However, this should not be required for the solution to work as the cloud service container gives us DNS resolution within its boundaries. However, the virtual network can be used to help manage visibility and security of the different virtual machine instances.

Once the storage accounts are created, I create the first of our two “front end” ARR servers by provisioning a new Windows Server 2012 virtual machine instance. I give it a meaningful name like “ARRFrontEnd01″ and make sure that I also create an availability set and define a HTTPS endpoint on port 443. If you’re using the Management portal, be sure to select the “from gallery” option as opposed to ‘quick create’ as it will give you additional options when provisioning the VM instance and allow you to more easily set the cloud service, availability set, and storage account. After the first virtual machine is created, create a second, perhaps “ARRFrontEnd02″, and “attach” it to the first instance by associating it with the endpoint we created while provisioning the previous instance.

Once our “front end” machines are provisioned, we set up two more Windows Server 2012 instances for our web servers, “WebServer01″ and “WebServer02″. However, since these machines will be behind our front end servers, we won’t declare any public endpoints for ports 80 or 443, just leave the defaults.

When complete, we should have four virtual machine instances, two that are load balanced via Windows Azure on port 433 and will act as our ARR front end servers and our two that will act as our web servers.

Now before we can really start setting things up, we’ll need to remote desktop into each of these servers and add a few roles. When we log on, we should see the Server Manager dashboard. Select “Add roles and features” from the “configure this local server” box.

In the “Add Roles and Features” wizard, skip over the “Before you Begin” (if you get it), and select the role-based installation type.

On the next page, we’ll select the current server from the server pool (the default) and proceed to adding the “Web Server (IIS)” server role.

This will pop-up another dialog confirming the features we want added. Namely the Management Tools and IIS Management Console. So take the defaults here and click “Add Features” to proceed.

The next page in the Wizard is “Select Features”. We’ve already selected what we needed when we added the role, so click on “Next” until you arrive at the “Select Role Services”. There are two optional role services here I’d recommend you consider adding. Health and Diagnostic Tracing will be helpful if we have to troubleshoot our ARR configuration later and The IIS Management Scripts and Tools will be essential if we want to automate the setup of any of this at a later date (but that’s another blog post for another day). Below is a composite image that shows these options selected.

It’s also a good idea to double-check here and make sure that the IIS Management Console is selected. It should be by default since it was part of the role features we included earlier. But it doesn’t hurt to be safe. J

With all this complete, go ahead and create several sites on the two web servers. We can leave the default site on port 80, but create two more HTTP sites. I used 8080 and 8090 for the two sites, but feel free to pick available ports that meet your needs. Just be sure to go into the firewall settings of that server enable inbound connections on these ports. I also went into the sites and changed the HTML so I could tell which server and which app I was getting results back from (something like “Web1 – SiteA” works fine).

Lastly, test the web sites from our two front end servers to make sure they can connect by logging into those servers and opening a web browser and enter in the proper address. This will be something like HTTP://<servername>:8080/iisstart.htm. The ‘servername’ parameter is simply the name we gave the virtual machine when it was provisioned. Make sure that you can hit both servers and all three apps from both of our proxy servers before proceeding. If these fail to connect, the most likely cause is an issue in the way the IIS site was defined, or an issue with the firewall configuration on the web server preventing the requests from being received.

Install ARR and setting up for HA

With our server environment now configured, and some basic web sites we can balance traffic against, it’s time to define our proxy servers. We start by installing ARR 3.0 (the latest version as of this writing and compatible with IIS 8.0. You can download it from here, or install it via the Web Platform Installer (WebPI). I would recommend this option, as WebPI will also install any dependencies and can be scripted. Fortunately, when you open up the IIS Manager for the first time and select the server, it will ask if you want to install the “Microsoft Web Platform” and open up a browser to allow you to download it. After a few adding a few web sites to the ‘trusted zone’ (and enabling file downloads when in the ‘internet’ zone), you’ll be able to download and install this helpful tool. Once installed, run it and enter “Application Request” into the search bar. We want to select version 3.0.

Now that ARR is installed (which we have to do on both of our proxy servers), let’s talk about setting this up for high availability. We hopefully placed both or proxy servers into an availability set and load balanced the 443 endpoint as mentioned above. This allows both servers to act as our proxy. But we have two possible challenges yet:

  1. How to maintain the ARR setup across two servers
  2. Ensure that session affinity (aka sticky sessions) works with multiple, load balanced ARR servers

Fortunately, there’s a couple of decent
blog posts on IIS.NET about this subject. Unfortunately, these appear to have been written by folks that are familiar with IIS, networking, pings and pipes, and a host of other items. But as always, I’m here to try and help cut through all that and put this stuff in terms that we can all relate too. And hopefully in such a way that we don’t lose any important details.

To leverage Windows Azure’s compute SLA, we will need to run two instances of our ARR machines and place them into an availability set. We set up both these servers earlier, and hopefully properly placed them into an availability set with a load balanced endpoint on port 443. This allows the Windows Azure fabric to load balanced traffic between the two instances. Also, should updates to the host server (where our VMs run) or the fabric components be necessary, we can minimize the risk of both ARR servers being taken offline at the same time.

This configuration leads us to the options highlighted in the blog post I linked previously, “Using Multiple Instances of Application Request Routing (AAR) Servers“. The article discusses using Shared Configuration and External Cache. A Shared Configuration allows two ARR servers to share their confiurations. By leveraging a shared configuration, changes made to one ARR server will automatically be leveraged by the other because both servers will share a single applicationhost.config file. The External Cache is used to allow both ARR servers to share affinity settings. So if a client’s first request is sent to a given back end web server, then all subsequent requests will be sent to that same back end server regardless of which ARR server receives the request.

For this POC, I decided not to use either option. Both require a shared network location. I could put this on either ARR server, but this creates a single point of failure. And since our objective is to ensure the solution remains as available as possible, I didn’t want to take a dependency that would ultimately reduce the potential availability of the overall solution. As for the external cache, for this POC I only wanted to have server affinity for one of the two web sites since the POC is mocking up both round-robin load balancing for requests that may be more like an API. For requests that are from a web browser, instead of using shared cache, we’ll use “client affinity”. This option returns a browser cookie that contains all the routing information needed by ARR to ensure that subsequent requests are sent to the same back end server. This is the same approach used by the Windows Azure Java SDK and Windows Azure Web Sites.

So to make a long story short, if we’ve properly set up our two ARR server in an availability set, with load balanced endpoints, there’s no additional high level configuration necessary to set up the options highlighted in the “multiple instances” article. We can get what we need within ARR itself.

Configure our ARR Web Farms

I realize I’ve been fairly high level with my setup instructions so far. But many of these steps have been fairly well documented and up until this point we’ve been painting with a fairly broad brush. But going forward I’m going to get more detailed since it’s important that we properly set this all up. Just remember, that each of the steps going forward will need to be executed on each of our ARR servers since we opted not to leverage the Shared Configuration.

The first step after our servers have been set up is to configure the web farms. Open the IIS Manager on one of our ARR servers and (provided our ARR 3.0 install was completed successfully), we should see the “Server Farm” node. Right-click on that node and select “Create Server Farm” from the pop-up menu as shown in the image at the right. A Server Farm is a collection of servers that we will have ARR route traffic to. It’s the definition of this farm that will control aspects like request affinity and load balancing behaviors as well as which servers will receive traffic.

The first step in setting up the farm is to add our web servers. Now in building my initial POC, this is the piece that caused me the most difficulty. Not because creating the server farm was difficult, but because there’s one thing that’s not apparent to those of us that aren’t intimately familiar with web hosting and server farms. Namely that we need to consider a server farm to be specific to one of our applications. It’s this understanding that helps us realize that we need the definition of the server farm to help us route requests coming to the ARR server on one port, to be routed to the proper port(s) on the destination back end servers. We’ll do this as we add each server to the farm using the following steps…

After clicking on “Create Server Farm”, provide a name for the farm. Something suitable of course…

After entering the farm name and clicking on the “Next” button, we’ll be presented with the “Add Server” dialog. In this box, we’ll enter in the name of each of our back end servers but more importantly we need to make sure we expand the “Advanced Settings” options so we can also specify the port on that server we want to target. In my case, I’m going to a ‘Web1′, the name of the server I want to add and I want to set ‘httpPort’ to 8080.

We’re able to do this because Windows Azure handles DNS resolution for the servers I added to the cloud service. And since they’re all in the same cloud service, we can address each server on any ports those servers will allow. There’s no need to define endpoints for connections between servers in the same cloud service. So we’ll complete the process by clicking on the ‘Add’ button and then doing the same for my second web server, ‘Web2′. We’ll receive a prompt about the creation of a default a rewrite rule, click on the “No” button to close the dialog.

It’s important to set the ‘httpPort’ when we add the servers. I’ve been unable to find a way to change this port via the IIS Manager UI once the server has been added. Yes you can change it via appcmd, powershell, or even directly editing the applicationhost.config, but that’s a topic for another day. J

Now to set the load balancing behavior and affinity we talked about earlier, we select the newly created server farm from the tree and we’ll see the icons presented below:

If we double-click on the Load Balance icon, it will open a dialog box that allows us to select from the available load balancing algorithms. For the needs of this POC, Least Recent Request and Weighted Round Robin would both work suitably. Select the algorithm you prefer and click on “Apply”. To set the cookie based client affinity I mentioned earlier, you can double click on the “Server Affinity” option and then check the box for “Client Affinity”.

The final item that we will make sure is enabled here is SSL Offloading. We can verify this by double-clicking on “Routing Rules” and verifying that “Enabled SSL Offloading” is checked which is should be by default.

Now it’s a matter of repeating this process for our second application (I put it on port 8090) as well as setting up the same two farms on the other ARR server.

Setting up the URL Rewrite Rule

The next step is to set up the URL rewrite rule that will tell ARR how to route requests for each of our applications to the proper web farm. But before we can do that, we need to make sure we have two unique URI’s, one for each of our applications. If you scroll up and refer to the diagram that provides the overview of our solution, you’ll see that an end user request to the solution are directed at custaweb.somedomian.com and device api calls are directed to custbweb.somedomain.com. So we will need to create an aliasing DNS entry for these names and alias them to the *.cloudapp.net URI that is the entry point of the cloud service where this solution resides. We can’t use just a forwarding address for this but need a true CNAME alias.

Presuming that has already been setup, we’re ready to create the URL rule for our re-write behavior.

We’ll start by selecting the web server itself in the IIS server manager and double clicking the URL Rewrite icon as shown below.

This will open the list of URL rewrite rules, and we’ll select “add rules…” form the action menu on the right. Select to create a blank inbound rule. Give the rule an appropriate name, and complete the sections as shown in the following images.

Matching URL

This section details what incoming request URI’s this rule should be applied too. I have set it up so that all inbound requests will be evaluated.

Conditions

Now as it stands, this rule would route nearly any request. So we need have to add a condition to the rule to associate it with a specific request URL. We need to expand the “Conditions” section and click on “Add…”. We specify “{HTTP_HOST}” as the input condition (what to check) and set the condition’s type is a simple pattern match. And for the pattern itself, I opted to use a regular expression that looks at the first part of the domain name and makes sure it contains the value “^custAweb.*” (as we highlighted in the diagram at the top). In this way we ensure that the rule will only be applied to one of the two URI’s in our sample.

Action

The final piece of the rule is to define the action. For our type, we’ll select “Route to Server Farm”, keep HTTP as the scheme, and specify the appropriate server farm. And for the path, we’ll leave the default value of “/{R:0}”. The final piece of this tells ARR to add any paths or parameters that were in the request URL to the forwarded request.

Lastly, we have the option of telling ARR that if we execute this rule, we should not process any subsequent rules. This can be checked or unchecked depending on your needs. You may desire to set up a “default” page for requests that don’t meet any of our other rules. In which case just make sure you don’t “stop processing of subsequent rules” and place that default rule at the bottom of the list.

This completes the basics of setting up of our ARR based reverse proxy. Only one more step remains.

Setting up SNI and SSL Offload

Now that we have the ARR URL Rewrite rules in place, we need to get all the messiness with the certificates out of the way. We’ll assume, for the sake of argument, that we’ve already created a certificate and added it to the proper local machine certificate store. If you’re unsure how to do this, you can find some instructions in this article.

We start by creating web site for the inbound URL. Select the server in the IIS Manager and right-click it to get the pop-up menu. This open the “Add Website” dialog which we will complete to set up the site.

Below you’ll find some settings I used. The site name is just a descriptive name that will appear in the IIS manager. For the physical path, I specified the same path as the “default” site that was created when we installed IIS. We could specify our own site, but that’s really not necessary unless you want to have a placeholder page in case something goes wrong with the ARR URL Rewrite rules. And since we’re doing SSL for this site, be sure to set the binding type to ‘https’ and specify the host name that matches the inbound URL that external clients will use (aka our CNAME). Finally, be sure to check “Require Server Name Indication” to make sure we support Server Name Indication (SNI).

And that’s really all there is to it. SSL offloading was already configured for us by default when we created the server farm (feel free to go back and look for the checkbox). So all we had to do was make sure we had a site defined in IIS that could be used to resolve the certificate. This will process the encryption duties, then ARR will pick up the request for processing against our rules.

Debugging ARR

So if we’ve done everything correctly, it should just work. But if it doesn’t, debugging ARR can be a bit of a challenge. You may recall that back when we installed ARR, I suggested also installing the tracing and logging features. If you did, these can be used to help troubleshoot some issue as outlined in this article from IIS.NET. While this is helpful, I also wanted to leave you with one other tip I ran across. If possible, use a browser on the server we’re configured ARR on to access the various web sites locally. While this won’t do any routing unless you set up some local DNS entries to help with resolving to the local machine, it will show you more than a stock “500″ error. By accessing the local IIS server from within, we can get more detailed error messages that help us understand what may be wrong with our rules. It won’t allow you to fix everything, but could sometimes be helpful.

I wish I had more for you on this, but ARR is admittedly a HUGE topic, especially for something that’s a ‘free’ add-on to IIS. This blog post is the results of several days of experimentation and self-learning. And even with this time invested, I would never presume to call myself an expert on this subject. So please forgive if I didn’t get into enough depth.

With this, I’ll call this article to a close. I hope you find this information useful and I hope to revisit this topic again soon. One item I’m still keenly interested in is how to automate these tasks. Something that will be extremely useful for anyone that has to provision new ‘apps’ into our server farm on a regular basis. Until next time then!

Postscript

I started this post in October 2013 and apologize for the delay in getting it out. We were hoping to get it published as a full-fledge magazine article but it just didn’t work out. So I’m really happy to finally get this out “in the wild”. I’d also like to give props to Greg, Gil, David, and Ryan for helping do technical reviews. They were a great help but I’m solely responsible for any grammar or spelling issues contained here-in. If you see something, please call it out in the comments or email me and I’m happy to make corrections.

This will also hopefully be the first of a few ARR related posts/project I plan to share over the next few weeks/months. Enjoy!

Cloud Computing News Digest for September 21st, 2012

I normally publish this over at my Sogeti blog at http://blogs.us.sogeti.com/ccdigest/ but that’s down at the moment so we’re going to my backup copy. I know, the self proclaimed “cloud guy” isn’t in the cloud. Well there’s an old saying that goes something like ‘the cobbler’s children have no shoes’. Smile

I’d say I’m late with this edition but this is developing into enough of a pattern that I think I’m just going to start thinking of monthly as the new weekly J So on to the news…

The Cloud Security Alliance (CSA) and Fujitsu announced the launch of the Big Data Working Group. The intent of this organization is to help the industry by bringing forth best practices for security and privacy when working with big data. They will start focused on research across several industry verticals with their first report due sometime this fall.

At the 2012 CloudOpen conference this past August, Suse announced their OpenStack based enterprise level private cloud solution called amazingly enough “Suse Cloud”. This IaaS based solution would help organizations deploy and manage private clouds with self-service and workload standardization capabilities.

I also found an article about a competitor to OpenStack, Eucalyptus. SearchCloudComputing has published a “deep dive” into using Eucalpytus 3.1. You’ll need to register as a member (its free) to read the full article

In my job, I’m often asked what skills are needed for cloud. This article by Joe McKendrick does a nice job of covering the list. Not just for individuals, but for organizations as well.

When you talk to cloud vendors, they will eventually reference PEU (Power to Energy Utilization) statistics in some way. But as this piece by David Linthicum over at Toolbox.com explains, the real savings are in the ability to adjust to changing needs and in turn, changing our consumption.

Last month the world watched the 2012 Summer Olympics. And it turns out the cloud played a major hand in helping deliver that content around the globe. Windows Azure Media Services helped deliver live and on-demand video content to several broadcasters. Eyes weren’t just on the games as Apica, a vendor of testing and monitoring solutions, monitored various Olympics related web sites and scored them for their uptime and performance.

For this edition I also found a presentation by Adrian Cockcroft of Netflix on the Cassandra (another noSQL database solution) Performance and Scalability on AWS. Even if you don’t plan to use Cassandra, I highly recommend listen to this and picking up what you can of their approach and learnings. The video lasts about an hour.

Pfizer (the drug…. er… pharmaceutical company), also ventured into the world of cloud computing to help with supply chain issues. If you ever wondered about your critical delivery, what about getting lifesaving medicine to patients.

On the Google front, they haven’t been quite. They recently launched the Google Cloud Partner Program, giving them a way to help promote and leverage delivery partners not unlike the programs already in place at Amazon and Microsoft.

Related to topics that are close to my heart, I have a great article on resilient solution engineering from Jesse Robbins at GameDay. Having all this capacity for disaster recovery and failover doesn’t do us much good if we won’t create solutions that can take advantage of it. And on the subject of architecture, just yesterday I ran across this great list of items for architectural principles taken from Will Larson’s “Introduction to Architecting Systems for Scale”. Definitely give this a read.

And to close out this edition, I have an info graphics on enterprise cloud adoption. I’m not a big fan of infographics, but I found this one useful and figured I’d share it with all of you.

Avoiding the Chaos Monkey

Yesterday I was pleased (and nervous) to be presenting at the Heartland Developers Conference in Omaha, NE. I’ve been hoping to present at this event for a couple years and was really pleased that one of my submissions was accepted. Especially given that the topic was more architect/concept then code. It was only my second time presenting this material and the first time for a non-captive audiance. And given that it was the 2pm slot, and only a handful of people fell asleep or left, I’m pretty pleased with how things went.

I’ve posted the deck for my Avoiding the Chaos Monkey presentation so please feel free to take and reuse. I just ask that you give proper credit and I’d love any feedback on it. I received some great feedback from HDC on the material and will be making some updates that show some real world scenarios and how applying the principles covered in this presentation can address them. I spoke to some of these during the presentation, but agreed with my colleague Eric that it would help to have more concrete and visual examples to drive the message home. I’ve already submitted the talk to two upcoming conferences and hopefully it will get accepted at one. Meanwhile, feel free to snag a copy and drop me a comment with any feedback you have!

You don’t really want an SLA!

I don’t often to editorials (and when I do, they tend to ramble), but I felt I’m due and this is a conversation I’ve been having a lot lately. I sit to talk with clients about cloud and one of the first questions I always get is “what is the SLA”? And I hate it.

The fact is that an SLA is an insurance policy. If your vendor doesn’t provide a basic level of service, you get a check. Not unlike my home owners insurance. If something happens, I get a check. The problem is that most of us NEVER want to have to get that check. If my house burns down, the insurance company will replace it. But all those personal mementos, the memories, the “feel” of the house are gone. So that’s a situation I’d rather avoid. What I REALLY want is safety. So install a fire-alarm, I make sure I have an extinguisher in the kitchen, I keep candles away from drapes. I take measures to help reduce the risk that I’ll need to cash my insurance policy.

When building solutions, we don’t want SLA’s. What we REALLY want is availability. So we as the solution owners need to take steps to help us achieve this. We have to weight the cost vs the benefit (do I need an extinguisher or a sprinkler system?) and determine how much we’re wiling to invest in actively working to achieve our own goals.

This is why when I get asked the question, I usually respond by giving them the answer and immediately jump into a discussion about resiliency. What is a service degradation vs an outage? How can we leverage redundancy? Can we decouple components and absorb service disruptions? These are the types of things we as architects need to start considering, not just for cloud solutions but for everything we build.

I continue to tell developers that the public cloud is a stepping stone. The patterns we’re using in the public cloud are lessons learned that will eventually get applied back on premises. As the private cloud becomes less vapor and more reality, the ability to think in these new patterns is what will make the next generation of apps truly useful. If a server goes down, how quickly does your load balancer see this and take that server out of rotation? How do the servers shift workloads?

When working towards availability, we need to take several things in mind.

Failures will happen – how we deal with them is our choice. We can have the world stop, or we can figure out how to “degrade” our solution to keep anything we can going.

How are we going to recover – when things return to normal, how does the solution “catch up” with what happened during the disruption

the outage is less important than how fast we react – we need to know something has gone wrong before our clients call to tell us

We (aka solution/application architects) really need to start changing the conversation here. We need to steer away from SLA’s entirely and when we can’t manage that at least get to more meaningful, scenario based SLA’s. This can mean instead of saying “the email server will be 99% of the time” we switch to “99% of emails will be transmitted within 5 minutes”. This is much more meaningful for the end users and also gives s more flexibility in how we achieve it. And depending on how traffic.

Anyway, enough rambling for now. I need to get a deck that discusses this ready for a presentation on Thursday that only about 20 minutes ago I realized I needed to do. Fortunately, I have an earlier draft of the session and definitely have the passion and knowhow to make this happen. So time to get cracking!

Until next time!

Session State with Windows Azure Caching Preview

I’m working on a project for a client and was asked to pull together a small demo using the new Windows Azure Caching preview.  This is the “dedicated” or better yet, “self hosted” solution that’s currently available as a preview in the Windows Azure SDK 1.7, not the Caching Service that was made available early last year. So starting with a simple MVC 3 application, I set out to enable the new memory cache for session state. This is only step 1 and the next step is to add a custom cache based on the LocalStorage feature of Windows Azure Cloud Services.

Enabling the self-hosted, in-memory cache

After creating my template project, I started by following the MSDN documentation for enabling the cache co-hosted in my MVC 3 web role. I opened up the properties tab for the role (right-clicking on the role in the cloud service via the Solution Explorer) and moved to the Caching tab. I checked “Enable Caching” and set my cache to Co-located (it’s the default) and the size to 20% of the available memory.

clip_image002

Now because I want to use this for session state, I’m also going to change the Expiration Type for the default cache from “Absolute” to “Sliding”. In the current preview, we only have one eviction type, Least Recently Used (LRU) which will work just fine for our session demo. We save these changes and take a look at what’s happened with the role.

There are three changes that I could find:

  • · A new module, Caching, is imported in the ServiceDefinition.csdef file
  • · A new local resource “Microsoft.WindowsAzure.Plugins.Caching.FileStore” is declared
  • · Four new configuration settings are added, all related to the cache: NamedCaches (a JSON list of named caches), LogLevel, CacheSizePercentage, and ConfigStoreConnectionString

Yeah PaaS! A few options clicked and the Windows Azure Fabric will handle spinning up the resources for me. I just have to make the changes to leverage this new resource. That’s right, now I need to setup my cache client.

Note: While you can rename the “default” cache by editing the cscfg file, the default will always exist. There’s currently no way I found to remove or rename it.

Client for Cache

I could configure the cache manually, but folks keep telling me to I need to learn this NuGet stuff. So lets do it with the NuGet packages instead. After a bit of fumbling to clean up a previously botched NuGet install fixed (Note: must be running VS at Admin to manage plug-ins), I right-clicked on my MVC 3 Webrole and selected “Manage NuGet Packages”, then following the great documentation at MSDN, searched for windowsazure.caching and installed the “Windows Azure Caching Preview” package.

This handles updating my project references for me, adding at least 5 of them that I saw at a quick glance, as well as updating the role’s configuration file (the web.config in my case) which I now need to update with the name of my role:

<dataCacheClientname=default>
<autoDiscoverisEnabled=trueidentifier=WebMVC />
<!–<localCache isEnabled=”true” sync=”TimeoutBased” objectCount=”100000″ ttlValue=”300″ />–>
</dataCacheClient>

Now if you’re familiar with using caching in .NET, this is all I really need to do to start caching. But I want to take another step and change my MVC application so that it will use this new cache for session state. This is simply a matter of replacing the default provider “DefaultSesionProvider” in my web.config with the AppFabricCacheSessionStoreProvider. Below are both for reference:

Before:

     <addname=DefaultSessionProvider
          type=System.Web.Providers.DefaultSessionStateProvider, System.Web.Providers, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
          connectionStringName=DefaultConnection
          applicationName=/ />

After:

<addname=AppFabricCacheSessionStoreProvider
type=Microsoft.Web.DistributedCache.DistributedCacheSessionStateStoreProvider, Microsoft.Web.DistributedCache
cacheName=default
useBlobMode=true
dataCacheClientName=default />

Its important to note that I’ve set the cacheName attribute to match the name of the named cached I set up previously, in this case “default”. If you set up a different named cache, set the value appropriately or expect issues.

But we can’t stop there, we also need to update the sessionState node’s attributes, namely mode and customProvider as follows:

<sessionStatemode=CustomcustomProvider=AppFabricCacheSessionStoreProvider>

Demo Time

Of course, all this does nothing unless we have some code that shows the functionality at work. So let’s increment a user specific page view counter. First, I’m going to go into the home controller and add in some (admittedly ugly) code in the Index method:

// create the session value if we’re starting a new session
if (Session.IsNewSession)
Session.Add(“viewcount”, 0);
// increment the viewcount
Session["viewcount"] = (int)Session["viewcount"] + 1;// set our values to display
ViewBag.Count = Session["viewcount"];
ViewBag.Instance = RoleEnvironment.CurrentRoleInstance.Id.ToString();

The first section just sets up the session value and handles incrementing them. The second block pulls the value back out to be displayed. And then alter the associated Index.cshtml page to render the values back out. So just insert the following wherever you’d like it to go.

Page view count: @ViewBag.Count<br />
Instance: @ViewBag.Instance

Now if we’ve done everything correctly, you’ll see the view count increment consistently regardless of which instance handles the request.

Session.Abandon

Now there’s some interesting stuff I’d love to dive into a bit more if I had time, but I don’t today. So instead, let’s just be happy with the fact that after more than 2 years, Windows Azure finally has “built in” session provider that is pretty darned good. I’m certain it still has its capacity limits (I haven’t tried testing to see how far it will go yet), but to have something this simple we can use for most projects is simply awesome. If you want my demo, you can snag it from here.

Oh, one last note. Since Windows Azure Caching does require Windows Azure Storage to maintain some information, don’t forget to update the connection string for it before you deploy to the cloud. If not, you’ll find instances may not start properly (not the best scenario admittedly). So be careful.

Until next time!

Exceeding the SLA–Its about resilience

Last week I was in Miami presenting at Sogeti’s Windows Azure Privilege Club summit. Had a great time, talked with some smart, brave, and generally great people about cloud computing and Windows Azure. But what really struck me was how little was out there about how to properly architect solutions so that they can take advantage of the promise of cloud computing.

So I figured I’d start putting some thoughts down in advance of maybe trying to write a whitepaper on the subject.

What is an SLA?

So when folks start thinking about uptime, the first thing that generally pops to mind is the vendor service level agreements, or SLA’s.

An SLA, for lack of a better definition is a contract or agreement that provides financial penalties if specific metrics are not met. For cloud, these metrics are generally expressed as a percentage of service availability/accessibility during a given period. What this isn’t, is a promise that things will be “up”, only that when they aren’t, the vendor/provider has some type of penalty they will pay. This penalty is usually a reimbursement of fees you paid.

Notice I wrote that as “when” things fail, not if. Failure is inevitable. And we need to start by recognizing this.

What are after?

With that out of the way, we need to look at what we’re after. We’re not after “the nines”. What we’re wanting is to protect ourselves from any potential losses that we could incur if our solutions are not available.

We are looking for protection from:

  • Hardware failures
  • Data corruption (malicious & accidental)
  • Failure of connectivity/networking
  • Loss of Facilities
  • <insert names of any of 10,000 faceless demons here>
    And since these types of issues are inevitable, we need to make sure our solution can handle them gracefully. In other words, we need to design our solutions to be resilient.

What is resilience?

To take a quote from the Bing dictionary:

image

Namely we need solutions that can self recovery from problems. This ability to flex and handle outages and easily return to full functionality when the underlying outages are resolved are what make your solution a success. Not the SLA your vendor gave you.

If you were Netflix, you test this with their appropriately named “chaos monkey”.

How do we create resilient systems?

Now that is an overloaded question and possibly a good topic for someone doctoral thesis. So I’m not going to answer that in today’s blog post. What I’d like to do instead of explore some concepts in future posts. Yes, I know I still need to finish my PHP series. But for now, I can at least set things up.

First off, assume everything can fail. Because at some point or another it will.

Next up, handle errors gracefully. “We’re having technical difficulties, please come back later” can be considered an approach to resilience. Its certainly better then a generic 404 or 500 http error.

Lastly, determine what resilience is worth for you. While creating a system that will NEVER go down is conceivably possible, it will likely be cost prohibitive. So you need to clear understand what you need and what you’re willing to pay for.

For now, that’s all I really wanted to get off my chest. I’ll publish some posts over the next few weeks that focus on some 10,000 foot high options for achieving resilience. Maybe after that, we’ can look at how these apply to Windows Azure specifically.

Until next time!

Detroit Day of Azure Keynote

Keynote is a fancy way of saying “gets to go first”. But when my buddy David Giard asked me if I would come Detroit to support his Day of Azure, I couldn’t say no. So we talked a bit, tossed around some ideas.. and I settled on a presetion idea I had been calling “When? Where? Why? Cloud?”. This presentation isn’t technical, its about helping educate both developers and decision makers on what cloud computing is, how you can use it, what opportunities, etc…. Its a way to start the conversation on cloud.

Session seemed to go pretty good, not much feedback but there were lots of noding heads, a few smiles (hopefully at my jokes), and only one person seemed to be falling asleep. Not bad for a foggy, drizzly 8am on a Saturday presentation. So as promised, I’ve uploaded the presentation here if you liked to take a look. And if you’re here because you were in the session, please leave a comment and let me know what you thought.

A Custom High-Availability Cache Solution

For a project I’m working on, we need a simple, easy to manage session state service. The solution needs to be highly available, low latency, but not persistent. Our session caches will also be fairly small in size (< 5mb per user). But given that our projected high end user load could be somewhere in the realm of 10,000-25,000 simultaneous users (not overly large by some standards), we have serious concerns about the quota limits that are present in todays version of the Windows Azure Caching Service.

Now we looked around, Memcached, ehCache, MonboDB, nCache to name some. And while they all did various things we needed, there were also various pros and cons. Memcached didn’t have the high availability we wanted (unless you jumped through some hoops). MongoDB has high availability, but raised issues about deployment and management. ehCache and nCache well…. more of the same. Add to them all that anything that had a open source license would have to be vetted by the client’s legal team before we could use it (a process that can be time consuming for any organization).

So I spent some time coming up with something I thought we could reasonably implement.

The approach

I started by looking at how I would handle the high availability. Taking a note from Azure Storage, I decided that when a session is started, we would assign that session to a partition. And that partitions would be assigned to nodes by a controller with a single node potentially handling multiple partitions (maybe primary for one and secondary for another, all depending on overall capacity levels).

The cache nodes would be Windows Azure worker roles, running on internal endpoints (to achieve low latency). Within the cache nodes will be three processes, a controller process, the session service process, and finally the replication process.

The important one here is the controller process. Since the controller process will attempt to run in all the cache nodes (aka role instances), we’re going to use a blob to control which one actually acts as the controller. The process will attempt to lock a blob via a lease, and if successful will write its name into that blob container. It will then load the current partition/node mappings from a simple Azure Storage table (I don’t predict us having more then a couple dozen nodes in a single cache) and verify that all the nodes are still alive.  It then begins a regular process of polling the nodes via their internal endpoints to check on their capacity.

The controller also then manages the nodes by tracking when they fall in and out of service, and determining which nodes handle which partitions. If a node in a partition fails, it will assign that a new node to that partition, and make sure that the node is in different fault and upgrade domains from the current node. Internally, the two nodes in a partition will then replicate data from the primary to the secondary.

Now there will also be a hook in the role instances so that the RoleEnvironment Changing ad Changed events will alert the controller process that it may need to rescan. This could be a response to the controller being torn down (in which case the other instances will determine a new controller) or some node being torn down so the controller needs to reassign their the partitions that were assigned to those nodes to new nodes.

This approach should allow us to remain online without interruptions for our end users even while we’re in the middle of a service upgrade. Which is exactly what we’re trying to achieve.

Walkthrough of a session lifetime

So here’s how we see this happening…

  1. The service starts up, and the cache role instances identify the controller.
  2. The controller attempts to load any saved partition data and validate it (scanning the service topology)
  3. The consuming tier, checks the blob container to get the instance ID of the controller, and asks if for a new session ID (and its resulting partition and node instance ID)
  4. The controller determines if there is room in an existing partition or creates a new partition.
  5. If a new partition needs to be created, it locates two new nodes (in separate domains) and notifies them of the assignment, then returns the primary node to the requestor.
  6. If a node falls out (crashes, is being rebooted), the session requestor would get a failure message, and goes back to the controller for a new node for that partition.
  7. The controller provides the name of the previously identified secondary node (which is of course now the primary), and also takes on the process of locating a new node.
  8. The new secondary node will contact the primary node to begin replicate its state. The new primary will start sending state event change messages to the secondary.
  9. If the controller drops (crash/recycle), the other nodes will attempt to become the controller by leasing the blob. Once established as a controller, it will start over at step 2.
  10. Limits

    So this approach does have some cons. We do have to write our own synchronization process, and session providers. We also have to have our own aging mechanism to get rid of old session data. However, its my believe that these shouldn’t be horrible to create so its something we can easily overcome.

    The biggest limitation here is that because we’re going to be managing the in-memory cache ourselves, we might have to get a bit tricky (multi-gigabyte collections in memory) and we’re going to need to pay close attention to maximum session size (which we believe we can do).
    Now admittedly, we’re hoping all this is temporary. There’s been mentions publically that there’s more coming to the Windows Azure Cache service. And we hope that we can at that time, swap out our custom session provider for one that’s built to leverage whatever the vNext of Azure Caching becomes.
    So while not ideal, I think this will meet our needs and do so in a way that’s not going to require months of development. But if you disagree, I’d encourage you to sound off via the site comments and let me know your thoughts. .
Follow

Get every new post delivered to your Inbox.

Join 1,129 other followers