A Custom High-Availability Cache Solution
February 7, 2012 9 Comments
For a project I’m working on, we need a simple, easy to manage session state service. The solution needs to be highly available, low latency, but not persistent. Our session caches will also be fairly small in size (< 5mb per user). But given that our projected high end user load could be somewhere in the realm of 10,000-25,000 simultaneous users (not overly large by some standards), we have serious concerns about the quota limits that are present in todays version of the Windows Azure Caching Service.
Now we looked around, Memcached, ehCache, MonboDB, nCache to name some. And while they all did various things we needed, there were also various pros and cons. Memcached didn’t have the high availability we wanted (unless you jumped through some hoops). MongoDB has high availability, but raised issues about deployment and management. ehCache and nCache well…. more of the same. Add to them all that anything that had a open source license would have to be vetted by the client’s legal team before we could use it (a process that can be time consuming for any organization).
So I spent some time coming up with something I thought we could reasonably implement.
I started by looking at how I would handle the high availability. Taking a note from Azure Storage, I decided that when a session is started, we would assign that session to a partition. And that partitions would be assigned to nodes by a controller with a single node potentially handling multiple partitions (maybe primary for one and secondary for another, all depending on overall capacity levels).
The cache nodes would be Windows Azure worker roles, running on internal endpoints (to achieve low latency). Within the cache nodes will be three processes, a controller process, the session service process, and finally the replication process.
The important one here is the controller process. Since the controller process will attempt to run in all the cache nodes (aka role instances), we’re going to use a blob to control which one actually acts as the controller. The process will attempt to lock a blob via a lease, and if successful will write its name into that blob container. It will then load the current partition/node mappings from a simple Azure Storage table (I don’t predict us having more then a couple dozen nodes in a single cache) and verify that all the nodes are still alive. It then begins a regular process of polling the nodes via their internal endpoints to check on their capacity.
The controller also then manages the nodes by tracking when they fall in and out of service, and determining which nodes handle which partitions. If a node in a partition fails, it will assign that a new node to that partition, and make sure that the node is in different fault and upgrade domains from the current node. Internally, the two nodes in a partition will then replicate data from the primary to the secondary.
Now there will also be a hook in the role instances so that the RoleEnvironment Changing ad Changed events will alert the controller process that it may need to rescan. This could be a response to the controller being torn down (in which case the other instances will determine a new controller) or some node being torn down so the controller needs to reassign their the partitions that were assigned to those nodes to new nodes.
This approach should allow us to remain online without interruptions for our end users even while we’re in the middle of a service upgrade. Which is exactly what we’re trying to achieve.
Walkthrough of a session lifetime
So here’s how we see this happening…
- The service starts up, and the cache role instances identify the controller.
- The controller attempts to load any saved partition data and validate it (scanning the service topology)
- The consuming tier, checks the blob container to get the instance ID of the controller, and asks if for a new session ID (and its resulting partition and node instance ID)
- The controller determines if there is room in an existing partition or creates a new partition.
- If a new partition needs to be created, it locates two new nodes (in separate domains) and notifies them of the assignment, then returns the primary node to the requestor.
- If a node falls out (crashes, is being rebooted), the session requestor would get a failure message, and goes back to the controller for a new node for that partition.
- The controller provides the name of the previously identified secondary node (which is of course now the primary), and also takes on the process of locating a new node.
- The new secondary node will contact the primary node to begin replicate its state. The new primary will start sending state event change messages to the secondary.
- If the controller drops (crash/recycle), the other nodes will attempt to become the controller by leasing the blob. Once established as a controller, it will start over at step 2.
So this approach does have some cons. We do have to write our own synchronization process, and session providers. We also have to have our own aging mechanism to get rid of old session data. However, its my believe that these shouldn’t be horrible to create so its something we can easily overcome.
- The biggest limitation here is that because we’re going to be managing the in-memory cache ourselves, we might have to get a bit tricky (multi-gigabyte collections in memory) and we’re going to need to pay close attention to maximum session size (which we believe we can do).
- Now admittedly, we’re hoping all this is temporary. There’s been mentions publically that there’s more coming to the Windows Azure Cache service. And we hope that we can at that time, swap out our custom session provider for one that’s built to leverage whatever the vNext of Azure Caching becomes.
- So while not ideal, I think this will meet our needs and do so in a way that’s not going to require months of development. But if you disagree, I’d encourage you to sound off via the site comments and let me know your thoughts. .