Azure File Depot – The BlobWatcher

Recently I was taking a look at WebJobs, the new feature added to Windows Azure Web Sites that lets you run applications continuously, at intervals, or triggered by certain events (such as a new object in Azure storage). One of the questions that popped into my head was, how does the “binding to blobs” work? If I could find an answer to that, perhaps I could add that as a feature to the File Depot project.

After poking around a bit, I found that there’s not really anything magical to what WebJobs was doing. They are leveraging the information that could be available to you or me. In the case of Azure Storage blobs, when you create a blob binding for the job, WebJobs is reaching into the storage account in question and turning on the write logs for blobs in that account and giving those logs a 7 day retention period. It’s these logs that are scanned/monitored by WebJobs so that when new blobs arrive, it can trigger your job based on the bindings you’ve set up.

FileDepotBlobWatcher-EnableLogs

 

So how are the logs scanned? Fortunately, Mike Stall has already published a great little write-up. In a nutshell, when a job is started, it does a full scan of the logs for all past data, then does incremental scans for new files. So armed with this information, I set out to create my own implementation of a blob detector, the Azure File Depot Blob Watcher!

Azure Storage Logs

Armed with the info from Mike’s post, the first step is to dig into the storage logs and figure out how they work. The storage team has a great post on using the logs and I recommend you take the time to give it a complete read. But here are the highlights…

When you enable logging, a new “$logs” container will be created. The blobs placed into this container are read only, you can read and delete them, but not alter them or their properties. The logs buffered up internally and periodically ‘flushed’ into this container as individual blobs.

In Mike’s post, he mentions that there is latency (5-10 minutes) detecting blobs, and this is because Azure storage buffers the logs for up to 5 minutes or until the buffer hits 4MB in size. At that time, they are written out, and we are able to access them. Thus the latency.

Log files are only written when there are operations we’ve indicated we want to log. But the naming convention always follows the pattern: <service>/YYYY/MM/DD/HHmm/<sequence>.log

So we’ve already identified a couple of requirements for our solution…

  • Don’t scan for new log files more than every 5 minutes
  • Get a list of logs from the $logs container that start with “blob/”
  • Don’t reprocess log files we’ve already examined

Once we have the files, we then have to parse them. I wrote a post last fall that describes using Excel parse the semi-colon delimited log entries. We’re going to need to do that in code, but fortunately it’s not that difficult. The logs are semi-colon delimited and use double-quotes to denote strings that include semi-colons that we won’t want to split/explode on. You could do this using a regular expression, but my own regex skills are so rusty that I opted to just parse the file via a bit of C# code.

int endDelim = 0;
int currentPos = 0;
while(currentPos <= logentry.Length-1)
{
 
    // if a quoted string... 
    if (logentry.Substring(currentPos,1).StartsWith("\""))
    {
        currentPos++; // skip opening quote
        endDelim = logentry.IndexOf("\";", currentPos);
        if (endDelim == -1) // if no delim, jump to end of string
            endDelim = logentry.Length - 1;
        properties.Add(logentry.Substring(currentPos, endDelim - currentPos));
        // skip ending quote and semicolon
        endDelim = endDelim + 2;
    }
    else // not quoted string
    {
        endDelim = logentry.IndexOf(';', currentPos);
        if (endDelim == -1) // if no delim, jump to end of string
            endDelim = logentry.Length - 1;
        properties.Add(logentry.Substring(currentPos, endDelim - currentPos));
        endDelim++;
    }
 
    currentPos = endDelim; // advance position
}

Not as elegant as a regex I fully admit. But with my unpracticed skills (it’s been 10+ years since I had my fingers deep in that), it would have taken me 2-3 times longer to get that working then just brute forcing it.

The final step is knowing what we want out of the logs. There’s two key values from the log that I’m after. The OperationType, and the RequestURI. The request URI is self-explanatory enough, that’s the URI of the blob that we’re trying to detect. The OperationType is the action that was performed against Azure storage. There’s only two values we’re going to monitor for, PutBlob and PutBlockList.

Now here is a bit of an issue. A small enough blob can be created or UPDATED, using just the PutBlob call. So if we detect that operation. So there is a chance that we may process the same file multiple times. We could resolve this by using a “receipt” pattern as is called out in the comments section of Mike’s post, or we could keep a list of processed blobs (perhaps in table storage). The approach really depends on your needs, so I’m going to leave it out of this implementation for now.

NOTE: It should also be noted, that since we’re only looking for PutBlob or PutBlockList operations, we’re not doing to be able to detect page blobs and will catch (via PutBlob) updates to smaller page blobs. Fixing this is definitely on my list, but will need to wait for another day.

The solution

Now that we know how to get at the log information, it’s time to start creating a solution. The first decision I made, was to separate detecting new logs files from their parsing. So we’ll have a LogScanner, and a LogParser. I also wanted to make parsing the log entries super easy, so I decided to create a LogEntry class that I can feed the string that is a log entry into and exposes the values as properties.

But I still have two issues… It’s likely, especially under high volumes, that parsing the logs will take much longer then detecting them. So under most circumstances, I can get by with a single LogScanner. So I’m going to implement a “traffic cop” or “gatekeeper” pattern so that only one LogScanner can run at a time.

My second issue is how to ensure I only alert to a new log file once. I’ll be running scans every 5 minuts or so, and listing blobs doesn’t really have an option for “give me only the new ones”. Fortunately, since I’m already using a gatekeeper, I can have it store the name of the last log file I processed for me. Making it pretty simple to keep track.

The final step of course is having both the LogScanner and LogParser use delegates so whomever is implementing them can create a method to handle when a log file is detected, or a new blob is found. Thus allowing them to control what actions are taken.

I’ll wrap the whole thing up in a reference implementation via a console app. So the final solution looks like this:

FileDepotBlobWatcher-SolutionLayout

The BlobLogEntry class expose the individual fields of the blob log entry (see the parsing code above or Codeplex for all this really does), the Gatekeeper to make sure only one LogScanner is trying to detect new log entries, and the LogScanner to parse a log once it’s been found.

Gatekeeper

I’ve blogged about the gatekeeper pattern before. I’ve known this as a “traffic cop” since long before folks started publishing design patterns on the internet, so to me that’s what it will always be. Regardless of the name, the purpose is to make sure only one process can do something at a time. We’re going to accomplish this by using a lease on an Azure storage blob as our control switch.

The Gatekeeper object needs to be able to start, stop, and renew the underlying blob lease. And because I’m also going to use it to store the last log file processed, I’m going to add SetText and GetText methods to write and retrieve strings to the underlying blob.

This class is fairly simple, so I’m not going detail code you can look at yourself on codeplex. So instead I’ll just call out a few highlights…

My gatekeeper constructor accepts a CloudBlockBlob for the blob on which we’ll place a lease. This gives the calling process full control over where that blob lives. It then creates a lease on the blob good for up to 60 seconds (the maximum allowed value), and attempts to renew that lease every 45 seconds. This gives me 15 seconds in the case of transient failures to successfully complete getting the lease before I run the risk of another scanner taking over.

In a couple places, we trap for a Storage Exception that has a 409 error code. This indicates that our attempt to get the lease has failed because somebody else already has a lease on the blob in question (aka another scanner has taken over).

Implementing the Gatekeeper is simply a matter of creating the CloudBlockBlob object, handing it off to the class constructor, and then calling start when we want to gain control. We can check periodically to see if we have the lease, optionally getting it if we don’t.

The final bit is to make sure the starting and stopping of a timer to renew the lease is put into the appropriate spots.

Take a look at the gatekeeper code and if you have you have questions, please feel free to post them in the comments.

LogParser

Also pretty straight forward is the parser. It takes the CloudBlockBlob object (which would be a log file) as a parameter for its constructor, then we the ParseFile method to inspect the log file.

public void ParseFile(FoundBlobDelegate callback)
{
    using (Stream stream = logFile.OpenRead())
    {
        // read the log file
        using (StreamReader reader = new StreamReader(stream))
        {
            string logEntry;
            while ((logEntry = reader.ReadLine()) != null)
            {
                // parse the log entry
                BlobLogEntry blobLog = new BlobLogEntry(logEntry);
 
                //NOTE: PutBlockList is the final write for a large block blob
                // PutBlob can also be used for small enough blobs, but also presents an overwrite of an existing one
                if (blobLog.OperationType.Equals("PutBlob") || blobLog.OperationType.Equals("PutBlockList"))
                    callback(blobLog.RequestUrl);
            }
        }
    }
}

This method opens a stream on the blob, and then reads through it line by line. Each line is parsed using the BlobLogEntry object and if the OperationType is “PutBlob” or “PutBlockList”.

Now I could have put this method into the LogScanner, but as I pointed out earlier, it’s highly likely it will take longer to parse the logs then to detect them. So in a real word implementation, the LogScanner may simply notify a pool of parsers, possibly via a queue. So separating the implementations made a certain amount of sense. Especially when I look ahead to having to deal with larger page blobs.

LogScanner

This is where most of my time on the project was spent. It has a few parallels with the Scanner in that we have a constructor that accepts some parameters (a CloudBlobClient and an instance of the Gatekeeper class), as well as Start and Stop methods.

Internally, the LogScanner object will be using the CloudBlobClient to create a CloudBlobContainer object that’s looking at the “$logs” container. We then use the gatekeeper to make sure that if I have multiple processes running log scans, only one of them can actually do the processing. Finally, it uses an internal timer object to make sure we’re scanning for new log files at a regular interval (which defaults to 5 minutes).

When we call the Start method, the LogScanner takes a delegate that the calling process can use to determine what action should be taken when a new log file is detected (such as using the LogParser to digest it). It then starts the gatekeeper process, and attempts to do an initial scan for logs (like Mike’s post said WebJobs does). Once that scan is complete, it will start the timer so we can do additional scans at the specified interval.

The stop just reverses these actions, stopping the scan timer and the gatekeeper. So the real meat of this class, is what happens when we scan for log files. So let’s walk through this a bit before I show you the code.

The first thing we need to be able to do is get a list of blobs in the $logs container. We have two scenarios we have to support with this, get everything (for an initial scan), and get just new stuff for incremental scans. The challenge is that Azure storage, only supports getting a list of blobs based on a filter on the name, not on any metadata or properties. The initial scan is fairly simple, we set our filter criteria to “blob/”, which will get all blob service logs in the container.

So let’s say we’ve already done a scan and we stored the last log file we found in our Gatekeeper, so I know where I left off. But how do I pick back up again? I could just filter for all logs and iterate through until we get back to the where we left off. Perfectly ok, but doesn’t strike me as particularly efficient. So if we think back to how the logs are named, I can parse the last log I found to go back to the year, month, day, and hour for which that log was produced. So when I pick back up on scanning, I scan for that hour and all the hours in between UTCNow and then.

Note: You could alternatively scan for day, month, or year. Depending on the frequency of your scans and the production of logs, these options could be more efficient then my hourly approach.

We start by extracting the datetime values from the last log file name (uri in the sample blow) we read from our gatekeeper…

int startPOS = uri.IndexOf("blob/") + 5;
int endPOS = uri.LastIndexOf('/');
 
return uri.Substring(startPOS, endPOS - startPOS);

We know all the URIs will have a “blob/” at the beginning since that’s the service we’re monitoring. Furthermore, the file names all end in a six digit sequence number with a ‘.log’ suffix. So if I find the position of the last ‘/’ character in the string, I can now extract the YYYY/MM/DD/HHmm portions from the URI. We can make all these assumptions because the log naming conventions are published and therefore somewhat immutable.

Note: Currently, the mm portion of log URI will always be zero per the published naming convention. This is a key assumption for our processing.

Next, we need to convert this substring to a datetime type

DateTime tmpDT;
// convert prefix to datetime
DateTime.TryParseExact(fileprefix, "yyyy/MM/dd/HHmm"null,
                       DateTimeStyles.None, out tmpDT);
return tmpDT;

This takes our URI substring, and converts it into a DateTime, leaving us to simply calculate the delta between the current UTC datetime and this value to know how many hour periods we need to filter for.

ScanPasses = ((DateTime.UtcNow - PrefixToDateTime(startingPrefix)).TotalHours + 1);

So now we know that we will do one filtered list for each hour from the last hour we found a file to the current datetime. Ideally, this could be optimized so that the gatekeeper stores the last scanned period so we don’t have to scan past hours for which there was no traffic. But my assumption is that if we’re scanning the logs, we expect traffic at fairly regular intervals. So repetitive scans of empty “hours” shouldn’t happen often. And when you add up the cost of those scans versus programmer time to optimizing things, I could scan a few eons of empty logs before the cost would match the programmer cost to fine tune this.

Now that we’re armed with that we need to do the scans of the logs, let’s look at some of the code…

// get last log file value from gatekeeper
string lastLog = gatekeeper.GetText();
 
// calculate starting prefix
if (!lastLog.Equals("blob/")) // we had a "last log" from previous runs
{
    startingPrefix = getPrefixFromURI(lastLog); // use that prefix as our starting point
    ScanPasses = ((DateTime.UtcNow - PrefixToDateTime(startingPrefix)).TotalHours + 1); // 
    pastPreviousLog = false// don't start raising "found log" events until we're past the last processed log
}

We start by getting the last log file we found from the gatekeeper. If that value is not “blob/”, then we’re doing a subsequent scan. We’ll get the data/time prefix from the log URI, and use that to calculate the number of scans we need to do. We also set a value that tells us we haven’t yet passed our previously found log file. We need this last part because subsequent scans will always resume in the same hour of the last log file we processed. And it’s possible that new log files have arrived.

Next we will enter into a loop that will execute once or each scan pass we calculated we need. If it’s a first time scan, we’ll only do one pass because our blob list filter will be all available logs.

// List the blobs using the prefix
IEnumerable<IListBlobItem> blobs = 
    logContainer.ListBlobs(string.Format("blob/{0}", startingPrefix), trueBlobListingDetails.Metadata, null);
 
// interate the list of log files
foreach (IListBlobItem item in blobs)
{
    CloudBlockBlob log = item as CloudBlockBlob;
    string LogURI = log.Uri.ToString();
    if (log != null)
    {
        if (pastPreviousLog)
        {
            // call Delegate to act on log file
            this.callback(log);
 
            // update gatekeeper blob 
            gatekeeper.SetText(LogURI);
            lastLog = LogURI;
        }
        if (lastLog.Equals(LogURI, StringComparison.OrdinalIgnoreCase))
            pastPreviousLog = true;
    }
}

For each log file, we look at the URI. If we’re past the last log file (as recorded by the gatekeeper, we will call the callback method handed into our object, alerting a calling process that a new log file has been found. We then ask the gatekeeper to save that URI as our new starting point for the next scan. Lastly, in case we had a previous log file recorded, we need to check and see if we’re at it, so we can process the additional logs.

And as we exit the log listing loop, we increment our filter criteria (so we can scan the next available hour), and decrement the scanpasses value so we know how many scans remain.

On either side of this, we also enable and disable the timer object. The only purpose of this is that on the off chance it takes us more than 5 minutes to scan the logs, we don’t double up on the scan operations.

Running the Sample

Hopefully you’ll find this solution pretty straightforward. With the classes in place, all that remains is to implement them, in this case as sample console application.

LogScanner and LogParser need some delegate methods. For LogScanner, we’ll use this …

public static void LogFound(CloudBlockBlob LogBlob)
{
    Console.WriteLine(string.Format("Parsing Log File: {0}", LogBlob.Uri));
 
    // Parse the Log
    LogParser myParser = new LogParser(LogBlob);
    //HINT: we could drop the log file into a queue and process asyncronously
    myParser.ParseFile(FoundBlob);
 
    //Option: delete the log once its processed
}

When the LogScanner finds a new log file, it will call this delegate. For my sample I’ve chosen to write the event to the console output, and immediately parse the file via the LogParser. Just keep in mind that the current implementation is a synchronous blocking call, so in a real production situation, you likely won’t want to do this. Instead, writing the event to a queue, where subscribers can then take and process the event.

We follow this up with a delegate for LogParser that will be called as we parse the log files that were found, and locate what we believe to be a blob.

public static void FoundBlob(string newBlobUri)
{
    // filter however you like, by container, file name, etc... 
    if (!newBlobUri.Contains("gatekeeper")) // ignore gatekeeper updates
        Console.WriteLine(string.Format("Found new blob: {0}", newBlobUri));
}

You’ll notice that in this method, I’m doing a wee bit of filtering based on the BlobURI. In a real implementation, you may only want to watch a handful of containers. In my sample implementation, the blob object that’s at the heart of the gatekeeper object will have the name “gatekeeper”, so I went for the simple approach to make sure I ignore any operations related to it. I thought about putting filter criteria (such as container) as an attribute of the LogParser, but ultimately settled on this approach as being far more flexible.

The final step was to go into the console app and set things in motion…

// set up our private variables.
string storageAccountString = Properties.Settings.Default.AzureStorageConnection;
 
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(storageAccountString);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();

We start by retrieving the Azure Storage Account’s connection string, and using that string to get a CloudStorageAccount object, with which we create a CloudBlobClient.

// make sure we have a gatekeerp in place
CloudBlobContainer gatekeeperContainer = blobClient.GetContainerReference("gatekeeper");
gatekeeperContainer.CreateIfNotExists(); // want to make sure the container is there... 
CloudBlockBlob gatekeeperBlob = gatekeeperContainer.GetBlockBlobReference("gatekeeper");
Gatekeeper mygatekeeper = new Gatekeeper(gatekeeperBlob, "blob/");

Using the CloudBlobClient, we create a container where our gatekeeper blob will go, then get a CloudBlockBlob that will be the gatekeeper blob (the blob we’ll put leases on). Finally, using that blob, we create the gatekeeper object which also initializes the contents.

Next, we initialize the LogScanner and tell it to start processes, calling the delegate we already defined.

LogScanner myScanner = new LogScanner(blobClient, mygatekeeper);
myScanner.ScanInterval = new TimeSpan(0, 5, 0);
myScanner.Start(LogFound);

After that, all that remains is to give myself a simple loop to run in while the LogScanner and LogParser do their work. I’ve put in one that will run for up to an hour. After the loop exits, it will stop the scanner, which will release the lease on the blob. If you stop the console app forcibly, just be aware that the gatekeeper lease will persist for up to 1 minute. So your initial scan upon launching the program likely won’t have any results unless you wait at least 1 minute before restarting.

With the sample program complete, all that remains is to set the Azure Storage Account Connection string in the program’s application settings (using a storage account that has the Blob write logging enabled), then compile and run the solution. As it runs, you can upload blobs into it (perhaps using the Publishing Console project also located in the FileDepot Codeplex project), and within 5-10 minutes, you should start seeing files show up in the BlobWatcher console app.

Magic, no longer

So with this, I hope I’ve shed a bit of light on the Azure Storage logs and how they can be used. As I look back on creating this sample, I find that I almost spent more time digging into how storage logs work, then was spent actually working on the code. The final product could use some fine tuning, as well as enhancement for page blob scenarios. But as a starting point, I’m fairly happy with it.

Admittedly, if all you really want to do is monitor for new blobs and act on them, your best approach is to use Azure WebJobs. That team has far more to time and resources than I do. And as such, they can give you a solution that will be far more robust this then my simple code sample. But replacing WebJobs was never my objective, I just wanted to help highlight how Azure storage logging can be used to do more than just track errors and capacity utilization.

Please do check out BlobWatcher at the Azure File Depot on Codeplex. And more importantly, leave feedback either here or there. I want to make sure the project is fulfilling some common needs and to that end, one can never have enough feedback.

Until next time!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: