Windows Azure Diagnostics Part 2–Options, Options, Options

It’s a hot and muggy Sunday here in Minnesota. So I’m sitting inside, writing this update while my wife and kids both get their bags ready for going back to school. Its hard to believe that summer is almost over already. Heck, I’ve barely moved my ‘68 cutlass convertible this year. But enough about my social agenda.

After 4 months I’m finally getting back to my WAD series. Sorry for the delay folks. It hasn’t been far from my mind since I did part 1 back in April. But I’m back with a post that I hope you’ll enjoy. And I’ve taken some of the time to do testing, digging past the surface and in hopes of bringing you something new.

Diagnostic Buffers

If you’ve read up on WAD at all, you’re probably read that there are several diagnostic data source that are collected by default. Something that’s not made real clear in the MSDN articles (and even in many other blogs and articles I read preparing for this), is that this information is NOT automatically persisted to Azure Storage.

So what’s happening is that these data sources are buffers that represent files stored in the local file system. The size of these buffers is governed by a property of the DiagnosticMonitorConfiguration settings, OverallQuotaInMB. This setting represents the total space on the VM that will used for the storage of all log file information. You can also set quotas for the various individual buffers the sum total of which should be no greater than the overall quota.

These buffers will continue to grow until their maximum quota is reached at which time the older entries will be aged off. Additionally, should your VM crash, you will likely lose any buffer information. So the important step is to make sure you have each of your buffers configured properly to persist the logs to Azure Storage in such a way that helps protect the information you are most interested in.

When running in the development fabric, you can actually see these buffers. Launch the development fabric UI and navigate to a role instance and right click it as seen below:


Poke around in there a bit and you’ll find the various file buffers I’ll be discussing later in this update.

If you’re curious about why this information isn’t automatically persisted, I’ve been told it was a conscious decision on the part of the Azure team. If all these sources were automatically persisted, the potential costs associated with Azure Storage could present an issue. So they erred on the side of caution.

Ok, with that said, its time to move onto configuring the individual data sources.

Windows Azure Diagnostic infrastructure Logs

Simply put, this data source is the chatter from the WAD processes, the role, and the Azure fabric. You can see it start up, configuration values being loaded and changed etc… This log is collected by default but like we just mentioned, but not persisted automatically. Like most data sources, configuring it is pretty straight forward. We start by grabbing the current diagnostic configuration in whatever manner suits you best (I covered a couple ways last time), giving us an instance of DiagnosticMonitorConfiguration that we can work with.

To adjust the WAD data source, we’re going work with DiagnosticInfrastructureLogs property which is of type BasicLogsBufferConfiguration. This allows us to adjust the following values:

BufferQuoteInMB – maximum size of this data source’s buffer

ScheduledTransferLogLevelFilter – this is the LogLevel threshold that is used to filter entries when entries are persisted to Azure storage.

ScheduledTransferPeriod – this TimeSpan value is the interval at which the log should be persisted to Azure Storage.

Admittedly, this isn’t a log you’re going to have call for very often, if ever. But I have to admit, when I looked it, it was kind of interesting to see more about what was going on under the covers when roles start up.

Windows Azure Logs

The next source that’s collected automatically is Azure Trace Listener messages. This data source is different from the previous because it only contains what you put into it. Since its based on trace listener, you have instrument your application to take advantage of this. Proper instrumentation of any cloud hosted application is something I consider a best practice.

Tracing is a topic so huge that considerable time can (and has) been expended to discuss it. You have switches, levels, etc… So rather then diving into that extensive topic, let me just link you to another source that does it exceedingly well.

However, I do want to touch on how get this buffer into Azure Storage. Using the Logs property DiagnosticMonitorConfiguration we again access an instance of the BasicLogsBufferConfiguration class, just like Azure Diagnostics Infrastructure logs, so the same properties are available. Set them as appropriate, save your configuration, and we’re good to go.

IIS Logs (web roles only)

The last data source that is collected by default, at least for web roles, are the IIS logs. These are a bit of an odd bird in that there’s no way to schedule a transfer or set a quota for these logs. I’m also not sure if their size counts against the overall quota. What is known is that if you do an on-demand transfer for ‘Logs’, this buffer will be copied to blob storage for you.


Out next buffer, the Failed Request Event Buffering log or FREB, is closely related to the IIS Logs. It is of course the failed IIS requests. This web role only data source is configured by modifying the web.config file of your role, introducing the following section.


Unfortunately, my tests for how to extract these logs haven’t yet been completed as I write this. But as soon as I do, I’ll update this post with that information. But for the moment, my assumption is that once configured, an on-demand transfer will pull them in along with the IIS Logs.

Crash Dumps

Crash dumps, like the FREB logs, aren’t automatically collected or persisted. Again I believe that doing an on-demand transfer will copy them to storage, but I’m still trying to prove it. But configuring the capture of this data also requires a different step. Fortunately, it’s the easiest of all the logs in that its simply and on/off switch that doesn’t even require a reference to the current diagnostic configuration. As follows:


Windows Event Logs

Do I really need to say anything about these? Actually yes, namely that the security log… forget about it. Adding custom event types/categories? Not an option.  However, what we can do is gather from the other logs though a simple xpath statement as follows:


In addition to this, you can also filter the severity level.

Of course, the real challenge is formatting the xpath. Fortunately, the king of Azure evangelists, Steve Marx has a blog post that helps us out. At this point I’d probably go on to discuss how to gather these, but you know… Steve already does that. And it would be pretty presumptuous of me to think I know better then the almighty “SMARX”. Alright, enough sucking up… I see the barometer is dropping. So lets move on. Open-mouthed smile

Performance Counters

We’re almost there. Now we’re down to performance counters. A topic most of us are likely familiar with. The catch is that as developers, you likely haven’t done much more than hear someone complain about them. Performance counters belong in the world of the infrastructure monitoring types. You know. Those folks that site behind closed doors with the projector aimed at a wall with scrolling graphs and numbers? If things start to go badly, a mysterious email shows up in the inbox of a business sponsor warning that a transaction took 10ms longer then it was supposed too. And the next thing you know, you’re called into an emergency meeting to find out what’s gone wrong.

Well guess what, mysterious switches in the server are no longer responsible for controlling these values. Now we can via the WAD as follows:


We create a new PerformanceCounterConfiguration, specific what we’re monitoring, and set a sample rate. Finally, we add that to the diagnostic configuration’s PerformanceCounters datasources and set the TimeSpan for the scheduled transfer. Be careful when adding though, because if you add the same counter twice, you’ll get twice the data. So check to see if it already exists before adding it.

Something important to note here, my example WON’T WORK. Because as of release of Azure Guest OS 1.2 (April of 2010), we need to use the specific versions of the performance counters or we won’t necessarily get results. So before you go trying this, get the right strings for the CounterSpecifier.

Custom Error Logs

*sigh* Finally! We’re at the end. But not so fast! I’ve actually saved the best for last. Smile How many of you have applications you may be considering moving to Azure? These likely already have complex file based logging in them and you’d rather not have to re-instrument them. Maybe you’re using a worker role to deploy an Apache instance and want to make sure you capture its non-Azure logs. Perhaps its just a matter of your having an application that captures data from another source and saves it to a file and you want a simply way to save those values into Azure storage without having to write more code.

imageYeah! You have an option through WAD’s support for custom logs. They call them logs, but I don’t want you to think like that. Think of this option as your escape clause for any time there’s a file in the VM’s local file store that you want to capture and save to Azure Storage! And yes, I speak from experience here. I LOVE this option. Its my catch all. And the code snippit at the left shows how to configure a data source to capture a file. In this snippet, “AzureStorageContainerName” refers to a blob in Azure Storage that these files will be copied too. LogFilePath is of course where the file(s) I want to save are located.

Then we add it to the diagnostic configuration’s Directories data sources. So simply yet flexible! All that remains is to set a ScheduledTransferPeriod or do an on-demand transfer.

Yes, I’m done

Ok, I think that does it. This went on far longer then I had originally intended. I guess I just had more to say then I expected. My only regret here is that just when I’m getting some momentum going on this blog again.. I’m going to have to take some time away. I’ve got another Azure related project that needs my attention and is  unfortunately under NDA. Smile with tongue out

Once that is finished, I need to dive into preparing several presentations I’m giving in October concerning the Azure AppFabric. If I’m lucky, I’ll have time to share what I learn as I work on those presentations. Until then… stay thirsty my friends.


Only as strong as the weakest link

I know, you’re still waiting on me to finish my Azure Diagnostics series. I’m working on it, but its taking longer then I wanted for me to do the due diligence on it. Meanwhile, I’ve been having discussions lately that I wanted to share.

Now I’m sure many of us have had to build a high performance website. We build the site, toss a load test at it, then tweak it until it can handle the load. If you need more performance, you may throw more memory, more CPU, performance tweak a few things. And if you’re lucky, the guesstimate that was used to come up with the projected load was either low or on a few occasions, accurate.

Problem is, as developers and architects, we’ve grown lazy. We’re too dependent on scaling vertically (bigger and more powerful hardware). We also too often fail to consider scalability in the early stages of architectural design. The prime example of this is database design. We think about creating an affective physical db schema, but how often do we design early on for scaling our database.

Yes, I’m talking about database partitioning.

The discussions I’ve been having lately usually start with “I need more than 50gb for SQL”. To which I usually respond “why?”. Do we have blob types stored in SQL? Can you move those objects to Azure Storage? Do you really need all 50gb in a single DB instance? Do you have a subset of that data that is mostly read only? Can it be moved to its own DB instance or better yet possibly be converted over to an in-proc cache system?

We need to rethink application data stores. Those with experience with data warehouses and operational data stores are already reading this and thinking ‘well duh”, but the fact of the matter is that the average developer still needs to change their mindset. To an extent, we need to reach back 20+ years to the days when the mainframe and multi-tenant systems were common.

The rules of DB’s have changed. And if we don’t change with them, we’ll continue to limit ourselves. To quote “The Matrix”, free your mind. Think about the DB not as an appliance but as yet another tool in your toolbox. If we don’t, it will remain the weakest link in our solutions.

On that note I’d like point you at a case study for TicketDirect out of New Zealand. Pay special attention to the “Migrating to Windows Azure” section. Where they talk about using small, sometimes temporary SQL Azure partitions to deliver a high performance, nearly on-demand type solution. Imagine selling out an arena for a U2 concert in 20 minutes and the hosting costs being less than a new netbook.

If reading this blog (or better yet the case study) excites you with new ideas, then I’ve accomplished my goal. Have fun thinking outside of the box.

Security vs Integration: What is the bigger risk?

Ok, this next statement is likely to get me flamed pretty solidly. But I’m doing it anyways. I disagree that security is the greatest risk facing companies that are considering the adoption of cloud computing technologies. Before you start firing back at me about this, please keep reading.

I recognize that security is a paramount concern for anyone considering a move to the cloud. I also strongly encourage anyone moving in that direction to make sure they are fully aware of any compliance challenges they may face. However, the fact remains over the last 10+ years; IT has developed a series of fairly robust and for the most part, successful patterns for securing data and applications.

Our network teams, infrastructure specialists, application architects, and compliancy officers have all become extremely adept at building the virtual equivalent of medieval castles whose primary purpose is to protect and safeguard access to our most sensitive assets, namely our data. Much like Fort Knox, these highly complex and closely monitored solutions keep our most precious resources safely locked behind iron gates.

I believe we’ve become too comfortable sitting behind our firewalls and appliances. We hide behind these fortifications and feel safe.

As we continue to move forward into the ever changing future that is IT, these castles are becoming a limiting factor. In today’s rapidly changing markets, we need to be able to adapt more quickly. To that end we’re starting to explore the possibilities of Cloud Computing, an approach that by its very nature is asking us to step outside the walls of our fortresses.

So now we need the IT version of an armored car. We put into it only what we need to do a specific job and make sure it has just the right mix of armor and mobility to get that job done. The truth of the matter is that the patterns and practices we’ll use to do this haven’t changed much in the last 10-20 years. Encryption, authentication, authorization, audits, etc… will all play a part in making this happen. There’s really nothing new here.

What I see as the real risk of Cloud Computing adoption is how we integrate public cloud computing platforms with traditional on-premise solutions. The technologies for accomplishing this are far less mature. We’re building new bridges, often with only partial understandings of the connections on both ends. This is the real risk and I’ve seen numerous accounts of failed cloud projects because this issue wasn’t addressed seriously enough.

So there’s my rant. I don’t expect everyone to agree with me. But hopefully there’s at least some food for thought. Flame away!