Loading it on with Event Grid

Last week, I had one of those moments that makes me really appreciate my role at Microsoft. I was fortunate enough to spend 2 days working along side colleagues and members of the Event Grid product group (PG) with the Event Grid. We played with some stuff that hasn’t been announced (sorry, not covering that here), as well as tested out a few scenarios for them and gave feedback on the service.

The scenario I picked was “Event Grid under load”, basically to try and DDOS the service. The objective of this scenario was to send load to a custom topic in Event Grid and see a) could I get throttling to kick in and b) what were the behaviors we saw.  The answer was a) yes and b) educational.

The Solution

Event Grid commits to handle the ingestion of up to 5,000 events per second, but we were encouraged to see how far we could push things. And while there are some solutions out there that can be used to generate simulated load, I went to what I knew… Service Fabric. I opted to create a simple micro-service that could generate traffic, then use a Service Fabric cluster to run multiple copies of that service. The service is pretty simple, about 20 lines of code inserted into the “run” method of a stateless service.

List<EventGridEvent> eventList = new List<EventGridEvent>();

for(int i=1; i <= 1000; i++)
    EventGridEvent myEvent = new EventGridEvent()
        Subject = $"Event {i}",
        EventType = "Type" + i.ToString()[0],
        EventTime = DateTime.UtcNow,
        Id = Guid.NewGuid().ToString(),
        Data = "sample data",
        DataVersion = "1.0"

TopicCredentials topicCredentials = new TopicCredentials(sasKey);
EventGridClient client = new EventGridClient(topicCredentials);
client.PublishEventsAsync(eventgridtopic, eventList);

await Task.Delay(TimeSpan.FromMilliseconds(100), cancellationToken);

I started running a single copy of this service on my development machine using my local development Service Fabric cluster. I played around with settings going from a batch of 100 events once per second and tweaking things until it was finally running as you see above with a batch of 1000 events about 10 times a second. At that point we stood up a 5 node Service Fabric cluster running Standard_D2_v2 instances so we could deploy the service there. I was able to batch up 1000 of my events because I didn’t exceed the 1mb maximum payload of a single publish operation. So if you opt for this same type of bulk send, please keep that in mind when designing your batch size.

Event Grid is pretty smart, so when an event comes in, it will persist it to storage, than look for subscribers that need to receive it. If there are none, it stops there. So to really help finish this scenario, we needed to have a way to consume the load. This could prove much more difficult then generating the load, so we instead opted for an easy solution… Event Hub. We created a subscriber on our Custom Topic that would just send all the event that came in to an Event Hub that was configured to automatically scale up to the maximum limits.

With everything in place, it was time to start testing.

The Results

We started our first real test with one copy of the load generating service on each node in the cluster, generating approximately 50,000 events per second. Using the Azure Portal, we monitored the Custom Topic and its sole subscriber closely and saw that everything appeared to be handling the load admirably. The only real deviation we saw was momentary degradation that we believe presented the Event Hub engaging its automatic scaling to handle the extra traffic. This is important because unlike a typical load test where you’d slowly scale up to the 50,000, we basically went from 0 to 50k in a few seconds.

Raja, a lead architect on Event Grid, was sitting next to me monitoring the cluster in Azure that hosted my custom topic. After getting pretty excited about what I was seeing, he agreed that we could double the traffic. So after a quick redeploy of the solution, we doubled our load to 100k events per second. Things started off great, but after a few minutes we noticed we started seeing steadily increasing failure rates on sends to the topic. When we looked at the data, the cause became apparent. Event Hub was throttling the Event Grid’s attempts to send events to it. This created back-pressure on the Event Grid cluster as it started executing its retry scenarios.

With this finding in mind, we realized that we need to change the test a bit and filter the Event Hub subscriber. We went with a simple option again and opted to have the Event Hub only subscribe to events where the Event Type was “Type 1”. In the code above, we take the first number of the loop (1-1000) and use that as the Event Type. Thus generating 9 different events types. So with the revised filtering, we returned to sending 50k events per second, with the Event Hub only getting slightly less than 5k of them. This worked fine, so we double it again to 100k. This time… because we have eliminated the back-pressure, we were able to achieve 100k without any failures.

With 100k successfully in place, Raja looked at the cluster and gave the thumbs up again for us to double traffic to 200k. We quickly saw that we’d hit our upper limit and we started getting publication errors on the custom topic. By our account, we were hitting the limit at just a nudge past 100k. But we let the load run for awhile and continued to monitor what was going on. As we watched, we saw that our throughput would drop, at times to as low as 30k per second but would never really get above that 100k limit. Raja explained that what was happening was that the cluster had capacity and was giving it to us up to a certain point. But as pressure from other tenants on the cluster came and went, our throughput would be throttled down to ensure that there were resources available for those tenants. In short, we had proven that the system was behaving exactly as intended.

So what did we learn…

The entire experience was great and I had alot of fun. But more importantly, I learned a few things. And in keeping with my role… part of the job is to make sure I pass these learnings along to others. So here’s what we have…

First, while the service can deliver beyond the 5,000 events per second that the PG commits too, we shouldn’t count on that. This burned folks back in the early days of Azure when they would build solutions using Azure Blobs and base their design on tests that showed it was exceeding the throughput targets that were committed to at that time. When things would get throttled down to something closer to those commitments, it would break folks solutions. So when you design and build a solution, be sure to design based on the targets for the service, not observed behaviors.

Secondly, back-pressure from subscriber egress (this include failed deliver retries) can affect ingestion throughput. So while its good to know that Event Grid will retry failures, you’ll want to take steps to keep the amount of resource drain from retries to the minimum. This can mean ensuring that the subscriber(s) can handle the load, but also that when testing your solution, you properly simulate the impact of all the subscribers that you may have in your production environment.

And finally, and I would hope this goes without saying… pick the right solution for the right job. Our simulated load in this case represented an atypical use case for Event Grid. Grid, unlike Event Hub, is meant to deliver the items we need to act on.. the actionable events. The nature of my simulated traffic was more akin to a logging solution where I’m just tossing everything at it. In reality, I’d probably have a log stream, and only send certain events to the grid. So this load test certainly didn’t use it in the matter that was intended.

So there you have it. I managed to throw enough load at both the Event Hub and Event Grid to cause both solutions to throttle my traffic. With a few tweaks, my simple service could also possibly be used to create a nice little (if admittedly overly expensive) HTTP DDOS engine. But that’s not what is important here (even if it was kind of fun). What is important is that we learned a few things about how this service works and how to design solutions that don’t end up creating more problems then they solve.

Until next time!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: