Network Isolation/Security with Azure Service Fabric

There are times you really need to take things beyond the “file new” experience and implement a more advanced scenario. And with these opportunities, there are times you realize that what you need likely isn’t a “one off” kind of thing. There are larger implications to what you need that can help solve a myriad of problems. This is the story of one these scenarios.

I was recently working with a partner as they explored Service Fabric. They liked what they saw, but there was a “but” (there almost always is). This partner is in the government space, and one of the requirements they had is that all public facing services are isolated and secured from any “back end” services (in a DMZ). If you’ve been doing IT for any length of time, this shouldn’t come as news. But the question they had for me was how to do this with Service Fabric.

There were a couple ways to address this that immediately came to mind. We could deploy the front end web application as an Azure Web App, hosted in an App Service Environment that was joined to the same VNet as the Service Fabric Cluster. We could also set up two Service Fabric clusters, again joined by a single VNet. The issue with both of these is that the front and back ends of the solution would need to be deployed and managed separately. Not a huge deal admittedly. But this did complicate the provisioning and deployment processes a bit, as well as seemed to run counter to the idea of a Service Fabric “application”, composed of multiple services as a single entity. I was fortunate that I had previously engaged my friend and colleague Kal to bring his considerable Service Fabric experience into play with this partner, and he suggested a third option, one we all found fairly intriguing.

A Service Fabric cluster has Node Types which are directly related to VM Scale Sets. Taking advantage of this, we could place different node types into different subnets and place Network Security Groups (NSGs) on the subnets to provide the level of isolation the partner required. We would then use Placement Constraints to ensure that the services within an application are only hosted in the proper subnet by using constraints specific to the node type, or types, in that subnet.

We ran the idea by Mark Fussell,  the lead Project Manager of the Service Fabric team. As we talked, we realized that folks had secured a cluster from all external access, but there didn’t appear to be a public, previously documented version of what we were proposing. Mark was supportive of the idea, and even offered up that in some of the “larger” Service Fabric clusters, the placement constraint approach has been used to ensure that the services that make up the Service Fabric Cluster remain isolated from those that comprise the applications deployed within it.

Our mission clear, I set to work! We were going to create a Azure Resource Manager template to create our “DMZ’d Service Fabric Cluster”.

Network Topology 

The first step was to create the overall network topology.

image

We have the front end subnet, which has a public load balancer that would handle traffic from the internet via a load balancer. There is a back end subnet with an internal load balancer that does not allow any connections from outside of the virtual network (using a private IP). Finally, we have a management subnet that contains the cluster services, including the web portal (on port 19080) and TCP client API (19000). For good measure, we’re also going to toss an RDP jump box into this subnet so if something goes wrong with any of the nodes in the cluster, we can remote in and troubleshoot (something that I used the heck out of while crafting this template).

With this in place, we then define the VM Scale Sets, and bind their network configurations to the proper subnets as follows:

"networkInterfaceConfigurations": [ 
  { 
    "name": "[variables('nodesMgmnt')['nicName']]", 
    "properties": { 
      "ipConfigurations": [ 
        { 
          "name": "[concat(variables('nodesMgmnt')['nicName'],'-',0)]", 
          "properties": { 
            "loadBalancerBackendAddressPools": [ 
              { 
                "id": "[variables('lbMgmnt')['PoolID']]"
              } 
            ], 
            "subnet": { 
              "id": "[variables('subnetManagement')['Ref']]"
            } 
          } 
        } 
      ], 
      "primary": true
    } 
  } 
]

With the VM Scale Sets in place, then we moved on to the Service Fabric Cluster to define each Node Type. Here’s the cluster node type definition for the management subnet node type.

{ 
  "name": "[variables('nodesMgmnt')['TypeName']]", 
  "applicationPorts": { 
    "endPort": "[variables('svcFabCluster')['applicationEndPort']]", 
    "startPort": "[variables('svcFabCluster')['applicationStartPort']]"
  }, 
  "clientConnectionEndpointPort": "[variables('svcFabCluster')['tcpGatewayPort']]", 
  "durabilityLevel": "Bronze", 
  "ephemeralPorts": { 
    "endPort": "[variables('svcFabCluster')['ephemeralEndPort']]", 
    "startPort": "[variables('svcFabCluster')['ephemeralStartPort']]"
  }, 
  "httpGatewayEndpointPort": "[variables('svcFabCluster')['httpGatewayPort']]", 
  "isPrimary": true, 
  "placementProperties": { 
    "isDMZ": "false"
  },            
  "vmInstanceCount": "[variables('nodesMgmnt')['capacity']]"
} 

The “name” of this Node Type, must match the name of a VM Scale Set, that’s how the two get wired together. Since this sample is for our “management” node type, it would also be the only one with the isPrimary property set too true.

At This point, we debugged the template and made sure the cluster to ensure it was valid and the cluster would come up “green”. The next (and harder step) is to start securing the cluster.

Note: If you create a cluster via the Azure portal with multiple node types, each node type will get its own subnet. However, we were after a reusable ARM template so we had to configure things ourselves.

Network Security

Unfortunately, when we set out to create this, there wasn’t much publicly available on the ports that were needed within a fabric cluster. So we had to do some guesswork, some heavy digging, as well as make a wishes for some good luck. So in this section I’m hoping to lay out some of what we learned to save others the effort.

First off, we started by blocking all inbound connections on the three subnets. I then opened ports 19080 (used by the Service Fabric web portal) and 19000 (used by the Fabric Client and Powershell) for the “management” subnet so I could interact with the cluster remotely. This was all done via the Azure Portal, interactively so we could test the rules out then use the Resource Explorer to export them to our template. We assumed that with these rules in place, we would see some of the nodes in the cluster go “red” or unhealthy. But we didn’t!

It took a day or so, but we eventually figured out that we were seeing two separate systems collide. Firstly, when a VM is brought up, the Service Fabric extension is inserted into it. This extension then registers the node with the cluster. As part of that process there’s a series of connections that are established. These connections are not ephemeral, remaining up for the life of the node. Our mistake was in assuming these connections, like we encourage most of our partners to do when building applications, were only temporary and established when they were needed.

Since these are established, persistent connections, they are not impacted when new NSG rules are applied. This makes sense since the NSG rules are there to interrogate any new connection requests, not look over everything that’s already been established. So the nodes would remain green until we rebooted them (tearing down their connections) and they tried (and failed) to re-establish their connection to the cluster.

This sorted out, we set about trying to place the remainder of the rules in place for the subnets. We knew we wanted internet connectivity to any application/service ports in the front end, as well as application/service ports in the backend from within the VNet. But what we were missing was the ports that Service Fabric needed. We found most of these in the cluster manifest:

 
<Endpoints> 
  <ClientConnectionEndpoint Port="19000" /> 
  <LeaseDriverEndpoint Port="1026" /> 
  <ClusterConnectionEndpoint Port="1025" /> 
  <HttpGatewayEndpoint Port="19080" Protocol="http" /> 
  <ServiceConnectionEndpoint Port="1027" /> 
  <ApplicationEndpoints StartPort="20000" EndPort="30000" /> 
  <EphemeralEndpoints StartPort="49152" EndPort="65534" /> 
</Endpoints> 

This worked fine at first. We stood up the cluster with these rules properly in place and the nodes were all green. However, when we’d tried to deploy an app to the cluster, it would always time out during the copy step. I spent a couple hours troubleshooting this one to eventually realize that it was something inside the cluster that was still blocked. I spent a bit of time trying to look at WireShark and Netstat runs inside of the nodes to determine what could still be the blocker. This could have carried on for some time had it not been for Vaishnav Kidambi pointing out that Service Fabric uses SMB to copy the application/service packages around to the nodes in the cluster. We added on a rule for that, and things started to work!

Note: As a result of this work, the Service Fabric product team has acknowledged that there’s a need for better documentation on the ports used by Service Fabric. So keep an eye out for additions to the official documentation.

Here’s what the final set of inbound rules for the Network Security Group (NSG) associated with the management subnet looked like.

image

A quick rundown… I’ll start at the highest priority (at the bottom) and work my way up since that’s how the NSG applies the rules. Rule 4000 blocks all traffic into the subnet. Rule 3950 and 3960 enable RDP connections within the VNet, and to the RDP jumpbox (at internal IP 10.0.3.4) from the internet. The next three rules (3920-3940) allow the connections needed by Service Fabric within the VNet only (thus allowing all the service fabric agents on the nodes to communicate). And finally, the first two rules (3900 and 3910) open up external connections for ports 19080 and 19000. Rules 3960, 3900, and 3910 are unique to the management subnet. I’ll get to why 19000 and 19080 are unique to this subnet in a moment.

Dynamic vs Static Ports

One sidebar for a moment. Connectivity between the front and back end is restricted to a set of ports you set when you run the template (it defaults to 80 and 443). In Service Fabric terms, this is called a static port. When you build services you also have the option of asking the Fabric for a port to use, a dynamic port. As of the writing of this article, the Azure load balancer does not support these dynamic ports. So to leverage them via the load balancer and our network isolation, we’d have to have a way to update both each time a port is allocated or released. Not ideal.

My thought is that most of the use of dynamic ports is likely going to be between services that have a trusted relationship. This relationship would likely results in the services being places inside the same subnet. If you needed to expose something they were doing to the “outside world”, you will likely set up a gateway/façade service that in turn might be load balanced. Its this gateway service that would be exposed on a static port so that it can easily be reached via a load balancer and secured with NSG rules.

Restricting Service Placement

With the network topology set, and the security rules for each of the subnets sorted, next up was ensuring that application services get placed into the proper locations. Service Fabric services can be given placement constraints. These constraints, defined in the Service Manifest, are checked against Placement Properties for each node type to determine which nodes types should host a service instance. These are commonly used for things like restricting services that require more memory to nodes that have more memory available or situations where specific types of hardware are required (a GPU for example).

Each node type gets a default placement property, NodeTypeName, which you can reference in a service manifest like so.

 
<ServiceTypes> 
  <!-- This is the name of your ServiceType. 
       This name must match the string used in RegisterServiceType call in Program.cs. -->
  <StatelessServiceType ServiceTypeName="Web2Type"> 
    <PlacementConstraints>(NodeTypeName==BackEnd)</PlacementConstraints> 
  </StatelessServiceType> 
</ServiceTypes> 

Now we may want to have other constraints beyond just NodeTypeName. Placement Properties can be assigned to the various Node Types in the cluster Manifest. Or, if you’re doing this via an ARM template such as I was, you can declare them directly in the template via a property within the NodeType definition/declaration.

 
"placementProperties": { 
  "isDMZ": "true" 
},

If you look at the node type definition I used earlier, you’ll where this property collection goes. In that template “isDMZ” is false.

Combined, the placement properties, as well as the placement constraints will help ensure that each of the services will go into the subnet that has already been configured to secure host it. But this does pose a challenge. If we declare the placement constraint in the service manifest as I show above, this does restrict which clusters we can deploy the service too. If a cluster doesn’t have our placement properties declared, the service will fail to deploy. We could address this by removing and then added the placement constraints later (not ideal) or altering the cluster manifests (again not ideal). But there are two other options. First, we could craft our own definition of the application/service types and register them with the cluster, then copy the packages to the cluster.

Note: Fore more on Placement Constraints, please check out my new blog post.

This article contains a section that talks about doing this via C# or Powershell. Another option, and one I think I actually prefer (but admittedly haven’t tried), is to use a build event to alter the manifest. You can then trigger this event based on various parameters to control if it happens when you’re doing a local build, vs a cloud build. Perhaps even going so far as reading a value from the Application Parameters or Publication Profile files. But for now, I’ll need to set these aside. There’s also a third option I’m investing but I’m not confident enough to bring it up yet. I hope to eventually circle back on these.

There is one other placement constraint (I mentioned I’d get to this). There are two things unique to the management node type/subnet. The first is that it’s the only subnet I would open ports 19000 and 19080 on. The reason for this is because this is the only node type in the cluster manifest that is marked as “isPrimary”. A service fabric cluster can only have one “primary” node type. This node type is the one where all the “system” services will be placed (Naming, FileStore, Cluster Manager, etc…). So setting “isPrimary” ensures that these services will be placed into this subnet, allowing me to keep them separate from any application services. I previously mentioned that this approach was proposed by Mark Fussell of the Service Fabric team. It’s a pattern that’s used by some larger clusters to help ensure that fabric management resource demands can be scaled independently of application needs.

Between placement of the management services on the primary node type, and restricting application placement via constraints, we can now put each of our services only where we want them to be.

Using the JumpBox

A common technique in cloud solutions is to leverage a “jump box”. Allowing direct, remote access to a virtual machine is sensitive and risky. To help manage this risk, there’s usually one or more, restricted access points that are used as gatekeepers. You access one of these gatekeepers as a leaping off point to access resources inside the security boundary. We’ve set up this approach, allowing you to RDP into a jump box from which you would then  RDP into the other boxes within the VNet.

Using this template, you’ll need to address all your VM instances via IP. Since we’re using dynamic IPs within the VNet, you can RDP into a box using a fairly simple address scheme. The third area of the IP address represents the subnet you want to access (1=front end, 2=back end, 3=management) and the final area is the specific machine. Azure reserves the first three address in a subnet rate for its own use, so you can start at 4 for the VMs in the front end or management subnets. For the back end subjet, I’ve used 10.0.2.4 as the private IP for the internal load balancer. So the nodes in that subnet start at 5.

The next step would be to adapt the “allowJumpBoxRDP”security rule on the management subnet so that it only allows connections from trusted sources (say your on-prem network).

Many diet colas died to bring you this information

So there you have it. I’ll admit that on the surface it may not seem like much. But if you’ve ever built an ARM template, you know how much effort it requires. Add into this all the stuff I had to learn/discover to get it to a functional state and validate it by deploying apps to it (which required more debugging and bug fixes) and..well… we’re talking quite a bit of effort. So I’m hoping that this article and the template will help a few folks avoid what I had to go through.

The entire template (complete with jump box), can be found in my github repo. I also have a version that is a secure cluster using certificates and Azure AD. I’m going to continue to try and polish it, and I’m also looking at getting it published (with additional guidance on usage) in the Azure QuickStart Templates repository. So be sure to let me know of any suggestions or bugs you find. I’ll do my best to get them worked in.

Until next time!

PS – thank you to everyone that helped contribute to this effort: Kal, Jason, Corey, Patrick, Mike, Shenlong, Vaishnav, Chacko, and Mikkel

15 Responses to Network Isolation/Security with Azure Service Fabric

  1. Huge thanks for sharing this ARM template! I have to create a similar solution soon and this will be a big help!

  2. Mark Lauter says:

    Great article and many thanks! This got me headed down the right path. I’m a little stuck on trying to deploy services to ASF now that my ASF cluster is only accessible from inside the VNet. Obviously can’t connect directly with Visual Studio anymore.

    • Brent says:

      When you deploy from Visual Studio, it runs a PowerShell script that’s in the /scripts folder of the application project called “Deploy-FabricApplication.ps1”. This scripts attempts to connect to the TCP client port on the target service fabric cluster (port 19000).

      If that location is not accessible from your build machine, you’ll need to package the application and copy it to somewhere that can reach that port and run the script there.

      In the scenarios I’ve worked on so far, this is usually handled by an automated build/deploy machine that’s on the same VNET. And the prober NSG rules have been applied to allow it to target the Private IP that the TCP client service is associated with. Optionally, it would directly target the service on any given primary node within the cluster if there’s no load balancer involved.

      • Mark Lauter says:

        Thanks, Brent! I decided to go to a dual load balancer setup with 19000 being served by the external LB. I control access to it with a security group – keeping the port blocked until we need to publish and then opening it up with IP restrictions.

        Thanks again for the article. I’ve been sharing it with everyone I know who’s interested in ASF.

  3. Ryan Elfman says:

    How come you use the private load balancer instead of the built in reverse proxy?

    • Brent says:

      Its definitely an acceptable approach. I opted to show the internal/private LB for two reasons. First, its something that’s easy for folks new to Service Fabric to wrap their heads around. Secondly, if they are setting up the cluster in a Vnet that’s connected to a VPN gateway and will only be accessed from inside of that VNet, the template has an example of a non-public load balancer. Now you could easily put an internal LB in front of the reverse proxy, but that could be problematic since it would allow ANY service on the cluster to be accessible via the reverse proxy (assuming no NSG rules block access).

  4. jonlanceley says:

    This is the closest article I’ve found to what I’m trying to do. However on the FrontEnd Load Balancer I’d like to make it an Internal Load Balancer and then have Application Gateway in front for SSL offloading etc. Or replace the ILB with Application Gateway. I think both of these are possible?

    But I also need to be using an outbound static IP which doesn’t change so a third party can grant access in their firewall as we will need to call their api’s to do stuff. And Application Gateway does not support a static public IP.

    So do I need to put another Load Balancer in front of Application Gateway which can have a public static IP on it. This doesn’t sound right to me.

    Any thoughts would be very welcome.

    • Brent says:

      You can configure the Application Gateway to use a private IP and put your front end service fabric services behind an internal load balancer. You could even look at removing the internal load balancer and instead leverage the Service Fabric Reverse Proxy. As for forcing outbound traffic through a fix IP, that gets into network tunneling/routing. This approach usually involves a virtual application and User Defined Routes. I would think in your scenario, that virtual applicance(s) would likely live in another subnet of your virtual network.

  5. jonlanceley says:

    In case any one else has the same problem I did I’ve documented how I setup service fabric with an application gateway at the front end, and how I needed a public static outbound IP number so a third party firewall could white-list it’s IP number.

    In essence it involved hosting 2 services in service fabric, 1 in the front end, and 1 in the management node type.

    My website (currently hosted as a web app service) called the front end node type service and that then made a call to the management node type where there was a stateless service running which then made an api call to a web service hosted at a third party.

    http://jonlanceley.blogspot.co.uk/2017/04/3-node-service-fabric-environment-with.html

  6. chrisseroka says:

    Great article, but could you please explain why you need a separate node type for management purposes only? Why do I have to pay extra money for VMs that I do not deploy anything (please correct me if I got it wrong)?

    • Brent says:

      You don’t “need” a separate management node type, it’s entirely optional. This pattern isolates the cluster system services away from your application/services, thus providing greater resource isolation between the two. This reduces the risk of performance “jitter” in the application services by ensuring that system operations have a minimal impact on service resources. By having separate nodes just for the system services, you are also now able to designate different compute skus and instance counts for those resources. So while you may opt to run D15’s for your application nodes, you may opt instead for D3’s for the management nodes. 🙂

      • chrisseroka says:

        What are the most demanding system operations from your experience? How can I decide whether it’s a good moment to extract additional management node type or save some $ because the cluster is small or does not use that much of system operations?

      • Brent says:

        My recommendation would be that if you’re doing a small cluster, you likely don’t need the extra isolation. However, if you plan on resizing the cluster often, hit the naming service ALOT, moving services around (especially stateful ones), or other operations would that would rely on the cluster system services, you may want to test it.

  7. Jim K says:

    Do you have any more updates to hosting service fabric in an environment t outside azure and requiring an internet facing dmz secured by isolating the nodes

    • Brent says:

      Service Fabric is really just a series of management services running on VMs on infrastructure. All I’ve done is illustrate how you’d do it using Azure’s IaaS features. The sample logical processes would apply to hosting it with another cloud provider’s infrastructure, or on prem. All that changes is the way the service fabric bits are installed and how they stitch the nodes together into the cluster.

Leave a reply to Mark Lauter Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.