With all due respect to cotton, the fabric of our lives (or at least our careers in IT) is Service Fabric.
In my last post, I touched on the use of Azure Service Fabric as the underlying substrate of some of our services, and today I’ll dive deeper into how this works and how our decision to architect our products in this way impacts you.
To start, let me state the obvious: Building distributed services at any scale (let alone cloud scale) is hard. There are no shortage of challenges when it comes to building a system like this, but here are the big ones:
- Redundancy is required since nodes in the system can fail, and these failures are often at the mercy of the commodity hardware employed in cloud services.
- Managing replication and quorum management is hard enough, and this is made more challenging when the number of nodes is really large.
- Detecting failures is incredibly difficult due to the complexity of distinguishing between node and link failures. You further have to ask yourself if you actually need to make this distinction between the two, and, if so, how do you do it efficiently?
- Large scale requires partitioning of your service. How should you implement partitioning?
- How do you manage leader election in a set of nodes of a service?
- How do you manage no downtime upgrades of the service?
- And, the biggest and most difficult question of all: How do you manage all this efficiently?
At a company the size of Microsoft, we encounter all of these issues hundreds of times a day. The solution we use is Azure Service Fabric (ASF). Long before we made this publically available, it was being used to support our most demanding services – things like Skype for Business, Event Hubs, DocumentDB, Azure SQL Database, and Bing Cortana. And, of course, Intune.
ASF is a platform that addresses the complexity of managing the challenges noted above by encapsulating them for use by other services. You can dive deep into how this service operates here.
A quick primer on Service Fabric:
Service Fabric is a distributed systems infrastructure that organizes machines into a dynamic federation and provides scalable, reliable, and efficient solutions to the aforementioned problems. With Windows Fabric, we, as Intune developers, can concentrate on the higher level business logic and avoid having to repeatedly solve low level hard problems of distributed systems.
Taking a dependency on a platform like this is a decision we did not want to take lightly, so, when we started to look into this a few years back, we engaged deeply with the ASF team, as we continue to do. We also worked closely with teams across Microsoft, like Skype (back then it was called Lync) and Azure SQL Database. Our work with these teams ensured that underlying technology and supportability of the platform were going to do everything we needed. At this point in the re-architecting process, we went all in and began using it for many of our most critical components.
One really important thing to point out here is this: As we have built Intune (and all the services within Intune) we have relied heavily on the Azure team for their expertise as well as for the Azure cloud platform itself. I cannot even begin to imagine how hard it would be to build all of the infrastructure required for a cloud-scale service all on my own – which is exactly what the Intune competitors are trying desperately to do.
This is just one area out of many where I am confident the EMS solutions will scale and perform far better than anything else on the market. The EMS solutions really are the only solutions built from the ground up as pure cloud services on a public cloud platform. This is a place where the architecture really, really matters. This architecture really does matter.
In more concrete terms, here’s how we use Azure Service Fabric
An example of a stateless service is what we call the Information Worker Service. This is the type of service that our Information Worker UX’s (both Company Portal Applications from the different platform stores and our web portal UX) talk to. This service then talks to other services to act on those requests – often aggregating across a number of other internal services. For example, the Information Worker Service will talk to a service to determine the applications made available to the user, and then to another service to determine details of those applications, and then to another service to determine the devices to which the user has already installed the applications. Each of these services is a micro service that can be updated independent of each other – this is one of the reasons why we are able to innovate and improve so quickly. But it is purely stateless. The services that the IW Service talks to are Stateful, i.e. there is a service that has application details relevant to the information worker (e.g. name, icon, publisher content, etc.).
There are also things that you get from ASF whether a micro-service is stateful or stateless. For example, you get a consistent deployment model, where ASF does a lot of the hard work for you. You also get consistent health monitoring. In future posts, I’ll examine how and when we deploy and how we monitor, but, for the purposes of this discussion, an important element to consider is Load Balancing.
To address the need for load balancing, we provide a cluster of nodes (VM’s) to ASF for allocating our micro-services – whereas Intune leaves the resource management to ASF. Via a set of defined API’s, Intune can tell ASF if a micro-service is running hot and with that information ASF can re-allocate the micro-service (specifically a replica of a particular partition) to another node that has more resources.
This setup means that you never have just one micro service per node – instead there may be multiple micro-services sharing the same node at the discretion of the ASF load-balancing algorithm which manages this. The critical element is that if there are ever bottlenecks, more nodes can be added without removing the need for linear scaling in the code with no shared data resources.
As we have added more features in production we have actually increased ring size and seen this scaling in action.
What about partitioning and replication for Stateful services?
Our architecture has partitioned the application service into 4 partitions per scale unit. For other services it is 16, and we are analyzing other areas to determine the need to add even more. We’re also considering the need to do dynamic partitioning.
For reference, a “scale unit” is a cluster of servers and we have multiple scale units in Intune.
Intune then has 5 replicas per partition.
That’s not a typo: Intune stores each piece of information 5 times.
When we write information about the application, we write to the primary replica of the appropriate partition. When quorum is achieved (i.e. we successfully write to 3 replicas), then the write is deemed successful.
If, for any reason, 2 of the replicas of a partition fail, we can still carry on and replicas will automatically be spun up to replace the ones that have failed. In the event there is an upgrade happening to a node, this is also no cause for concern because there are four other domains to rely upon during this process. We have 5 upgrade domains and 5 fault domains in Intune – if the primary fails, a new primary will be elected to take the writes.
This unique architecting is incredibly powerful for our customers – it provides an incredible level of fault tolerance and availability.
All of this functionality is provided by ASF. The Intune code reacts to certain events by implementing a set of APIs that ASF invokes and we have to package up our data for replication – but ASF is doing the hard work of detecting failures, managing quorum, electing primaries etc. Doing this efficiently and at scale is phenomenally difficult – and Intune gets this for free because we chose to build on a public cloud platform.
In Intune, we have done some interesting things to optimize the building of a secondary replica when it comes up fresh. We have exhaustively tested this to ensure that, even when the system gets hit with bursts of data, the extra resources incurred by waiting for quorum on writes in not affecting the throughput. The way we’ve achieved this is by architecting everything to be asynchronous.
We continue, of course, to tweak the performance of the system, including service partitioning and inter-service calls as we go.
On the Stateless side of things, there are not any partitions, but any instance of a stateless service can handle any call – key to performance and availability. And, of course, even with stateless you still get all the benefits of failure detection, load balancing, monitoring etc.
* * *
We treat Azure Service Fabric as a very close partner, and we stay very close to their release cycles so that we are always within a month of their latest update. As a service we can afford to do that.
Thus far, we have been very happy with the decision to use this technology, and the benefits of this architecture are simply amazing. The innovations of my team – combined with the rapid innovations from the Azure Server Fabric Team – is compounding the ways users can apply and scale with this technology.
To learn more about Azure Service Fabric, visit: https://aka.ms/servicefabric.