vSAN and Fault Domains, aka Rack Awareness

Keeping your virtual workloads up and running at all times while also providing the back-end data resiliency is key to any VMware vSphere deployment. This is true if   your shared-storage mode consists of a “traditional” three tier architecture (host/fabric/storage) or if you leverage Hyper-Converged Infrastructure (HCI) to     consolidate   and provide compute/storage resources. How you accomplish this task though is different. With the traditional storage array you have redundant controllers front ending your disk subsystem, or if scaling you might place multiple controller across cabinets in a “cluster” configuration. With HCI/vSAN the concepts are still basically the same, but you are now leveraging both hardware (compute/storage nodes) and the software to logically place your data across cabinets. In vSAN this means leveraging Fault Domains for rack awareness.

Make it RAIN

When working with vSAN and how it protects/ensures virtual machine availability think of the fundamental RAID (Redundant Array of Independent Disks) concepts spanning across the nodes that make up your vSAN cluster, aka RAIN (Redundant Array of Independent Nodes). These concepts are applied via policies and specifically around Failures to Tolerate (FTT) and Failure Tolerance Method (FTM). The selection or combination of these options provide either mirroring (RAID1) or RAID5/6 with FTT set to one or two and FTM configured for Erasure Coding.

Figure 1 – Logical FTT =1 with Mirroring

With our four-node cluster example outlined above, vSAN is providing a replica copy of the data in case of a single ESXi host failure to provide availability. While “logically” the example helps explain the concepts, in practice you might have some additional considerations to account for. In the next example, we take our four-node cluster and place it into a possible real world scenario. Still staying with the FTT=1 with mirroring the primary/replica copies of data remain the same, but now all the servers (thus the data) are contained in a single rack. While the software will protect from a node failure, imagine if the power to the whole rack was lost?

Figure 2 – Physical FTT=1 with Mirroring

vSAN Fault Domains

Way back in in the vSAN 6.0 release (when it was called VSAN), VMware introduced the ability to create Fault Domains to provide rack awareness and a bit of control where vSAN placed data objects. Without the use of Fault Domains the potential is there for vSAN to place all or the majority of needed objects to keep a virtual machine protected and running on hosts in the same rack. Obviously stated, this raises a potential concern. With the creation of Fault Domains we can provide some intelligence for vSAN to spread these objects across nodes in multiple racks, thus providing higher availability. Continuing our example from above, using FTT=1 with Mirroring, and then providing vSAN with rack awareness we have a better distribution of our objects:

Figure 3 – Logical and Physical w/Fault Domains

With the above configuration, we can lose a host or a full rack of hosts and keep our workloads up and running till the host/rack is restored or vSAN recreates the failed objects in a different Fault Domain. For a demonstration, I have configured four Fault Domains in my lab consisting of a single node. The next screen grab provides a view from my TUK-SRM01 virtual machine (protected by FTT=1 w/Mirroring) and showing how the vSAN components are placed in the given Fault Domain:

Figure 4 – vSAN Fault Domains

Figure 5 – Virtual Machine Object Placement

Things to Consider

While the configuration/implementation of Fault Domains is easy and straight forward, here are a few things to keep in mind:

  • A minimum of three (3) Fault Domains are required, standard practice would be to configure four (4)
  • When creating Fault Domains, create each Fault Domain with the same amount of hosts
  • Configure enough Fault Domains to meet the requirements of the Failures to Tolerate policy setting (see table below)

Failures to Tolerate Setting

Minimum Number FD

Suggested Number FD

FTT=1 w/Mirroring

3

4

FTT=2 w/Mirroring

5

6

FTT=3 w/Mirroring

7

8

FTT=1 w/Erasure Coding (R5)

4

5

FTT=2 w/Erasure Coding (R6)

5

6

 

Thanks for Reading!

-Jason