For this objective I used the following resources:
- vSphere Availability documentation
- vSphere Resource Management documentation
- vCenter Server and Host Management documentation
- vSphere Troubleshooting documentation
Objective 7.5 – Troubleshoot HA and DRS Configurations and Fault Tolerance
Identify HA/DRS and vMotion Requirements
- All hosts must be licensed for vSphere HA
- You need at least two hosts in the cluster
- All hosts need to be configured with static IP addresses. If you are using DHCP, you must ensure that the address for each hosts persists across reboots
- There should be at least on management network in common among all hosts and best practices is to have at least two. Management networks differ depending on the version of host you are using.
- To ensure that any virtual machine can run on any host in the cluster, all hosts should have access to the same virtual machine networks and datastores
- For VM Monitoring to work, VMware tools must be installed
- vSphere HA supports both IPv4 and IPv6. A cluster that mixes the use of both of the protocol versions, however is more likely to result in a network partition
For further information see page 32 of the vSphere Availability documentation
- Shared Storage
- Storage can be either SAN or NAS
- Shared VMFS volumes
- Place the disks of all virtual machines on VMFS volumes that are accessible by all hosts
- Set access mode for the shared VMFS to public
- Ensure the VMFS volumes on source and destination host use volume names, and all virtual machines use those volume names for specifying the virtual disks
- Processor Compatibility – Processors of both the source and destination host must be of the same vendor (AMD or Intel) and be of the same processor family. This requirement is more for the use of vMotion and allowing a VM to execute its processes from one host to the other. vCenter provides advanced features to make sure that processor compatibility requirements are met:
- Enhanced vMotion Compatibility (EVC) – You can use EVC to help ensure vMotion compatibility for the hosts in a cluster. EVC ensures that all hosts in a cluster present the same CPU feature set to virtual machines, even if the actual CPUs on the hosts differ. This prevents migration with vMotion from failing due to incompatible CPUs.
- CPU Compatibility Masks – vCenter Server compares the CPU features available to a virtual machine with the CPU features of the destination host to determine whether to allow or disallow migrations with vMotion. By applying CPU compatibility mask to individual virtual machines, you can hide certain CPU features from the virtual machine and potentially prevent migrations with vMotion from failing due to incompatible CPUs.
For further information see pages 63 thru 64 of the vSphere Resource Management documentation
- The virtual machine configuration file for ESXi hosts must reside on a VMware Virtual Machine File System (VMFS)
- vMotion does not support raw disks or migration of applications clustered using Microsoft Cluster Service (MSCS)
- vMotion requires a private Gigabit Ethernet (minimum) migration network between all of the vMotion enabled managed hosts. When vMotion is enabled on a managed host, configure a unique network identity object for the managed host and connect it to the private migration network
- You cannot use migration with vMotion to migrate a virtual machine that uses a virtual device backed by a device that is not accessible on the destination host
- You cannot use migration with vMotion to migrate a virtual machine that uses a virtual device backed by a device on the client computer
For further information see page 56 of the vSphere Resource Management documentation and pages 123 thru 124 of the vCenter Server and Host Management documentation
Verify vMotion/Storage vMotion Configuration
See above sections for DRS and vMotion requirements. Key areas of focus will be proper networking (VMKernel interface for vMotion), CPU compatibility and shared storage access across all hosts.
Verify HA Network Configuration
- On legcacy ESX hosts in the cluster, vSphere HA communications travel over all networks that are designated as service console networks. VMkernel networks are not used by these hosts for vSphere HA communications
- On ESXi hosts in the cluster, vSphere HA communications, by default, travel over VMkernel networks, except those marked for use with vMotion. If there is only one VMkernel network, vSphere HA shares it with vMotion, if necessary. With ESXi 4.x and ESXi, you must also explicitly enable the Management Network checkbox for vSphere HA to use this network
For further information see page 40 of the vSphere Availability documentation
Verify HA/DRS Cluster Configuration
Configuration issues and other errors can occur for your cluster or its hosts that adversely affect the proper operation of vSphere HA. You can monitor these errors by looking at the Cluster Operational Status and Configuration Issues screens, which are accessible in the vSphere Client from the vSphere HA section of the cluster’s Summary tab.
For further information see page 30 of the vSphere Availability documentation
Troubleshoot HA Capacity Issues
To troubleshoot HA capacity issues first be familiar with the three Admission Control Policies:
- Host failures the cluster tolerates (default) – You can configure vSphere HA to tolerate a specified number of host failures. Uses a “slot” size to display cluster capacity
- Percentage of cluster resources reserved as failover spare capacity – You can configure vSphere HA to perform admission control by reserving a specific percentage of cluster CPU and memory resources for recovery from host failure
- Specify failover hosts – You can configure vSphere HA to designate specific hosts as the failover hosts
- Things to look out for when troubleshooting HA issues:
- Failed or disconnected hosts
- Over sized VM’s with high CPU/memory reservations. This will affect slot sizes
- Lack of capacity/resources if you using “Specify Failover Hosts”, IE not enough hosts set as failovers
See Section 5 – Troubleshooting Availablity in the vSphere Troubleshooting documentation that outlines common failover scenarios for each of the three Admission Control Policies. For further reading on the three admission control policies see page 22 thru 28 of the vSphere Availability documentation.
Troubleshoot HA Redundancy Issues
Like all other components in a vSphere design, you want design redundancy for a clusters HA network traffic. You can go about this one of two ways or both. The use of NIC of teaming (two physical NICs preferably connected to separate physical switches) is the most common method used. This will allow either of the two links to fail and still be able to communicate on the the network. The second option is the setup and creation of a secondary management network. This second interface will need to be attached to a different virtual switch as well as a different subnet as the primary network. This will allow for HA traffic to be communicated over both networks.
Interpret the DRS Resource Distribution Graph and Target/Current Host Load Deviation
The DRS Resource Distribution Chart is used to display both memory and CPU metrics for each host in the cluster. Each resource can be displayed in either a percentage or as a size in mega bytes for memory or mega hertz for CPU. In the chart display each box/section represents a VM running on that host and the resources it is currently consuming. The chart is accessed from the Summary tab at the cluster level under the section for VMware DRS. Click the hyperlink for View Resource Distribution Chart.
The target/current host load deviation is a representation of the balance of resources across the hosts in your cluster. The DRS process runs every 5 minutes and analyzes resource metrics on each host across the cluster. Those metrics are plugged in an equeation:
(VM entitlements)/(Host Capacity)
This value returned is what determines the “Current host load standard deviation”. If this number is higher then the “Target host load standard deviation” your cluster is imbalanced and DRS will make recommendations on which VM’s to migrate to re-balance the cluster.
Troubleshoot DRS Load Imbalance Issues
DRS clusters become imbalanced/overcomitted for several reasons:
- A cluster might become overcommitted if a host fails
- A cluster becomes invalid if vCenter Server is unavailable and you power on virtual machines using a vSphere Client connected directly to a host
- A cluster becomes invalid if the user reduces the reservation on a parent resource pool while a virtual machine is in the process of failing over
- If changes are made to hosts or virtual machines using a vSphere Client connected to a host while vCenter Server is unavailable, those changes take effect. When vCenter Server becomes available again, you might find that clusters have turned red or yellow because cluster requirements are not longer met.
Troubleshoot vMotion/Storage vMotion Migration Issues
For vMotion refer to section above for DRS and vMotion requirements. Make sure all requirements are being met.
For Storage vMotion be aware of the following requirements and limitations
- Virtual machine disks must be in persistent mode or be raw device mappings (RDMs). For virtual compatibility mode RDMs, you can migrate the mapping file or convert to thick-provisioned or thin-provisioned disks during migration as long as the destination is not an NFS datastore. If you convert the mapping file, a new virtual disk is created and the contents of the mapped LUN are copied to this disk. For physical compatibility mode RDMs, you can migrate the mapping file only.
- Migration of virtual machines during VMware Tool installation is not supported
- The host on which the virtual machine is running must have a license that includes Storage vMotion
- The host on which the virtual machines is running must have access to both the source and target datastore
Interpret vMotion Resource Maps
vMotion resource maps provide a visual representation of hosts, datastores, and networks associated with the selected virtual machine.
vMotion resource maps also indicate which hosts in the virtual machine’s cluster or datacenter are compatible, it must meet the following criteria:
- Connect to all the same datastores as the virtual machine
- Connect to all the same networks as the virtual machine
- Have compatible software with the virtual machine
- Have a compatible CPU with the virtual machine
Identify the Root Cause of a DRS/HA Cluster or Migration Issue Based on Troubleshooting Information
Use information from above topics to help isolate the issue based on HA/DRS requirements as well pages from the reference documents listed.
Verify Fault Tolerance Configuration
Identify Fault Tolerance Requirements
When VMware Fault Tolerance was originally announced back in the ESXi/ESX 4.x days it received a lukewarm reception. While the concept of protecting tier 1 workloads with a synchronous/shadow VM, the requirment of supporting a single vCPU virtual machine limited the use case of the feature. In vSphere 6 VMware has lifted the vCPU limitation from 1 vCPU to up 4 vCPU (based on licensing). With this increase I would assume this feature will now be leveraged in environments.
Beyond the increase of support for multi processor, there are other requirements/features you should know for the exam:
- Physical CPU’s must be compatible with vSphere vMotion or Enhanced vMotion Compatibility (EVC)
- Physical CPU’s must support hardware MMW virtualization (Intel EPT or AMD RVI
- Use a dedicated 10GB network for FT logging
- vSphere Standard and Enterprise allows up to 2 vCPU’s for FT
- vSphere Enterprise Plus allows ups to 4 vCPU’s for FT
While FT provides a higher level of availability, there are a few features that are NOT supported if a VM is protected via Fault Tolerance:
- Virtual machine snapshots
- Storage vMotion
- Linked Clones
- Virtual SAN (VSAN)
- VM Component Protection (VMCP)
- Virtual Volume datastores
- Storage-based policy management
- I/O filters