Enhanced vMotion Compatibility (EVC) Best Practices

During an infrastructure provisioning conversation the other day a comment was made that, as a best practice, EVC should be avoided if possible. I was taken aback by this statement as the use of EVC is clear to me. Unfortunately I haven't had a chance to speak with the commenter yet to understand their point of view.

My opinion is that EVC should always be enabled on vSphere clusters, making sure you know what it means for that cluster when it's enabled. In order to understand where I'm coming from, let's revisit what EVC is doing for us.

What is it Good For?

When VMware introduced vMotion, it was a significant milestone for virtualization adoption. Of course we're all very familiar with the concept now, to the point of taking it for granted. While successful in furthering the ability to manage virtual workloads by allowing them to move between hosts, this "VM freedom" did not come without limits.

vMotion is limited to hosts of a particular CPU architecture. This is the familiar Intel vs. AMD boundary that needs to be identified on a cluster by cluster basis. Within those CPU-bound clusters, however, a VM can have problems if vMotioned to a host with a differing set of CPU instructions. From the VM's point of view, the CPU's instruction sets have effectively changed instantly, either gaining or losing certain instruction sets based on the underlying hardware. This could have disastrous effects on the VM and it's workload.

An example of a vMotion Error due to a VM to Host EVC level mismatch.

This image is an example of an error thrown by vSphere when the VM's CPU requirements aren't met by the underlying host CPU. It comes from Lindsay Hill's blog where he describes how he actually addressed this situation.

One way to deal with this type of issue is to make all CPUs the same in all hosts within a cluster. This quickly becomes an operational challenge, especially in a normal environment. Unless you business needs are such that you will never need to add more hosts, you're going to have to grow your clusters at some point. The nature of IT, in this case specially the nature of enterprise hardware manufacturers, means that this becomes infeasible quickly. Within a relatively short window, typically on the order of 12-18 months, it becomes exponentially more difficult to buy the same equipment as you have today. So besides building a brand new cluster with the same CPUs every time you need to scale, what can we do?

Enter Enhanced vMotion Compatibility, or EVC. EVC is designed to allow a boundary, or level, to be set at the cluster level. This level is effectively a definition of CPU instruction sets that will be "allowed" within the cluster. Hosts with CPUs that can meet this level, by having all of these instruction sets available, can be added to the cluster. Hosts with CPUs that cannot meet this level cannot be added to the cluster. This provides a level of consistency for the VM's so that we avoid the risk of impacting their workloads. This provides the best of both worlds. There are still some higher level constraints, such as Intel vs. AMD, however we end up with much more flexibility in our cluster hardware design.

Does it Hurt?

At first glance it would seem reasonable to expect that a process that adds a bit of inspection could potentially impact performance. VMware conducted an evaluation of the impact of EVC on the performance of certain workloads, which they published in the white paper, the "Impact of Enhanced vMotion Compatibility on Application Performance". In the white paper it was found that certain EVC levels met the needs of some workloads, while other EVC levels did not.

Does this mean that EVC was a contributing factor to the performance VMware observed? Not really, no. The two workloads in the paper that benefited from newer CPU instruction sets, AES-encryption and video encoding, could have their needs met in different ways. If there's a strong business need to have these workloads run more quickly, alternatives such as dedicated clusters or increasing the EVC level (if all hosts are compatible) will meet these needs. If it's not critical to the business that these workloads execute quickly, then perhaps they're fine within a general cluster with a lower EVC level.

So it becomes clear that the issue is not really about whether to enable EVC or not, but what "level" of EVC is right for your workloads. This consideration should already be part of your design's system requirements, so selection of the right EVC level should be straightforward.

Best Practice

In summary, a CPU baseline should be established when any new cluster is created by choosing the proper CPUs and enabling EVC. Remember that EVC means that you can't move backwards through CPU generations but that you can always move forwards (often without impacting running VMs). The performance impact of EVC in and of itself is practically non-existent. The longer you go without enabling EVC on a cluster the harder it will be to take a full cluster outage to enable EVC.

So always turn on EVC on your new empty clusters and plan an EVC Enable Party for your existing clusters sooner, rather than later.

Featured image photo by robynejay