Enhanced vMotion Compatibility (EVC) Best Practices

During an infrastructure provisioning conversation the other day a comment was made that, as a best practice, EVC should be avoided if possible. I was taken aback by this statement as the use of EVC is clear to me. Unfortunately I haven’t had a chance to speak with the commenter yet to understand their point of view.
My opinion is that EVC should always be enabled on vSphere clusters, making sure you know what it means for that cluster when it’s enabled. In order to understand where I’m coming from, let’s revisit what EVC is doing for us.
What is it Good For?
When VMware introduced vMotion, it was a significant milestone for virtualization adoption. Of course we’re all very familiar with the concept now, to the point of taking it for granted. While successful in furthering the ability to manage virtual workloads by allowing them to move between hosts, this “VM freedom” did not come without limits.
vMotion is limited to hosts of a particular CPU architecture. This is the familiar Intel vs. AMD boundary that needs to be identified on a cluster by cluster basis. Within those CPU-bound clusters, however, a VM can have problems if vMotioned to a host with a differing set of CPU instructions. From the VM’s point of view, the CPU’s instruction sets have effectively changed instantly, either gaining or losing certain instruction sets based on the underlying hardware. This could have disastrous effects on the VM and it’s workload.
This image is an example of an error thrown by vSphere when the VM’s CPU requirements aren’t met by the underlying host CPU. It comes from Lindsay Hill’s blog where he describes how he actually addressed this situation.
One way to deal with this type of issue is to make all CPUs the same in all hosts within a cluster. This quickly becomes an operational challenge, especially in a normal environment. Unless you business needs are such that you will never need to add more hosts, you’re going to have to grow your clusters at some point. The nature of IT, in this case specially the nature of enterprise hardware manufacturers, means that this becomes infeasible quickly. Within a relatively short window, typically on the order of 12-18 months, it becomes exponentially more difficult to buy the same equipment as you have today. So besides building a brand new cluster with the same CPUs every time you need to scale, what can we do?
Enter Enhanced vMotion Compatibility, or EVC. EVC is designed to allow a boundary, or level, to be set at the cluster level. This level is effectively a definition of CPU instruction sets that will be “allowed” within the cluster. Hosts with CPUs that can meet this level, by having all of these instruction sets available, can be added to the cluster. Hosts with CPUs that cannot meet this level cannot be added to the cluster. This provides a level of consistency for the VM’s so that we avoid the risk of impacting their workloads. This provides the best of both worlds. There are still some higher level constraints, such as Intel vs. AMD, however we end up with much more flexibility in our cluster hardware design.
Does it Hurt?
At first glance it would seem reasonable to expect that a process that adds a bit of inspection could potentially impact performance. VMware conducted an evaluation of the impact of EVC on the performance of certain workloads, which they published in the white paper, the “Impact of Enhanced vMotion Compatibility on Application Performance“. In the white paper it was found that certain EVC levels met the needs of some workloads, while other EVC levels did not.
Does this mean that EVC was a contributing factor to the performance VMware observed? Not really, no. The two workloads in the paper that benefited from newer CPU instruction sets, AES-encryption and video encoding, could have their needs met in different ways. If there’s a strong business need to have these workloads run more quickly, alternatives such as dedicated clusters or increasing the EVC level (if all hosts are compatible) will meet these needs. If it’s not critical to the business that these workloads execute quickly, then perhaps they’re fine within a general cluster with a lower EVC level.
So it becomes clear that the issue is not really about whether to enable EVC or not, but what “level” of EVC is right for your workloads. This consideration should already be part of your design’s system requirements, so selection of the right EVC level should be straightforward.
Best Practice
In summary, a CPU baseline should be established when any new cluster is created by choosing the proper CPUs and enabling EVC. Remember that EVC means that you can’t move backwards through CPU generations but that you can always move forwards (often without impacting running VMs). The performance impact of EVC in and of itself is practically non-existent. The longer you go without enabling EVC on a cluster the harder it will be to take a full cluster outage to enable EVC.
So always turn on EVC on your new empty clusters and plan an EVC Enable Party for your existing clusters sooner, rather than later.
Featured image photo by robynejay
HI,
Thanks for publishing this. I’m actually starting to think about EVC in our environment. We’ve been going through a lot of growing pains (100 VMs – 650 VM’s in a matter of 2 years). I’ve typically built non-evc clusters, mostly out of ignorance (after reading your article).
That said, for me, I see EVC more useful for the migration of resources, more than augmenting our cluster, at least to some degree. For example, we have a cluster with dual socket 8 core procs, and we’re going to new HW with dual socket 18 core procs. One of the wins with the 18 core procs is we’re hoping to reduce VMware and MS licensing in addition to providing more resources. While keeping the 8 core boxes in the cluster would allow more resources, ultimately it comes at the cost of less efficient licensing.
My question for you would be have you ever noticed any reliability issues with EVC? I think ultimately that would be of a bigger concern too me, compared to squeezing every last instruction set out of my procs.
Also, are there any concerns to be wary about if vCenter is part of the same EVC cluster and using vDS? I can’t recall where I read it, but I read somewhere about that being of concern.
The situation you’ve described, where you’re introducing a new host to a cluster that has differing CPU specs is exactly the type of scenario that EVC is made for. By breaking down your approach to adjusting CPU resources into its component steps, you can see that you’re extending your cluster with the new host(s), but then ultimately removing one or more of the old. Assuming your goal is to completely remove the 8-core systems and leave 18-core systems, you could potentially increase your EVC level if the 18-core procs are of a newer generation.
Regarding stability, I’ve enabled and used EVC since it was introduced back in the ESX 3.5U2 days and haven’t ever encountered a problem with it. When you consider the tactic it’s taking, filtering the CPU instruction sets, pretty much your only point of concern is how well that filter is maintained. VMware’s had many years now to make sure the wizard stays behind the curtain.
I haven’t heard any concerns about deploying vCenter on an EVC enabled cluster. There can be situations, typically during vCenter recovery, where vDS can get in the way. I’d recommend reading Chris Wahl’s post on using Ephemeral Binding to reduce the risk (http://wahlnetwork.com/2015/01/30/vds-ephemeral-binding/).
Enhanced vMotion Compatibility (EVC) Best Practices https://t.co/OumO3ffuk0 via @deeabson