Planning for Failure with DRS Groups

You’ve spent many hours on your virtualization infrastructure design. There are multiple network paths, at least two of each physical device, and, heck, even multiple upstream Internet providers. The multi-tiered application has multiple VM nodes at each tier for distributing load and boosting service availability.

3:47am and your buzzing phone jolts you awake. There’s been a UPS failure in the data center and a row of equipment has gone down hard. Good thing you put in all that redundancy. Only – crap – the application’s down too. Looks like all the nodes in the app tier died when the power dropped. HA’s bringing the VMs back up, but this drops the service availability below what you and the business agreed upon.

In the complex environment that is the modern data center it's a challenge remembering and identifying all the possible fault domains, those ever pervasive single-points-of-failure, that can take an application down. We tend to focus on the layers we most closely interact with.

It's all too easy to take for granted that the data center itself is not immune to failure. And you’d be right in pointing out that modern data centers typically account for their fault domains by implementing redundancy and controls. What is overlooked in our story, is how fault domains overlap and impact each other.

Let’s look at an example of how we can plan for these failures.

Host DRS Groups

Let’s say we have a modest data center with two rows of equipment. Each row consists of a number of racks with the requisite storage, network and server gear to support a single virtual environment. There are redundant room level UPS’s, each protecting one of the rows. If there is ever a UPS failure only one row, roughly half the equipment, is affected.

Here’s a simplified diagram of what our data center looks like. Since we’re going to focus on the virtual infrastructure, we’ll only depict hosts within the rows.

How do we represent this physical layout in vCenter? We can group these hosts together into host DRS groups. These groups will logically show the physical grouping of hosts so that we can make decisions based on these groups.

We've now grouped the hosts together along our row boundaries, but ultimately this is about our application. How do we deal with the app's VMs?

Virtual Machine DRS Groups

Following our story as an example, let's say that we have a typical three-tiered application. There's a web tier, app tier, and database tier. Each tier, for simplicity's sake, has two VMs each.

We're all used to using VM affinity rules to keep VMs together or apart from each other. In our example we're often tempted to use VM-VM anti-affinity rules to keep the A nodes away from the B nodes. As our story illustrated, however, this doesn't guarantee that the VMs will be distributed along our physical row boundaries.

Conceptually, not only are we interested in keeping the A nodes away from the B nodes to allow the application to survive a host failure, but we now need to make sure the application can survive a row failure. Lets group our VMs along the A and B nodes into two virtual machine DRS groups.

Now we have host DRS groups that logically group hosts along our physical row boundaries, and we have virtual machine DRS groups that logically group VMs along our node boundaries. How do we tie them together?

Like Peanut Butter and Chocolate

Turns out that our two different sets of group definitions are very complementary. Let's chose to associate our VM DRS Group A with Host DRS Group 1, and VM DRS Group B with Host DRS Group 2. Logically it looks something like this.

In our DRS settings, we simply create rules that match our choices:

VM DRS Group A must run on hosts in Host DRS Group 1.
VM DRS Group B must run on hosts in Host DRS Group 2.

DRS will now place the VMs according to our rules.

What will happen now if we have a row failure? We'll only lose one node out of each application tier, meaning the application stands a good chance of staying up. How about our original concern if we have a host failure? Because the rules we've created keep the nodes on different groups of hosts, they also make sure that the nodes don't run on the same host at the same time.

Late night alerts are now a lot less likely to cause you grief the next morning.

Featured image photo by Librarian Avenger