Planning for Failure with DRS Groups
You’ve spent many hours on your virtualization infrastructure design. There are multiple network paths, at least two of each physical device, and, heck, even multiple upstream Internet providers. The multi-tiered application has multiple VM nodes at each tier for distributing load and boosting service availability.
3:47am and your buzzing phone jolts you awake. There’s been a UPS failure in the data center and a row of equipment has gone down hard. Good thing you put in all that redundancy. Only – crap – the application’s down too. Looks like all the nodes in the app tier died when the power dropped. HA’s bringing the VMs back up, but this drops the service availability below what you and the business agreed upon.
In the complex environment that is the modern data center it’s a challenge remembering and identifying all the possible fault domains, those ever pervasive single-points-of-failure, that can take an application down. We tend to focus on the layers we most closely interact with.
It’s all too easy to take for granted that the data center itself is not immune to failure. And you’d be right in pointing out that modern data centers typically account for their fault domains by implementing redundancy and controls. What is overlooked in our story, is how fault domains overlap and impact each other.
Let’s look at an example of how we can plan for these failures.
Host DRS Groups
Let’s say we have a modest data center with two rows of equipment. Each row consists of a number of racks with the requisite storage, network and server gear to support a single virtual environment. There are redundant room level UPS’s, each protecting one of the rows. If there is ever a UPS failure only one row, roughly half the equipment, is affected.
Here’s a simplified diagram of what our data center looks like. Since we’re going to focus on the virtual infrastructure, we’ll only depict hosts within the rows.
How do we represent this physical layout in vCenter? We can group these hosts together into host DRS groups. These groups will logically show the physical grouping of hosts so that we can make decisions based on these groups.
We’ve now grouped the hosts together along our row boundaries, but ultimately this is about our application. How do we deal with the app’s VMs?