Planning for Failure with DRS Groups

You’ve spent many hours on your virtualization infrastructure design. There are multiple network paths, at least two of each physical device, and, heck, even multiple upstream Internet providers. The multi-tiered application has multiple VM nodes at each tier for distributing load and boosting service availability.

3:47am and your buzzing phone jolts you awake. There’s been a UPS failure in the data center and a row of equipment has gone down hard. Good thing you put in all that redundancy. Only – crap – the application’s down too. Looks like all the nodes in the app tier died when the power dropped. HA’s bringing the VMs back up, but this drops the service availability below what you and the business agreed upon.

In the complex environment that is the modern data center it’s a challenge remembering and identifying all the possible fault domains, those ever pervasive single-points-of-failure, that can take an application down. We tend to focus on the layers we most closely interact with.

It’s all too easy to take for granted that the data center itself is not immune to failure. And you’d be right in pointing out that modern data centers typically account for their fault domains by implementing redundancy and controls. What is overlooked in our story, is how fault domains overlap and impact each other.

Let’s look at an example of how we can plan for these failures.

Host DRS Groups

Let’s say we have a modest data center with two rows of equipment. Each row consists of a number of racks with the requisite storage, network and server gear to support a single virtual environment. There are redundant room level UPS’s, each protecting one of the rows. If there is ever a UPS failure only one row, roughly half the equipment, is affected.

Here’s a simplified diagram of what our data center looks like. Since we’re going to focus on the virtual infrastructure, we’ll only depict hosts within the rows.

Planning for Failure with DRS Groups - Dee Abson - Image01

How do we represent this physical layout in vCenter? We can group these hosts together into host DRS groups. These groups will logically show the physical grouping of hosts so that we can make decisions based on these groups.

Planning for Failure with DRS Groups - Dee Abson - Image02

We’ve now grouped the hosts together along our row boundaries, but ultimately this is about our application. How do we deal with the app’s VMs?

Dee Abson

Dee Abson is a technical architect from Alberta, Canada. He's been working in the field of technology for over 20 years and specializes in server and virtualization infrastructure. Working with VMware products since ESX 2, he holds several VMware certifications. He is a 9x VMware vExpert. You can find him on Twitter and Mastodon.

10 Responses

  1. @xinity_bot says:

    Planning for Failure with DRS Groups https://t.co/vcOvk7Ruo2 #General #Design

  2. RT @deeabson: New Post: Planning for Failure with DRS Groups https://t.co/eCEG3D4yfV #design #vExpert https://t.co/8wYRMrqi0W

  3. @PlanetV12n says:

    Planning for Failure with DRS Groups https://t.co/H2UFaIL4vO

  4. @tbdorg says:

    ICYMI: Planning for Failure with DRS Groups https://t.co/CvBnz1nH0c ##design https://t.co/tykifexLiu

  5. @cookiem68 says:

    Planning for Failure with DRS Groups https://t.co/yXtLMv446y https://t.co/2WkutW56C9

  6. @bdseymour says:

    Planning for Failure with DRS Groups https://t.co/68V7BqXzJU https://t.co/ETC9GQKFic

  7. @forgetmebot says:

    Never Forget: Planning for Failure with DRS Groups https://t.co/yoNu0EUyNV #design #NeverForget

  8. @deeabson says:

    ICYMI: Planning for Failure with DRS Groups https://t.co/gkHTdVBMRs #design #vExpert

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: