VMworld 2016 Roundup: Day 5

Architecting Site Recovery Manager to Meet Your Recovery Goals [STO7973]

One of the last sessions for VMworld 2016, delivered by Ivan Jordanov and Gurusimran (GS) Khalsa. SRM depends on two important things, protection groups and recovery plans. Protection groups (PGs) are groups of VMs that fail over together to the recovery site. You can have different PGs depending on your replication types. A VM can only belong to one PG. Recovery plans (RPs) can have many PGs. And PGs can belong to more than one RP. It may be a good idea to define array-based protection groups, where the collection of VMs are tied to a common piece of underlying storage.

A new style of PGs available are storage policy protection groups. They have a high level of autonomy, being policy based. It’s simpler to provision, migration and decommission VMs as the SRM requirements are handled via storage policy.

Bare in mind that the most common data centre failure scenario is a partial failover. This is when a critical application or service fails, but the entire data centre does not. You need to design for this scenario when figuring out your PGs and RPs. The latest version of SRM supports up to 250 protection groups, so use them wisely.

There are a variety of topologies supported by SRM:

  • Active-Passive Failover
  • Active-Active Failover
  • Bi-Directional Failover
  • Multi-Site Failover

SRM version 6.1 introduced stretched clusters, which only requires one vCenter, and will cross-vMotion VMs instead of the typical power-off/power-on approach. This fits nicely with VSAN stretched clusters.

SRM also supports enhanced linked mode. Enhanced linked mode requires a standalone Platform Services Controller (PSC) in each environment. You then join all the PSC nodes into a single SSO domain. One of the benefits of this approach is that you can share a single SRM license across all of your sites. Plus it gives you a central point of management for all of your vCenters.

One of the biggest impacts on a business’ RTO is how long it takes to decide to failover. After all, SRM or any other DR plan can’t necessarily determine on it’s own whether a failover is appropriate. IP customization of the VMs significantly contributes to recovery time as well. Ideally a stretched layer-2 network would allow for the fastest technical recovery time, but comes with increased complexity and operational overheads. An alternative to having stretched layer-2 is to move the VLAN/subnet upon failover. This would require making routing changes at the time of failover, and of course would impact all systems in the moved subnet.

Remember when designing your DR plan to take priorities and dependencies into account. Depending on how you organize your groups, you may find that you have differing recovery times. The more LUNs and datastores there are, the more work SRM has to do, so try to make due with a reasonable minimum if recovery performance is a priority. Don’t replicate VM swap files. They will be rebuilt when the VMs come up at the recovery site. The fewer RPs you have, the faster your recovery times.

Make sure that all of your protected VMs have VMware Tools installed and runing, ideally the latest version. This is needed for SRM to communicate with and make changes to the VMs. Set the VMs to suspend on recovery, and power off VMs at the source site at the start of failover. Both of these will improve your failover success, and allow you to chose how to bring up the recovery site.

Something that is easy to overlook is the size of your vCenter at the recovery site. It should be the same size, or bigger, as the protected site, to make sure that it’s capable of handling all the recovered VMs. The more hosts you have at your recovery site, the better recovery time you’ll have, as SRM will be able to take advantage of parallelism. Make sure you enable DRS at the recovery site, to make sure resource contention is addressed as the site comes up. Ideally have different recovery plans that target different clusters, if possible, to streamline the recovery process.

If you rely on scripts to make changes via SRM during a failover, make sure you set a script timeout. If you don’t and your script hangs, your recovery process will hang as well. A lot of best practices and guidance can be found in the SRM Installation and Configuration Guide, so give it a read. One of the best practices is to have a separate database for SRM, don’t share the vCenter DB.

Most of all, be clear and forthcoming with the business. Make sure that you’ve mutually established your RTO, RPO, cost of downtime, application priorities, units of failure and any associated externalities. You should seek executive buy in for these metrics as well as the DR approach. Try to have documented SLAs within the business for DR.

vRealize Infrastructure Navigator can be a useful tool for discovering and mapping technical dependencies.

SRM can be used for non-disruptive DR testing. Frequent DR testing reduces the risk of failover during a failover. Use VLANs or isolated networks in your test environment to keep your DR test from impacted production. You could specify different port groups for the test failovers versus the real failover. This is specified in the network mapping and/or recovery plans.

Finally, don’t try to protect everything. For example applications that have their own built-in resiliency or availability, such as Microsoft Active Directory Domain Services. To support non-disruptive DR testing, you could clone those applications not protected by SRM, such as AD, by using scripts during the test.

She’s Done

That’s it for VMworld US for another year. While some expected announcements didn’t drop, such as an announcement about the next version of vSphere, we still have VMworld 2016 Europe to look forward to. It’s with a heavy heart and tired feet that I say farewell to friends new and old. We’ll all see each other soon enough. In the meantime, there’s always the Interwebs. Stay classy, always.

Dee Abson

Dee Abson is a technical architect from Alberta, Canada. He's been working in the field of technology for over 20 years and specializes in server and virtualization infrastructure. Working with VMware products since ESX 2, he holds several VMware certifications. He was awarded VMware vExpert for 2014-2020. You can find him on Twitter.

7 Responses

  1. New Post: VMworld 2016 Roundup: Day 5 https://t.co/kgmvn2xfRd #news #vmworld

  2. VMworld 2016 Roundup: Day 5 https://t.co/Zexzap3d72 #General #News

  3. ICYMI: VMworld 2016 Roundup: Day 5 https://t.co/v1VE6saa7t #news #vmworld

  4. ICYMI: VMworld 2016 Roundup: Day 5 https://t.co/WG8C4QBDB5 #news #vmworld #ICYMI

  5. ICYMI: VMworld 2016 Roundup: Day 5 https://t.co/gbuOMbh2Fq #news #vmworld #vExpert

  6. ICYMI: VMworld 2016 Roundup: Day 5 https://t.co/Mtwlr5TKlO #news #vmworld #ICYMI #vExpert

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: