VMworld 2016 Roundup: Day 5

The last (half) day of VMworld 2016. This day always incorporates one last, non-VMware specific, general session, and offers breakout sessions for about half the day until the conference closes down. Off we go.

General Session

For general session commentary, check out the Live Blog. It was encouraging to see a full lineup of women engaging the audience in talks about their science and technologies jobs. It was clear that these jobs were also their passion. As is typical for the Thursday general session, these talks weren't focused on VMware so much as they were on topics of general interest. We were presented with thoughtful and inspirational notions and ideas, and I hope this continues at future VMworlds.

Virtual SAN Management – Time to Level Up Your Ops Game! [MGT7770]

Jeff Godfrey & Rawlinson Rivera delivered a (relatively) slide deck free presentation, opting to demo almost the entire talk. Since it was freeform, I didn't have the benefit of a presented agenda or other structure to draw from. As such, this summary's going to consist mostly of bits and pieces of information I was able to capture from the talk. Note that the presentation was prefaced with a disclaimer that some or all the topics presented may be forward-looking and there's no guarantee that VMware will deliver on any or all the topics presented. Basically.

Analytics collected by Virtual SAN (VSAN) are stored in a storage object on VSAN, whose entries are overwritten on a 90 day cycle. If you want to keep the data longer (and you do), you need to roll up the data to vRealize Operations Manager. Remember that VSAN is just one part of a SDDC, so having the metrics in vROps is ideal. Also note that the performance metrics stored on VSAN storage are separate from the vCenter database, so there is no impact on vCenter performance (apparently this is a common question). When viewing VSAN data through vCenter, you can view the 90 days worth of information through what is effectively a sliding window, where you can see 24 hours worth at a time. When you roll up your VSAN data to vROps, you'll be able to keep up to 6 months worth of data.

The VSAN management pack, needed to view VSAN data and access the default dashboards, is installed with all versions of vROps. You don't have to have Enterprise, any version of vROps will do. Customized dashboards can be easily created by copying components from the out-of-the-box VSAN dashboards. You should install the management pack(s) for your storage devices so that you can have visibility across your entire storage fabric.

There is a 9,000 components per host limit in VSAN. vROps can provide alerts well in advance of an error, so proactive actions can be taken. The benefits of having vROps monitor those components in your VSAN clusters should be fairly evident.

If you have vRealize Log Insight, and you should since a 25-OSI pack is available with vCenter, you should have your VSAN trace logs feed into it. As of VSAN 6.2, the trace logs that are forwarded to vRLI can be made into human readable information, which will help in understanding and troubleshooting your VSAN clusters.

The audience was treated to a demonstration of a customized multi-cloud monitoring dashboard in vROps, which was built by Sunny Dua of VMware. The dashboard allowed you to drill down through the VSAN components to get further information during your troubleshooting efforts. Note that vROps can automate DRS actions, and you can migrate a VM across clusters via XvMotion. Make sure that you're running vROps 6.3 or better to get full functionality.

Successfully Virtualize and Operate Your Microsoft Skype for Business Infrastructure on the VMware vSphere Platform [VIRT7620]

I was interested in attending this session on Skype for Business, delivered by Adam Ball, Rakesh Gajwani, and Hemal Doshi as this seems to be a hot topic as of late. First, what is Skype for Business? Skype for Business (SfB) if a real-time communications platform. It requires negligible latency, as "real-time" is synonymous with "live", as far as end-users are concerned.

If it's not tuned properly, it can cause issues. High latency (greater than 250ms), packet jitter (the deviation in latency between packets), or a lack of infrastructure resources (compute, memory, storage IOPs) can all contribute to a poor experience.

SfB has native high availability features, such as application pools and SQL mirroring & AlwaysOn Availability Groups. It also has a native disaster recovery feature, in the form of pool pairing. Note that vMotion can be enabled on SfB VMs, but you have to be mindful of when you execute it. Best practices for vMotion and DRS for SfB:

Enable vMotion.
Set to manual.
Perform vMotion only when needed and in line with change control.
Set DRS to partially automated.
Use anti-affinity rules to keep your SfB pool members apart.

A video demonstrations was shown of what happens to a Skype for Business video conference when the SfB server is vMotioned. The call quality was affected, with the audio and video becoming interrupted for several seconds. This would be considered a problem by most SfB users, especially if the audio becomes interrupted.

Some of the potential pitfalls of virtualizing SfB on vSphere include:

Forgetting that virtualization is not the same as installing on physical.
- Take advantage of resource pooling, resource abstraction, and utilization "fairness", to make sure SfB gets the resources it needs.
More is not always better.
- Size your SfB VMs appropriately, and set up a performance baseline.
There are more "potential" choke points.
- Virtualization does add complexity, so make sure you know your platform.

Make sure that you use paravirtual SCSI adapters (PVSCSI). They have less CPU overhead and a larger queue depth. If necessary, increase your ring pages, and don't forget to configure Windows to match your queue depth and ring pages to take full advantage. Ensure that you have the VMware Tools installed in your guest VMs, as they're necessary to support PVSCSI adapters. Additional SCSI controllers improves concurrency, so make sure you distribute your PVSCSI disks across your SCSI controllers (up to four controllers per VM).

When troubleshooting performance issues in SfB, you should prioritize audio over video quality. This may seem somewhat counter intuitive at first, but for the user experience it doesn't matter if you can be seen if you can't be heard.

VMware itself is in the midst of deploying Skype for Business for their own internal use. Why are they doing this? They have over 19,000 people, 30,000 devices, and 90 locations worldwide. They were looking to unify their communications platforms so that they had a single solution for the entire corporation. Their technical requirements included a need for dial-in capabilities, PRI/TDM support for non-SIP areas, integration with an existing Avaya PBX, and the ability to work with existing short dial codes. Skype for Business fit the bill.

So far VMware has two DCs deployed, serving 10,000 users. They have SIP trunks setup for their dial-in numbers. In the future they will be deployed to 8 data centres, serving 30,000 users with enterprise voice service reaching those 90 locations.

VMware released a white paper during VMworld: Best Practices Guide: Virtualizing Microsoft Skype for Business Server on VMware vSphere. If you're running or are interested in running SfB on vSphere, you need to read this white paper.

Architecting Site Recovery Manager to Meet Your Recovery Goals [STO7973]

One of the last sessions for VMworld 2016, delivered by Ivan Jordanov and Gurusimran (GS) Khalsa. SRM depends on two important things, protection groups and recovery plans. Protection groups (PGs) are groups of VMs that fail over together to the recovery site. You can have different PGs depending on your replication types. A VM can only belong to one PG. Recovery plans (RPs) can have many PGs. And PGs can belong to more than one RP. It may be a good idea to define array-based protection groups, where the collection of VMs are tied to a common piece of underlying storage.

A new style of PGs available are storage policy protection groups. They have a high level of autonomy, being policy based. It's simpler to provision, migration and decommission VMs as the SRM requirements are handled via storage policy.

Bare in mind that the most common data centre failure scenario is a partial failover. This is when a critical application or service fails, but the entire data centre does not. You need to design for this scenario when figuring out your PGs and RPs. The latest version of SRM supports up to 250 protection groups, so use them wisely.

There are a variety of topologies supported by SRM:

Active-Passive Failover
Active-Active Failover
Bi-Directional Failover
Multi-Site Failover

SRM version 6.1 introduced stretched clusters, which only requires one vCenter, and will cross-vMotion VMs instead of the typical power-off/power-on approach. This fits nicely with VSAN stretched clusters.

SRM also supports enhanced linked mode. Enhanced linked mode requires a standalone Platform Services Controller (PSC) in each environment. You then join all the PSC nodes into a single SSO domain. One of the benefits of this approach is that you can share a single SRM license across all of your sites. Plus it gives you a central point of management for all of your vCenters.

One of the biggest impacts on a business' RTO is how long it takes to decide to failover. After all, SRM or any other DR plan can't necessarily determine on it's own whether a failover is appropriate. IP customization of the VMs significantly contributes to recovery time as well. Ideally a stretched layer-2 network would allow for the fastest technical recovery time, but comes with increased complexity and operational overheads. An alternative to having stretched layer-2 is to move the VLAN/subnet upon failover. This would require making routing changes at the time of failover, and of course would impact all systems in the moved subnet.

Remember when designing your DR plan to take priorities and dependencies into account. Depending on how you organize your groups, you may find that you have differing recovery times. The more LUNs and datastores there are, the more work SRM has to do, so try to make due with a reasonable minimum if recovery performance is a priority. Don't replicate VM swap files. They will be rebuilt when the VMs come up at the recovery site. The fewer RPs you have, the faster your recovery times.

Make sure that all of your protected VMs have VMware Tools installed and runing, ideally the latest version. This is needed for SRM to communicate with and make changes to the VMs. Set the VMs to suspend on recovery, and power off VMs at the source site at the start of failover. Both of these will improve your failover success, and allow you to chose how to bring up the recovery site.

Something that is easy to overlook is the size of your vCenter at the recovery site. It should be the same size, or bigger, as the protected site, to make sure that it's capable of handling all the recovered VMs. The more hosts you have at your recovery site, the better recovery time you'll have, as SRM will be able to take advantage of parallelism. Make sure you enable DRS at the recovery site, to make sure resource contention is addressed as the site comes up. Ideally have different recovery plans that target different clusters, if possible, to streamline the recovery process.

If you rely on scripts to make changes via SRM during a failover, make sure you set a script timeout. If you don't and your script hangs, your recovery process will hang as well. A lot of best practices and guidance can be found in the SRM Installation and Configuration Guide, so give it a read. One of the best practices is to have a separate database for SRM, don't share the vCenter DB.

Most of all, be clear and forthcoming with the business. Make sure that you've mutually established your RTO, RPO, cost of downtime, application priorities, units of failure and any associated externalities. You should seek executive buy in for these metrics as well as the DR approach. Try to have documented SLAs within the business for DR.

vRealize Infrastructure Navigator can be a useful tool for discovering and mapping technical dependencies.

SRM can be used for non-disruptive DR testing. Frequent DR testing reduces the risk of failover during a failover. Use VLANs or isolated networks in your test environment to keep your DR test from impacted production. You could specify different port groups for the test failovers versus the real failover. This is specified in the network mapping and/or recovery plans.

Finally, don't try to protect everything. For example applications that have their own built-in resiliency or availability, such as Microsoft Active Directory Domain Services. To support non-disruptive DR testing, you could clone those applications not protected by SRM, such as AD, by using scripts during the test.

She's Done

That's it for VMworld US for another year. While some expected announcements didn't drop, such as an announcement about the next version of vSphere, we still have VMworld 2016 Europe to look forward to. It's with a heavy heart and tired feet that I say farewell to friends new and old. We'll all see each other soon enough. In the meantime, there's always the Interwebs. Stay classy, always.

VMworld 2016 Roundup: Day 5

General Session

Virtual SAN Management – Time to Level Up Your Ops Game! [MGT7770]

Successfully Virtualize and Operate Your Microsoft Skype for Business Infrastructure on the VMware vSphere Platform [VIRT7620]

Architecting Site Recovery Manager to Meet Your Recovery Goals [STO7973]

She's Done

Comments

More from this blog

VMware Explore 2022: A New Dawn [Day 2]

VMware Explore 2022: General Session Live [Day 3]

VMware Explore 2022: The More Things Change [Day 1]

VMworld 2019 Roundup: Day 5

VMworld 2019 Roundup: Day 4

Command Palette

General Session

Virtual SAN Management – Time to Level Up Your Ops Game! [MGT7770]

Successfully Virtualize and Operate Your Microsoft Skype for Business Infrastructure on the VMware vSphere Platform [VIRT7620]

Architecting Site Recovery Manager to Meet Your Recovery Goals [STO7973]

She's Done

Comments

More from this blog