VMworld 2015 Roundup: Day 5

The final VMworld catchup post, promise! In the spirit of No Post Left Behind, here is my Thursday summary for VMworld 2015 US.

General Session

For General Session commentary, check out the Live Blog. In hindsight, looking back over the live commentary, the Imagenet presentation was a good example of machine learning, which is an even more salient topic in IT these days. Greg Gage's converged demos, converged cockroach, converged human, while maybe were meant to mirror the topic of infrastructure convergence, seem to have more relevance to IoT. Finally David Eagleman's talk on the limitations of human sensory input reminds me of how we truly don't know what we don't know.

It certainly is interesting how perceptions and lessons ebb and warp when reflected in time's rear-view mirror.

Physical Cluster to Virtual Failover Using Site Recovery Manager [STO5053]

Brad Pinkston, Joe Doss, and Dave Hellman give us an inside look at disaster recovery at Levi Strauss. Levis Strauss needed to set up a disaster recovery solution for thousands of VMs. Those VMs ran Tier 1 workloads and so the DR needed to be highly automated. As part of this effort, their physical SQL server clusters, which provided services to VMs, needed to be support by a 100% virtual disaster recovery solution.

To accomplish this, a virtual SQL server node was added to each cluster, and VMware Site Recovery Manager was used to protect those virtual nodes. Some lessons they learned during this exercise:

The SRM post-power on tasks needed to force SQL start with the /forcequorum flag to start without a quorum.
The quorum mode had to be set to Disk Only.
Windows Server 2008 and Windows Server 2008 R2 have slightly different locations and parameters and had to be accounted for.
The Perennially Reserved flag had to be set on all recovered RDMs.
PowerCLI was installed on the SRM server to execute scripts to make these changes.

Levi Strauss ran into some technical challenges during their DR testing as well:

Hosts hang when cluster LUNs are masked.
- Per VMware KB1016106 the Perennially Reserved flag had to be set for all RDM LUNs on each host.
RDMs were not mapping on the recovered VMs due to timeouts.
- Had to make sure hosts were at a minimum of ESXi 5.5.
- Storage timeouts were increased and had to tune rescan parameters. This may be environment specific and your mileage may vary.
Cluster network name resource were not coming online.
- Recovered VMs couldn't find DNS.
- The primary DNS zone must be reachable.
- Their workaround was to disable the DNS registration must succeed parameter.
Active Directory computer account passwords changed on recovery.
- Their workaround was to run a Repair AD object on the cluster's network name resource.

Ultimately Levis Strauss was able to complete a full DR restore in 4 hours versus the 72 hours it had taken them before. The infrastructure components of that were up in 2 hours. A significant improvement.

For future DR, they noted that vSphere 6 allowed them to vMotion shared disk WSFC nodes, and that ESXi 6.0 Update 1 with vCenter Server 5.5 Update 3 supports SQL 2012 Always-On Availability Groups (AAGs).

vSphere Distributed Switch 6.0 – Technical Deep Dive [NET4976]

https://youtu.be/IJCbqxELrfg

Jason Nash and Chris Wahl brought back an updated version of their popular vDS technical deep dive session (sadly it didn't get accepted for VMworld 2016). As you can see the session was recorded and made available on YouTube for everyone to watch. So, rather than recap the information from the session, I'll highlight some takeaways I found especially useful or interesting.

Remember the Nexus 1000v? VMware stopped selling it, but Cisco continues to sell it and it's still supported in vSphere 6. I do wonder how many customers actually rely on it, though.

Based on comments about versions of network components and available features, it's a good idea to upgrade to the most current version of the components. This was driven home by the fact that vDS 4.0 is no longer an available option in vSphere 5 and up. Upgrading should be non-disruptive, but take a soft maintenance window, just in case.

The Traffic Placement Engine in NIOC 3.0 is quite powerful, however you have to make sure that you're solving an existing problem when choosing to implement NIOC. As always, keep things as simple as possible.

I found the discreet, multiple TCP/IP stacks within vSphere to be quite interesting. This session was where I was introduced to this feat of engineering. Makes sense as it isolates traffic types for both performance and fault isolation purposes.

Whether you're a vSphere admin or a network admin, I urge you to watch this session to get a better understanding of networking within vSphere.