Upgrading vSphere 5.5 ‘Simple Install’ with SRM and Linked Mode to vSphere 6

A fairly common deployment topology with vSphere 5.5 was to use the ‘Simple Install’ method which placed all the individual vCenter components onto a virtual or physical vCenter Server.

This would then hook into an external virtual or physical SRM server.  With Linked Mode used for ease of management.

An example vSphere 5.5 topology is shown below.vSphere 5.5 Simple Install

As well as the normal considerations with vSphere upgrades around:

  • Hardware compatibility and firmware versions
  • Component interoperability
  • Database compatibility
  • vCenter Plugins
  • VM Hardware & Tools
  • Backup interoperability
  • Storage interoperability

We now have to consider the Platform Services Controller.

Platform Services Controller

The Platform Services Controller is a group of infrastructure services containing vCenter Single Sign-On, License Service, Lookup Service and VMware Certificate Authority.

vCenter SSO Provides secure authentication services between components using secure token exchange.  Rather than relying on a third party such as Active Directory.

vSphere License Provides a common license inventory and management capabilities

VMware Certificate Authority Provides signed certificates for each component.

The issue arises with vCenter SSO component, as most people would have opted for vSphere 5.5 ‘Simple Install’.  This means you end up with an embedded Platform Services Controller, see ‘How vCenter Single Sign-On Affects Upgrades

The embedded Platform Services Controller topology has been deprecated by VMware, see ‘List of Recommended Topologies for VMware vSphere 6.0.x‘.  This is also confirmed in VMware Site Recovery Manager 6.1 documentation under ‘Site Recovery Manager in a Two-Site Topology with One vCenter Instance per Platform Services Controller

What Does This Mean?

Due to the architectural changes between vSphere 5.5 and 6.  You cannot perform an in-place upgrade from vSphere 5.5 to vSphere 6 if you originally selected ‘Simple Install’ as you will end up with an deprecated topology.

The only choice will be a new vCenter 6 using the topology shown below.

vSphere 6 PSC with SRM

This also means you will need to deploy an extra two virtual machines to support this configuration.

How To: Perform a SRM Unplanned Failover & Maintain ‘Business As Usual’ Operations

SRM LogicalPurpose

The purpose of this blog post is to provide the steps required to perform a Site Recovery Manager unplanned failover and maintain business as usual operations.  I performed these steps twice on a clients live production environment with users accessing production virtual machines at the ‘source’ site.  The users noticed no impact to their daily work activities.

Pre-Requisites

The pre-requisites listed below had been discussed with the client and change control invoked for the following items:

  • vCenter and Site Recovery Manager would not be accessible during the unplanned failover
  • vSphere Client 5.5 U2 is used to enable editing of virtual machines with hardware level 10
  • Source vCenter and Site Recovery Manager ‘pinned’ to an ESXi Host using DRS Groups Manager ‘should’ rules to enable easy location of virtual machines
  • Replication stopped for the production remote copy virtual volumes for the duration of the test
  • Test virtual volume created and presented to ESXi Hosts using an existing Host Set
  • Test virtual machine created using Mike Brown’s Tiny VM to minimise inter site link bandwith consumption.  Note this doesn’t have VMware Tools installed.
  • Remote Copy IP and Management Interfaces for 3PAR StoreServ had been located on upstream switch

Steps One – Isolate Storage

Isolation of the 3PAR StoreServ at the ‘source’ site by issuing ‘shutdown’ command on the Management and Remote copy IP interfaces on the upstream switch.

If RCIP traffic and Management traffic are on the same subnet, RCIP traffic will traverse Management interfaces

Verify that you can no longer ping the RCIP interfaces and that your Remote Copy Group are in a ‘Stopped’ status.

Step Two – vCenter & SRM

Connect to the ESXi Host that runs the vCenter and Site Recovery Manager virtual machines and manually disconnect their virtual NIC’s

Result

Using the above process, we have isolated the 3PAR StoreServ, vCenter and Site Recovery Manager virtual machines.  This simulates having an inter site link failure, but enables users to continue to access virtual machines at the source site.

Perform your unplanned failover on the Test Virtual Volume and then issue the ‘no shutdown’ command against your 3PAR StoreServ Remote Copy and Management interfaces.  Then finally reconnect the virtual NICs on your vCenter and Site Recovery Manager virtual machines.

3PAR StoreServ & Site Recovery Manager Expected Behaviour

Purpose

The purpose of this post is to document the expected behaviour of the 3PAR StoreServ 7×00 and VMware Site Recovery Manager in both a ‘planned failover’ and ‘unplanned failover’.

Envrionment

The tests where performed on two different environments each containing the same infrastructure.

  • vCenter 5.5 Update 2 (Build 2001466)
  • Site Recovery Manager 5.8.0 (Build 2056894)
  • HP 3PAR SRA 5.5.2.285
  • HP 3PAR Inform OS 3.1.3 (MU1) P03, P07, P09

3PAR Details

Prominent details about the 3PAR configuration are highlighted below.

  • Single common provisioning group used for virtual volumes and remote copy space
  • Auto LUN ID used
  • Auto Recover enabled
  • A Synchronous replication using 15 minute interval schedule
  • Virtual Volumes are presented to a Host Set at Source Site and are ‘Exported
  • Virtual Volumes are presented to a Host Set at Target Site and are ‘Un-Exported’

During the tests, I was logged into the source and destination 3PAR StoreServ’s and issued the following 3PAR CLI commands to observe behaviour state.

  • showrcopy groups SRMTEST01*
    • Shows state of the remote copy group at each location
  • showrcopy links
    • Shows the status of the remote copy links at each location
  •  showvv
    • Shows the virtual volume information at each location

Planned Failover

Planned Failover is when both the source and target sites are both up.

The table below shows the observed behaviour on the 3PAR StoreServ at both the source and target sites along with the SRM workflow step

SRM Workflow Step Source RG Name Source Role Destination RG Name Destination Role Sync State
Pre Failover SRMTEST01 Primary SRMTEST01.r398979 Secondary Synched/Synching
Planned Failover SRMTEST01 Primary SRMTEST01.r398979 Primary-Rev Stopped
Reprotect SRMTEST01 Secondary-Rev SRMTEST01.r398979 Primary-Rev Synced/Syncing
Planned Failback SRMTEST01 Primary SRMTEST01.r398979 Secondary Stopped
Reprotect SRMTEST01 Primary SRMTEST01.r398979 Secondary Synced/Syncing

Unplanned Failover – Source Site Down

An Unplanned Failover is when the source site is down and the target site is up.

Before the Unplanned Failover workflow is instigated, the 3PAR StoreServ, vCenter and SRM virtual machines are isolated in the source site.

Note: This particular test this was performed during production hours with users accessing the source virtual machines for business as usual activities.  I will create a further blog post on how to achieve this.

The table below shows the observed behaviour on the 3PAR StoreServ behaviour at both source and target sites along with the SRM workflow step when the inter site link is down.

SRM Workflow Step Source RG Name Source Role Destination RG Name Destination Role Sync State
Unplanned Failover SSRMTEST01 Primary (Unconfirmed) SRMTEST01.r398979 Primary-Rev Stopped

The details below describe the behaviour observed and any error messages encountered.

  • 60 seconds is the timeout value for 3PAR remote copy to see the inter site link as down
  • showrcopy grounds SRMTEST01* command ran to verify that SyncStatus field displays ‘stopped’

StoreServ-7200 cli% showrcopy groups SRMTEST01*

Name                                   Target                 Status       Role            Mode       Options

SCC_SRMTEST01.r398979   StoreServ-7200   Stopped   Secondary   Periodic   Period 15m, over_per_alert
LocalVV                               ID                        RemoteVV                    ID          SyncStatus    LastSyncTime
SRMTEST01_DR                   14096                 SRMTEST01_PR           16598   Stopped        2015-04-21 14:22:57 BST

  • showrcopy links command ran on target 3PAR StoreServ to verify partner link is down

StoreServ-7200 cli% showrcopy links

Remote Copy System Information
Status: Started, Normal

Link Information

Target Node Address Status Options
StoreServ-7200 0:3:1 172.16.1.10 Down
StoreServ-7200 1:3:1 172.16.1.11 Down
receive 0:3:1 receive Up
receive 1:3:1 receive Up

  • Target SRM Server Error Message displayed

SRM Error Message

  • Target SRM logs checked which shows this is an expected behaviour as part of the SRM workflow, the target SRA tries to contact the source SRA but fails as the site is down.

Message [2015-04-21 14:35:47.272 ‘arrayMgm.GetRCTargetSysInfo’ 3PAR_3031 verbose (Process id=1652) (Thread id=1)] Complete: Info. Call. –> [2015-04-21 14:35:47.272 ‘discoverDevices.Run’ 3PAR_1013 error (Process id=1652) (Thread id=1)] Error. Peer array id <39897> is not a valid entry in the connected HP 3PAR Storage Server.

Unplanned Failover – Source Site Up

Inter site link re-established and source site checks are performed which entail:

  • Services checked on source  vCenter and SRM Server
    • SRM Service is stopped, expected behaviour as cannot communicate with vCenter. SRM Service started

The next step is CRITICAL in the SRM workflow.   At this point the source and target sites both hold primary read/write copies of data.

SRM at the source site believes that replication is continuing and that nothing has changed!

A device refresh is needed to enable to leverage the HP 3PAR SRA to discover the state of the 3PAR StoreServ arrays.  Once done the ‘Failover in Progress’ should be displayed.

Failover In Progress

 

The table below shows the observed behaviour on the 3PAR StoreServ behaviour at both source and target sites along with the SRM workflow step when the inter site link is up.

SRM Workflow Step Source RG Name Source Role Destination RG Name Destination Role Sync State
Source Site Up SRMTEST01 Primary SRMTEST01.r398979 Primary-Rev Stopped
Planned Failover SRMTEST01 Primary SRMTEST01.r398979 Primary-Rev Stopped
Reprotect SRMTEST01 Secondary-Rev SRMTEST01.r398979 Primary-Rev Synced/Syncing
Planned Failback SRMTEST01 Primary SRMTEST01.r398979 Secondary Stopped
Reprotect SRMTEST01 Primary SRMTEST01.r398979 Secondary Synced/Syncing

Final Thoughts

Using 3PAR StoreServ with Site Recovery Manager provides an easy to use workflow orchestration.  However it is critical to understand the behaviour of each dependency and identify and remediate any action which is not expected.

The key step in an unplanned failover is to refresh your devices once the inter site link is re-established.  If this is not done, you will asking SRM to perform a workflow which is out of synch with the 3PAR StoreServ which will result in a rebuild of your SRM environment and a call to HP and VMware support.

SRM: Reprotect Unsupported

When VMware Site Recovery Manager 5.0 was launched back September 2011 a new feature set was added to give you the ability to perform ‘automated re-protection’ and ‘automated failback’ using array based replication.

The release notes for Site Recovery Manager 5.0 describe this feature set in more detail.

  • Automated Re-Protection.
    • Re-protection is a new extension to recovery plans for use only with array-based replication. Automated re-protect enables the environment at the recovery site to establish replication and protection of the environment back to the original protected site through a single click.
  • Automated Failback
    • Automated failback returns the entire environment to the originally protected primary site. This can only happen after re-protection has ensured that data replication and synchronization have been established to the original primary site. Failback will run the same workflow that was used to migrate the environment to the protected site, ensuring that the critical systems encapsulated by the recovery plan are returned to their original environment. Automated failback, like re-protection, is only available for use with array-based replication protected virtual machines.

SRM Conceptual Diagram v0.1

Background

Since the release of SRM 5.0 I have performed a number of production installations using ‘array based replication’.  As part of the verification of the platform, the clients has requested the following functional tests be performed with ‘test virtual machines’

  1. Test Failover
    • Provide documented evidence that in a planned or unplanned event that the business should be able to recover within defined SLA’s.
  2. Planned Failover and Failback
    • Verify that an upcoming known event such as office refurbishment or other maintenance work a planned failover to the disaster recovery site and planned failback to the original protected site will work within SLA’s.
  3. Unplanned Failover and Failback
    • Verify that an unknown event such as a power outage or WAN failure that an unplanned failover to the disaster recovery site and a planned failback to the original protected site (once service had been restored) could be achieved within SLA’s.

All of the these tests have past with a number of minor issues which are resolved along the way.  That’s the point of the test right!

Reprotect Warning

During a recent installation of SRM using HP 3PAR StoreServ 7200 ‘a synchronous’ protection across two remote copy groups.  The first and second test passed without issue.  It was when we performed the ‘unplanned failover and failback’ that the issue arose.

Unplanned Failover Process

The first step is to sever the intersite link between protected and unprotected site.  Once complete you perform a Disaster Recovery Failover in SRM at the Recovery Site.  This leaves the following tasks unresolved which is shown in the screenshot below.

  • Pre-Synch Storage
    • Replicate recent changes
  • Shutdown VM’s at Protected Site
    • Ensure virtual machine data is consistent
  • Prepare Protected VMs for Migration
    • Create a final snapshot of the volume on which the protected VM’s reside
  • Synchronize Storage
    • Perform a final storage synchronisation to cover all changes

DC02 When DC01 Back Online

When you bring the original protected site back on line a ‘Recovery’ is required which performs the operations above which could not be completed.  In the screenshot below this has been completed successfully.

DC01 Recovery Performed AKA Planned Migration

This is the point now which a ‘Reprotect’ can be performed so that the original Protected site becomes the Recovery site.  At this moment we started to experience issues with the following failure notification:

Failed to reverse replication for failed devices.   Cause: A storage operation on unknown consistency group ‘PG01’

A call was logged with HP and VMware as the SRM logged showed that it was a storage provider fault and that the reverse replication command could not issued.

2015-01-27T11:01:50.894Z [01664 error ‘Recovery’ ctxID=69310807 opID=bbdef04] Plan execution (reprotect workflow) failed; plan id: recovery-plan-1234, plan name: RP01, error: (dr.storageProvider.fault.StorageReverseReplicationFailed) {

–>    dynamicType = <unset>,

–>    faultCause = (dr.storage.fault.UnknownDeviceGroup) {

–>       dynamicType = <unset>,

–>       faultCause = (vmodl.MethodFault) null,

–>       id = “RP01”,

–>       msg = “”,

–>    },

–>    msg = “”,

–> }

This is when things got interesting and in my opinion VMware decided to hide behind some rather ambiguous text.

Ambiguous Text

The text below is taken from the VMware Site Recovery Manager 5.8 Documentation Center

‘If you performed a disaster recovery operation, you must perform a planned migration when both sites are running again. If errors occur during the attempted planned migration, you must resolve the errors and rerun the planned migration until it succeeds’

How do you perform a planned migration if you have performed a disaster recovery option? There is no option for this only ‘Recovery’ what do they actually mean?  Well the next paragraph states the following:

Reprotect is not available under certain conditions:

  • Recovery plans cannot finish without errors. For reprotect to be available, all steps of the recovery plan must finish successfully.
  • You cannot restore the original site, for example if a physical catastrophe destroys the original site. To unpair and recreate the pairing of protected and recovery sites, both sites must be available. If you cannot restore the original protected site, you must reinstall Site Recovery Manager on the protected and recovery sites.

So in our case all steps of the ‘Recovery’ operation had finished and we expected to be able to failback, considering that the same documentation under Reprotect Virtual Machines After a Recovery states:

‘After a recovery, the recovery site becomes the new protected site, but it is not protected yet. If the original protected site is operational, you can reverse the direction of protection to use the original protected site as a new recovery site to protect the new protected site.

Manually reestablishing protection in the opposite direction by recreating all protection groups and recovery plans is time consuming and prone to errors. Site Recovery Manager provides the reprotect function, which is an automated way to reverse protection.’

VMware Support Statement

After numerous backward and forward exchanges.  VMware’s answer was that in the event of an unplanned failover to perform a supported reprotect you must meet the following conditions:

  • Delete your Recovery Plans
  • Delete your Protection Groups
  • Manually reverse replication on your storage
  • Re-create your Protection Groups
  • Re-create your Recovery Plans

Really VMware?

Final Thoughts

SRM is mature intelligent product that understands when a Disaster Recovery failover has been performed.

  • Why then do we have the options for ‘Recovery’ and ‘Reprotect’ if these are not supported in this scenario?
  • Why does SRM documentation not clearly state what is and isn’t supported?
  • Why is SRM not able to cope with this scenario?  Surely it should be supported.

This was new to me and my use cases for SRM have now reduced.  One of the key components of the product is to remove manual administration to mitigate risk of human errors.

The positives are that with this new found knowledge I will be looking at alternative products as such Zerto to meet customer requirements.

HP StoreVirtual & SRM – Case Of The Missing Datastores

Problem Statement

Datastores do not show under Array Managers > Devices and therefore you cannot create Protection Groups.

No Datastores

Replicated datastores have virtual machines within them and replication has completed within the Centralized Management Console

CMC Console

vSphere Console

Methodology

  • HP StoreVirtual SRA installed from HP StoreVirtual Storage
  • SRM server has an interface on the iSCSI subnet
  • .NET Framework 3.5.1 installed on SRM Server as without this you won’t be able to discover the Storage Replication Adapter

Solution

Even though your datastores are showing correctly and are replicating, a lower case character match is required between your vSphere iSCSI and the CMC initiator node name.

vSphere IQN

CMC IQN

 

In my case the vSphere IQN contained DC01-ESXi01 in capitals, whereas the CMC IQN contained dc01-esxi01.

  • If you change the IQN in vSphere to the same name but in lowercase characters, connectivity remains
  • Perform a rescan of Storage Devices and VMFS Volumes
  • Verify that datastores are now showing under Array Managers > Devices
  • Create Protection Groups

Datstores Working