SRM: Reprotect Unsupported

When VMware Site Recovery Manager 5.0 was launched back September 2011 a new feature set was added to give you the ability to perform ‘automated re-protection’ and ‘automated failback’ using array based replication.

The release notes for Site Recovery Manager 5.0 describe this feature set in more detail.

  • Automated Re-Protection.
    • Re-protection is a new extension to recovery plans for use only with array-based replication. Automated re-protect enables the environment at the recovery site to establish replication and protection of the environment back to the original protected site through a single click.
  • Automated Failback
    • Automated failback returns the entire environment to the originally protected primary site. This can only happen after re-protection has ensured that data replication and synchronization have been established to the original primary site. Failback will run the same workflow that was used to migrate the environment to the protected site, ensuring that the critical systems encapsulated by the recovery plan are returned to their original environment. Automated failback, like re-protection, is only available for use with array-based replication protected virtual machines.

SRM Conceptual Diagram v0.1

Background

Since the release of SRM 5.0 I have performed a number of production installations using ‘array based replication’.  As part of the verification of the platform, the clients has requested the following functional tests be performed with ‘test virtual machines’

  1. Test Failover
    • Provide documented evidence that in a planned or unplanned event that the business should be able to recover within defined SLA’s.
  2. Planned Failover and Failback
    • Verify that an upcoming known event such as office refurbishment or other maintenance work a planned failover to the disaster recovery site and planned failback to the original protected site will work within SLA’s.
  3. Unplanned Failover and Failback
    • Verify that an unknown event such as a power outage or WAN failure that an unplanned failover to the disaster recovery site and a planned failback to the original protected site (once service had been restored) could be achieved within SLA’s.

All of the these tests have past with a number of minor issues which are resolved along the way.  That’s the point of the test right!

Reprotect Warning

During a recent installation of SRM using HP 3PAR StoreServ 7200 ‘a synchronous’ protection across two remote copy groups.  The first and second test passed without issue.  It was when we performed the ‘unplanned failover and failback’ that the issue arose.

Unplanned Failover Process

The first step is to sever the intersite link between protected and unprotected site.  Once complete you perform a Disaster Recovery Failover in SRM at the Recovery Site.  This leaves the following tasks unresolved which is shown in the screenshot below.

  • Pre-Synch Storage
    • Replicate recent changes
  • Shutdown VM’s at Protected Site
    • Ensure virtual machine data is consistent
  • Prepare Protected VMs for Migration
    • Create a final snapshot of the volume on which the protected VM’s reside
  • Synchronize Storage
    • Perform a final storage synchronisation to cover all changes

DC02 When DC01 Back Online

When you bring the original protected site back on line a ‘Recovery’ is required which performs the operations above which could not be completed.  In the screenshot below this has been completed successfully.

DC01 Recovery Performed AKA Planned Migration

This is the point now which a ‘Reprotect’ can be performed so that the original Protected site becomes the Recovery site.  At this moment we started to experience issues with the following failure notification:

Failed to reverse replication for failed devices.   Cause: A storage operation on unknown consistency group ‘PG01’

A call was logged with HP and VMware as the SRM logged showed that it was a storage provider fault and that the reverse replication command could not issued.

2015-01-27T11:01:50.894Z [01664 error ‘Recovery’ ctxID=69310807 opID=bbdef04] Plan execution (reprotect workflow) failed; plan id: recovery-plan-1234, plan name: RP01, error: (dr.storageProvider.fault.StorageReverseReplicationFailed) {

–>    dynamicType = <unset>,

–>    faultCause = (dr.storage.fault.UnknownDeviceGroup) {

–>       dynamicType = <unset>,

–>       faultCause = (vmodl.MethodFault) null,

–>       id = “RP01”,

–>       msg = “”,

–>    },

–>    msg = “”,

–> }

This is when things got interesting and in my opinion VMware decided to hide behind some rather ambiguous text.

Ambiguous Text

The text below is taken from the VMware Site Recovery Manager 5.8 Documentation Center

‘If you performed a disaster recovery operation, you must perform a planned migration when both sites are running again. If errors occur during the attempted planned migration, you must resolve the errors and rerun the planned migration until it succeeds’

How do you perform a planned migration if you have performed a disaster recovery option? There is no option for this only ‘Recovery’ what do they actually mean?  Well the next paragraph states the following:

Reprotect is not available under certain conditions:

  • Recovery plans cannot finish without errors. For reprotect to be available, all steps of the recovery plan must finish successfully.
  • You cannot restore the original site, for example if a physical catastrophe destroys the original site. To unpair and recreate the pairing of protected and recovery sites, both sites must be available. If you cannot restore the original protected site, you must reinstall Site Recovery Manager on the protected and recovery sites.

So in our case all steps of the ‘Recovery’ operation had finished and we expected to be able to failback, considering that the same documentation under Reprotect Virtual Machines After a Recovery states:

‘After a recovery, the recovery site becomes the new protected site, but it is not protected yet. If the original protected site is operational, you can reverse the direction of protection to use the original protected site as a new recovery site to protect the new protected site.

Manually reestablishing protection in the opposite direction by recreating all protection groups and recovery plans is time consuming and prone to errors. Site Recovery Manager provides the reprotect function, which is an automated way to reverse protection.’

VMware Support Statement

After numerous backward and forward exchanges.  VMware’s answer was that in the event of an unplanned failover to perform a supported reprotect you must meet the following conditions:

  • Delete your Recovery Plans
  • Delete your Protection Groups
  • Manually reverse replication on your storage
  • Re-create your Protection Groups
  • Re-create your Recovery Plans

Really VMware?

Final Thoughts

SRM is mature intelligent product that understands when a Disaster Recovery failover has been performed.

  • Why then do we have the options for ‘Recovery’ and ‘Reprotect’ if these are not supported in this scenario?
  • Why does SRM documentation not clearly state what is and isn’t supported?
  • Why is SRM not able to cope with this scenario?  Surely it should be supported.

This was new to me and my use cases for SRM have now reduced.  One of the key components of the product is to remove manual administration to mitigate risk of human errors.

The positives are that with this new found knowledge I will be looking at alternative products as such Zerto to meet customer requirements.

7 thoughts on “SRM: Reprotect Unsupported

  1. Hi sorry to hear of this experience do you have an SR number for this case? also when you say:

    “How do you perform a planned migration if you have performed a disaster recovery option? There is no option for this only ‘Recovery’ what do they actually mean? ”

    To run a recovery plan in “planned migration” mode you simply run a ‘Recovery’ and one of the options you should see as a radio button selection is ‘Planned Migration’ or ‘Disaster Recovery’ (below which is a checkbox for ‘Forced Recovery’). Step 4 here: http://pubs.vmware.com/srm-58/topic/com.vmware.srm.admin.doc/GUID-D8125217-825C-426F-B275-225F99AAD16B.html

    When you perform a DR migration once the two sites are connected there are certain tasks SRM needs to carry out before reprotect is available. To get to that point once the sites are connected the first thing you do is simply re-run your recovery plan choosing “planned migration” and those tasks are executed. Successfully failed over vm’s are not affected.

    Are you saying you did that and were still unable to either see the reprotect button activate (sounds like it did as you seem to be running reprotect) or you saw the reprotect button from the outset but the workflow always failed with the errors you mentioned?

    Just trying to work out what happened in what order.

    Disclaimer: I work for vmware. whilst I have seen the odd issue running reprotects over the last few years they have always been down to configuration anomalies at the storage layer in the cases I’ve worked. Once resolved the reprotect workflow completed successfully. In some cases bug fixes to the SRA’s in use have been needed as well so its also important to know what SRA version you are running along with the relevant SRM build.

    1. Thank you for the reply Lee, appreciate you taking the time out to respond.

      We performed a Disaster Recovery Failover. Then when the communication was re-established between both sites a Recovery task was executed (no planned migration option is available) which completes successfully. Then you have the radial button to Reprotect which is the point of failure.

      Note that during Planned Migration Failover, Reprotect and Failback work as expected.

      I’m not able to share the SR number on the internet due to the confidential nature of the client. I should be able to send this to you via your work email address if that is acceptable?

      1. Hi Craig
        I am a bit confused as the reprotect option is greyed out until a successful planned migration is run once the sites are connected. Someone must have run this. It is the default so maybe it was not noticed but when you click recovery as I said the next screen you see after acknowledging the warning is the option to choose whether to run in planned migration mode or DR mode. Default is planned migration and SRM will always attempt this. You should be able to see this was done if you look back at the recovery plan history tab. Once completed successfully you will then notice reprotect is available.

        It sounds like the reprotrct prepare phase was hitting errors with your devices on the 3Par array but to figure out why we need the log bundles.

        Was the service request logged with vmware or HP support? I don’t have access to tickets logged wth HP directly but we may be able to request the logs of that was case unless you still have the ones you supplied? If it was a vmware ticket that’s a lot simpler. Just need the number. I would expect an issue like this should have reached or SRM eacalation team who are more than capable at root causing issues like this but in your article it does not sound like a root cause was given and you were simply told to delete and recreate objects.

      2. A support ticket was raised with HP and VMware. In the end we had to orchestrate a conference call between HP Support and VMware Support and the outcome of this was the information I supplied.

        The SRM log shows that the reprotect failed due to an unknown consistency group. My belief was that this failure was caused due to the 3PAR StoreServ renaming the Remote Copy Group to PG01.rxxxxxx whereas SRM was looking for the original protection group name.

        However I wasn’t able to push this any further as I was informed it was an unsupported operation and we needed to manually reverse replication, delete recovery plans etc.

        More than happy to share the details with you via email if you can provide an address?

      3. I think you’ve supplied details to my colleague (Ben) so we have those. Will take a look and get back to you. The “unsupported” statement is certainly not the way I implement and use SRM in fact doing that you are doing has always worked. Need to look at the tickets to see what’s going on here. Thanks for your patience Craig.

  2. Craig, thanks for your assistance working with VMware support to iron out what the issue was that was encountered. I also appreciated the follow-up blog post that discussed how to successfully perform a reprotect and failback operation with SRM and 3PAR after an unplanned failover https://vmfocus.com/2015/04/28/3par-storeserv-site-recovery-manager-expected-behaviour/

    I’m glad our SRM support specialists were able to work with you to solve the problem.

    1. Ben, no problem at all, it was great working with the GSS team.

      I look forward to the updated SRM documentation, let me know when it’s released 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s