Carefully Plan Filer Names and Volume Names when using VMware Site Recovery Manager with NetApp / IBM N series SRA

I have been struggling with an issue reprotecting a Protection Group within VMware Site Recovery Manager with the IBM N series (NetApp) Storage Replication Adapter.

The first step Configure Storage to Reverse Direction would fail with “Error – Failed to reverse replication for failed over devices. SRA command ‘prepareReverseReplication’ failed. Peer array ID provided in the SRM input is incorrect Check the SRM logs to verify correct peer array ID.

The setup was as follows

Site-A                                                              Site-B
Array Manager – NSERIES01                        Array Manager – DRNSERIES01
Local device – /vol/NFS_VMware_Test         Local Device – /vol/NFS_VMware_Test

NSERIES01:NFS_VMware_Test was SnapMirrored to DRNSERIES01:NFS_VMware_Test 

The volume had a single virtual machine in it with a single vmdk and was running at Site-A.

I had a protection group for this replication pair with the single virtual machine in it.

I could successfully recover the protection group at Site-B and reprotect it so that the SnapMirror was now from DRNSERIES01:NFS_VMware_Test to NSERIES01:NFS_VMware_Test and the Recovery Plan was created at Sie-A. I could then recover the protect group back to Site-A (i.e. the new fail-back functionality in SRM 5). It was when I tried to reprotect the protection group again so that it was ready for a subsequent recovery to Site-B that I had this issue.

I also had the issue if I started with a Protect Group at Site-B, recovered it at Site-A, I could not perform a reprotect back to Site-B.

Checking the SRM logs showed the following

validating if path NFS_VMware_Test is valid mirrored device for given peerArrayId

curr state = invalid, prev state = ,source name:DRNSERIES01:NFS_VMware_Test, destination:NSERIES01:NFS_VMware_Test, path=NFS_VMware_Test, arrayID:NSERIES01
Skipping /vol/NFS_VMware_Test as peerArrayId DRNSERIES01  is not valid

I placed a support call with VMware who pointed out that the error message was coming from the SRA and therefore I would need to place a call with IBM. IBM were struggling to find a reason for this error and after IBM Level 2 support could not find the reason the passed it to NetApp to investigate. The first response from NetApp was to wait for the resync to complete before doing the Reprotect, however it is the Reprotect that does the resync so that response was obviously rubbish. 

While I was waiting for IBM to come back to me I posted the issue on both the NetApp and VMware Forums and reached out to the two best guys that I know at VMware for SRM, Mike Laverick and Lee Dilworth. Both had good suggestions about making sure everything was setup exactly as detailed in the documentation, but as it was I still did not have a solution. Links to the Forum posts NetApp and VMware.

So I decided to dig into the perl scripts that the SRA uses to find the problem myself. After a lot of investigation and debugging the perl scripts used by the NetApp (IBM N series SRA) I discovered the following:

The issue was that the filer name at the Recovery Site had the same name as the filer at the Protected Site with a prefix on it, i.e. in my case the Recovery Site filer was named NSERIES01 and the Protected Site was DRNSERIES01.  Remember I had already performed a fail-over and fail-back so the Protected Site was my original Recovery Site, so yes the filer on the Protected Site for this Protection Group is DRNSERIES01 and the Recovery Site has NSERIES01 on it.

When the Reprotect task is run the first step is to call the SRA with the command prepareReverseReplication, this calls reverseReplication.pl which attempts to check that the SnapMirror is broken off.  It gets the status of all of the SnapMirrors from the filer at the Recovery Site, i.e. in this case NSERIES01.  It then goes through each of these looking for a match of the local-filer-name:volume-name in the source of the snapmirror, e.g. for my test group it was attempting to match NSERIES01:NFS_VMware_Test, at this point the source of the SnapMirror is DRNSERIES01:NFS_VMware_Test which is correct but because the script is using a pattern matching test it matches NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test as NSERIES01:NFS_VMware_Test is contained within DRNSERIES01:NFS_VMware_Test. It then checks if the destination of the snapmirror matches the peerArrayID (i.e. in this case DRNSERIES01) which it does not as the destination, correctly, is NSERIES01 and then reports that the peerArrayID is incorrect. If there is no match on the local-filer-name:volume-name in the source of the snapmirror then it goes on to check the destination of the snapmirrors and when it finds a match it check if the peerArrayID matches the source of the SnapMirror and if it does it then checks that the status of the SnapMirror is broken-off.

I never hit the issue with the first reprotect because DRNSERIES01:NFS_VMware_Test is not contained within the source of the SnapMirror (NSERIES01:NFS_VMware_Test) and therefore it goes on to the next test of checking for DRNSERIES01:NFS_VMware_Test in the destination of the SnapMirror, which it finds and then checks DRNSERIES01 against the destination that also matches and finally confirms that the SnapMirror relationship is broken-off.

I had changed the volume on DRNSERIES01 a while ago because I thought the issue may have been due to the volume names being the same but I had changed it by putting a suffix of _Repl on the end and therefore the script was still matching NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test_Repl.

I have now configured my test group as follows:

Protected Site
Filer Name = NSERIES01
Volume Name = NFS_VMware_Test_HMR

Recovery Site
File Name = DRNSERIES01
Volume Name = NFS_VMware_Test_EWC 

I can now recover to the Recovery Site, Reprotect, Fail-Back, Reprotect and Fail-over again and continue performing recoveries and reprotect over and over again as often as you can re-record on a Scotch VHS tape!

If the filer on Site-B had a suffix on its name instead of a prefix, e.g. it was named NSERIES01DR, or had a completely different name then I would never have hit this bug in the SRA.  I will be waiting for NetApp to fix the SRA.  In the meantime I will be renaming all of my volumes at the recovery site so that to avoid this issue.

UPDATE FROM VAUGHN STEWART: This is a known issue and will be fixed in a SRA 2.1.x release currently undergoing VMware certification.

This entry was posted in IBM, N series, NetApp, Site Recovery Manager, Storage, VMware. Bookmark the permalink.

10 Responses to Carefully Plan Filer Names and Volume Names when using VMware Site Recovery Manager with NetApp / IBM N series SRA

  1. Pingback: Carefully Plan Filer Names and Volume Names when using VMware Site Recovery Manager with NetApp / IBM N series SRA (Pelicano Hints & Tips) | VMware News

  2. Sam says:

    Hi – Nice blog. I experienced the same issue today because my volume names were both the same and now I am unable to Reprotect. Now that my VM is in my Recovery site, how did you manage to get it back to your protected site?
    Thanks, Sam

    • This is the method I used
      Delete the recovery plan
      Delete the protection group
      Reverse Resync the snapmirro relationship(s)
      You may need to refresh the devices on the array managers
      recreate the protection group
      recreate the recovery plan
      You should no be able to do a failover back to your original site.

      However this may work.
      I guess the snapmirror relationship is broken off
      Rename your volume(s) on the original site (the volumes that are not being used as they are being used at the other site)
      Recreate snapmirror relationships from the site the VMs are running from to the renamed volumes (as there is a common snapshot you should be able to do a resync)
      Remove the old snapmirror relationship(s) using the original volume name(s) – you may have to manually delete snapshots left from these relationships
      Refresh devices in array manager
      May now be able to reprotect – although it may expect to see the snapmirror relationship broken off and in the other direction!

      • Sam says:

        Thanks, I went the path of managing to break vcenter all together and having to rebuild!
        New naming convention and new name. Both Test and Recovery successful but still unable to Reprotect.

        Error – Failed to reverse replication for failed over devices. SRA command ‘prepareReverseReplication’ failed. Address of the storage array is not reachable. Storage array might be down or IP address entered might be incorrect. Ensure that the storage array is up and running and the IP address of the storage array is reachable through the command line interface.

        I’ve logged a call with both HP (VMware) and Fujitsu (NetApp) to see if they have anymore insight.

        • vCenter name should not matter.
          What is the name of your storage controllers at the Protected Site and the Recovery Site?
          What are the names of your volumes at the Protected and Recovery Site?

  3. Sam says:

    Protected Site
    Filer Name = NETAPP01W
    Volume Name = NETAPP01W_NFS_1

    Recovery Site
    File Name = NETAPP01A
    Volume Name = NETAPP01A_NFS_1

    I think I have missed out a vital step. Before I click Reprotect in SRM, do I have to manually go into my Recovery’s Site SnapMirror and click Reverse ReSync?

  4. VMCreator says:

    Hi Pelicano,

    Just a quick question. I am hoping you may help me.

    If you have two NetApp NFS filers in a HA pair in protected site (each have separate IP addresses) and two NetApp NFS filers in a HA pair in the recovery site (each have separate IP addresses).

    When you configure SINGLE SRM array managers for each site it asks for only 1 IP address for one of the filers in both sites, and you can pair them accordingly.

    If a NetApp NFS filer HA failover occurs then the secondary filer is then the Primary and SRM cannot reference this other IP address and the array pairing will be disconnected.

    The field for several NFS addresses separated by commas is still tied to the controller IP address, so this does not help.

    I cannot find any information anywhere of how SRM works in a NetApp NFS based HA pair.

    Other storage systems use a Virtual IP address (VIP) to do this, as it is a cluster.

    Cheers

    Rob

    • When the secondary filer does a takeover of the Primary filer then it should take the IP address you use for NFS from the primary filer, otherwise your ESXi hosts would lose connectivity to the NFS volumes as they should be mounted via IP address. Check the rc file on your filers to check the partner filer takes the IP address when a takeover occurs.

      Not sure how well SRM will work in a HA takeover situation as never tried this.

      • VMcreator says:

        Thank for such a prompt reply. Yes you are correct about the HA filer failover and all this works as design.

        The SRM I believe will break?, as it references it’s Array Manager with the primary’s IP address. When the filer fails over, the secondary takes the primary role and SRM will not have this IP address. I have thought about DNS round robin.

        I have posted to NetApp community, let’s see what transpires. I’ll let you know the outcome. A good one to think about, but I have inherited this designed infrastructure and I may well be a bringer of bad tidings?

        • Rob, the primary filer’s IP address should move across to the secondary filer as part of the HA. If you can send me the /etc/rc file from each of the filers I will have a look for you. I will send you my email address. In the /etc/rc file for each interface om the Primary filer you can configure which interface on the secondary the IP address will be brought up on in a takeover situation, e.g.

          ifgrp create lacp vif1 -b ip e0a e0b
          ifconfig vif1 192.168.100.101 netmask 255.255.255.0 partner vif1 trusted -wins

Leave a Reply to VMCreator Cancel reply

Your email address will not be published. Required fields are marked *