Skip to main content

HOW TO Recover from failed Storage VMotion

 HOW-TO: Recover from failed Storage VMotion

A while ago I received a request from the storage department to move a whole ESX cluster to another storage I/O-Group. This would be a disruptive action.
I was wondering if storage VMotion would help me out here if they assigned me some new storage on the other I/O-Group instead of moving the current LUNs.

If you are going to use sVMotion I would strongly suggest using the sVMotion plug-in from lostcreations.

Well sVMotion made it possible to move the cluster to the other I/O-Group online. This would save me a lot of time explaining to the customer why the entire cluster had to go offline. We are talking off 175+ vms here.

So that was the path I chose, and it worked out fine, but I got so excited on sVMotion that I didn’t paid attention to the available storage space left on my new LUNs. So after a while the LUN filled up and the sVMotion process failed.

Whenever a sVMotion fails you probably end up in a situation where the config files are moved to the new location and the .vmdk files and their accompanying snapshot files (sVMotion creates a snapshot in order to copy the .vmdk files) are still in the old location.

Issuing another sVMotion will generate this error: “ERROR: A specified parameter was not correct. spec” and if you turn off the vm you get an extra option “Complete Migration”. This option actually makes a copy of the .vmdk files to the same LUN, and hence it requires twice the space of the .vmdks on your LUN and most importantly it requires downtime.

Here’s what I did to resolve this split:

  • Create a snapshot of the vm. Since it’s not available via vCenter GUI in this state, you have to do this in the COS or connect your VIC directly to the ESX host.

    • Through SSH console session:

      • find the config_file_path of the VM
        vmware-cmd -l
      • Create a snapshot of the vm
        vmware-cmd <config_file_path> createsnapshot snapshot_name snapshot_description 1 1

    • Through VIC:
      • Use GUI as normal.

  • Remove (Commit) the snapshots:
    vmware-cmd <config_file_path> removesnapshots
    This will remove the newly created snapshot AND the snapshot created by sVMotion.

  • vCenter still thinks the vm is in dmotion state so you can’t edit settings, perform VMotion or anything else via vCenter. To fix this we need to clear the DMotionParent parameters in the .vmx file with the following commands from the COS:
    vmware-cmd <config_file_path> setconfig scsi0:0.DMotionParent ""
    vmware-cmd <config_file_path> setconfig scsi0:1.DMotionParent ""
    Do this for every DMotionParent entry in the .vmx file, so be sure to check your .vmx file to get the right SCSI IDs. Note that editing the .vmx file directly will not trigger a reload of the .vmx config file!

  • Now Perform a new storage migration to move back the .vmx configuration file to its original location.

  • Clean up destination LUN and remove any files/folders created by the failed sVMotion. We’re done and back in business again without downtime!
    We can retry the sVMotion now.

Offcourse I didn’t found out all this by myself. All credits go to Argyle from theVMTN Forum. You can read his original thread here.

Someone would probably say “Don’t try this at home”, but if you’re curious and do want to try this at home, use the following procedure to reproduce this split situation:

  • perform a sVMotion of a TEST vm
  • on the COS of the ESX host issue”:
    service mgmt-vmware restart
  • Have fun!!


Popular posts from this blog

Recreating a missing VMFS datastore partition in VMware vSphere 5.x and 6.x

    Symptoms A datastore has become inaccessible. A VMFS partition table is missing.   Purpose The partition table is required only during a rescan. This means that the datastore may become inaccessible on a host during a rescan if the VMFS partition was deleted after the last rescan. The partition table is physically located on the LUN, so all vSphere hosts that have access to this LUN can see the change has taken place. However, only the hosts that do a rescan will be affected.   This article provides information on: Determining whether this is the same problem Resolving the problem   Cause This issue occurs because the VMFS partition can be deleted by deleting the datastore from the vSphere Client. This is prevented by the software, if the datastore is in use. It can also happen if a physical server has access to the LUN on the SAN and does an install, for example.   Resolution To resolve this issue: Run the  partedUtil  command on the host with the issues and verify if your output

ما هى ال FSMO Roles

  بأختصار ال FSMO Roles هى اختصار ل Flexible Single Operation Master و هى عباره عن 5 Roles فى ال Active Directory و هما بينقسموا لقسمين A - Forest Roles 1- Schema Master Role و هى ال Role اللى بتتحكم فى ال schema و بيكون فى Schema Master Role واحد فى ال Forest بيكون موجود على Domain Controller و بيتم التحكم فيها من خلال ال Active Directory Schema Snap in in MMC بس بعد ما يتعمل Schema Register بواسطه الامر التالى من ال Cmd regsvr32 schmmgmt.dll 2-Domin Naming Master و هى ال Role المسئوله عن تسميه ال Domains و بتتأكد ان مفيش 2 Domain ليهم نفس الاسم فى ال Forest و بيتم التحكم فيها من خلال ال Active Directory Domains & Trusts B- Domain Roles 1-PDC Emulator و هى ال Role اللى بتتحكم فى ال Password change فى ال domain و بتتحكم فى ال time synchronization و هى تعتبر المكان الافتراضى لل GPO's و هى تعتبر Domain Role مش زى الاتنين الاولانيين و بيتم التحكم فيها من خلال ال Active directory Users & Computers عن طريق عمل كليك يمين على اسم الدومين و نختار operations master فى تاب ال PDC Emu

Unlock the VMware VM vmdk file

  Unlock the VMware VM vmdk file Kill -9 PID Sometimes a file or set of files in a VMFS become locked and any attempts to edit them or delete will give a device or resource busy error, even though the vm associated with the files is not running. If the vm is running then you would need to stop the vm to manipulate the files. If you know that the vm is stopped then you need to find the ESX server that has the files locked and then stop the process that is locking the file(s). 1. Logon to the ESX host where the VM was last known to be running. 2.  vmkfstools -D /vmfs/volumes/path/to/file  to dump information on the file into /var/log/vmkernel 3.  less /var/log/vmkernel  and scroll to the bottom, you will see output like below: a. Nov 29 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)FS3: 130: &lt;START vmware-16.log&gt; b. Nov 29 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)Lock [type 10c00001 offset 30439424 v 21, hb offset 4154368 c. Nov 29 15:49:17 vm22 vmkernel: gen 664