HOW-TO: Recover from failed Storage VMotion
A while ago I received a request from the storage department to move a whole ESX cluster to another storage I/O-Group. This would be a disruptive action.
I was wondering if storage VMotion would help me out here if they assigned me some new storage on the other I/O-Group instead of moving the current LUNs.
If you are going to use sVMotion I would strongly suggest using the sVMotion plug-in from lostcreations.
Well sVMotion made it possible to move the cluster to the other I/O-Group online. This would save me a lot of time explaining to the customer why the entire cluster had to go offline. We are talking off 175+ vms here.
So that was the path I chose, and it worked out fine, but I got so excited on sVMotion that I didn’t paid attention to the available storage space left on my new LUNs. So after a while the LUN filled up and the sVMotion process failed.
Whenever a sVMotion fails you probably end up in a situation where the config files are moved to the new location and the .vmdk files and their accompanying snapshot files (sVMotion creates a snapshot in order to copy the .vmdk files) are still in the old location.
Issuing another sVMotion will generate this error: “ERROR: A specified parameter was not correct. spec” and if you turn off the vm you get an extra option “Complete Migration”. This option actually makes a copy of the .vmdk files to the same LUN, and hence it requires twice the space of the .vmdks on your LUN and most importantly it requires downtime.
Here’s what I did to resolve this split:
- Create a snapshot of the vm. Since it’s not available via vCenter GUI in this state, you have to do this in the COS or connect your VIC directly to the ESX host.
Through SSH console session:
- find the config_file_path of the VMCode:
- Create a snapshot of the vmCode:
vmware-cmd <config_file_path> createsnapshot snapshot_name snapshot_description 1 1
- find the config_file_path of the VM
- Through VIC:
- Use GUI as normal.
- Use GUI as normal.
- Remove (Commit) the snapshots:This will remove the newly created snapshot AND the snapshot created by sVMotion.Code:
vmware-cmd <config_file_path> removesnapshots
- vCenter still thinks the vm is in dmotion state so you can’t edit settings, perform VMotion or anything else via vCenter. To fix this we need to clear the DMotionParent parameters in the .vmx file with the following commands from the COS:Do this for every DMotionParent entry in the .vmx file, so be sure to check your .vmx file to get the right SCSI IDs. Note that editing the .vmx file directly will not trigger a reload of the .vmx config file!Code:
vmware-cmd <config_file_path> setconfig scsi0:0.DMotionParent "" vmware-cmd <config_file_path> setconfig scsi0:1.DMotionParent ""
- Now Perform a new storage migration to move back the .vmx configuration file to its original location.
- Clean up destination LUN and remove any files/folders created by the failed sVMotion. We’re done and back in business again without downtime!
We can retry the sVMotion now.
Offcourse I didn’t found out all this by myself. All credits go to Argyle from theVMTN Forum. You can read his original thread here.
Someone would probably say “Don’t try this at home”, but if you’re curious and do want to try this at home, use the following procedure to reproduce this split situation:
- perform a sVMotion of a TEST vm
- on the COS of the ESX host issue”:
service mgmt-vmware restart
- Have fun!!