Skip to main content

Storage troubleshooting commands

 



General host info and storage issues


Host Info:

echo -e "Host Info \n ==========" && hostname -f && vmware -vl && date && uptime 


SCSI sense codes (ignoring noise as per KB1031221):

cd /var/run/log;SENSE=$(grep -vE ' 0x85| 0x4d| 0x1a| 0x12' vmkernel.log|egrep -oi 'H:+.*+sense data+.*' |sort |uniq -c |sort -nr |head -20) ; echo -e "Host & Plug-in\n====================" ; echo "$SENSE" |grep "D:0x0" ; echo ;echo -e "Device\n====================" ; echo "$SENSE" |grep -v "D:0x0" ;


Common issues: 

echo -e "vmkernel common errors\n====================" ; egrep -i 'snapshot|doubt|medium|apd|perm|non-responsive|offline|marked|corrupt|abort|timeout|frame|lock|splitter|zdriver|heap|admission|Rank violation' vmkernel.log|cut -d" " -f3- |sort|uniq -c |sort -nr |head -30


Hardware issues:

echo -e "ipmi-sel Messages:\n====================";localcli hardware ipmi sel list -p -i -n all |grep Message


NOTE: If you can't check the logs and you have something like :

root@Server-dr1:~] less /var/log/vmkernel.log

/var/log/vmkernel.log: Input/output error

# less /var/run/log/vmkernel.log

Input/output error

..... try:

watch "dmesg | tail -20"


Driver logs


Driver logs:

echo -e "HBA drivers logs\n====================" ; egrep $( localcli storage core adapter list |awk 'FNR >2 {print $2}' |sort |uniq |sed ':a;N;$!ba;s/\n/|/g') vmkernel.log| egrep -v 'INFO|PCI' |cut -d " " -f2- | cut -d ")" -f2- |sort |uniq -c |sort -nr |less

Network Health


Ethernet issues:

echo -e "Ethernet Issues:\n====================";/usr/lib/vmware/vm-support/bin/nicinfo.sh |egrep 'errors|dropped' |grep -v ": 0"


To find if there is a network congestion by checking packets retransmission:

one=$(vsish -e cat /net/tcpip/instances/defaultTcpipStack/stats/tcp |grep sndrexmitpack |cut -d : -f2) ;sleep 10;two=$(vsish -e cat /net/tcpip/instances/defaultTcpipStack/stats/tcp |grep sndrexmitpack |cut -d : -f2);let
"dif= two - one";echo "$dif packets retransmitted during 10 seconds"


Adaptor and connectivity 


To get all the information for vmhbas 

for name in `vmkchdev -l | grep vmhba | awk '{print$5}'`;do echo $name ; echo "VID :DID  SVID:SDID"; vmkchdev -l | grep -w $name | awk '{print $2 , $3}';printf "Driver: ";echo `esxcfg-scsidevs -a | grep -w $name |awk '{print $2}'`;vmkload_mod -s `esxcfg-scsidevs -a | grep -w $name|awk '{print $2}'` |grep -i version;echo `lspci -vvv | grep -w $name | awk '{print $1=$NF="",$0}'`;printf "\n";done


NIC/HBA info:

lspci

esxcli storage core adapter list

vmkchdev -l| egrep 'nic|hba'

vmkchdev -l| egrep 'nic|hba' | awk {'print $2 " " $3'} |sort | uniq

/usr/lib/vmware/vm-support/bin/nicinfo.sh

esxcfg-info -a


Devices/Adpater stats:


localcli storage core device stats get

localcli storage core adapter stats get


OR vsish:


/> get /storage/scsifw/devices/naa.600605b00e02a5a02290472e030b89c5/stat


/> get /storage/scsifw/adapters/vmhba64/stats


Get used vmhbas and number of paths associated with each one 

esxcfg-mpath -L | grep -i vmhba | awk -F ":" '{print $1}'| sort | uniq -c



Paths status:

esxcli storage core path list |grep "State: " |sort | uniq -c


FC


FC status:

echo -e "FC status:\n====================";localcli storage san fc list;localcli storage san fc stats get |egrep 'Error|Failure|Loss|Invalid' |grep -v ": 0";echo;echo -e "FC Events\n===================="  localcli storage san fc events get


fcoe status:

echo -e "FCOE Status \n====================";localcli storage san fcoe list && localcli storage  san fcoe stats get


ISCSI

Iscsi status:

echo -e "ISCSI Status \n====================";localcli storage san iscsi list && localcli storage san iscsi stats get


Iscsi connections tests :


nc -z -s [host's port IP address] [iSCSI server's IP address] [Port ID]

vmkping I vmk1 -d -s 8972 x.x.x.x


Note: If you have more then one vmkernel port on the same network (such as a heartbeat vmkernel port for iSCSI) then all vmkernel ports on the host on the network would need to be configured with Jumbo Frames (MTU: 9000) too. If there are other vmkernel ports on the same network with a lower MTU then the vmkping command will fail with the -s 8972 option. Here in the command -d option sets DF (Don't Fragment) bit on the IPv4 packet.
To test 1500 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 1472.
You can specify which vmkernel port to use for outgoing ICMP traffic with the -I option
-d option sets DF (Don't Fragment) bit on the IPv4 packet.
-s 8972 in case you are using jumbo frames MTU 9000
-s 1472 in case you are using normal MTU 1500


Iscsi sessions/connections/targets/more:

esxcli  iscsi  session  list

esxcli iscsi session connection list

esxcli iscsi adapter target portal list

esxcli iscsi networkportal lis

esxcli iscsi adapter param get -A vmhba34


Script to set MTU to 1500 for the VMKs used for ISCSi port binding:


for x in `localcli iscsi networkportal list |grep vmk |cut -d ":" -f2 | cut -d " " -f2`;do localcli network ip interface set -m 1500 -i $x;done


/etc/init.d/hostd restart


 esxcfg-rescan --all


SAS status:

echo -e "SAS Status \n====================";localcli storage san sas list && localcli storage san sas stats get


RDM , Devices, Partitions , Datastores 


Print the partition table:

partedUtil getptbl /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Check the partition  data:

hexdump -c /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 |less


Seach for VMFS signature 

offset="128 2048"; for dev in `esxcfg-scsidevs -l | grep "Console Device:" | awk {'print $3'}`; do disk=$dev; echo $disk; partedUtil getptbl $disk; { for i in `echo $offset`; do echo "Checking offset found at $i:"; hexdump -n4 -s $((0x100000+(512*$i))) $disk; hexdump -n4 -s $((0x1300000+(512*$i))) $disk; hexdump -C -n 128 -s $((0x130001d + (512*$i))) $disk; done; } | grep -B 1 -A 5 d00d; echo "---------------------"; done


Finding & creating VMFS Partition table  link


  1. Download the attached find_vmfs_partition_boundaries.sh script and upload to ESXi host's /tmp directory
  2. Change permission:
    "chmod 777 /tmp/find_vmfs_partition_boundaries.sh"
  3. Change directory to:
    /sbin - "cd /sbin"
    Run the script against the device (naa.60060160729025007628b54969f4e211 in this example, do not use the path /vmfs/devices/disks):
  4. ../tmp/find_vmfs_partition_boundaries.sh naa.60060160729025007628b54969f4e211
    Using naa.60060160729025007628b54969f4e211 ...
    Starting offset is 1048576 and LVM majorVersion is 05. Assuming VMFS5 and GPT.
    Done. Check the /tmp/partitioncmds.txt file for partition-creation syntax.
    "less /tmp/partitioncmds.txt" provides the command to run:
  5. partedUtil setptbl /vmfs/devices/disks/naa.60060160729025007628b54969f4e211 gpt "1 2048 1048562549 AA31E02A400F11DB9590000C2911D1B8 0"




find_vmfs_partition_boundaries.sh Collapse source
#!/bin/sh
 
# A tool to automatically find the partition offsets of a VMFS volume, in sectors.
# NOTE: This tool should only be run on ESXi. ESX Classic will return sane information,
# but the generated partedUtil commands won't work as they don't use the COS devpath.
 
# NOTE: This tool performs little sanity checking and may potentially return invalid data!
# As will all things, you should "trust but verify."
 
# Make sure we are using a valid device file.
checkin=`echo "$1" |grep -E "^naa|^eui|^t10|^mpx"`
if "$checkin" != "" ];then
    echo "Using "$1" ..."
    else
    echo "$1 is not an NAA, T10, EUI or MPX identifier."
    echo "Usage: find_vmfs_partition_boundaries.sh <NAA, T10, EUI or MPX identifier>"
    echo
    exit 1
fi
 
# Once we validate that the input is an identifier, make sure it actually exists.
if [ ! -f /vmfs/devices/disks/$1 ];then
    echo "Could not find \"/vmfs/devices/disks/$1\""
    echo "Please verify that the identifier exists!"
    echo
    exit 2
fi
 
# Use dd to grab the first 8 MB of a disk so metadata can be quickly parsed
dd if=/vmfs/devices/disks/$1 of=/tmp/part.txt bs=1M count=8 > /dev/null 2>&1
 
# Set the "hexoff" environmental variable to the beginning of the LVM header
hexoff=`hexdump -C /tmp/part.txt |grep "0d d0 01 c0" |awk '{print $1}'`
 
# Check to make sure that we actually found LVM magic.
if "$hexoff" "" ];then
    echo "LVM magic not found on $1!"
    echo "Quitting ..."
    echo
    exit 3
fi
 
# Set the "startdec" variable to the decimal form of the hex offset found in the previous step
startdec=`echo $((0x$hexoff))`
 
# Set the "startoff" variable to the decimal offset less the 1 MB (1048576 bytes) pad.
# This determines the decimal value of the beginning of the partition, in bytes.
startoff=`expr $startdec - 1048576`
 
# Determine the sarting sector by dividing the "startoff" variable by 512 (512-byte sectors)
startsect=`expr $startoff \/ 512`
 
# Determine the LVM majorVersion. This will be used later to try to infer the correct partition type.
veroff=`expr $startdec \+ 4`
lvmver=`hexdump -C /tmp/part.txt -s $veroff -n 1 |awk '{print $2}'`
 
# Define the location of the partition size within LVM.
lvmsize=`expr $startdec \+ 94`
 
# Determine the LVM partition size by grabbing the size in hex (8 bytes), reversing endianess.
hexsize=`hexdump -v /tmp/part.txt -s $lvmsize -n 8 |awk '{print $5 $4 $3 $2}'`
 
# Convert this figure to decimal to get the raw size of the volume, in bytes and set the "decsize" variable
decsize=`echo $((0x$hexsize))`
 
# Add this figure to the $startoff variable to get the gross size
dectotal=`expr $startoff \+ $decsize`
 
# Use $dectotal and expr to find the size of the volume in GB, rounded down to the nearest whole integer.
sizegb=`expr $dectotal \/ 1024 \/ 1024 \/ 1024`
 
# Divide $dectotal by 512 to get the raw size of the volume, in sectors and set a variable
sectotal=`expr $dectotal \/ 512`
 
#Find the ending sector by subtracting one from the total
endsect=`expr $sectotal - 1`
 
# Determine what type of partition we ought to create, depending on some conditions.
# The logic here isn't elegant and could use work.
# We simply use a handful of ifs to determine how to proceed.
if "$startsect" "2048" ] && [ "$lvmver" "05" ]; then
    echo "Starting offset is $startoff and LVM majorVersion is $lvmver. Assuming VMFS5 and GPT."
# It is important to note here that theoretically, this could be a case where VMFS3 was created under vSphere5 and then subsequently upgraded to VMFS5.
# Depending on size, it is conceivable that the approriate partition type here actually could be MBR. Despite this potential corner case, setting the type
# to be GPT won't break anything because, as it is VMFS5, only non-vSphere4 hosts will be accessing it so there shouldn't be any trouble handing GPT.
    echo "partedUtil setptbl /vmfs/devices/disks/$1 gpt \"1 $startsect $endsect AA31E02A400F11DB9590000C2911D1B8 0\"" >> /tmp/partitioncmds.txt
fi
if "$sizegb" -gt "2048" ] && [ "$lvmver" "05" ];then
    echo "Partition size is >2TB and LVM majorVersion is $lvmver. Assuming VMFS5 and GPT."
    echo "partedUtil setptbl /vmfs/devices/disks/$1 gpt \"1 $startsect $endsect AA31E02A400F11DB9590000C2911D1B8 0\"" >> /tmp/partitioncmds.txt
fi
if "$lvmver" != "05" ]; then
    echo "LVM majorVersion is $lvmver. Assuming VMFS3 and MBR."
    echo "partedUtil set /vmfs/devices/disks/$1 \"1 $startsect $endsect 251 0\"" >> /tmp/partitioncmds.txt
fi
if "$lvmver" "05" ] && [ "$startsect" "128" ] && [ "$sizegb" -lt "2048" ]; then
    echo "LVM majorVersion is $lvmver but the starting offset is $startsect and total size is < 2TB."
    echo "Assuming upgraded VMFS3 with MBR."
    echo "partedUtil set /vmfs/devices/disks/$1 \"1 $startsect $endsect 251 0\"" >> /tmp/partitioncmds.txt
fi
echo "Done. Check the /tmp/partitioncmds.txt file for partition-creation syntax."
exit 0




Browse datastore/filesystems:

localcli storage vmfs extent list

localcli storage filesystem list

esxcli storage vmfs snapshot list


Find the detailed information about the file system:

vmkfstools -v10 -Ph /vmfs/volumes/datastore1/


Get the UUID of the ESXi installation partition:

esxcfg-info -b


Force mount:

esxcfg-volume -M xxxxx


Get the Perennially Reserved flag value for NON-VMFS LUNs :

esxcli storage vmfs extent list| grep -iEoh "naa.*|eui.*|t10.*" | awk '{print $1}'  > /tmp/vmfs.txt;esxcli storage core device list | awk '{print $1}' | grep -iE "naa|eui|t10" | grep -v -f  /tmp/vmfs.txt > /tmp/nonvmfs.txt ;printf "%10s %s\n" "UID                                 " "Perennially Reserved";printf "%10s %s\n" "---                                 " "--------------------";for x in `grep -iE "naa|eui|t10" /tmp/nonvmfs.txt`;do printf "%10s %s\n" $x `esxcli storage core device list -d $x |grep -i "perennially" | awk '{print $NF}' `; done

 

List RDMs:

vim-cmd vmsvc/getallvms|grep 'vmnamehere'| awk '{print $1}'|grep [0-9]|while read a; do vim-cmd vmsvc/device.getdevices $a|grep -v parent|grep -A8 RawDisk;done

OR

find /vmfs/volumes/ -type f -name '*.vmdk' -size -1024k -exec grep -l '^createType=.*RawDeviceMap' {} \; > /tmp/rdmsluns.txt

for i in `cat /tmp/rdmsluns.txt`; do vmkfstools -q $i; done

OR

List RDM vml & vmdk 

for i in `vm-support -V | awk '{print $1}' |  cut -d "/" -f 1-5`; do find $i -name '*.vmdk' ;done | grep -v "\-flat.vmdk" | grep -v "rdm" > /tmp/vmdk.txt ; for i in `grep vmdk /tmp/vmdk.txt`; do echo $i ; grep -i RawDeviceMap $i; done > /tmp/tempfile.txt; grep -iB 1 "RawDeviceMap" /tmp/tempfile.txt | grep vmdk > /tmp/rdmvmdk.txt; for i in `grep vmdk /tmp/rdmvmdk.txt`; do echo $i; esxcfg-scsidevs -u |grep `vmkfstools -q $i | grep vml | awk '{print $3}'`; done | grep naa | awk '{print $1}' > /tmp/rdmnaa.txt; for i in `grep naa /tmp/rdmnaa.txt`; do esxcli storage core device setconfig -d $i --perennially-reserved=true ; done


Manually grow datastore:


partedUtil getptbl /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

partedUtil getUsableSectors /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

partedUtil resize /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 2048 xxxxxxxxxx

vmkfstools --growfs /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1

vmkfstools -V


Unmount DS:


esxcli storage filesystem unmount -u 55a670c0-474d3873-c2c6-ac162dbd3bc8


Deattach device :


esxcli storage core device detached list --> shows all manually detached devices
esxcli storage core device set --state=off -d <naa.id> --> manually detach a device
esxcli storage core device detached remove -d <naa.id> --> permanently remove the device configuration


SCSI reservation conflicts:


esxcfg-info | egrep -B5 "s Reserved|Pending" 




|----Console Device....................../dev/sda


|----DevfsPath........................../vmfs/devices/disks/vml.


02000000006001c230d8abfe000ff76c198ddbc13e504552432035


|----SCSI Level..........................6


|----Queue Depth.........................128


|----Is Pseudo...........................false


|----Is Reserved.........................false


|----Pending Reservations................ 1




Note: The host that has Pending Reserves with a value that is larger than 0 is holding the lock.



Remove the reservation from anywhere :

vmkfstools --lock lunreset /vmfs/devices/disks/vml.02000000006001c230d8abfe000ff76c198ddbc13e504552432035



Manually delete datastore from the DB:


 Collapse source
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
 
 
VCDB=# select ID from VPX_ENTITY where name = 'SF-VMDS01';
 
 
 id
 
 
-----
 
 
 367
 
 
(1 row)
 
 
   
 
 
delete from VPX_DS_ASSIGNMENT where DS_ID=367;
 
 
delete from VPX_VM_DS_SPACE where DS_ID=367;
 
 
delete from VPX_DATASTORE where ID=367;
 
 
delete from VPX_ENTITY where ID=367;


Rescan HBA/FS:

esxcfg-rescan --all    

vmkfstools -V  


VAAI primitives status:

esxcli storage core device vaai status get -d naa.600601603aa029002cedc7f8b356e311


verify unmap info :

esxcli storage vmfs reclaim config get -u <Datastore_UUID>

# vsish

> cd /vmkModules/vmfs3/auto_unmap/volumes/

> ls

This should list all the volumes that the system are watching for auto_unmap



Mask LUN By Path KB1009449:

esxcli storage core claimrule add -r 300 -t location -A vmhba33 -C 0 -T 3 -L 0 -P MASK_PATH

esxcli storage core claimrule load

esxcli storage core claiming reclaim -d t10.F405E46494C4542596D477279794D24364A7A4D233F69713



NFS 


Connection test KB1003967:


vmkping -I vmkN -s nnnn xxx.xxx.xxx.xxx

vmkN is vmk0, vmk1, etc, depending on which vmknic is assigned to NFS.

Note: The -I option to select the vmkernel interface is available only in ESXi 5.1. Without this option in 4.x/5.0, the host uses the vmkernel associated with the destination network being pinged in the host routing table. The host routing table can be viewed by running the esxcfg-route -l command.

nnnn is the MTU size minus 28 bytes for overhead. For example, for an MTU size of 9000, use 8972.

xxx.xxx.xxx.xxx is the IP address of the target NFS storage.


nc -z array-IP 2049


remove and mount NFS3:


esxcli storage nfs remove -v nfs_datastore
esxcli storage nfs add --host=dir42.eng.vmware.com --share=/<mount_dir> --volume-name=nfsstore-dir42
esxcli storage nfs add -H 192.168.5.2 -s /ctnr-async-metro-oltp-1-siteA -v ctnr-async-metro-oltp-1-siteA


Note: for NFS 4.1 replace nfs with nfs41 

Note:The conf are in /etc/vmware/esx.conf

NFS Heap


'memstats -r heap-stats -s name:pctFreeOfMax  |grep nfs'


NFS DAVG

NFS DAVG Collapse source
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
#!/bin/bash
 
#################################################################
# Description: Computes Read DAVG(readIssueTime) for NFS DS
# In    : NFS share name
# Out   :
#         ReadTime -> Total time for a Read
#         ReadIssueTime(DAVG) -> Time for a split Read
#                             ->RTT(TCPIP+pnic+path+Array)
# Author: Shah Pahelwan Shahnawaz Hussain shussain@vmware.com
###################################################################
count=0
cur_read=0
cur_readsplits=0
prev_read=0
prev_readsplits=0
reads=0
prev_rit=0
prev_rt=0
prev_rbytes=0
avg_rt=0
 
cur_writes=0
cur_writesplits=0
prev_writes=0
prev_writesplits=0
wites=0
prev_wit=0
prev_wt=0
prev_wbytes=0
avg_wt=0
 
share=$1
RDFMT="%-9d %-9d %-8d %-10d %-12d %-9d %s %-10d %-10d %-8d %-10d %-13d %-8d\n"
#WRFMT="%-9d %-12d %-12d %-12d  %-6d\n"
RD_OP_PARAMS=""$reads" "$avg_rt" "$avg_rit" "$avg_rbytes" "$RKBps""
 
if "$#" -ne 1 ]; then
   echo "Usage: $0 <NFS Share>"
   exit 1
fi
 
time=`date`;
echo "$time: Running $0 $1"
 
vsish_path="vsish -e get /vmkModules/nfsclient/mnt/"$share"/properties"
 
echo "vsish path $vsish_path"
 
echo "======================================================================================================= ======================"
echo "Reads  || Split  ||  Time || Issue   ||  Bytes   ||  ReadKB/s  ## Writes || Split  ||  Time  || Issue   ||  Bytes    ||WritKB/s"
echo "       || Reads  || /Read || Time/RD || /Read(KB)||            ##        || Writes || /Write || Time/WR || /Write(KB)||        "
echo "       ||        ||  (Avg)|| (DAVG)  ||  (Avg)   ||            ##        ||        ||  (Avg) || (DAVG)  ||  (Avg)    ||        "
echo "=============================================================================================================================="
while [ $count -le 1000 ]
do
 
sleep 1
   cur_read=`$vsish_path  | grep reads | cut -d ":" -f 2`;  
   cur_readsplits=`$vsish_path  | grep  readSplitsIssued | cut -d ":" -f 2`
   cur_rt=`$vsish_path  | grep readTime | cut -d ":" -f 2`
   cur_rit=`$vsish_path  | grep readIssueTime | cut -d ":" -f 2`
   cur_rbytes=`$vsish_path  | grep readBytes | cut -d ":" -f 2`
    
   cur_writes=`$vsish_path  | grep writes | cut -d ":" -f 2`;  
   cur_writesplits=`$vsish_path  | grep writeSplitsIssued | cut -d ":" -f 2`;  
   cur_wt=`$vsish_path  | grep writeTime | cut -d ":" -f 2`
   cur_wit=`$vsish_path  | grep writeIssueTime | cut -d ":" -f 2`
   cur_wbytes=`$vsish_path  | grep writeBytes | cut -d ":" -f 2`
   #echo $cur_rt
    
##Compure Read Stats##  
   reads=$((cur_read - prev_read))
   readsplits=$((cur_readsplits - prev_readsplits))
   readTime=$((cur_rt - prev_rt))
   readIssueTime=$((cur_rit - prev_rit))
   readBytes=$((cur_rbytes - prev_rbytes))
#   echo "Total reads $reads"
#   echo "Total read time $readTime readIssueTime $readIssueTime"
 
#   avg_rt=$(echo "scale=8; $readTime/($reads*1000)" |bc)
#   avg_rit=$(echo "scale=8; $readIssueTime/($reads*1000)" |bc)
   if [ $reads -ne 0 ]
   then
      avg_rt=$(($readTime/($reads*1000)))
      avg_rit=$(($readIssueTime/($reads*1000)))
      avg_rbytes=$(($readBytes/($reads*1024)))
   fi
   RKBps=$(($readBytes/(1024)))
    
    
##Compute write Stats
   writes=$((cur_writes - prev_writes))
   writesplits=$((cur_writesplits - prev_writesplits))
   writeTime=$((cur_wt - prev_wt))
   writeIssueTime=$((cur_wit - prev_wit))
   writeBytes=$((cur_wbytes - prev_wbytes))
    
   if [ $writes -ne 0 ]
   then
      avg_wt=$(($writeTime/($writes*1000)))
      avg_wit=$(($writeIssueTime/($writes*1000)))
      avg_wbytes=$(($writeBytes/($writes*1024)))
   fi
   WKBps=$(($writeBytes/(1024)))
 
##print Stats##
#   echo "$reads  $avg_rt   $avg_rit           $avg_rbytes  $KBps "
   printf "$RDFMT" "$reads" "$readsplits" "$avg_rt" "$avg_rit" "$avg_rbytes" "$RKBps" "##" "$writes" "$writesplits" "$avg_wt" "$avg_wit" "$avg_wbytes" "$WKBps"
 
prev_read=$cur_read
   prev_readsplits=$cur_readsplits
   prev_rt=$cur_rt
   prev_rit=$cur_rit
   prev_rbytes=$cur_rbytes
   prev_writes=$cur_writes
   prev_writesplits=$cur_writesplits
   prev_wt=$cur_wt
   prev_wit=$cur_wit
   prev_wbytes=$cur_wbytes
    
#reset the values
   avg_rt=0
   avg_rit=0
   avg_rbytes=0
   avg_wt=0
   avg_wit=0
   avg_wbytes=0
   RKBps=0
   WKBps=0
   readBytes=0
   writeBytes=0
 
let count+=1
   
done


Capturing network dump 

tcpdump-uw host array-ip –w /vmfs/volumes/datastorex/capture.pcap

Corruptions


VMDK

vmkfstools  -x  check /vmfs/volumes/iSCSI-T1/VCSA1/VCSA1_11.vmdk

vmkfstools  -x  repair /vmfs/volumes/iSCSI-T1/VCSA1/VCSA1_11.vmdk


LUN  Partition table checker

voma -m ptbl -d /dev/disks/mpx.vmhba0:C0:T0:L0


LUN LVM Checker

voma -m lvm -d  /dev/disks/mpx.vmhba0:C0:T0:L0:3


LUN VMFS

voma -m vmfs -f check -d /vmfs/devices/disks/naa.600a0980006c1123000001cf58413cdc:1

voma -m vmfs -f fix -d /vmfs/devices/disks/naa.600a0980006c1123000001cf58413cdc:1


For VMFS 6 (=>ESXi 6.7u2) 

voma -m vmfs -f advfix -d /vmfs/devices/disks/naa.xxxxxx0000b8:1 -p /tmp/voma.txt


(Note: you have to unmount the DS from all hosts and run voma from one host only)

Locks


VMFS:

ls | while read x; do vmfsfilelockinfo -p $x| grep -i "is locked"; done

OR

vmkfstools -D xxx.vmdk


NFS3:

ls -la | grep .lck


#Find the Host that created the lock

hexdump -C .lck-e003090001000000


00000010 01 00 00 00 64 68 69 6e 67 2d 65 73 78 2e 76 6d |..........esx.vm|
00000020 77 61 72 65 2e 63 6f 6d 00 00 00 00 00 00 00 00 |ware.com........|

#Find the file that is being locked

stat * | grep -B2 `v2=$(v1=.lck-e003090001000000;echo ${v1:13:2}${v1:11:2}${v1:9:
2}${v1:7:2}${v1:5:2});printf "%d\n" 0x$v2` | grep File
File: Win2k8-flat.vmdk


VSAN:

ls *.vmdk | xargs grep "vsan://" | awk -F'//'  '{print $2}' | awk -F'"'  '{print $1}' | while read x; do vmfsfilelockinfo -p .$x.lck ; done


Find out which files are open by which process;

lsof | grep -i “VM name”


Which VM is using this process xxxx=process 

esxcli vm process list  | grep -iC4 xxxxx


VVOL


VVOl status:

echo -e "VVOL Status \n====================";echo -e "VVOL VASS";esxcli storage vvol vasaprovider list;echo -e "VVOL protocolendpoint "; esxcli storage vvol protocolendpoint list;echo -e "VVOL storagecontainer ";esxcli storage vvol storagecontainer list


Test VASA SSL:

openssl s_client -connect vasa.local:443

OR

openssl s_client -connect vasa.local 443

Refresh VASA cert

Browse to vCenter Server in the vSphere Web Client navigator.

1.Click the Configure tab, and click Storage Providers.

2.From the list, select the storage provider and click the Refresh the certificate


Refresh ESXi certificate:


Browse to the host in the vSphere Web Client inventory.

Click the Manage tab and click Settings.

Select System, and click Certificate.You can view detailed information about the selected host's certificate.

Click Renew or Refresh CA Certificates.

OR

cd /etc/vmware/ssl

mv rui.crt orig.rui.crt

mv rui.key orig.rui.key

/sbin/generate-certificates

/etc/init.d/hostd restart

/etc/init.d/vpxa restart
Reconnect ESXi host to vCenter server.
Run the following commands on the ESXi host:
/etc/init.d/vvold ssl_reset

/etc/init.d/vvold restart


Publish certificate vCenter:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert publish --chain --cert /tmp/vasa.crt


Trust certificate from ESXi

/etc/vmware/ssl/castore.pem


Find VVOL Objects [details]


 localcli --plugin-dir /usr/lib/vmware/esxcli/int/ storage internal vvol virtualvolume get --container-id 8954ad916c2b4520-9bd5519866e14fcf --uuid naa.60060160a9c9b824cf9bdb4e38fb4dfe

localcli --plugin-dir /usr/lib/vmware/esxcli/int/ storage internal vvol virtualvolume metadata list  --container-id 8954ad916c2b4520-9bd5519866e14fcf   --uuid naa.60060160a9c9b824cf9bdb4e38fb4dfe


localcli --plugin-dir /usr/lib/vmware/esxcli/int/ storage internal vvol daemon set --dump-objects ; less /var/log/vvold.log


Add VASA from ESXi without vCenter (shared by Lav Tiwari)

localcli --plugin-dir /usr/lib/vmware/esxcli/int/ storage internal vvol vasaprovider add --vp-name YYYYY --vp-url https://x.x.x.x:xxxx/vasa/version.xml

/etc/init.d/hostd restart


Trivia logs for vvold and sps:


For the host vvold, the is a log level in /etc/vmware/vvold/config.xml which overrides the local vvold verbose setting:

config>
  <log>
    <!-- default log level -->
    <level>verbose</level>


you have to change this log to trivia, and restart the vvold before it is possible to change the vvold log-level to trivia using esxcli.

"BTW, the is a log level in /etc/vmware/vvold/config.xml which overrides the local vvold verbose setting:


config>
  <log>
    <!-- default log level -->
    <level>verbose</level>


you have to change this log to trivia, and restart the vvold before it is possible to change the vvold log-level to trivia using esxcli."


And for vCenter SPS  (sometimes other components needs to be  =TRACE) …. Depending on the logs/case :


- Edit file /usr/lib/vmware-vpx/sps/conf/log4j.properties, and change the following lines
  log4j.appender.file.Threshold=TRACE
  log4j.logger.com.vmware.vim.storage.common.vc=TRACE
- Restart sps with the command "vmon-cli -r sps"


Example "1. The error observed is that SPS is not able to invoke a Property Collector call on vpxd to find the datastore on which the virtual disk is placed.

2019-11-18T12:28:01.551Z [pool-17-thread-4] INFO opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.vim.storage.common.vc.impl.VcInventoryImpl - No entities of type VirtualMachine found in VC inventory
2019-11-18T12:28:01.551Z [pool-17-thread-4] WARN opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.vim.storage.common.vc.impl.VcQueryImpl - No virtual devices found for given vm : vm-346390
2019-11-18T12:28:01.551Z [pool-17-thread-4] ERROR opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.pbm.prov.impl.PreProvisionServiceImpl - Not able to find current placement hub of entity: (pbm.ServerObjectRef) {
   dynamicType = null,
   dynamicProperty = null,
   objectType = virtualDiskId,
   key = vm-346390:2000,
   serverUuid = ED65BB82-CDFD-4B35-8986-FB9C0B5EC307
}
2019-11-18T12:28:01.551Z [pool-17-thread-4] WARN opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.pbm.prov.impl.PreProvisionServiceImpl - Error when trying to run pre-provision validation
java.lang.RuntimeException: Not able to find current placement hub of entity: (pbm.ServerObjectRef) {
   dynamicType = null,
   dynamicProperty = null,
   objectType = virtualDiskId,
   key = vm-346390:2000,
   serverUuid = ED65BB82-CDFD-4B35-8986-FB9C0B5EC307
}
   at com.vmware.pbm.prov.impl.PreProvisionServiceImpl.findCurrentPlacementHub(PreProvisionServiceImpl.java:599)
   at com.vmware.pbm.prov.impl.PreProvisionServiceImpl.fillExistingPolicyAssociations(PreProvisionServiceImpl.java:512)
   at com.vmware.pbm.prov.impl.PreProvisionServiceImpl.preProvisionValidate(PreProvisionServiceImpl.java:437)
   at com.vmware.pbm.profile.impl.ProfileManagerImpl.preProvisionProcess(ProfileManagerImpl.java:4119)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at com.vmware.vim.vmomi.server.impl.InvocationTask.run(InvocationTask.java:65)
   at com.vmware.vim.vmomi.server.common.impl.RunnableWrapper$1.run(RunnableWrapper.java:47)
   at com.vmware.vim.storage.common.task.opctx.RunnableOpCtxDecorator.run(RunnableOpCtxDecorator.java:38)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
2019-11-18T12:28:01.552Z [pool-17-thread-4] INFO opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.pbm.profile.impl.ProfileManagerImpl - Timer stopped: preProvisionProcess, Time taken: 60 ms.

2. However, just a few milli secs before this SPS was able to fetch this information from vpxd.

2019-11-18T12:28:01.536Z [pool-6-thread-10] DEBUG opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.spbm.domain.policy.Entity - Datastore for vm-346390:2000 : ManagedObjectReference{type = Datastore, value = datastore-323358}
2019-11-18T12:28:01.536Z [pool-6-thread-10] DEBUG opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.spbm.domain.policy.Entity - Datastore type of vm-346390:2000 : VVOL
2019-11-18T12:28:01.540Z [pool-6-thread-10] DEBUG opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.spbm.domain.policy.Entity - Backing object id for vm-346390:2000 : naa.60002AC00000000000007B430001A419
2019-11-18T12:28:01.544Z [pool-6-thread-10] DEBUG opId=k2xnped9-138464-auto-2yuc-h5:70026758-3-01 com.vmware.spbm.domain.policy.Entity - Storage id for vm-346390:2000 : [vvol:482a38f0f78a48a8-a937d1f8ae4b64d3]

3. There is no error in the vpxd logs corresponding to the property collector query.

4. The vc support bundle has TRACE level logging for parts of SPBM. Thanks for attempting to reproduce the issue with this turned on. However, this did not enable TRACE level logging in the part of code that connects to VC for the property collector query. So at this point, it is not clear why the property collector call from SPS to vpxd failed to fetch the datastore information for the virtual disk.

Mohamed,
Is this a one-off failure or is it easily reproducible? If reproducible, could you increase the logging level to TRACE for the property collector calls alone? This will unfortunately log a lots of lines and the log files will churn faster. So you would have to turn this on, reproduce the issue, collect the logs, and then turn it off quickly. Here is how it can be done:

- Edit file /usr/lib/vmware-vpx/sps/conf/log4j.properties, and change the following lines
  log4j.appender.file.Threshold=TRACE
  log4j.logger.com.vmware.vim.storage.common.vc=TRACE
- Restart sps with the command "vmon-cli -r sps"
- Also turn on TRIVIA logging in vpxd, and restart with the command "vmon-cli -r vpxd"

- Repro issue, collect vc support, revert these log lines in sps and restart sps once again, revert vpxd to INFO level logging and restart vpxd again."

Network commands


#List NICs

esxcfg-nics -l

#List vmknics

esxcfg-vmknic -l

#List routes
esxcfg-route -l

#List neighbor-list
esxcfg-route -n

#List Switchs
esxcfg-vswitch -l

#List all ports
net-stats -l

#errors and statistics for a network adapter, run this command:
esxcli network nic stats get -n <vmnicX>

#Get the vmk tag
esxcli network ip interface tag get -i vmk1



  1. ----------------------

    # esxcli storage core device detached list --> shows all manually detached devices
    # esxcli storage core device set --state=off -d <naa.id> --> manually detach a device
    # esxcli storage core device detached remove -d <naa.id> --> permanently remove the device
    ============================
    iSCSI:

    # nc -z -s [host's port IP address] [iSCSI server's IP address] [Port ID]

    -----------------
    # vmkping I vmk1 -d -s 8972 x.x.x.x

    Note: If you have more then one vmkernel port on the same network (such as a heartbeat vmkernel port for iSCSI) then all vmkernel ports on the host on the network would need to be configured with Jumbo Frames (MTU: 9000) too. If there are other vmkernel ports on the same network with a lower MTU then the vmkping command will fail with the -s 8972 option. Here in the command -d option sets DF (Don't Fragment) bit on the IPv4 packet.
    To test 1500 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 1472.
    You can specify which vmkernel port to use for outgoing ICMP traffic with the -I option
    -d option sets DF (Don't Fragment) bit on the IPv4 packet.
    -s 8972 in case you are using jumbo frames MTU 9000
    -s 1472 in case you are using normal MTU 1500

    ------------
    esxcli iscsi adapter target portal list --> shows all iscsi target portals
    esxcli iscsi adapter list
    esxcli iscsi networkportal list --> lists the details for iscsi adapters (excluding vmnics)
    esxcli iscsi adapter param get -A vmhba34 --> lists parameters of iscsi adapters
    ----------------
    Nmp commands:

    esxcli storage nmp satp list --> lists the default PSP SATP combinations
    esxcli storage nmp psp list --> lists PSPs available in the host
    esxcli storage nmp path list -d <NAA ID> -->
    esxcli storage nmp device list -d <NAA ID> | grep PSP --> shows the PSP of this LUN.
    esxcli storage nmp device set -d <NAA ID> --psp=VMW_PSP_RR --> changes the PSP of this LUN to be RR.
    esxcli storage core claimrule list
    esxcli storage core claimrule load --> load the new claim rule you added
    esxcli storage core claimrule run
    esxcli storage core claiming reclaim -d <NAA ID> --> unclaim and then re-claim the LUN
    esxcli storage nmp satp rule list
    ========================
    VAAI

    # esxcli storage core device vaai status get
    # esxcli storage core device vaai status get -d naa.600601603aa029002cedc7f8b356e311 --> to know which VAAI primitives are supported,helpful in Unmap Issues

Comments

Popular posts from this blog

ما هى ال FSMO Roles

  بأختصار ال FSMO Roles هى اختصار ل Flexible Single Operation Master و هى عباره عن 5 Roles فى ال Active Directory و هما بينقسموا لقسمين A - Forest Roles 1- Schema Master Role و هى ال Role اللى بتتحكم فى ال schema و بيكون فى Schema Master Role واحد فى ال Forest بيكون موجود على Domain Controller و بيتم التحكم فيها من خلال ال Active Directory Schema Snap in in MMC بس بعد ما يتعمل Schema Register بواسطه الامر التالى من ال Cmd regsvr32 schmmgmt.dll 2-Domin Naming Master و هى ال Role المسئوله عن تسميه ال Domains و بتتأكد ان مفيش 2 Domain ليهم نفس الاسم فى ال Forest و بيتم التحكم فيها من خلال ال Active Directory Domains & Trusts B- Domain Roles 1-PDC Emulator و هى ال Role اللى بتتحكم فى ال Password change فى ال domain و بتتحكم فى ال time synchronization و هى تعتبر المكان الافتراضى لل GPO's و هى تعتبر Domain Role مش زى الاتنين الاولانيين و بيتم التحكم فيها من خلال ال Active directory Users & Computers عن طريق عمل كليك يمين على اسم الدومين و نختار operations master فى تاب ال PDC Emu

Recreating a missing VMFS datastore partition in VMware vSphere 5.x and 6.x

    Symptoms A datastore has become inaccessible. A VMFS partition table is missing.   Purpose The partition table is required only during a rescan. This means that the datastore may become inaccessible on a host during a rescan if the VMFS partition was deleted after the last rescan. The partition table is physically located on the LUN, so all vSphere hosts that have access to this LUN can see the change has taken place. However, only the hosts that do a rescan will be affected.   This article provides information on: Determining whether this is the same problem Resolving the problem   Cause This issue occurs because the VMFS partition can be deleted by deleting the datastore from the vSphere Client. This is prevented by the software, if the datastore is in use. It can also happen if a physical server has access to the LUN on the SAN and does an install, for example.   Resolution To resolve this issue: Run the  partedUtil  command on the host with the issues and verify if your output

Unlock the VMware VM vmdk file

  Unlock the VMware VM vmdk file Kill -9 PID Sometimes a file or set of files in a VMFS become locked and any attempts to edit them or delete will give a device or resource busy error, even though the vm associated with the files is not running. If the vm is running then you would need to stop the vm to manipulate the files. If you know that the vm is stopped then you need to find the ESX server that has the files locked and then stop the process that is locking the file(s). 1. Logon to the ESX host where the VM was last known to be running. 2.  vmkfstools -D /vmfs/volumes/path/to/file  to dump information on the file into /var/log/vmkernel 3.  less /var/log/vmkernel  and scroll to the bottom, you will see output like below: a. Nov 29 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)FS3: 130: &lt;START vmware-16.log&gt; b. Nov 29 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)Lock [type 10c00001 offset 30439424 v 21, hb offset 4154368 c. Nov 29 15:49:17 vm22 vmkernel: gen 664