A special thanks goes out to felipef
for all the help today.
Â
History:
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (4) host pool â one
in a failed state due to hardware failure
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (1) 3.2T data lun â
SR-UUID = aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
Â
The issue:
The 3.2T
datalun was presenting as 91% utilized and only 33%
virtually allocated.
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
Work log:
Â
Results were confirmed via the XC GUI
and via the command line as identified below
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xe sr-list params=all
uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
physical-utilisation ( RO): 3170843492352
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
physical-size ( RO): 3457918435328
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
virtual size: 1316940152832
type ( RO): lvmohba
sm-config (MRO): allocation: thick;
use_vhd: true
Â
Further digging found that summing
all the vdis on the SR resulted in the virtual
allocation number
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Commands + results:
xe vdi-list
sr-uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
params=physical-utilisation --minimal | sed 's/,/ + /g'
| bc âl
physical utilization:Â
1,210,564,214,784
xe vdi-list
sr-uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
params=virtual-size --minimal | sed 's/,/ + /g' | bc âl
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ virtual size:
1,316,940,152,832
Â
At this point we started looking at
the VG to see if there were some LVs that were taking
space but not known by the xapi
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Command + result:
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ vgs
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
VGÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
ÂÂÂÂ#PV #LV #SN AttrÂÂ VSizeÂÂÂ VFree
VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56aÂÂ
1Â 33ÂÂ 0 wz--n-ÂÂÂ 3.14T 267.36G
Â
(lvs --units B | grep aa15042e | while
read vg lv flags size; do echo -n "$size +" | sed
's/B//g'; done; echo 0)| bc -l
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
3170843492352
Â
So at this point we have confirmed
that there are in fact lvs not accounted for by xapi. So
we look for them
lvs | grep
aa15042e | grep VHD | cut -c7-42 | while read uuid; do [
"$(xe vdi-list uuid=$uuid --minimal)" == "" ] &&
echo $uuid ; done
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
This returned a long list of UUIDs that did not have a
matching entry in xapi
Â
Grabbing one of the UUIDs at random
and searching back in the xensource.log we find
something strange
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
[20121113T09:05:32.654Z|debug|xcp-nc-bc1b8|1563388
inet-RPC|SR.scan R:b7ff8ccc6566|dispatcher]
Server_helpers.exec exception_handler: Got exception
SR_BACKEND_FAILURE_181: [ ; Error in Metadata volume
operation for SR. [opterr=VDI delete operation failed
for parameters:
/dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/MGT,
c866d910-f52f-4b16-91be-f7c646c621a5. Error: Failed to
read file with params [3, 0, 512, 512]. Error:
Input/output error];Â ]
Â
After a little googling around and
finally finding a thread on the citrix forums (http://forums.citrix.com/thread.jspa?threadID=299275)
that pointed me at a process to rebuild the metadata for
that specific SR without having to blow away the SR and
start fresh.
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Commands
lvrename
/dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/MGT
/dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/OLDMGT
xe sr-scan
uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
Â
This got rid of the SR_backend errors
but the LVs continued to persist. Started looking in
the SMlog started seeing lines that pointed at the pool
not being ready and exiting
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ <25168>
2012-11-14 12:27:24.195463ÂÂÂÂÂ Pool is not ready,
exiting
Â
At this point I manually forced the
offline node out of the pool and the SMlog reported a
success in the purge process.
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xe host-forget
uuid=<down host>
Â