A special thanks goes out to felipef for all the help today.
History:
(4) host pool – one in a failed state due to hardware failure
(1) 3.2T data lun – SR-UUID = aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
The issue:
The 3.2T datalun was presenting as 91% utilized and only 33% virtually allocated.
Work log:
Results were confirmed via the XC GUI and via the command line as identified below
xe sr-list params=all uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
physical-utilisation ( RO): 3170843492352
physical-size ( RO): 3457918435328
virtual size: 1316940152832
type ( RO): lvmohba
sm-config (MRO): allocation: thick; use_vhd: true
Further digging found that summing all the vdis on the SR resulted in the virtual allocation number
Commands + results:
xe vdi-list sr-uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a params=physical-utilisation --minimal | sed 's/,/ + /g' | bc –l
physical utilization: 1,210,564,214,784
xe vdi-list sr-uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a params=virtual-size --minimal | sed 's/,/ + /g' | bc –l
virtual size: 1,316,940,152,832
At this point we started looking at the VG to see if there were some LVs that were taking space but not known by the xapi
Command + result:
vgs
VG #PV #LV #SN Attr VSize VFree
VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a 1 33 0 wz--n- 3.14T 267.36G
(lvs --units B | grep aa15042e | while read vg lv flags size; do echo -n "$size +" | sed 's/B//g'; done; echo 0)| bc -l
3170843492352
So at this point we have confirmed that there are in fact lvs not accounted for by xapi. So we look for them
lvs | grep aa15042e | grep VHD | cut -c7-42 | while read uuid; do [ "$(xe vdi-list uuid=$uuid --minimal)" == "" ] && echo $uuid ; done
This returned a long list of UUIDs that did not have a matching entry in xapi
Grabbing one of the UUIDs at random and searching back in the xensource.log we find something strange
[20121113T09:05:32.654Z|debug|xcp-nc-bc1b8|1563388 inet-RPC|SR.scan R:b7ff8ccc6566|dispatcher] Server_helpers.exec exception_handler: Got exception SR_BACKEND_FAILURE_181: [ ; Error in Metadata volume operation for SR. [opterr=VDI delete operation failed for parameters: /dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/MGT, c866d910-f52f-4b16-91be-f7c646c621a5. Error: Failed to read file with params [3, 0, 512, 512]. Error: Input/output error]; ]
After a little googling around and finally finding a thread on the citrix forums (http://forums.citrix.com/thread.jspa?threadID=299275) that pointed me at a process to rebuild the metadata for that specific SR without having to blow away the SR and start fresh.
Commands
lvrename /dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/MGT /dev/VG_XenStorage-aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a/OLDMGT
xe sr-scan uuid=aa15042e-2cdd-5ebc-9f0e-3d189c5cb56a
This got rid of the SR_backend errors but the LVs continued to persist. Started looking in the SMlog started seeing lines that pointed at the pool not being ready and exiting
<25168> 2012-11-14 12:27:24.195463 Pool is not ready, exiting
At this point I manually forced the offline node out of the pool and the SMlog reported a success in the purge process.
xe host-forget uuid=<down host>