Xen project Mailing List

Re: [Xen-users] how to start VMs in a particular order

Date: Mon, 30 Jun 2014 13:43:16 +0200

Delivery-date: Mon, 30 Jun 2014 11:57:03 +0000

List-id: Xen user discussion <xen-users.lists.xen.org>

Mail-followup-to: xen-users@xxxxxxxxxxxxx

Joost Roeleveld <joost@xxxxxxxxxxxx> writes: > On Sunday 29 June 2014 17:35:17 lee wrote: >> "J. Roeleveld" <joost@xxxxxxxxxxxx> writes: >> > Try to read the SMART-values of the disk. >> >> I'm not sure how to do that, and what would they tell me? > > Either connect the disks directly to a sata port on a mainboard (normal > desktop would suffice). Disabling the raid-functionality of the card might > also > suffice. > Then use (assuming the disk is /dev/sda) > # smartctl --all /dev/sda IIRC, there is a way to somehow display the smart info, probably with arcconf. I'd rather not use that if it might cause problems. Connecting them to SATA ports would be going to lengths. In any case, I'd get some numbers that won't tell me anything, and that three disks would suddenly go bad only because they are connected to a different controller seems very unlikely. >> I know, they aren't suited for this purpose. Yet they have been working >> fine on the P800, and that three disks should decide to go bad in a way >> that blocks the controller (or whatever happens) every now and then >> seems unlikely. > > No, it doesn't. Why not? These disks might never work with this controller. That doesn't mean that they have gone bad. > Does the error occur after the server has been idle for a while? Or when the > disks are being stressed? I haven't seen any relation between disk usage and crashes. There seem to have been different reasons for crashing, i. e. first it would crash with "swiotbl is full", then with "arcconf seems to hang" and now with "scsi bus hanging?". I upgraded the kernel with one from Debian backports, then a couple days later there was another kernel upgrade when I removed the status checking. So it crashed with "scsi bus hanging?", and I changed the PHY setting of the controller again: The controller has PHY settings, ranging from 0--5, which can be changed for each disk individually. They were all on 5 to begin with, and the controller had trouble to detect the SATA disks on 5. I changed them all to 0 because it's the default, and the docs say that's supposed to work best. Since that, it doesn't have problems detecting the disks. It still crashed with PHY on 1, and I'm on 2 now. It hasn't crashed in over a day yet [knocks on wood]. If it works now, I'll leave it at 2; if it crashes again, I'll increase to 3 ... Apparently this PHY setting is at the lowest level of the SATA protocol and has something to do with how the link between the devices is established. So what happens when the link between a disk and the controller suddenly goes down and cannot be re-established? I'd expect the controller to handle that gracefully, especially since it's hot-plug capable with SAS drives. Perhaps it blocks, trying to re-establish the link because the disk is still present, and is unsuccessful until rebooted. > If the former, then you need to figure out how to AVOID the disks to enter > powersaving mode. It takes time for the disks to spin up again afterwards. > The > raid controller is timing out on access to the disks. > > If the latter, then you might have issues on the drives themselves which the > drives are trying to solve themselves. > > My guess is that it is the former. (eg. when the server has been idle for a > while) It's been idle over night and didn't crash. Since the disks are data-only, there isn't anything accessing them unless I do something with the data. If it was powersaving causing problems, chances are that I'd have had problems with it before. But then, it seems that an SATA link goes down or can go down when a disk saves power. So you might be right: disk goes to sleep, controller cannot re-establish link because of PHY settings, and then things hang. Is it even possible to disable the power management of WD20EARS in such a way that the SATA link remains up at all times? I never did anything about power management with these disks. >> So I think it's more likely an incompatibility of these disks with the >> ServeRaid controller than the disks being bad, and I'd have to replace >> all of them. Or this controller just sucks. > > Yep, incompatibility. Not necessarily with these disks, but with the > powersaving settings in the disks firmware. I believe there are tools > available > you could use to adjust those settings. But I have no experience with them > and > you need to connect the disks directly to a standard sata port and use ms > windows. (As I think those are ms windows tools) Hm, I don't have windoze. And won't the settings be lost once the computer/disk is turned off? >> IBM has supposedly fixed such issues with firmware updates, and >> I updated everything I could even before installing the disks. > > Check the settings on the raid card for powersaving/spindown/powerup > timeouts/.... IIRC, arcconf said power management for the disks is disabled, and I think the controller might have spin-up settings to spin up the disks one after the other when booting. For now, I don't want to touch anything and see if it crashes again. If it does, I'll see what I can find out about power management. That pm causes problems seems to make the most sense now. > You could try changing the raid controller? Maybe, over time, if I can get one that fits and which doesn't have the 2TB limit. I'd have to connect it somehow to the drive enclosure. The P800 is a rather big card, and even if I can plug it into the server, how would I connect it? >> So there I'm stuck :( The plan was to have my data on the server. >> Perhaps I'll have to declare the experiment as failed and sell the >> server. > > Not necessarily, but I would advice against using green drives in a server > when using hardware raid cards. I'd advise against that, too. None of this was planned when I bought the WD20EARS; they were bought to be used with software raid. Suitable disks would have cost 2.5 times as much. >> I could probably run the disks as JBOD. If they are incompatible with >> the controller, that won't help. > > Try putting the disks through individually to the OS. Then use Linux software > raid (mdadm) to do the RAID. That should work better as the RAID-software on > the card won't end up with timeout issues after powersaving kicks in. If it was merely timing issues, the controller should, at worst, fail the disk, shouldn't it? If it's issues with the SATA link going away and not coming back, the problem would persist with JBOD. I might try JBOD, though, because I'm tempted to switch to ZFS. But first the hardware needs to be stable. >> Perhaps the controller is broken. Or it's something that xen does. > > Xen has nothing to do with this. > Most likely: raid-controler <-> disks incompatibility. Well, "swiotlb is full" looks more like kernel/xen than anything else. Or it's a symptom caused by an underlying problem like disk incompatibility. I'm still undecided about whether this is the kind of problem that has multiple causes or not. >> I wish it was a feature of xen --- that would make sense, but how would >> xen know when a VM is fully up ... > > It can, actually. > > If you have client-utilities running inside the VM, those can check easily > when the VM is fully booted. (put those to start last, for instance) > > Then those utilities use the xen-api to inform the host. > Read up on xenfs, it is usable to communicate between the guest and the host. Client utilities? Xenfs? Hmmm ... Why don't ppl use that? -- Knowledge is volatile and fluid. Software is power. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.