Xen project Mailing List

Re: [win-pv-devel] Problems with xenvbd

To: Paul Durrant <Paul.Durrant@xxxxxxxxxx>

From: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Date: Fri, 4 Sep 2015 17:31:56 +0100

Cc: "win-pv-devel@xxxxxxxxxxxxxxxxxxxx" <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>, Fabio Fantoni <fabio.fantoni@xxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxx>, RafaÅ WojdyÅa <omeg@xxxxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 07 Sep 2015 10:40:37 +0000

List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

On Fri, 4 Sep 2015, Paul Durrant wrote: > > -----Original Message----- > > From: Stefano Stabellini [mailto:stefano.stabellini@xxxxxxxxxxxxx] > > Sent: 04 September 2015 17:25 > > To: Paul Durrant > > Cc: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx; Stefano > > Stabellini > > Subject: RE: [win-pv-devel] Problems with xenvbd > > > > On Fri, 4 Sep 2015, Paul Durrant wrote: > > > > -----Original Message----- > > > > From: win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv-devel- > > > > bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Paul Durrant > > > > Sent: 02 September 2015 10:00 > > > > To: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx > > > > Cc: Stefano Stabellini > > > > Subject: Re: [win-pv-devel] Problems with xenvbd > > > > > > > > > -----Original Message----- > > > > > From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx] > > > > > Sent: 02 September 2015 09:54 > > > > > To: Paul Durrant; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx > > > > > Cc: Stefano Stabellini > > > > > Subject: Re: [win-pv-devel] Problems with xenvbd > > > > > > > > > > Il 01/09/2015 16:41, Paul Durrant ha scritto: > > > > > >> -----Original Message----- > > > > > >> From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx] > > > > > >> Sent: 21 August 2015 14:14 > > > > > >> To: RafaÅ WojdyÅa; Paul Durrant; win-pv-devel@xxxxxxxxxxxxxxxxxxxx > > > > > >> Subject: Re: [win-pv-devel] Problems with xenvbd > > > > > >> > > > > > >> Il 21/08/2015 10:12, Fabio Fantoni ha scritto: > > > > > >>> Il 21/08/2015 00:03, RafaÅ WojdyÅa ha scritto: > > > > > >>>> On 2015-08-19 23:25, Paul Durrant wrote: > > > > > >>>>>> -----Original Message----- From: > > > > > >>>>>> win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv- > > devel- > > > > > >>>>>> bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Rafal Wojdyla > > Sent: 18 > > > > > >>>>>> August 2015 14:33 To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx > > Subject: > > > > > >>>>>> [win-pv-devel] Problems with xenvbd > > > > > >>>>>> > > > > > >>>>>> Hi, > > > > > >>>>>> > > > > > >>>>>> I've been testing the current pvdrivers code in preparation for > > > > > >>>>>> creating upstream patches for my xeniface additions and I > > noticed > > > > > >>>>>> than xenvbd seems to be very unstable for me. I'm not sure if > > it's > > > > > >>>>>> a problem with xenvbd itself or my code because it seemed to > > only > > > > > >>>>>> manifest when the full suite of our guest tools was installed > > along > > > > > >>>>>> with xenvbd. In short, most of the time the system crashed > > with > > > > > >>>>>> kernel memory corruption in seemingly random processes > > shortly > > > > > >>>>>> after start. Driver Verifier didn't seem to catch anything. You > > can > > > > > >>>>>> see a log from one such crash in the attachment crash1.txt. > > > > > >>>>>> > > > > > >>>>>> Today I tried to perform some more tests but this time without > > our > > > > > >>>>>> guest tools (only pvdrivers and our shared libraries were > > > > > >>>>>> installed). To my surprise now Driver Verifier was crashing the > > > > > >>>>>> system every time in xenvbd (see crash2.txt). I don't know why > > it > > > > > >>>>>> didn't catch that previously... If adding some timeout to the > > > > > >>>>>> offending wait doesn't break anything I'll try that to see if > > > > > >>>>>> I can > > > > > >>>>>> reproduce the previous memory corruptions. > > > > > >>>>>> > > > > > >>>>> Those crashes do look odd. I'm on PTO for the next week but I'll > > > > have > > > > > >>>>> a look when I get back to the office. I did run verifier on all > > > > > >>>>> the > > > > > >>>>> drivers a week or so back (while running vbd plug/unplug tests) > > but > > > > > >>>>> there have been a couple of changes since then. > > > > > >>>>> > > > > > >>>>> Paul > > > > > >>>>> > > > > > >>>> No problem. I attached some more logs. The last one was during > > > > > system > > > > > >>>> shutdown, after that the OS failed to boot (probably corrupted > > > > > >>>> filesystem since the BSOD itself seemed to indicate that). I > > > > > >>>> think > > > > every > > > > > >>>> time there is a BLKIF_RSP_ERROR somewhere but I'm not yet > > familiar > > > > > with > > > > > >>>> Xen PV device interfaces so not sure what that means. > > > > > >>>> > > > > > >>>> In the meantime I've run more tests on my modified xeniface > > driver > > > > to > > > > > >>>> make sure it's not contributing to these issues but everything > > > > seemed > > > > > to > > > > > >>>> be fine there. > > > > > >>>> > > > > > >>>> > > > > > >>> I also had a disk corruption on windows 10 pro 64 bit with pv > > > > > >>> drivers > > > > > >>> build of 11 august but I'm not sure that is related to winpv > > > > > >>> drivers, > > > > > >>> on same domU I started testing also snapshot with qcow2 disk > > overlay. > > > > > >>> For this case I don't have useful information because don't try to > > > > > >>> boot windows at all but if rehappen I'll try to take other useful > > > > > >>> information. > > > > > >> Happen another time but also this I was unable to understand what > > is > > > > > >> exactly the cause. > > > > > >> On windows reboot all seems was ok and did a clean shutdown but > > on > > > > > >> reboot seabios don't found bootable disk and qemu log don't show > > > > useful > > > > > >> informations. > > > > > >> qemu-img check show errors: > > > > > >>> /usr/lib/xen/bin/qemu-img check W10.disk1.cow-sn1 > > > > > >>> ERROR cluster 143 refcount=1 reference=2 > > > > > >>> Leaked cluster 1077 refcount=1 reference=0 > > > > > >>> ERROR cluster 1221 refcount=1 reference=2 > > > > > >>> Leaked cluster 2703 refcount=1 reference=0 > > > > > >>> Leaked cluster 5212 refcount=1 reference=0 > > > > > >>> Leaked cluster 13375 refcount=1 reference=0 > > > > > >>> > > > > > >>> 2 errors were found on the image. > > > > > >>> Data may be corrupted, or further writes to the image may corrupt > > it. > > > > > >>> > > > > > >>> 4 leaked clusters were found on the image. > > > > > >>> This means waste of disk space, but no harm to data. > > > > > >>> 27853/819200 = 3.40% allocated, 22.65% fragmented, 0.00% > > > > compressed > > > > > >>> clusters > > > > > >>> Image end offset: 1850736640 > > > > > >> I created it with: > > > > > >> /usr/lib/xen/bin/qemu-img create -o > > > > > >> backing_file=W10.disk1.xm,backing_fmt=raw -f qcow2 > > W10.disk1.cow- > > > > > sn1 > > > > > >> and changed the xl domU configuration: > > > > > >> disk=['/mnt/vm2/W10.disk1.cow-sn1,qcow2,hda,rw',... > > > > > >> Dom0 is with xen 4.6-rc1 and qemu 2.4.0 > > > > > >> DomU is windows 10 pro 64 bit with pv drivers build of 11 august > > > > > >> > > > > > >> How I can know for sure if it is a winpv or qemu or other problem > > and > > > > > >> take useful information to report? > > > > > >> > > > > > >> Thanks for any reply and sorry for my bad english. > > > > > > This sounds very much like a lack of synchronization somewhere. I > > recall > > > > > seeing other problems of this ilk when someone was messing around > > with > > > > > O_DIRECT for opening images. I wonder if we are missing a flush > > operation > > > > > on shutdown. > > > > > > > > > > > > Paul > > > > > > > > > > > Thanks for reply. > > > > > I did a fast search but I not found O_DIRECT grepping in libxl, I > > > > > found > > > > > it only in qemu code. > > > > > After I tried with patch that seems added setting of it for xen: > > > > > > > > > > > http://git.qemu.org/?p=qemu.git;a=commitdiff;h=454ae734f1d9f591345fa78 > > > > > 376435a8e74bb4edd > > > > > Checking in libxl seems disabled by default and from some old xen post > > > > > seems that O_DIRECT creates problems. > > > > > I should try it enable direct-io-safe in domUs qcow2 disks? > > > > > Added also Stefano Stabellini as cc. > > > > > @Stefano Stabellini: What is the current know status and result of > > > > > direct-io-safe? > > > > O_DIRECT should be entirely safe to use, at least on ide and qdisk. I > > haven't done the analysis on ahci emulation in qemu to know whether that > > would be true for ahci disks, but that doesn't matter because unplug is > > not implemented for ahci disks. > > > > > > > > > Sorry is the question are stupid by or my english is too bad or many > > > > > post of latest years are confused and in same cases seems also > > > > > contradictory about stability/integrity/performance using it or not. > > > > > In particular seems crash with some kernels but I not understand > > exactly > > > > > what versions and/or with which patches. > > > > > > > > > > @Paul Durrant: have you see my other mail when I wrote that based on > > my > > > > > latest test with xen 4.6 without udev file windows domUs with new pv > > > > > driver don't boot and for still boot it correctly I must readd udev > > > > > file, can this cause unexpected case related to this problem or is > > > > > different? > > > > > http://lists.xen.org/archives/html/win-pv-devel/2015- > > 08/msg00033.html > > > > > > > > > > > > > I'm not sure why udev would be an issue here. The problem you have > > > > appears to be QEMU ignoring the request to unplug emulated disks. I've > > not > > > > seen this behaviour on my test box so I'll need to dig some more. > > > > > > > > > > I notice you have 6 IDE channels? Are you using AHCI by any chance? If you > > are then it looks like QEMU is not honouring the unplug request... that > > would > > be where the bug is. I'll try to repro myself. > > > > Unplug on ahci is actually unimplemented, see hw/i386/xen/xen_platform.c: > > > > static void unplug_disks(PCIBus *b, PCIDevice *d, void *o) > > { > > /* We have to ignore passthrough devices */ > > if (pci_get_word(d->config + PCI_CLASS_DEVICE) == > > PCI_CLASS_STORAGE_IDE > > && strcmp(d->name, "xen-pci-passthrough") != 0) { > > pci_piix3_xen_ide_unplug(DEVICE(d)); > > } > > } > > > > the function specifically only unplugs IDE disks. > > I am not sure what to do about ahci unplug, given that we don't > > implement scsi disk unplug either. After all, if the goal is to unplug > > the disk, why choose a faster emulated protocol? > > I think we should unplug the disk regardless of type, if we support > configuring disks of that type through libxl. The reason, in this case AFAIU, > for wanting ahci is to speed up Windows boot where initial driver load is > still done through int13 and hence emulated disk. I would be happy to take a patch which makes QEMU unplug all kinds of disks, as long as is able to skip passed though devices (see comment in the code).

_______________________________________________ win-pv-devel mailing list win-pv-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/win-pv-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.