[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [win-pv-devel] Problems with xenvbd





2015-09-04 18:31 GMT+02:00 Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>:
On Fri, 4 Sep 2015, Paul Durrant wrote:
> > -----Original Message-----
> > From: Stefano Stabellini [mailto:stefano.stabellini@xxxxxxxxxxxxx]
> > Sent: 04 September 2015 17:25
> > To: Paul Durrant
> > Cc: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx; Stefano
> > Stabellini
> > Subject: RE: [win-pv-devel] Problems with xenvbd
> >
> > On Fri, 4 Sep 2015, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv-devel-
> > > > bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Paul Durrant
> > > > Sent: 02 September 2015 10:00
> > > > To: Fabio Fantoni; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > Cc: Stefano Stabellini
> > > > Subject: Re: [win-pv-devel] Problems with xenvbd
> > > >
> > > > > -----Original Message-----
> > > > > From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx]
> > > > > Sent: 02 September 2015 09:54
> > > > > To: Paul Durrant; RafaÅ WojdyÅa; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > > Cc: Stefano Stabellini
> > > > > Subject: Re: [win-pv-devel] Problems with xenvbd
> > > > >
> > > > > Il 01/09/2015 16:41, Paul Durrant ha scritto:
> > > > > >> -----Original Message-----
> > > > > >> From: Fabio Fantoni [mailto:fabio.fantoni@xxxxxxx]
> > > > > >> Sent: 21 August 2015 14:14
> > > > > >> To: RafaÅ WojdyÅa; Paul Durrant; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > > > > >> Subject: Re: [win-pv-devel] Problems with xenvbd
> > > > > >>
> > > > > >> Il 21/08/2015 10:12, Fabio Fantoni ha scritto:
> > > > > >>> Il 21/08/2015 00:03, RafaÅ WojdyÅa ha scritto:
> > > > > >>>> On 2015-08-19 23:25, Paul Durrant wrote:
> > > > > >>>>>> -----Original Message----- From:
> > > > > >>>>>> win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx [mailto:win-pv-
> > devel-
> > > > > >>>>>> bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf Of Rafal Wojdyla
> > Sent: 18
> > > > > >>>>>> August 2015 14:33 To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx
> > Subject:
> > > > > >>>>>> [win-pv-devel] Problems with xenvbd
> > > > > >>>>>>
> > > > > >>>>>> Hi,
> > > > > >>>>>>
> > > > > >>>>>> I've been testing the current pvdrivers code in preparation for
> > > > > >>>>>> creating upstream patches for my xeniface additions and I
> > noticed
> > > > > >>>>>> than xenvbd seems to be very unstable for me. I'm not sure if
> > it's
> > > > > >>>>>> a problem with xenvbd itself or my code because it seemed to
> > only
> > > > > >>>>>> manifest when the full suite of our guest tools was installed
> > along
> > > > > >>>>>> with xenvbd. In short, most of the time the system crashed
> > with
> > > > > >>>>>> kernel memory corruption in seemingly random processes
> > shortly
> > > > > >>>>>> after start. Driver Verifier didn't seem to catch anything. You
> > can
> > > > > >>>>>> see a log from one such crash in the attachment crash1.txt.
> > > > > >>>>>>
> > > > > >>>>>> Today I tried to perform some more tests but this time without
> > our
> > > > > >>>>>> guest tools (only pvdrivers and our shared libraries were
> > > > > >>>>>> installed). To my surprise now Driver Verifier was crashing the
> > > > > >>>>>> system every time in xenvbd (see crash2.txt). I don't know why
> > it
> > > > > >>>>>> didn't catch that previously... If adding some timeout to the
> > > > > >>>>>> offending wait doesn't break anything I'll try that to see if I can
> > > > > >>>>>> reproduce the previous memory corruptions.
> > > > > >>>>>>
> > > > > >>>>> Those crashes do look odd. I'm on PTO for the next week but I'll
> > > > have
> > > > > >>>>> a look when I get back to the office. I did run verifier on all the
> > > > > >>>>> drivers a week or so back (while running vbd plug/unplug tests)
> > but
> > > > > >>>>> there have been a couple of changes since then.
> > > > > >>>>>
> > > > > >>>>> Paul
> > > > > >>>>>
> > > > > >>>> No problem. I attached some more logs. The last one was during
> > > > > system
> > > > > >>>> shutdown, after that the OS failed to boot (probably corrupted
> > > > > >>>> filesystem since the BSOD itself seemed to indicate that). I think
> > > > every
> > > > > >>>> time there is a BLKIF_RSP_ERROR somewhere but I'm not yet
> > familiar
> > > > > with
> > > > > >>>> Xen PV device interfaces so not sure what that means.
> > > > > >>>>
> > > > > >>>> In the meantime I've run more tests on my modified xeniface
> > driver
> > > > to
> > > > > >>>> make sure it's not contributing to these issues but everything
> > > > seemed
> > > > > to
> > > > > >>>> be fine there.
> > > > > >>>>
> > > > > >>>>
> > > > > >>> I also had a disk corruption on windows 10 pro 64 bit with pv drivers
> > > > > >>> build of 11 august but I'm not sure that is related to winpv drivers,
> > > > > >>> on same domU I started testing also snapshot with qcow2 disk
> > overlay.
> > > > > >>> For this case I don't have useful information because don't try to
> > > > > >>> boot windows at all but if rehappen I'll try to take other useful
> > > > > >>> information.
> > > > > >> Happen another time but also this I was unable to understand what
> > is
> > > > > >> exactly the cause.
> > > > > >> On windows reboot all seems was ok and did a clean shutdown but
> > on
> > > > > >> reboot seabios don't found bootable disk and qemu log don't show
> > > > useful
> > > > > >> informations.
> > > > > >> qemu-img check show errors:
> > > > > >>> /usr/lib/xen/bin/qemu-img check W10.disk1.cow-sn1
> > > > > >>> ERROR cluster 143 refcount=1 reference=2
> > > > > >>> Leaked cluster 1077 refcount=1 reference=0
> > > > > >>> ERROR cluster 1221 refcount=1 reference=2
> > > > > >>> Leaked cluster 2703 refcount=1 reference=0
> > > > > >>> Leaked cluster 5212 refcount=1 reference=0
> > > > > >>> Leaked cluster 13375 refcount=1 reference=0
> > > > > >>>
> > > > > >>> 2 errors were found on the image.
> > > > > >>> Data may be corrupted, or further writes to the image may corrupt
> > it.
> > > > > >>>
> > > > > >>> 4 leaked clusters were found on the image.
> > > > > >>> This means waste of disk space, but no harm to data.
> > > > > >>> 27853/819200 = 3.40% allocated, 22.65% fragmented, 0.00%
> > > > compressed
> > > > > >>> clusters
> > > > > >>> Image end offset: 1850736640
> > > > > >> I created it with:
> > > > > >> /usr/lib/xen/bin/qemu-img create -o
> > > > > >> backing_file=W10.disk1.xm,backing_fmt=raw -f qcow2
> > W10.disk1.cow-
> > > > > sn1
> > > > > >> and changed the xl domU configuration:
> > > > > >> disk=['/mnt/vm2/W10.disk1.cow-sn1,qcow2,hda,rw',...
> > > > > >> Dom0 is with xen 4.6-rc1 and qemu 2.4.0
> > > > > >> DomU is windows 10 pro 64 bit with pv drivers build of 11 august
> > > > > >>
> > > > > >> How I can know for sure if it is a winpv or qemu or other problem
> > and
> > > > > >> take useful information to report?
> > > > > >>
> > > > > >> Thanks for any reply and sorry for my bad english.
> > > > > > This sounds very much like a lack of synchronization somewhere. I
> > recall
> > > > > seeing other problems of this ilk when someone was messing around
> > with
> > > > > O_DIRECT for opening images. I wonder if we are missing a flush
> > operation
> > > > > on shutdown.
> > > > > >
> > > > > >Â Â Paul
> > > > > >
> > > > > Thanks for reply.
> > > > > I did a fast search but I not found O_DIRECT grepping in libxl, I found
> > > > > it only in qemu code.
> > > > > After I tried with patch that seems added setting of it for xen:
> > > > >
> > > >
> > http://git.qemu.org/?p=qemu.git;a=commitdiff;h=454ae734f1d9f591345fa78
> > > > > 376435a8e74bb4edd
> > > > > Checking in libxl seems disabled by default and from some old xen post
> > > > > seems that O_DIRECT creates problems.
> > > > > I should try it enable direct-io-safe in domUs qcow2 disks?
> > > > > Added also Stefano Stabellini as cc.
> > > > > @Stefano Stabellini: What is the current know status and result of
> > > > > direct-io-safe?
> >
> > O_DIRECT should be entirely safe to use, at least on ide and qdisk. I
> > haven't done the analysis on ahci emulation in qemu to know whether that
> > would be true for ahci disks, but that doesn't matter because unplug is
> > not implemented for ahci disks.
> >
> >
> > > > > Sorry is the question are stupid by or my english is too bad or many
> > > > > post of latest years are confused and in same cases seems also
> > > > > contradictory about stability/integrity/performance using it or not.
> > > > > In particular seems crash with some kernels but I not understand
> > exactly
> > > > > what versions and/or with which patches.
> > > > >
> > > > > @Paul Durrant: have you see my other mail when I wrote that based on
> > my
> > > > > latest test with xen 4.6 without udev file windows domUs with new pv
> > > > > driver don't boot and for still boot it correctly I must readd udev
> > > > > file, can this cause unexpected case related to this problem or is
> > > > > different?
> > > > > http://lists.xen.org/archives/html/win-pv-devel/2015-
> > 08/msg00033.html
> > > > >
> > > >
> > > > I'm not sure why udev would be an issue here. The problem you have
> > > > appears to be QEMU ignoring the request to unplug emulated disks. I've
> > not
> > > > seen this behaviour on my test box so I'll need to dig some more.
> > > >
> > >
> > > I notice you have 6 IDE channels? Are you using AHCI by any chance? If you
> > are then it looks like QEMU is not honouring the unplug request... that would
> > be where the bug is. I'll try to repro myself.
> >
> > Unplug on ahci is actually unimplemented, see hw/i386/xen/xen_platform.c:
> >
> > static void unplug_disks(PCIBus *b, PCIDevice *d, void *o)
> > {
> >Â Â Â/* We have to ignore passthrough devices */
> >Â Â Âif (pci_get_word(d->config + PCI_CLASS_DEVICE) ==
> >Â Â Â Â Â Â ÂPCI_CLASS_STORAGE_IDE
> >Â Â Â Â Â Â Â&& strcmp(d->name, "xen-pci-passthrough") != 0) {
> >Â Â Â Â Âpci_piix3_xen_ide_unplug(DEVICE(d));
> >Â Â Â}
> > }
> >
> > the function specifically only unplugs IDE disks.
> > I am not sure what to do about ahci unplug, given that we don't
> > implement scsi disk unplug either. After all, if the goal is to unplug
> > the disk, why choose a faster emulated protocol?
>
> I think we should unplug the disk regardless of type, if we support configuring disks of that type through libxl. The reason, in this case AFAIU, for wanting ahci is to speed up Windows boot where initial driver load is still done through int13 and hence emulated disk.

I would be happy to take a patch which makes QEMU unplug all kinds of
disks, as long as is able to skip passed though devices (see comment in
the code).

Faster boot not only on windows domUs but also on linux hvm domUs, for example if I remember good on fedora 21 with lxde with ahci take only 20% of total ide test time.
I tested ahci for months and seems strange I not saw unplug not working, on previous mail log where boot correctly can you take a look if unplug it correctly? (should be with ahci if I'm not wrong)
In any case any possible improvement/bugfix is appreciated.
About direct-io-safe I tried today with qcow2 disk and dom0 insta-reboot on windows boot, dom0 kernel is 3.2 from official wheezy repository, nothing useful in dom0 and domU logs, I had not time to retry with full dom0 and serial on lan on dom0 but is needed I'll do when possible.
_______________________________________________
win-pv-devel mailing list
win-pv-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/win-pv-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.