[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Qemu-devel] Question about xen disk unplug support for ahci missed in qemu



CC'ing Kevin O'Connor

On 10/14/15 13:27, Ian Campbell wrote:
> On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote:
>>> Can't you just teach SeaBIOS how to deal with your PV disks and then
>>> only add that to your VM and forget about IDE/AHCI? I mean, that's how
>>> it's done for virtio-blk, and it doesn't involve any insanities like
>>> ripping out non-hotpluggable devices.
>>
>> Teaching SeaBIOS to deal with PV disks can be done, in fact we already
>> support PV disks in OVMF. It is possible to boot Windows with OVMF
>> without any IDE disks (patch pending for libxl to create a VM without
>> emulated IDE disks).
> 
> One stumbling block in the past has been how to know when the PV drivers in
> the BIOS are no longer required, such that the ring can be torn down and/or
> the connection etc handed over to the OS driver.
> 
> I think we deal with this in OVMF using ExitBootServices? (TBH I'm not sure 
> how).

Search "XenBusDxe/XenBusDxe.c" in edk2 for "EVT_SIGNAL_EXIT_BOOT_SERVICES".

TBH, the code in NotifyExitBoot() doesn't seem valid. If you check the
UEFI spec (for example, v2.5, but the requirement I'm about to quote is
very old), in the specification of EFI_BOOT_SERVICES.CreateEvent() you find:

EVT_SIGNAL_EXIT_BOOT_SERVICES

    This event is to be notified by the system when ExitBootServices()
    is invoked. This event is of type EVT_NOTIFY_SIGNAL and should not
    be combined with any other event types. The notification function
    for this event is not allowed to use the Memory Allocation
    Services, or call any functions that use the Memory Allocation
    Services and must only call functions that are known not to use
    Memory Allocation Services, because these services modify the
    current memory map.The notification function must not depend on
    timer events since timer services will be deactivated before any
    notification functions are called.

NotifyExitBoot() in "XenBusDxe/XenBusDxe.c" calls the
DisconnectController() boot service. That in turn leads to calls to
EFI_DRIVER_BINDING_PROTOCOL.Stop() functions (speaking generally), which
inevitably free memory as part of unbinding the device, thereby breaking
the above requirement.

The right solution is the following:
- when a driver binds a device (a "handle"), a piece of the resources
  allocated for that binding should be a new event, to be signaled at
  ExitBootServices() time. The handler function can be shared by all
  such devices. The context passed to the handler should be the (driver-
  specific) structure that represents the binding and the state of the
  device in general.

- When the driver unbinds the device, the event should be closed. This
  will automatically unregister the callback as well.

- Now, when the callback is entered at all, you can be sure that the
  binding still exists. In this case, you should probe into the various
  fields of the context (the device state, practically), to figure out
  if this device "lives" or is dormant. For simpler devices, the answer
  is always "alive", but some devices could have configuration states
  where they are bound yet not configured (using no hw resources etc).

- In case the device is alive, the action to take is to make it abort
  any in-flight transfers or other operations, and re-set / deconfigure
  it *without* touching any memory allocations.

You can see this in the virtio-net driver in OVMF. In the
"OvmfPkg/VirtioNetDxe" directory, see the "TechNotes.txt",
"DriverBinding.c" and "Events.c" files. The callback function is
VirtioNetExitBoot(), and the event registration / deregistration happens in:

VirtioNetDriverBindingStart()
  VirtioNetSnpPopulate()
    gBS->CreateEvent(EVT_SIGNAL_EXIT_BOOT_SERVICES)

vs.

VirtioNetDriverBindingStop()
  VirtioNetSnpEvacuate()
    gBS->CloseEvent()

What VirtioNetExitBoot() does is very simple: resetting the virtio
device is a small action, and it covers the responsibilities.


I'll admit that the virtio-scsi and virtio-block drivers play a bit
dirty here. (I've known this for a long time, but been silent about it.)
They should have similar callbacks, but don't (In theory, all devices
that are bound & alive at ExitBootServices() should be re-set, without
touching the memory services.)

There are two mitigating factors here:
- unlike with virtio-net, the scsi and block drivers in OVMF support
  only synchronous operations. When you are not calling their
  functions, there are no transfers in flight. And when something calls
  ExitBootServices(), that thing is not calling virtio-block /
  virtio-scsi functions.

- The first thing the OS-level virtio drivers do (certainly in Linux,
  hopefully in Windows) is a virtio-reset on each virtio device found.
  (This is actually required by the virtio specification, both old and
  new.)

Now, there's one small window for issues here. If something in the guest
scribbled over the memory that, according to QEMU, still hosts a live
virtio ring, between ExitBootServices() and said OS-level virtio-reset,
QEMU *could* interpret that -- even without an explicit virtio kick --
as garbage virtio operations, and abort the guest (or the device would
enter a failed state, not sure what happens in today's QEMU).

This window is very very small in practice however. OVMF allocates the
virtio rings in EfiBootServicesData type memory. The OS is allowed to
reuse that kind of memory right after ExitBootServices(). However, Linux
developers have found that many buggy UEFI drivers :) allow those areas
to be used by hardware even after ExitBootServices(), at least for a
while. So they have some reservation code in place to deal with that.
(I'm fuzzy on the details, sorry. CC'ing Ard and Matt.)

In any case, this decreases the chance that virtio ring corruption could
occur between ExitBootServices() and the OS-level virtio-reset, for
OVMF's virtio-block and virtio-scsi drivers. For purity, I should really
write those few tens of lines of ExitBootServices() handler code for
both drivers... but you know how it works; you make time for what really
matters. I have never seen (occur, or be reported) such a problem with OVMF.

However, what I *have* seen is a similar ring corruption in SeaBIOS.
There was a quirky error path in the code that enabled SeaBIOS to
internally release (and later, reuse) the memory that from QEMU's POV
still hosted a live virtio ring. As you can imagine, the release and the
"reuse" (-> abort by QEMU) were quite distanced from each other. Fixed
it in SeaBIOS commit 5f2d17d35b23. See also later commit 5e03881f98b3.

> AFAIK the BIOS interfaces do not have anything as reliable as that.
> 
> How does virtio deal with this in the BIOS case?

It doesn't, as far as I can tell.

I don't think it has to, though! On a BIOS box, you can always boot DOS,
or another operating system that continues to use the BIOS interfaces
forever. (Same as if you never call ExitBootServices() in UEFI.)

Given that no starter pistol gets fired between the firmware and the OS
on such a platform, they must always respect each other. I guess this
could occur through the E820 map, or some such.

No clue in what kind of E820 memory SeaBIOS allocates the virtio rings,
but I guess the Linux kernel stays away from those areas until it's past
device probing and binding.

In other words, I guess it is exactly the existence of
ExitBootServices() that *gives rise* to this issue. Without
ExitBootServices(), the BIOS can never know when its services are no
longer needed, so it keeps them alive. For which reason, the OS --
without means to tell the BIOS to go away *now* -- must stay away for a
reasonably long time too. (I'm speculating of course.)

With ExitBootServices(), there's a clear separation, but when only one
side adheres to it, things can break badly.

Long story short, here's my unwashed recommendation for Xen PV:
- In OVMF, fix the current ExitBootServices() handler. Only
  de-configure the device using the XenBus protocol, without allocating
  or freeing any dynamic memory, directly or indirectly.

  I'll also add writing ExitBootServices() handlers for virtio-block
  and virtio-scsi to my TODO list. (Now that you've forced me to own up
  to their lack.) Sigh. :)

- In SeaBIOS, allocate the ring in normal memory, and trust that the
  kernel stays away until the Xen guest driver binds the device, and
  resets it (as the very first step)

  If you're worried (or else, if I'm simply wrong!), allocate the ring
  in reserved memory. A few lost pages are a small price to pay for
  XenPV in SeaBIOS! ;)

... It's 1AM again. Sigh. Someone steal my keyboard, please. (And hit me
with it too if I wrote many incorrect statements above...)

Cheers
Laszlo

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.