[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.



On Tue, 2010-11-16 at 07:17 -0500, Stefano Stabellini wrote:
> On Tue, 16 Nov 2010, Daniel Stodden wrote:
> > Let's say we create an extension to tapdisk which speaks blkback's
> > datapath in userland. We'd basically put one of those tapdisks on every
> > storage node, independent of the image type, such as a bare LUN or a
> > VHD. We add a couple additional IPC calls to make it directly
> > connect/disconnect to/from (ring-ref,event-channel) pairs.
> > 
> > Means it doesn't even need to talk xenstore, the control plane could all
> > be left to some single daemon, which knows how to instruct the right
> > tapdev (via libblktapctl) by looking at the physical-device node. I
> > guess getting the control stuff out of the kernel is always a good idea.
> > 
> > There are some important parts which would go missing. Such as
> > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
> > gntmap 352 pages simultaneously isn't so good, so there still needs to
> > be some bridge arbitrating them. I'd rather keep that in kernel space,
> > okay to cram stuff like that into gntdev? It'd be much more
> > straightforward than IPC.
> > 
> > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> > Can't find it now, what happened? Without, there's presently still no
> > zero-copy.
> > 
> > Once the issues were solved, it'd be kinda nice. Simplifies stuff like
> > memshr for blktap, which depends on getting hold of original grefs.
> > 
> > We'd presumably still need the tapdev nodes, for qemu, etc. But those
> > can stay non-xen aware then.
> > 
> 
> Considering that there is a blkback implementation in qemu already, why
> don't use it? I don't certainly feel the need of yet another blkback
> implementation.
> A lot of people are working on qemu nowadays and this would let us
> exploit some of that work and contribute to it ourselves.
> We would only need to write a vhd block driver in qemu (even though a
> "vdi" driver is already present, I assume is not actually compatible?)
> and everything else is already there.
> We could reuse their qcow and qcow2 drivers that honestly are better
> maintained than ours (we receive a bug report per week about qcow/qcow2
> not working properly).
> Finally qemu needs to be able to do I/O anyway because of the IDE
> emulation, so it has to be in the picture in a way or another. One day
> not far from now when we make virtio work on Xen, even the fast PV
> data path might go through qemu, so we might as well optimize it.
> After talking to the xapi guys to better understand their requirements,
> I am pretty sure that the new upstream qemu with QMP support would be
> able to satisfy them without issues.
> Of all the possible solutions, this is certainly the one that requires
> less lines of code and would allow us to reuse more resource that
> otherwise would just remain untapped.
> 
> I backported the upstream xen_disk implementation to qemu-xen
> and run a test on the upstream 2.6.37rc1 kernel as dom0: VMs boot fine
> and performances seem to be interesting.  For the moment I am thinking
> about enabling the qemu blkback implementation as a fallback in case
> blktap2 is not present in the system (ie: 2.6.37 kernels).

I'm not against reducing code and effort. But in order to switch to a
different base we would need a drop-in match for VHD and at least a good
match for all the control machinery on which xen-sm presently depends.
There's also a lot of investment in filter drivers etc.

Then there is SM control, stuff like pause/unpause to get guests off the
storage nodes for snapshot/coalesce, more recently calls for statistics
and monitoring, tweaking some physical I/O details, etc. Used to be a
bitch, nowadays it's somewhat simpler, but that's all stuff we
completely depend on.

Moving blkback out of kernel space, into tapdisk, is predictable in size
and complexity. Replacing tapdisks altogether would be quite a different
story.

The remainder below isn't fully qualified, just random bits coming to my
mind, assuming you're not talking about sharing code/libs and
frameworks, but actual processes.

1st, what's the rationale with fully PV'd guests on xen? (That argument
might not count if just taking qemu as the container process and
stripping emulation for those.)

Related, there's the question of memory footprint. Kernel blkback is
extremely lightweight. Moving the datapath into userland can create
headaches, especially on 32bit dom0s with lot of guests and disks on
backends which used to be bare LUNs under blkback. That's a problem
tapdisk has to face too, just wondering about the size of the issue in
qemu.

Related, Xapi depends a lot on dom0 plugs, where the datapath can be
somewhat hairy when it comes to blocking I/O and resource allocation.

Then there is sharing. Storage activation normally doesn't operate in a
specific VM context. It presently doesn't even relate to a particular
VBD, much less a VM. For qemu alone, putting storage virtualization into
the same address space is an obvious choice. For Xen, enforcing that
sounds like a step backward.

>From the shared-framework perspective, and the amount of code involved:
The ring path alone is too small to consider, and the more difficult
parts on top of that like state machines for write ordering and syncing
etc are hard to share because the depend on the queue implementation and
image driver interface. 

Control might be a different story. As far as frontend/backend IPC via
xenstore goes, right now I still feel like those backends could be
managed by a single daemon, similar to what blktapctrl did (let's just
make it stateless/restartable this time). I guess qemu processes run
their xenstore trees already fine, but internally?

Daniel




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.