[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.



On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote:
> I'll throw an idea there and you educate me why it's lame.
>
> Going back to the primary issue of dropping zero-copy, you want the block 
> backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because 
> you run into all sorts of quirkiness otherwise: magical VM_FOREIGN 
> incantations to back granted mfn's with fake page structs that make 
> get_user_pages happy, quirky grant PTEs, etc.
>
> Ok, so how about something along the lines of GNTTABOP_swap? Eerily 
> reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.
>
> The observation is that for a blkfront read, you could do the read all along 
> on a regular dom0 frame, and when stuffing the response into the ring, swap 
> the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then 
> the algorithm folds out:
>
> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, 
> creates a pool of reserved regular pages via get_free_page(s). These pages 
> have their refcount pumped, no one in dom0 will ever touch them.
>
> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap 
> immediately. One of the backend-reserved mfn's is swapped with the domU mfn. 
> Pfn's and page struct's on both ends remain untouched.

Would GNTTABOP_swap also require the domU to have already unmapped the
page from its own pagetables?  Presumably it would fail if it didn't,
otherwise you'd end up with a domU mapping the same mfn as a
dom0-private page.

> 3. For blkfront reads, call swap when stuffing the response back into the ring
>
> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much 
> like balloon and others do, without fear of races. More importantly, b) you 
> don't have a weirdo granted PTE, or work with a frame from other domain. It's 
> your page all along, dom0
>
> 5. One assumption for domU is that pages allocated as blkfront buffers won't 
> be touched by anybody, so a) it's safe for them to swap async with another 
> frame with undef contents and b) domU can fix its p2m (and kvaddr) when 
> pulling responses from the ring (the new mfn should be put on the response by 
> dom0 directly or through an opaque handle)
>
> 6. Scatter-gather vectors in ring requests give you a natural multicall 
> batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as 
> often and at the granularity as skbuff's demanded for GNTTABOP_transfer
>
> 7. Potentially domU may want to use the contents in a blkfront write buffer 
> later for something else. So it's not really zero-copy. But the approach 
> opens a window to async memcpy . From the point of swap when pulling the req 
> to the point of pushing the response, you can do memcpy at any time. Don't 
> know about how practical that is though.

I think that will be the common case - the kernel will always attempt to
write dirty pagecache pages to make clean ones, and it will still want
them around to access.  So it can't really give up the page altogether;
if it hands it over to dom0, it needs to make a local copy first.

> Problems at first glance:
> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and 
> blkback.
> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like 
> balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken 
> care of. domU will probably need to neuter its kvaddr before granting, and 
> then re-establish it when the response arrives. Weren't all these hypercalls 
> ultimately more expensive than memcpy for GNTABOP_transfer for netback?
> 3. Managing the pool of backend reserved pages may be a problem?
>
> So in the end, perhaps more of an academic exercise than a palatable answer, 
> but nonetheless I'd like to hear other problems people may find with this 
> approach

It's not clear to me that its any improvement over just directly copying
the data up front.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.