[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code



Keir (and xen physical memory management experts) --

Alright, I think I am ready for the final step of plugging
in tmem into the existing xen physical memory management
code. **

This is a bit long, but I'd appreciate some design feedback
before I proceed with this.  And that requires a bit of
background explanation... if this isn't enough background,
I'll be happy to answer any questions.

(Note that tmem is not intended to be deployed on a 32-bit
hypervisor -- due to xenheap constraints -- and should port
easily (though hasn't been ported yet) to ia64.  It is
currently controlled by a xen command-line option, default
off; and it requires tmem-modified guests.)

Tmem absorbs essentially all free memory on the machine
for its use, but the vast majority of that memory can be
easily freed, synchronously and on demand, for other uses.
Tmem now maintains its own page list, tmem_page_list,
which holds tmem pages when they (temporarily) don't contain
data.  (There's no sense scrubbing and freeing these to
xenheap or domheap, when tmem is just going to grab them
again and overwrite them anyway.)  So tmem holds three
types of memory:

(1) Machine-pages (4K) on the tmem_page_list
(2) Pages containing "ephemeral" data managed by tmem
(3) Ppages containing "persistent" data managed by tmem

Pages regularly move back and forth between ((2)or(3))
and (1) as part of tmem's normal operations. When a page
is moved "involuntarily" from (2) to (1), we call this
an "eviction".  Note that, due to compression, evicting
a tmem ephemeral data page does not necessarily free up
a raw machine page (4K) of memory... partial pages
are kept in a tmem-specific tlsf pool, and tlsf frees
up the machine page when all allocations on it are freed.
(tlsf is the mechanism underlying the new highly-efficient
xmalloc added to xen-unstable late last year.)

Now let's assume that Xen has need of memory but tmem
has absorbed it all.  Xen's demand is always one of
the following: (here, a page is a raw machine page (4K))

A) a page
B) a large number of individual non-consecutive pages
C) a block of 2**N consecutive pages (order N > 0)

Of these:
(A) eventually finds its way to alloc_heap_pages()
(B) happens in (at least) two circumstances:
 (i) when a new domain is created, and
 (ii) when a domain makes a balloon request.
(C) happens mostly at system startup and then rarely
    after that (when? why? see below)

Tmem will export this API:

a) struct page_info *tmem_relinquish_page(void)
b) struct page_info *tmem_relinquish_pageblock(int order)
c) uint32_t tmem_evict_npages(uint32_t npages)
d) uint32_t tmem_relinquish_pages(uint32_t npages)

(a) and (b) are internal to the hypervisor.  (c) and
(d) are internal and accessible via privileged hypercall.

(a) is fairly straightforward and synchronous, though it
may be a bit slow since it has to scrub the page before
returning.  If there is a page in tmem_page_list, it will
(scrub and) return it.  If not, it will evict tmem ephemeral
data until there is a page freed to tmem_page_list and
then it will (scrub and) return it.  If tmem has no
more ephemeral pages to evict and there's nothing in
tmem_page_list, it will return NULL.  (a) can be used in,
for example, alloc_heap_pages when "No suitable memory blocks"
can be found so as to avoid failing the request.

(b) is similar but is used if order > 0 (ie. a bigger chunk
of pages is needed).  It works the same way except that,
due to fragmentation, it may have to evict MANY pages,
in fact possibly ALL ephemeral data.  Even then it still may
not find enough consecutive pages to satisfy the request.
Further, tmem doesn't use a buddy allocator... because it
uses nothing larger than a machine page, it never needs one
internally. So all of those
pages need to be scrubbed and freed to the xen heap before
it can be determined if the request can be satisfied.
As a result, this is potentially VERY slow and still has
a high probability of failure.  Fortunately, requests for
order>0 are, I think, rare.

(c) and (d) are intentionally not combined.  (c) evicts
tmem ephemeral pages until it has added at least npages
(machine pages) into the tmem_page_list. This may be slow.
For (d), I'm thinking it will transfer npages from
tmem_page_list to the scrub_list, where the existing
page_scrub_timer will eventually scrub them and free
them to xen's heap. (c) will return the number of pages
it successfuly added to tmem_page_list.  And (d) will
return the number of pages it successfully moved from
tmem_page_list to scrub_list.

So this leaves some design questions:

1) Does this design make sense?
2) Are there places other than in alloc_heap_pages
   in Xen where I need to add "hooks" for tmem
   to relinquish a page or a block of pages?
3) Are there are any other circumstances I've
   forgotten where large npages are requested?
4) Does anybody have a list of alloc requests of
     order > 0
   that occur after xen startup (e.g. when launching
   a new domain) and the consequences of failing the
   request?  I'd consider not providing interface (b)
   at all if it never happens or if multi-page requests
   always fail gracefully (e.g. get broken into smaller
   order requests).  I'm thinking for now that I may
   not implement this, just fail it and printk and
   see if any bad things happen.

Thanks for taking the time to read through this... any
feedback is appreciated.

Dan

** tmem has been working for months but the code has
until now allocated (and freed) to (and from)
xenheap and domheap.  This has been a security hole
as the pages were released unscrubbed and so data
could easily leak between domains.  Obviously this
needed to be fixed :-)  And scrubbing data at every
transfer from tmem to domheap/xenheap would be a huge
waste of CPU cycles, especially since the most likely
next consumer of that same page is tmem again.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.