[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

To: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
From: Wei Huang <wei.huang2@xxxxxxx>
Date: Fri, 20 Mar 2009 08:56:25 -0500
Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Tim Deegan <Tim.Deegan@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <Keir.Fraser@xxxxxxxxxxxxx>
Delivery-date: Fri, 20 Mar 2009 07:01:13 -0700
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Dan,

I agree on the order: A > C >= B > D. Generally, super page shouldperform better than small pages.

In reality, the difference between B & C is subtle. It depends on howTLB cache is designed and whether TLB flush happens frequently.


-Wei

Dan Magenheimer wrote:

Interesting.  And non-intuitive.  I think you are saying
that, at least theoretically (and using your ABCD, not
my ABC below), A is always faster than
(B | C), and (B | C) is always faster than D.  Taking into
account the fact that the TLB size is fixed (I think),
C will always be faster than B and never slower than D.

So if the theory proves true, that does seem to eliminate
my objection.

Thanks,
Dan
-----Original Message-----
From: George Dunlap [mailto:george.dunlap@xxxxxxxxxxxxx]
Sent: Friday, March 20, 2009 3:46 AM
To: Dan Magenheimer
Cc: Wei Huang; xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Tim Deegan
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support


Dan,
Don't forget that this is about the p2m table, which is (if Iunderstandcorrectly) orthogonal to what the guest pagetables are doing. So thescenario, if HAP is used, would be:
A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages,guest PTsuse 2MB pages, P2M uses 2MB pages
 - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
 - A tlb miss requires 3 * 4 = 12 reads
C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
 - A tlb miss requires 4 * 3 = 12 reads
D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
 - A tlb miss requires 4 * 4 = 16 reads
And adding the 1G p2m entries will change the multiplier from 3 to 2(i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4kguest pages).
(Those who are more familiar with the hardware, please correct me ifI've made some mistakes or oversimplified things.)
So adding 1G pages to the p2m table shouldn't changeexpectations of theguest OS in any case. Using it will benefit the guest to the samedegree whether the guest is using 4k, 2Mb, or 1G pages. (If Iunderstandcorrectly.)
 -George

Dan Magenheimer wrote:
Hi Wei --

I'm not worried about the overhead of the splintering, I'm
worried about the "hidden overhead" everytime a "silent
splinter" is used.

Let's assume three scenarios (and for now use 2MB pages though
the same concerns can be extended to 1GB and/or mixed 2MB/1GB):

A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
   only 2MB pages (no splintering occurs)
B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
   only 4KB pages (because of fragmentation, all 2MB pages have
   been splintered)
C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
   4KB pages

Now run some benchmarks.  Clearly one would assume that A is
faster than both B and C.  The question is: Is B faster or slower
than C?

If B is always faster than C, then I have less objection to
"silent splintering".  But if B is sometimes (or maybe always?)
slower than C, that's a big issue because a user has gone through
the effort of choosing a better-performing system configuration
for their software (2MB DB on 2MB OS), but it actually performs
worse than if they had chosen the "lower performing" configuration.
And, worse, it will likely degrade across time so performance
might be fine when the 2MB-DB-on-2MB-OS guest is launched
but get much worse when it is paused, save/restored, migrated,
or hot-failed.  So even if B is only slightly faster than C,
if B is much slower than A, this is a problem.

Does that make sense?

Some suggestions:
1) If it is possible for an administrator to determine how many
   large pages (both 2MB and 1GB) were requested by each domain
   and how many are currently whole-vs-splintered, that would help.
2) We may need some form of memory defragmenter
-----Original Message-----
From: Wei Huang [mailto:wei.huang2@xxxxxxx]
Sent: Thursday, March 19, 2009 12:52 PM
To: Dan Magenheimer
Cc: George Dunlap; xen-devel@xxxxxxxxxxxxxxxxxxx;
keir.fraser@xxxxxxxxxxxxx; Tim Deegan
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support


Dan,
Thanks for your comments. I am not sure about whichsplintering overheadyou are referring to. I can think of three areas:
1. splintering in page allocation
In this case, Xen fails to allocate requested page order.
So it falls
back to smaller pages to setup p2m table. The overhead isO(guest_mem_size), which is a one-time deal.
2. P2M splits large page into smaller pages
This is one directional because we don't merge smaller
pages to large
ones. The worst case is to split all guest large pages. Sooverhead isO(total_large_page_mem). In long run, the overhead will
converge to 0
because it is one-directional. Note this overhead also coverswhen PoDfeature is enabled.
3. CPU splintering
If CPU does not support 1GB page, it automatically does
splintering
using smaller ones (such as 2MB). In this case, the overheadis alwaysthere. But 1) this only happens to a small number of old
chips; 2) I
believe that it is still faster than 4K pages. CPUID (1gb
feature and
1gb TLB entries) can be used to detect and stop this
problem, if we
don't really like it.
I agree on your concerns. Customers should have the right tomake theirown decision. But that require new feature is enabled in the firstplace. For a lot of benchmarks, splintering overhead can beoffset withbenefits of huge pages. SPECJBB is a good example of usinglarge pages(see Ben Serebrin's presentation in Xen Summit). With thatsaid, I agreewith the idea of adding a new option in guest configure file.
-Wei


Dan Magenheimer wrote:
I'd like to reiterate my argument raised in a previous
discussion of hugepages:  Just because this CAN be made
to work, doesn't imply that it SHOULD be made to work.
Real users use larger pages in their OS for the sole
reason that they expect a performance improvement.
If it magically works, but works slow (and possibly
slower than if the OS had just used small pages to
start with), this is likely to lead to unsatisfied
customers, and perhaps allegations such as "Xen sucks
when running databases".

So, please, let's think this through before implementing
it just because we can.  At a minimum, an administrator
should be somehow warned if large pages are getting splintered.

And if its going in over my objection, please tie it to
a boot option that defaults off so administrator action
is required to allow silent splintering.

My two cents...
Dan
-----Original Message-----
From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx]
Sent: Thursday, March 19, 2009 2:07 AM
To: George Dunlap
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;Tim DeeganSubject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page
Table Support
Here are patches using the middle approach. It handles 1GBpages in PoDby remapping 1GB with 2MB pages & retry. I also added
code for 1GB
detection. Please comment.

Thanks a lot,

-Wei

-----Original Message-----
From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On
Behalf Of George
Dunlap
Sent: Wednesday, March 18, 2009 12:20 PM
To: Huang2, Wei
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx;Tim DeeganSubject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page
Table Support
Thanks for doing this work, Wei -- especially all the
extra effort for
the PoD integration.

One question: How well would you say you've tested the PoD
functionality?  Or to put it the other way, how much do I need to
prioritize testing this before the 3.4 release?
It wouldn't be a bad idea to do as you suggested, and
break things
into 2 meg pages for the PoD case.  In order to take the best
advantage of this in a PoD scenario, you'd need to have a balloon
driver that could allocate 1G of continuous *guest* p2m
space, which
seems a bit optimistic at this point...

 -George

2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>:
Current Xen supports 2MB super pages for NPT/EPT. The
attached patches
extend this feature to support 1GB pages. The PoD
(populate-on-demand)
introduced by George Dunlap made P2M modification harder.
I tried to
preserve existing PoD design by introducing a 1GB PoD
cache list.
Note that 1GB PoD can be dropped if we don't care about
1GB when PoD
is
enabled. In this case, we can just split 1GB PDPE into
512x2MB PDE
entries
and grab pages from PoD super list. That can pretty much make
1gb_p2m_pod.patch go away.



Any comment/suggestion on design idea will be appreciated.



Thanks,



-Wei





The following is the description:

=== 1gb_tools.patch ===

Extend existing setup_guest() function. Basically, it tries to
allocate 1GB
pages whenever available. If this request fails, it falls
back to 2MB.
If
both fail, then 4KB pages will be used.



=== 1gb_p2m.patch ===

* p2m_next_level()
Check PSE bit of L3 page table entry. If 1GB is found
(PSE=1), we
split 1GB
into 512 2MB pages.



* p2m_set_entry()

Configure the PSE bit of L3 P2M table if page order == 18 (1GB).



* p2m_gfn_to_mfn()
Add support for 1GB case when doing gfn to mfn
translation. When L3
entry is
marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise,
we do the regular address translation (gfn ==> mfn).



* p2m_gfn_to_mfn_current()

This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate().
Otherwise, it does a normal translation. 1GB page is taken into
consideration.



* set_p2m_entry()

Request 1GB page



* audit_p2m()

Support 1GB while auditing p2m table.



* p2m_change_type_global()

Deal with 1GB page when changing global page type.



=== 1gb_p2m_pod.patch ===

* xen/include/asm-x86/p2m.h
Minor change to deal with PoD. It separates super page
cache list into
2MB
and 1GB lists. Similarly, we record last gpfn of sweeping
for both 2MB
and
1GB.



* p2m_pod_cache_add()

Check page order and add 1GB super page into PoD 1GB cache list.



* p2m_pod_cache_get()
Grab a page from cache list. It tries to break 1GB page
into 512 2MB
pages
if 2MB PoD list is empty. Similarly, 4KB can be requested
from super
pages.
The breaking order is 2MB then 1GB.



* p2m_pod_cache_target()
This function is used to set PoD cache size. To increase
PoD target,
we try
to allocate 1GB from xen domheap. If this fails, we try
2MB. If both
fail,
we try 4KB which is guaranteed to work.
To decrease the target, we use a similar approach. We
first try to
free 1GB
pages from 1GB PoD cache list. If such request fails, we
try 2MB PoD
cache
list. If both fail, we try 4KB list.



* p2m_pod_zero_check_superpage_1gb()

This adds a new function to check for 1GB page. This function is
similar to
p2m_pod_zero_check_superpage_2mb().



* p2m_pod_zero_check_superpage_1gb()
We add a new function to sweep 1GB page from guest memory.
This is the
same
as p2m_pod_zero_check_superpage_2mb().



* p2m_pod_demand_populate()

The trick of this function is to do remap_and_retry if
p2m_pod_cache_get()
fails. When p2m_pod_get() fails, this function will
splits p2m table
entry
into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee
populate demands always work.





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

References:
- RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
  - From: Dan Magenheimer

Prev by Date: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
Next by Date: [Xen-devel] [PATCH] Make PyGrub run first entry in grub config when invalid default boot option provided
Previous by thread: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
Next by thread: [Xen-devel] [PATCH] ioemu-stubdom: Use xen-setup-stubdom rather than configure
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.