[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 0/2] xen/mm: Fix offlining pages to avoid corrupting the heap



This series fixes a bug where offlining pages could lead to unaligned
buddies being merged back onto the free list. The result is a chain of
events that can corrupt the heap and trigger a Xen panic after a few
allocations and frees.

For example, an MCE caused by faulty RAM may mark pages as offline.
When a buddy containing offlined pages is freed, those pages
are moved to dedicated isolated page lists.

reserve_offline_page() lacks alignment checks and may grow adjacent
healthy spans into unaligned buddies that violate the fundamental buddy
invariant: buddies of a given order must be aligned to their size.

Consider a valid order-2 buddy (4 pages) with this layout:

   +---------------+-----------------+-----------------+----------------+
   | head page     | tail page 1     | tail page 2     | tail page 3    |
   +---------------+-----------------+-----------------+----------------+

reserve_offline_page() then merges unaligned tail pages:

   +---------------+-----------------+-----------------+----------------+
   | offlined page |     head page with a tail page    | single page    |
   +---------------+-----------------+-----------------+----------------+

This leads to a Xen panic, demonstrated by the test case:

1. When a single page is allocated from this buddy, MFN 7 is allocated:

        MFN 4             MFN 5             MFN 6             MFN 7
  +---------------+-----------------+-----------------+----------------+
  | offlined page |    head page        tail page     | allocated page |
  |               |       Unaligned buddies are       |                |
  |               |      an invariant violation!      |                |
  +---------------+-----------------+-----------------+----------------+

2. When MFN 7 is freed, the predecessor merge in free_heap_pages()
   kicks in, merging MFN 7 with its naturally aligned predecessor MFN 6:

        MFN 4             MFN 5             MFN 6            MFN 7
  +---------------+-----------------+-----------------+
  | offlined page |    head page         tail page    |
  |               |       Unaligned buddies are       |
  |               |      an invariant violation!      |
  +---------------+-----------------+-----------------+----------------+
                                    |    head page        tail page    |
                                    +-----------------+----------------+

  As shown, MFN 6 is double-freed. It is in two buddies:
  - As the tail page of the unaligned order-1 buddy starting at MFN 5.
  - As the head page of the aligned order-1 buddy starting at MFN 6.

3. The next allocations would allocate MFN 7 again, and MFN 6 as well:

   Due to the double-free, after the first allocation, MFN 6 remains on
   the free list even though its PGC_status is set to in-use.

        MFN 4             MFN 5             MFN 6            MFN 7
  +---------------+-----------------+-----------------+
  | offlined page |    head page         tail page    |
  |               |       Unaligned buddies are       |
  |               |      an invariant violation!      |
  +---------------+-----------------+-----------------+----------------+
                                    |   in-use page   |   in-use page  |
                                    +-----------------+----------------+

4. When the next page from this buddy is allocated, get_free_page()
   returns the buddy head MFN 5.  If the allocation is for order-0,
   alloc_heap_pages() splits page 6; otherwise, it keeps the buddy.
   Either way, the allocator checks the pages' PGC_status values and
   expects them not to be in-use. Because MFN 6 is already in-use,
   Xen panics (example panic log):

   pg[0] MFN 842adc c=0x4000000000000000 o=0 v=0 t=0
   Xen BUG at common/page_alloc.c:1324

I reproduced this while running intensive NUMA claim tests combined
with page offlining. The test case in this series demonstrates the
cascading corruption that leads to the panic without intentionally
having to crash a Xen instance to test for the bug.

Running the test produces the following output (trimmed):

   $ make -C tools/tests/native test TARGETS=offline-unaligned |
     grep -v ' xen/'
   |   The buddy #5 is not aligned to order-1!
   | <0>pg[0] MFN 00006 c=0x8000000000000001 o=1213 v=0 t=0
   | xen/common/page_alloc.c:1324: WE INVOKED a XEN BUG in alloc_heap_pages()

The second patch fixes the root cause and updates the test case to
serve as a regression test.

This series is based on the native test environment v3 for NUMA claims:
https://lists.xen.org/archives/html/xen-devel/2026-05/msg01163.html

It in turn depends on the NUMA claim sets v7 series:
https://lists.xen.org/archives/html/xen-devel/2026-05/msg00363.html

You can pull the series with dependencies for review and testing:

$ git pull git@xxxxxxxxxx:bernhardkaindl/xen.git offline-unaligned-buddies-v1
$ make -C tools/tests/native TARGETS=offline-unaligned test

Fixes: e4865c2315 ('Page offline support in Xen side')
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@xxxxxxxxxx>

Bernhard Kaindl (2):
  tools/tests/native: Test for Xen Panic after memory offlining
  xen/mm: Fix offlining pages only make aligned buddies, fixes Xen crash

 tools/tests/native/offline-unaligned.c | 79 ++++++++++++++++++++++++++
 xen/common/page_alloc.c                |  5 ++
 2 files changed, 84 insertions(+)
 create mode 100644 tools/tests/native/offline-unaligned.c

-- 
2.39.5




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.