[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()

  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 17 Aug 2021 11:30:43 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=x2JEO6iOFuxODv7ul/SyrfjtIyD7oZ+Z+zvWXec/sw0=; b=g5LxNC9x+go9+v5nNA9m2W9nkoRr6rv+QRhYybDUXFCF9vR/8a0PECqHUA4WuiQs03NAEweK3lyI5fNbCPcfWX1hvbb6FFK1HAidxfHxfGqpdpj9T+2+cdpPh14IPcSW0wMGQI9VjpkBHB1ttf8ne+3LKR5whHtR8gmJ4mJMP0t0bady5ALtEeOGNtcRQlZRawPs6Y4N5Yj7PZ20aCeS6vjEjVcXFO7a/5RuE6R8F8IKPu8D5qzo1UCWvV5n8wuqkPOSGSM0qFfCiMfwRffDXTfxjE13O8YQLpxqU3+vug56suQKnn2ZlCJWPjv9VWp0yuHYwdER/qsJCS294zehDw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VL2jNV6Um5SZwHK74oknahHIeDRQTflQ2gd3gQq5FHuEY5dEq3ypgOgKXguwd/43B0fpEtZ2akW5dXdd656NnmnU1MiwdeQiwtoavCsF+JUvBuBgPNxtvZSyzoV/TXk9ilHbZlcm8+8s5k2KT9+9ksMvTa4HMq+6L+2+O8yhHhGnoAcGFXVb5l7e8rr3m0VZeEvuO/0jY0cbeDmy88p1h8jIDCRiFLQfRpwQW2hxDKMha8I2RE2bpicEGG3F3W6PreIWBnh0Dt7zWSiEvF24bt3hfN4t0bSM7YpWHE+x1KauPttRib+h+TZ8o2vGEtmQak0rA8r6g3Ad7sFOVWO1zg==
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 17 Aug 2021 10:31:17 +0000
  • Ironport-hdrordr: A9a23:ERPZlqBLvD5NRTzlHejssceALOsnbusQ8zAXPhhKOGVom7+j5r iTdZUgpGPJYVMqMk3I9urwXZVoLUmzyXcx2/h2AV7AZniYhILLFvAH0WKK+VSJcECTmdK1l5 0QFJSWY+eRMbEOt7eZ3ODOKadC/DDoysGVbKzlvgxQpElRGttdxjY8LgOVVmBNaE14CYEiFJ yaj/A32QaISDAya8v+Kn4bU+3EvtGOu4nhZXc9dm0awTjLqTamrJv7GRSexBt2aUI7/Z4StU zBnEjD+qCuqbWFxgTH12nVhq4m6efJ+59mAcPJsMwcMSjqhhvtW4h7Qb2Fu1kO0ZGSwWdvtN zC5ysmP9xu51PdF1vF1SfF6k3F1Tlr1HP401+fhhLY0L7ErRwBerd8ub4=
  • Ironport-sdr: y4Wc/yA+s+qKDbxyNmHXHhvul/tAEdUziDdNEuHmfVQ01LxeLxyDMjeJmYsGzoDjoRFz+1rhel iacecm3YnSscfXa6qMiQPTyL/4T+sKSTLTzINAgwLy8e88xoVmKD1XxE6KrBgCzU4HYrjVep1r 9YVKEIUoUCYhFB0AodCuBRil4+hPzkSrhI4JifKEoX/ijlzVFyXUDSHchkrCXSZZPn3madV6L1 HUUH+PCbsy/5W+Ic0uAWbSM0LNpEeURnBWxZ8PD1g/6/g2NFhlULs4ju3qCA0+kZ1YfSMSUBem hBSptAz9NG1cNaiQpPhr9f+G
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 17/08/2021 09:54, Jan Beulich wrote:
> On 16.08.2021 21:25, Andrew Cooper wrote:
>> On 16/08/2021 16:29, Jan Beulich wrote:
>>> About every time I look at dom0_construct_pv()'s "calculation" of
>>> nr_pt_pages I question (myself) whether the result is precise or merely
>>> an upper bound. I think it is meant to be precise, but I think we would
>>> be better off having some checking in place. Hence add ASSERT()s to
>>> verify that
>>> - all pages have a valid L1...Ln (currently L4) page table type and
>>> - no other bits are set, in particular the type refcount is still zero.
>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>>> ---
>>> There are (at least) two factors supporting my uncertainty about the
>>> "calculation" being precise: The loop starting from 2 (which clearly is
>>> too small for a possible result)
>> 2 was the correct absolute minimum for 2-level guests.
> Which has been history for how many years?

Yeah, but I'd expect to phrase that as "a remnant of 32bit non-PAE guests".

> The minimum for the current implementation is 4 afaict,

I'm not sure the monitor table for PV32 is intended to count.

>  and ...
>> XTF kernels don't exceed the 2M boundary (at least, not currently), so
>> they can be mapped with only 3 or 4 pagetables, except:
>> * 3-level guests are created with 4 L2's for no obvious reason.  This is
>> nothing to do with legacy PAE paging, nor with how a typical Linux/BSD
>> kernel works.  The requirement to make 3-level guests work (and even
>> then, only under 32bit Xen) is to create a PGT_pae_xen_l2 if not already
>> covered by the other mappings.  Any non-toy kernel discards these
>> pagetables in favour of its own idea of pagetables.
> ... could be 3 for 32-bit Dom0.

Right, and starting from (compat ? 6 : 4) is probably a good move, along
with an explanation of these totally magic numbers.

And now I've thought about this some more, I'm pretty certain we do
better than an "start at 2, inc 1 and retry" loop.  We can calculate the
pagetables needed for the virtual layout statically, possibly even
including the default 6/4, and use that for a lower bound.  At most, we
need to loop once per possibly-unpopulated pagetable level (so 1 for
32bit, 3 for 64bit) to cover the cases of extra pagetables tipping over
a page size boundary.

>> * v_end is rounded up to 4MB.
>> Most XTF guests will operate entirely happily in a few hundred kb of
>> space, and the same will be true of other microservices.  The rounding
>> up of memory might be helpful for the traditional big VMs case, but it
>> isn't correct or useful for other usecases.
>>> and an apparently wrong comment stating
>>> that not only v_end but also v_start would be superpage aligned
>> Which comment?  The only one I see about 4M has nothing to do with
>> superpages.
> The one immediately ahead of the related variable declarations:
>     /*
>      * This fully describes the memory layout of the initial domain. All
>      * *_start address are page-aligned, except v_start (and v_end) which are
>      * superpage-aligned.
>      */
> I see nothing forcing v_start to be superpage-aligned, while I
> do suspect that the "calculation" of the number of page tables
> will be wrong when it isn't.

This is an XTF test booting as dom0

(d2) (XEN) *** Building a PV Dom0 ***
(d2) (XEN) ELF: phdr: paddr=0x100000 memsz=0x12000
(d2) (XEN) ELF: memory: 0x100000 -> 0x112000
(d2) (XEN) ELF: note: GUEST_OS = "XTF"
(d2) (XEN) ELF: note: GUEST_VERSION = "0"
(d2) (XEN) ELF: note: LOADER = "generic"
(d2) (XEN) ELF: note: HYPERCALL_PAGE = 0x106000
(d2) (XEN) ELF: note: XEN_VERSION = "xen-3.0"
(d2) (XEN) ELF: note: FEATURES = "!writable_page_tables|pae_pgdir_above_4gb"
(d2) (XEN) ELF: note: PAE_MODE = "yes"
(d2) (XEN) ELF: using notes from SHT_NOTE section
(d2) (XEN) ELF: VIRT_BASE unset, using 0
(d2) (XEN) ELF_PADDR_OFFSET unset, using 0
(d2) (XEN) ELF: addresses:
(d2) (XEN)     virt_base        = 0x0
(d2) (XEN)     elf_paddr_offset = 0x0
(d2) (XEN)     virt_offset      = 0x0
(d2) (XEN)     virt_kstart      = 0x100000
(d2) (XEN)     virt_kend        = 0x112000
(d2) (XEN)     virt_entry       = 0x100000
(d2) (XEN)     p2m_base         = 0xffffffffffffffff
(d2) (XEN)  Xen  kernel: 64-bit, lsb, compat32
(d2) (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100000 -> 0x112000
(d2) (XEN)  Dom0 alloc.:   000000003e800000->000000003ec00000 (240731
pages to be allocated)
(d2) (XEN)  Loaded kernel: 0000000000100000->0000000000112000
(d2) (XEN)  Init. ramdisk: 0000000000112000->0000000000112000
(d2) (XEN)  Phys-Mach map: 0000000000112000->00000000002ea2d8
(d2) (XEN)  Start info:    00000000002eb000->00000000002eb4b8
(d2) (XEN)  Xenstore ring: 0000000000000000->0000000000000000
(d2) (XEN)  Console ring:  0000000000000000->0000000000000000
(d2) (XEN)  Page tables:   00000000002ec000->00000000002f1000
(d2) (XEN)  Boot stack:    00000000002f1000->00000000002f2000
(d2) (XEN)  TOTAL:         0000000000000000->0000000000400000
(d2) (XEN)  ENTRY ADDRESS: 0000000000100000

It would appear that v_start comes directly and unmodified from the

Other observations:  The ramdisk start/end aren't zero despite one being
absent, and the M2P and start info ends aren't aligned.

>>>  (in fact
>>> v_end is 4MiB aligned, which is the superpage size only on long
>>> abandoned [by us] non-PAE x86-32).
>> Tangentially, that code needs some serious work to use ROUNDUP/DOWN
>> macros for clarity.
> Agreed.
>>> --- a/xen/arch/x86/pv/dom0_build.c
>>> +++ b/xen/arch/x86/pv/dom0_build.c
>>> @@ -59,6 +59,10 @@ static __init void mark_pv_pt_pages_rdon
>>>          l1e_remove_flags(*pl1e, _PAGE_RW);
>>>          page = mfn_to_page(l1e_get_mfn(*pl1e));
>>> +        ASSERT(page->u.inuse.type_info & PGT_type_mask);
>>> +        ASSERT((page->u.inuse.type_info & PGT_type_mask) <= 
>>> PGT_root_page_table);
>> This is an obfuscated
>> ASSERT((page->u.inuse.type_info & PGT_type_mask) >= PGT_l1_page_table &&
>>        (page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
> I can certainly switch to this yet longer piece of code,

Improved clarity is substantially more important than conciseness.

>  and ...
>> and
>>> +        ASSERT(!(page->u.inuse.type_info & ~PGT_type_mask));
>> this has no context.
>> At a bare minimum, you need a comment stating what properties we're
>> looking for, so anyone suffering an assertion failure has some clue as
>> to what may have gone wrong.
> ... I can certainly transform the respective parts of the
> description into a code comment.

Thanks.  It doesn't need to be much, but it does need to be something.




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.