[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: xen-unstable linux-5.14: 1 of 2 multicall(s) failed: cpu 0
On 07.09.21 10:11, Jan Beulich wrote: On 07.09.2021 09:58, Juergen Gross wrote:On 06.09.21 23:35, Sander Eikelenboom wrote:L.S., On my AMD box running: xen-unstable changeset: Fri Sep 3 15:10:43 2021 +0200 git:2d4978ead4 linux kernel: 5.14.1 With this setup I'm encountering some issues in dom0, see below. -- Sander xl dmesg gives: (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 63b936 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 6a0622 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 6b63da already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 638dd9 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 68a7bc already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 63c27d already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 6a04f2 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 690d49 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 6959a0 already pinned (XEN) [2021-09-06 18:15:04.089] mm.c:3506:d0v0 mfn 6a055e already pinned (XEN) [2021-09-06 18:15:04.090] mm.c:3506:d0v0 mfn 639437 already pinned dmesg gives: [34321.304270] ------------[ cut here ]------------ [34321.304277] WARNING: CPU: 0 PID: 23628 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x176/0x1a0 [34321.304288] Modules linked in: [34321.304291] CPU: 0 PID: 23628 Comm: apt-get Not tainted 5.14.1-20210906-doflr-mac80211debug+ #1 [34321.304294] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010 [34321.304296] RIP: e030:xen_mc_flush+0x176/0x1a0 [34321.304300] Code: 89 45 18 48 c1 e9 3f 48 89 ce e9 20 ff ff ff e8 60 03 00 00 66 90 5b 5d 41 5c 41 5d c3 48 c7 45 18 ea ff ff ff be 01 00 00 00 <0f> 0b 8b 55 00 48 c7 c7 10 97 aa 82 31 db 49 c7 c5 38 97 aa 82 65 [34321.304303] RSP: e02b:ffffc90000a97c90 EFLAGS: 00010002 [34321.304305] RAX: ffff88807d416398 RBX: ffff88807d416350 RCX: ffff88807d416398 [34321.304306] RDX: 0000000000000001 RSI: 0000000000000001 RDI: deadbeefdeadf00d [34321.304308] RBP: ffff88807d416300 R08: aaaaaaaaaaaaaaaa R09: ffff888006160cc0 [34321.304309] R10: deadbeefdeadf00d R11: ffffea000026a600 R12: 0000000000000000 [34321.304310] R13: ffff888012f6b000 R14: 0000000012f6b000 R15: 0000000000000001 [34321.304320] FS: 00007f5071177800(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 [34321.304322] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [34321.304323] CR2: 00007f506f542000 CR3: 00000000160cc000 CR4: 0000000000000660 [34321.304326] Call Trace: [34321.304331] xen_alloc_pte+0x294/0x320 [34321.304334] move_pgt_entry+0x165/0x4b0 [34321.304339] move_page_tables+0x6fa/0x8d0 [34321.304342] move_vma.isra.44+0x138/0x500 [34321.304345] __x64_sys_mremap+0x296/0x410 [34321.304348] do_syscall_64+0x3a/0x80 [34321.304352] entry_SYSCALL_64_after_hwframe+0x44/0xae [34321.304355] RIP: 0033:0x7f507196301aI can see why this failure is occurring, but I'm not sure which way is the best to fix it. The problem is that a pinned page table is moved: the pmd entry referencing it is cleared and a new reference is put into the pmd. This is done by getting the old pmd entry, clearing that entry, and then using pmd_populate() to write the new pmd entry. pmd_populate() will lead to a call of xen_pte_alloc() trying to pin the referenced page table, which is failing, as it is already pinned. The problem has been introduced by commit 0881ace292b662d2 in kernel 5.14. Following solutions would be possible: 1. When running as PV guest skip the optimization of move_pgt_entry() by letting arch_supports_page_table_move() return false. This will result in a performance drop in some cases. 2. Unpin the page table before calling pmd_populate(). This adds some unneeded hypercall and without flushing the TLB I'm feeling uneasy to do that.I agree as far as the "unneeded hypercall" aspect goes, but I don't see any connection to the TLB (or a need to flush it): Pinning has nothing to do with insertion into a live page table; a pinned page table can be entirely free floating. It's the removal from a (possibly) live page table which would require a flush. And this removal is happening: /* Clear the pmd */ pmd = *old_pmd; pmd_clear(old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); So unpinning after calling pmd_clear() seems to be risky. 3. Add a check in xen_pte_alloc() if the page table is pinned already and if this is the case, don't do the pinning. This is a rather clean solution, but will result in other failures if a page table is used multiple times (this case would be caught today as in the failure above). My tendency is towards solution 3 as it is local to Xen code and has the best performance.I agree 3 looks most promising. I can't judge how big of a risk there is for a page table to get used in more than one place, and hence how important it is to be able to detect that case. Thanks. I'm going that route then. Juergen Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc Attachment:
OpenPGP_signature
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |