Xen project Mailing List

Re: [Xen-devel] race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev)

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Marek Marczykowski-GÃrecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Mon, 22 Jun 2015 20:13:35 +0200

Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, dgdegra@xxxxxxxxxxxxx, David Vrabel <david.vrabel@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 22 Jun 2015 18:13:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote: > On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-GÃrecki wrote: > > On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-GÃrecki wrote: > > > On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-GÃrecki > > > wrote: > > > > Hi, > > > > > > > > What is the proper way to handle shared pages (either side - using > > > > gntdev or gntalloc) regarding fork and possible exec later? The child > > > > process do not need to access those pages in any way, but will map > > > > different one(s), using newly opened FD to the gntdev/gntalloc device. > > > > Should it unmap them and close FD to the device manually just after the > > > > fork? Or the process using gntdev or gntalloc should prevent using fork > > > > at all? > > > > > > > > I'm asking because I get kernel oops[1] in context of such process. This > > > > process uses both gntdev and gntalloc. The PID reported there is a > > > > child, which maps additional pages (using newly opened FD to > > > > /dev/xen/gnt*), but I'm not sure if the crash happens before, after or > > > > at this second mapping (actually vchan connection), or maybe even at > > > > cleanup of this second mapping. The parent process keeps its mappings > > > > for the whole lifetime of its child. I don't have a 100% reliable way > > > > to reproduce this problem, but it happens quite often when I run such > > > > operations in a loop. > > > > > > Any ideas? > > > > I've done some further debugging, below is what I get. > > Woot! Thank you! > > > > > > The kernel is vanilla 3.19.3, running on Xen 4.4.2. > > > > > > > > The kernel message: > > > > [74376.073464] general protection fault: 0000 [#1] SMP > > > > [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE > > > > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > > > > nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl > > > > iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul > > > > crc32c_intel pcspkr xen_netfront ghash_clmulni_intel nfsd auth_rpcgss > > > > nfs_acl lockd grace xenfs xen_privcmd dummy_hcd udc_core xen_gntdev > > > > xen_gntalloc xen_blkback sunrpc u2mfn(O) xen_evtchn xen_blkfront > > > > [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G > > > > O 3.19.3-4.pvops.qubes.x86_64 #1 > > > > [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: > > > > ffff88000032c000 > > > > [74376.073532] RIP: e030:[<ffffffffa00952c5>] [<ffffffffa00952c5>] > > > > unmap_if_in_range+0x15/0xd0 [xen_gntdev] > > > > static void unmap_if_in_range(struct grant_map *map, > > unsigned long start, unsigned long end) > > { > > unsigned long mstart, mend; > > int err; > > > > if (!map->vma) > > return; > > > > The above crash is at first access to "map"... > > > > > > [74376.073543] RSP: e02b:ffff88000032fc08 EFLAGS: 00010292 > > > > [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: > > > > 00007fd8616ea000 > > > > [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: > > > > dead000000100100 > > > > ... which is 0xdead000000100100 (LIST_POISON1). > > > > > > > > [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: > > > > 0000000000000000 > > > > [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: > > > > 00007fd8616e9000 > > > > [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: > > > > ffff880012702e70 > > > > [74376.073569] FS: 00007fd8616ca700(0000) GS:ffff880013c80000(0000) > > > > knlGS:0000000000000000 > > > > [74376.073574] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: > > > > 0000000000042660 > > > > [74376.073582] Stack: > > > > [74376.073584] ffff8800188356c0 00000000000000d0 ffff88000032fc68 > > > > 00000000c64ef797 > > > > [74376.073590] 0000000000000220 dead000000100100 00007fd8616e9000 > > > > 00007fd8616ea000 > > > > [74376.073596] ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 > > > > ffff880012702e70 > > > > [74376.073603] Call Trace: > > > > [74376.073610] [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 > > > > [xen_gntdev] > > > > [74376.073620] [<ffffffff811e88fb>] > > > > __mmu_notifier_invalidate_range_start+0x5b/0x90 > > > > [74376.073627] [<ffffffff811c2a59>] do_wp_page+0x769/0x820 > > > > [74376.074031] [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0 > > > > [74376.074031] [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10 > > > > [74376.074031] [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0 > > > > [74376.074031] [<ffffffff817560a6>] ? mutex_lock+0x16/0x37 > > > > [74376.074031] [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0 > > > > [xen_evtchn] > > > > [74376.074031] [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0 > > > > [74376.074031] [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0 > > > > [74376.074031] [<ffffffff81062211>] do_page_fault+0x31/0x70 > > > > [74376.074031] [<ffffffff81759e28>] page_fault+0x28/0x30 > > > > > > static void mn_invl_range_start(struct mmu_notifier *mn, > > struct mm_struct *mm, > > unsigned long start, unsigned long end) > > { > > struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn); > > struct grant_map *map; > > > > spin_lock(&priv->lock); > > list_for_each_entry(map, &priv->maps, next) { > > unmap_if_in_range(map, start, end); > > } > > > > mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so > > something is wrong with priv->maps list. > > > > So I've searched for all the list_del calls on this list and found one not > > guarded by spinlock: > > > > static int gntdev_release(struct inode *inode, struct file *flip) > > { > > struct gntdev_priv *priv = flip->private_data; > > struct grant_map *map; > > > > pr_debug("priv %p\n", priv); > > > > while (!list_empty(&priv->maps)) { > > map = list_entry(priv->maps.next, struct grant_map, next); > > list_del(&map->next); > > gntdev_put_map(NULL /* already removed */, map); > > } > > WARN_ON(!list_empty(&priv->freeable_maps)); > > > > if (use_ptemod) > > mmu_notifier_unregister(&priv->mn, priv->mm); > > kfree(priv); > > return 0; > > } > > > > So I see this as: > > P1(parent) P2(child) > > 1. gntdev_release called > > 2. list destroyed (above loop) > > > > 3. page fault occurs, gntdev mmu notifier called > > 4. priv->lock taken > > 5. iterate over priv->maps > > > Could we check the map->users? I don't have an easy way to reproduce the problem. It happens randomly once a day/two in some random domain. I've tried to fiddle with gnttab+fork+page faults, but failed to reliably reproduce the problem. > > 6. crashed since map is already destroyed > > Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole > thing! Ouch. > > > > 7. mmu_notifier_unregister calls: > > 8. mn_release, which tries to take priv->lock > > 9. this process hangs > > > > So I'd guess the fix would be to move mmu_notifier_unregister before > > releasing priv->maps list. Could someone more familiar with this code > > confirm this? > > Hey Marek, > > Could we just add an lock around the 'gntdev_release' loop? I don't fully understand what is going there, especially when gntdev_release is called (in case of forked process, still some memory mapped etc), and when mmu notifier is called. But just looking at this crash IMO adding a lock should be sufficient. What will happen when mmu notifier would not find the mapping? I guess nothing, because the mapping is just removed for a reason, right? > Daniel added this code to guard against a similar race: > > commit 16a1d0225e22e4e273e6b60a21db95decde666c2 > Author: Daniel De Graaf <dgdegra@xxxxxxxxxxxxx> > Date: Wed Jan 2 22:57:12 2013 +0000 > > xen/gntdev: correctly unmap unlinked maps in mmu notifier > > If gntdev_ioctl_unmap_grant_ref is called on a range before unmapping > it, the entry is removed from priv->maps and the later call to > mn_invl_range_start won't find it to do the unmapping. Fix this by > creating another list of freeable maps that the mmu notifier can search > and use to unmap grants. > > Signed-off-by: Daniel De Graaf <dgdegra@xxxxxxxxxxxxx> > > Lets see what he thinks. > > > > > > [74376.074031] Code: e9 dd fd ff ff 31 c9 31 db e9 20 fe ff ff 0f 1f > > > > 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 89 fb > > > > 48 83 ec 28 <48> 8b 47 10 48 85 c0 74 4e 4c 8b 00 49 39 d0 73 46 4c > > > > 8b 48 08 > > > > [74376.074031] RIP [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0 > > > > [xen_gntdev] > > > > [74376.074031] RSP <ffff88000032fc08> > > > > [74376.091682] ---[ end trace 2b21c5b714eb1071 ]--- > > > > And the other process: > > > > > > [74404.069009] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! > > > > [qrexec-agent:9379] > > > > [74404.069009] Modules linked in: fuse xt_conntrack ipt_MASQUERADE > > > > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > > > > nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables > > > > intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul > > > > crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel > > > > nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd > > > > udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O) > > > > xen_evtchn xen_blkfront > > > > [74404.069009] CPU: 2 PID: 9379 Comm: qrexec-agent Tainted: G D > > > > O 3.19.3-4.pvops.qubes.x86_64 #1 > > > > [74404.069009] task: ffff880010e24a00 ti: ffff880002470000 task.ti: > > > > ffff880002470000 > > > > [74404.069009] RIP: e030:[<ffffffff81757b11>] [<ffffffff81757b11>] > > > > _raw_spin_lock+0x21/0x30 > > > > [74404.069009] RSP: e02b:ffff880002473e18 EFLAGS: 00000297 > > > > [74404.069009] RAX: 0000000000000040 RBX: ffff880002345c00 RCX: > > > > 0000000000018cf8 > > > > [74404.069009] RDX: 0000000000000041 RSI: ffff880002345c00 RDI: > > > > ffff880012702e60 > > > > [74404.069009] RBP: ffff880002473e18 R08: ffff880012702240 R09: > > > > 00000001802a0019 > > > > [74404.069009] R10: ffffea000049c080 R11: ffffffffa00955bf R12: > > > > ffff880012702e70 > > > > [74404.069009] R13: ffff880012702e40 R14: ffff8800132c6f20 R15: > > > > ffff880012b163c0 > > > > [74404.069009] FS: 00007fd8616ca700(0000) GS:ffff880013d00000(0000) > > > > knlGS:0000000000000000 > > > > [74404.069009] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [74404.069009] CR2: 00007fd8610be098 CR3: 000000000b971000 CR4: > > > > 0000000000042660 > > > > [74404.069009] Stack: > > > > [74404.069009] ffff880002473e48 ffffffffa0095452 ffff880002473e48 > > > > ffff880002345c00 > > > > [74404.069009] ffff880012702e70 0000000000000000 ffff880002473e78 > > > > ffffffff811e8c2e > > > > [74404.069009] ffff880002473e78 ffff880012702e40 ffff880012702e40 > > > > ffff880012d123c8 > > > > [74404.069009] Call Trace: > > > > [74404.069009] [<ffffffffa0095452>] mn_release+0x22/0x130 > > > > [xen_gntdev] > > > > [74404.069009] [<ffffffff811e8c2e>] > > > > mmu_notifier_unregister+0x4e/0xe0 > > > > [74404.069009] [<ffffffffa00957c0>] gntdev_release+0x60/0xa0 > > > > [xen_gntdev] > > > > [74404.069009] [<ffffffff8120ec0f>] __fput+0xdf/0x1e0 > > > > [74404.069009] [<ffffffff8120ed5e>] ____fput+0xe/0x10 > > > > [74404.069009] [<ffffffff810b56df>] task_work_run+0xbf/0x100 > > > > [74404.069009] [<ffffffff81014c47>] do_notify_resume+0x97/0xb0 > > > > [74404.069009] [<ffffffff81758127>] int_signal+0x12/0x17 > > > > [74404.069009] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 > > > > 48 89 e5 b8 00 01 00 00 f0 66 0f c1 07 0f b6 d4 38 c2 75 04 5d c3 f3 > > > > 90 0f b6 07 <38> d0 75 f7 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 > > > > 66 90 55 > > > > -- > > Best Regards, > > Marek Marczykowski-GÃrecki > > Invisible Things Lab > > A: Because it messes up the order in which people normally read text. > > Q: Why is top-posting such a bad thing? > > -- Best Regards, Marek Marczykowski-GÃrecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

Attachment: pgprAacMlVFUH.pgp
Description: PGP signature

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.