Xen project Mailing List

Re: [Xen-devel] race condition in xen-gntdev

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

From: Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>

Date: Mon, 22 Jun 2015 15:14:16 -0400

Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 22 Jun 2015 19:15:04 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 06/22/2015 02:37 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Jun 22, 2015 at 08:13:35PM +0200, Marek Marczykowski-Górecki wrote:

On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote:

On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-Górecki wrote:

On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-Górecki wrote:

On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-Górecki wrote:

Hi,

What is the proper way to handle shared pages (either side - using
gntdev or gntalloc) regarding fork and possible exec later? The child
process do not need to access those pages in any way, but will map
different one(s), using newly opened FD to the gntdev/gntalloc device.
Should it unmap them and close FD to the device manually just after the
fork? Or the process using gntdev or gntalloc should prevent using fork
at all?

I'm asking because I get kernel oops[1] in context of such process. This
process uses both gntdev and gntalloc. The PID reported there is a
child, which maps additional pages (using newly opened FD to
/dev/xen/gnt*), but I'm not sure if the crash happens before, after or
at this second mapping (actually vchan connection), or maybe even at
cleanup of this second mapping. The parent process keeps its mappings
for the whole lifetime of its child. I don't have a 100% reliable way
to reproduce this problem, but it happens quite often when I run such
operations in a loop.


Any ideas?


I've done some further debugging, below is what I get.


Woot! Thank you!

The kernel is vanilla 3.19.3, running on Xen 4.4.2.

The kernel message:
[74376.073464] general protection fault: 0000 [#1] SMP
[74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi 
x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr 
xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenfs 
xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc 
u2mfn(O) xen_evtchn xen_blkfront
[74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G           O   
3.19.3-4.pvops.qubes.x86_64 #1
[74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: 
ffff88000032c000
[74376.073532] RIP: e030:[<ffffffffa00952c5>]  [<ffffffffa00952c5>] 
unmap_if_in_range+0x15/0xd0 [xen_gntdev]


        static void unmap_if_in_range(struct grant_map *map,
                          unsigned long start, unsigned long end)
        {
            unsigned long mstart, mend;
            int err;

            if (!map->vma)
                return;

The above crash is at first access to "map"...

[74376.073543] RSP: e02b:ffff88000032fc08  EFLAGS: 00010292
[74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 00007fd8616ea000
[74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead000000100100


... which is 0xdead000000100100 (LIST_POISON1).

[74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000000000000000
[74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 00007fd8616e9000
[74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff880012702e70
[74376.073569] FS:  00007fd8616ca700(0000) GS:ffff880013c80000(0000) 
knlGS:0000000000000000
[74376.073574] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000000000042660
[74376.073582] Stack:
[74376.073584]  ffff8800188356c0 00000000000000d0 ffff88000032fc68 
00000000c64ef797
[74376.073590]  0000000000000220 dead000000100100 00007fd8616e9000 
00007fd8616ea000
[74376.073596]  ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 
ffff880012702e70
[74376.073603] Call Trace:
[74376.073610]  [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 [xen_gntdev]
[74376.073620]  [<ffffffff811e88fb>]
__mmu_notifier_invalidate_range_start+0x5b/0x90
[74376.073627]  [<ffffffff811c2a59>] do_wp_page+0x769/0x820
[74376.074031]  [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0
[74376.074031]  [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10
[74376.074031]  [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0
[74376.074031]  [<ffffffff817560a6>] ? mutex_lock+0x16/0x37
[74376.074031]  [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0
[xen_evtchn]
[74376.074031]  [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0
[74376.074031]  [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0
[74376.074031]  [<ffffffff81062211>] do_page_fault+0x31/0x70
[74376.074031]  [<ffffffff81759e28>] page_fault+0x28/0x30



        static void mn_invl_range_start(struct mmu_notifier *mn,
                        struct mm_struct *mm,
                        unsigned long start, unsigned long end)
        {
            struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
            struct grant_map *map;

            spin_lock(&priv->lock);
            list_for_each_entry(map, &priv->maps, next) {
                unmap_if_in_range(map, start, end);
            }

mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so
something is wrong with priv->maps list.

So I've searched for all the list_del calls on this list and found one not
guarded by spinlock:

        static int gntdev_release(struct inode *inode, struct file *flip)
        {
            struct gntdev_priv *priv = flip->private_data;
            struct grant_map *map;

            pr_debug("priv %p\n", priv);

            while (!list_empty(&priv->maps)) {
                map = list_entry(priv->maps.next, struct grant_map, next);
                list_del(&map->next);
                gntdev_put_map(NULL /* already removed */, map);
            }
            WARN_ON(!list_empty(&priv->freeable_maps));

            if (use_ptemod)
                mmu_notifier_unregister(&priv->mn, priv->mm);
            kfree(priv);
            return 0;
        }

So I see this as:
P1(parent)                 P2(child)
                            1. gntdev_release called
                            2. list destroyed (above loop)

3. page fault occurs, gntdev mmu notifier called
4. priv->lock taken
5. iterate over priv->maps



Could we check the map->users?


I don't have an easy way to reproduce the problem. It happens randomly
once a day/two in some random domain. I've tried to fiddle with
gnttab+fork+page faults, but failed to reliably reproduce the problem.

6. crashed since map is already destroyed


Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole
thing! Ouch.


                            7. mmu_notifier_unregister calls:
                            8.   mn_release, which tries to take priv->lock
                            9. this process hangs

So I'd guess the fix would be to move mmu_notifier_unregister before
releasing priv->maps list. Could someone more familiar with this code
confirm this?


Hey Marek,

Could we just add an lock around the 'gntdev_release' loop?


I don't fully understand what is going there, especially when
gntdev_release is called (in case of forked process, still some memory
mapped etc), and when mmu notifier is called. But just looking at this
crash IMO adding a lock should be sufficient. What will happen when mmu notifier
would not find the mapping? I guess nothing, because the mapping is just
removed for a reason, right?


That was my thought. But then why didn't we add this lock in the first place?
Hopefully Daniel can remember ~2yr patch :-)

The reason that gntdev_release didn't have a lock is because there are not supposed to be any references to the areas pointed to by priv->maps when it is called. However, since the MMU notifier has not yet been unregistered, it is apparently possible to race here; the comment on mmu_notifier_unregister seems to confirm this as a possibility (as do the backtraces). I think adding the lock will be sufficient. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.