Xen project Mailing List

Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>

Date: Wed, 5 Sep 2012 16:38:48 +0200

Cc: Robert Phillips <robert.phillips@xxxxxxxxxx>, Ben Guthro <ben@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Wed, 05 Sep 2012 14:39:29 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Wednesday, September 5, 2012, 4:06:01 PM, you wrote: > On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote: >> Ben, >> >> You have asked me to provide the rationale behind the gnttab_old_mfn patch, >> which you emailed to Sander earlier today. >> Here are my findings. >> >> I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c has >> changed from our previous version. It now calls gnttab_map_refs() in >> drivers/xen/grant-table.c. >> >> That function first calls HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, >> ... ) and then calls m2p_add_override() in p2m.c > And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr with the > machine address.. >> which is where I made my change. >> >> The unpatched code was saving the pfn's old mfn in kmap_op->dev_bus_addr. >> >> kmap_op is of type struct gnttab_map_grant_ref. That data type is used to >> record grant table mappings so later they can be unmapped correctly. > Right, but the blkback makes a distinction by passing NULL as kmap_op, which > means it should > use the old mechanism. Meaning that once the hypercall is done, the > map_ops[i].bus_addr is not > used anymore.. >> >> The problem with saving the old mfn in kmap_op->dev_bus_addr is that it is >> later overwritten by __gnttab_map_grant_ref() in xen/common/grant_table.c > Uh, so the problem of saving the old mfn in dev_bus_addr has been there for a > long long time then? > Even before this patch set? >> >> Since the storage holding the old mfn got overwritten, the unmapping was >> being done incorrectly. The balloon code detected that and bugged at >> drivers/xen/balloon.c:359 >> > Hmm, I believe the storage for holding the old mfn was/is page->index. >> My patch simply adds another member called old_mfn to struct >> gnttab_map_grant_ref rather than trying to overload dev_bus_addr. >> >> I don't know if Sander's bug is the same or related. The BUG_ON at >> drivers/xen/balloon.c:359 is quite general. It simply asserts that we are >> not trying to re-map a valid mapping. > Right. Somehow he ends up with valid mappings where there should be none. And > lots of them. It's something between kernel v3.4.1 and v3.5.3, haven't had time to narrow it down yet. Any suggestions for specific commits i could try to quickly bisect this one ? >> >> -- Robert Phillips >> >> >> -----Original Message----- >> From: Sander Eikelenboom [mailto:linux@xxxxxxxxxxxxxx] >> Sent: Tuesday, September 04, 2012 3:35 PM >> To: Ben Guthro >> Cc: Konrad Rzeszutek Wilk; xen-devel@xxxxxxxxxxxxx; Robert Phillips >> Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning >> althoug dom0_mem=X, max:X set >> >> >> Tuesday, September 4, 2012, 8:07:11 PM, you wrote: >> >> > We ran into the same issue, in newer kernels - but had not yet >> > submitted this fix. >> >> > One of the developers here came up with a fix (attached, and CC'ed >> > here) that fixes an issue where the p2m code reuses a structure member >> > where it shouldn't. >> > The patch adds a new "old_mfn" member to the gnttab_map_grant_ref >> > structure, instead of re-using dev_bus_addr. >> >> >> > If this also works for you, I can re-submit it with a Signed-off-by >> > line, if you prefer, Konrad. >> >> Hi Ben, >> >> This patch doesn't work for me: >> >> When starting the PV-guest i get: >> >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op >> (68b69070). >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op >> (0). >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op >> (0). >> >> >> and from the dom0 kernel: >> >> [ 374.425727] BUG: unable to handle kernel paging request at >> ffff8800fffd9078 >> [ 374.428901] IP: [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270 >> [ 374.428901] PGD 1e0c067 PUD 0 >> [ 374.428901] Oops: 0000 [#1] PREEMPT SMP >> [ 374.428901] Modules linked in: >> [ 374.428901] CPU 0 >> [ 374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted >> 3.6.0-rc4-20120830+ #70 System manufacturer System Product Name/P5Q-EM DO >> [ 374.428901] RIP: e030:[<ffffffff81336e4e>] [<ffffffff81336e4e>] >> gnttab_map_refs+0x14e/0x270 >> [ 374.428901] RSP: e02b:ffff88002f185ca8 EFLAGS: 00010206 >> [ 374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX: >> 00000000fffd9078 >> [ 374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI: >> 00003ffffffff000 >> [ 374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09: >> 0000000000000000 >> [ 374.428901] R10: 0000000000000000 R11: 0000000000000000 R12: >> 0000000000000004 >> [ 374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15: >> 0000000000000002 >> [ 374.428901] FS: 00007f6def9f2740(0000) GS:ffff88003fc00000(0000) >> knlGS:0000000000000000 >> [ 374.428901] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4: >> 0000000000042660 >> [ 374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 374.428901] Process qemu-system-i38 (pid: 4308, threadinfo >> ffff88002f184000, task ffff8800376f1040) >> [ 374.428901] Stack: >> [ 374.428901] ffffffffffffffff 0000000000000050 00000000fffd9078 >> 00000000000fffd9 >> [ 374.428901] 0000000001000000 ffff8800382135a0 ffff88002f185d08 >> ffff880038211960 >> [ 374.428901] ffff88002f11d2c0 0000000000000004 0000000000000003 >> 0000000000000001 >> [ 374.428901] Call Trace: >> [ 374.428901] [<ffffffff8134212e>] gntdev_mmap+0x20e/0x520 >> [ 374.428901] [<ffffffff8111c502>] ? mmap_region+0x312/0x5a0 >> [ 374.428901] [<ffffffff810ae0a0>] ? lockdep_trace_alloc+0xa0/0x130 >> [ 374.428901] [<ffffffff8111c5be>] mmap_region+0x3ce/0x5a0 >> [ 374.428901] [<ffffffff8111c9e0>] do_mmap_pgoff+0x250/0x350 >> [ 374.428901] [<ffffffff81109e88>] vm_mmap_pgoff+0x68/0x90 >> [ 374.428901] [<ffffffff8111a5b2>] sys_mmap_pgoff+0x152/0x170 >> [ 374.428901] [<ffffffff812b29be>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> [ 374.428901] [<ffffffff81011f29>] sys_mmap+0x29/0x30 >> [ 374.428901] [<ffffffff8184b939>] system_call_fastpath+0x16/0x1b >> [ 374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff 0f >> 00 00 48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1 >> <48> 23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00 >> [ 374.428901] RIP [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270 >> [ 374.428901] RSP <ffff88002f185ca8> >> [ 374.428901] CR2: ffff8800fffd9078 >> [ 374.428901] ---[ end trace 0e0a5a49f6503c0a ]--- >> >> >> >> > Ben >> >> >> > On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom <linux@xxxxxxxxxxxxxx> >> > wrote: >> >> >> >> Tuesday, September 4, 2012, 6:33:47 PM, you wrote: >> >> >> >>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote: >> >>>> Hi Konrad, >> >>>> >> >>>> This seems to happen only on a intel machine i'm trying to setup as a >> >>>> development machine (haven't seen it on my amd). >> >>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G >> >>>> of mem. >> >> >> >>> Is this only with Xen 4.2? As, does Xen 4.1 work? >> >>>> >> >>>> Dom0 and guest kernel are 3.6.0-rc4 with config: >> >> >> >>> If you back out: >> >> >> >>> f393387d160211f60398d58463a7e65 >> >>> Author: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> >> >>> Date: Fri Aug 17 16:43:28 2012 -0400 >> >> >> >>> xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M. >> >> >> >>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)? >> >> >> >> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this >> >> bug (with Xen 4.2). >> >> >> >> Will use the debug patch you mailed and send back the results ... >> >> >> >> >> >>>> [*] Xen memory balloon driver >> >>>> [*] Scrub pages before returning them to system >> >>>> >> >>>> From >> >>>> http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone >> >>>> , I thought this should be okay >> >>>> >> >>>> But when trying to start a PV guest with 512MB mem, the machine (dom0) >> >>>> crashes with the stacktrace below (complete serial-log.txt attached). >> >>>> >> >>>> From the: >> >>>> "mapping kernel into physical memory >> >>>> about to get started..." >> >>>> >> >>>> I would almost say it's trying to reload dom0 ? >> >>>> >> >>>> >> >>>> [ 897.161119] device vif1.0 entered promiscuous mode >> >>>> mapping kernel into physical memory >> >>>> about to get started... >> >>>> [ 897.696619] xen_bridge: port 1(vif1.0) entered forwarding state >> >>>> [ 897.716219] xen_bridge: port 1(vif1.0) entered forwarding state >> >>>> [ 898.129465] ------------[ cut here ]------------ >> >>>> [ 898.132209] kernel BUG at drivers/xen/balloon.c:359! >> >>>> [ 898.132209] invalid opcode: 0000 [#1] PREEMPT SMP >> >> >> >> >> >> >> >> _______________________________________________ >> >> Xen-devel mailing list >> >> Xen-devel@xxxxxxxxxxxxx >> >> http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.