[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Crashing kernel with dom0/libxc gnttab/gntshr
On Fri, 2 Aug 2013, Jeremy Fitzhardinge wrote: > On 08/02/2013 06:50 AM, Stefano Stabellini wrote: > > On Tue, 30 Jul 2013, Daniel De Graaf wrote: > >> On 07/30/2013 12:58 PM, David Vrabel wrote: > >> [...] > >>> [ 902.729307] BUG: Bad page map in process vchan-node1 pte:12bfff167 > >>> pmd:b9b5c067 > >>> [ 902.729312] page:ffffea0004afffc0 count:1 mapcount:-1 mapping: > >>> (null) index:0xffffffffffffffff > >>> > >>> I think this is the test for page_mapcount(page) < 0 in zap_pte_range(). > >>> This has looked up the page using the PTE it is trying to clear. Has > >>> it found the correct page? Since the MFN is currently mapped into the > >>> same domain, has the m2p_override stuff confused the look up and it is > >>> checking the grantee page not the granter? > >>> > >>> David > >> I think something like this is happening, since while reproducing this > >> on my test system, some linked list corruption was found that I believe > >> to be the cause of this problem. The gnttab_map_refs function on PV uses > >> m2p_add_override on the page, which threads page->lru to an > >> m2p_overrides list. However, something else is using page->lru during > >> the use of gntdev, as shown by the following debug patch: > > I have never managed to prove that something else is trying to use > > page->lru while the m2p_override is using it. > > > > Jeremy, at the time the code was written, you were pretty confident > > that page->lru couldn't be used by anybody else. > > Why was that? > > Hm. Probably the reasoning was that page->lru was only used for pages > which in the pagecache, mapped from files, and m2p pages are never > mapped from files. But maybe something else has decided to use lru for > non-mapped pages (transparent hugepage? page dedup?), or are m2p pages > getting into the pagecache somehow? > I think it could be the latter. For example we have recently changed QEMU not to use O_DIRECT on foreign grants to work around a network bug in the kernel. It might be possible that these pages end up in the pagecache after they have been already added to the m2p. > > > > > > > >> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c > >> index 3c8803f..198e57e 100644 > >> --- a/drivers/xen/gntdev.c > >> +++ b/drivers/xen/gntdev.c > >> @@ -294,6 +294,11 @@ static int map_grant_pages(struct grant_map *map) > >> if (err) > >> return err; > >> + printk("map page0 lru: %p prev=%p:%p next=%p:%p\n", > >> + &map->pages[0]->lru, > >> + map->pages[0]->lru.prev, map->pages[0]->lru.prev->next, > >> + map->pages[0]->lru.next, map->pages[0]->lru.next->prev); > >> + > >> for (i = 0; i < map->count; i++) { > >> if (map->map_ops[i].status) > >> err = -EINVAL; > >> @@ -320,6 +325,10 @@ static int __unmap_grant_pages(struct grant_map *map, > >> int > >> offset, int pages) > >> } > >> } > >> + printk("unmap page0 lru: %p prev=%p:%p next=%p:%p\n", > >> + &map->pages[0]->lru, > >> + map->pages[0]->lru.prev, map->pages[0]->lru.prev->next, > >> + map->pages[0]->lru.next, map->pages[0]->lru.next->prev); > >> err = gnttab_unmap_refs(map->unmap_ops + offset, > >> use_ptemod ? map->kmap_ops + offset : NULL, map->pages > >> + offset, > >> pages); > >> > >> Output: > >> [ 88.610644] map page0 lru: ffffea0001dee160 > >> prev=ffffffff82f2d510:ffffea0001dee160 > >> next=ffffffff82f2d510:ffffea0001dee160 > >> [ 88.611515] BUG: Bad page map in process a.out pte:8000000077b85167 > >> pmd:2541a067 > >> [ 88.611525] page:ffffea0001dee140 count:1 mapcount:-1 mapping: > >> (null) index:0xffffffffffffffff > >> [ 88.611532] page flags: 0x1000000000000814(referenced|dirty|private) > >> [ 88.611541] addr:00007f1adaef3000 vm_flags:140400fb anon_vma: > >> (null) mapping:ffff8800692974a0 index:0 > >> [ 88.611547] vma->vm_ops->fault: (null) > >> [ 88.611555] vma->vm_file->f_op->mmap: gntalloc_mmap+0x0/0x1d0 > >> [...backtrace cropped...] > >> [ 88.614301] unmap page0 lru: ffffea0001dee160 > >> prev=ffff8800254c9d08:ffff88001ea0b120 > >> next=ffff8800254c9d08:ffff88001ea0b938 > >> > >> The initial map is a linked list with only that element, so the address > >> 0xffffffff82f2d510 is the m2p_overrides entry. This means the page being > >> found by zap_pte_range is not a valid struct page. > >> > >> The struct page* being used by the gntalloc device was 0xffffea0000952740, > >> for reference; it's not a direct collision between the page used by the > >> gntdev and gntalloc devices. > >> > >> Not sure what the best fix is for this at the moment. > >> > >> -- > >> Daniel De Graaf > >> National Security Agency > >> > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |