[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen unstability on HP Moonshot m400



On Mon, 2015-03-23 at 23:58 +0000, Stefano Stabellini wrote:
> On Mon, 23 Mar 2015, Christoffer Dall wrote:
> > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> 
> > wrote:
> >       On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> >       > Hi,
> >       >
> >       > I have been experiencing a problematic crash running Xen on m400 
> > over
> >       > the last few days.  I already spoke to Ian and Stefano about this, 
> > but
> >       > thought I'd summarize what I've seen so far and loop in a wider
> >       > audience.
> >       >
> >       > The basic setup is this:
> >       >  - Two m400 nodes, one running Linux bare-metal, the other running
> >       > Xen.
> >       >  - The Xen node runs Dom0 and 1 DomU
> >       >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with 
> > two
> >       > parts on it
> >       >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
> >       > the internet) and regular bridging to eth1 which is connected to a
> >       > private VLAN to the bare-metal node
> >       >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
> >       >  - DomU runs apache2 serving the GCC manual (see
> >       > 
> > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)
> >       >
> >       > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 
> > 100
> >       
> > >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N
> > y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F
> >       gcc%2Findex.html"
> >       >
> >       > (10.10.1.120 is the DomU IP address of the bridged interface to 
> > eth1)
> >       >
> >       > What happens now is that the entire Xen node goes down.  I see 
> > various
> >       > errors in the kernel log, some examples:
> >       > http://pastebin.ubuntu.com/10642148/
> >       > http://pastebin.ubuntu.com/10642177/
> >       > http://pastebin.ubuntu.com/10642181/
> >       > http://pastebin.ubuntu.com/10635573/
> >       >
> >       >
> >       > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
> >       > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
> > 
> >       Is it worth adding
> >       
> > https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
> >       to your kernel? It isn't Xen specific but it's perhaps possible that 
> > Xen opens the window wider.

You definitely want that one. Without it, the page table walker could
end up using a stale pointer to a page being used for something other
than page tables.

> > 
> >       How confident are you in
> >       
> > https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb
> >  ?
> >       (although I suppose you aren't running in ACPI mode if you are running
> >       Xen?)
> > 
> > 
> > I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot 
> > without it, so not sure if there's an
> > alternative?  Mark?
> 
> This patch is key: it doesn't look like it is setting
> dev->archdata.dma_coherent appropriately, see the implementation of
> set_arch_dma_coherent_ops.

You'd want this if booting with ACPI. You might also need it for
enumerated PCI devices even if booting with devicetree.

> 
> 
> > 
> >       If we think the issue might be to do with coherency of foreign 
> > mappings
> >       undergoing i/o from dom0 and we've already ruled out disk (by using a
> >       loopback mounted rootfs) then it might be worth bodging netback to
> >       always copy too.
> > 
> >       Adding a call to skb_orphan_frags right before the netif_receive_skb 
> > in
> >       drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
> >       rather inefficient way of doing that (so I hope it doesn't perturb the
> >       issue).
> > 
> > 
> > I'll be happy to try this.
> 
> If we are right and the problem is due to the commit above not setting
> dma_coherent to true (the kernel will think that actually the network
> card is not coherent), then Ian's workaround should hide the problem.
> 
> 
> > 
> >       Stefano (who is more familiar with the Linux swiotlb side of things 
> > than
> >       me) is travelling this week so he'll be on West coast time, not sure
> >       when he gets off a plane nor if he's on email anyway (he's at ELC + 
> > this
> >       ARM ACPI thing)
> > 
> > 
> > ok, we'll see what happens.
> > 
> > -Christoffer
> > 
> > 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.