[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen unstability on HP Moonshot m400
On Tue, 2015-03-24 at 09:54 -0400, Mark Salter wrote: > On Mon, 2015-03-23 at 23:58 +0000, Stefano Stabellini wrote: > > On Mon, 23 Mar 2015, Christoffer Dall wrote: > > > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> > > > wrote: > > > On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote: > > > > Hi, > > > > > > > > I have been experiencing a problematic crash running Xen on m400 > > > over > > > > the last few days. I already spoke to Ian and Stefano about > > > this, but > > > > thought I'd summarize what I've seen so far and loop in a wider > > > > audience. > > > > > > > > The basic setup is this: > > > > - Two m400 nodes, one running Linux bare-metal, the other running > > > > Xen. > > > > - The Xen node runs Dom0 and 1 DomU > > > > - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with > > > two > > > > parts on it > > > > - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected > > > to > > > > the internet) and regular bridging to eth1 which is connected to a > > > > private VLAN to the bare-metal node > > > > - Dom0 and DomU are configured with 14GB of ram, 4 cpus each > > > > - DomU runs apache2 serving the GCC manual (see > > > > > > > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh) > > > > > > > > The bare-metal node runs apache bench, like this: "ab -n 100000 > > > -c 100 > > > > > > >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N > > > y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F > > > gcc%2Findex.html" > > > > > > > > (10.10.1.120 is the DomU IP address of the bridged interface to > > > eth1) > > > > > > > > What happens now is that the entire Xen node goes down. I see > > > various > > > > errors in the kernel log, some examples: > > > > http://pastebin.ubuntu.com/10642148/ > > > > http://pastebin.ubuntu.com/10642177/ > > > > http://pastebin.ubuntu.com/10642181/ > > > > http://pastebin.ubuntu.com/10635573/ > > > > > > > > > > > > All Linux kernels are 3.18 plus some tweaks for the m400 > > > cartridge: > > > > > > > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18 > > > > > > Is it worth adding > > > > > > https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec > > > to your kernel? It isn't Xen specific but it's perhaps possible > > > that Xen opens the window wider. > > You definitely want that one. Without it, the page table walker could > end up using a stale pointer to a page being used for something other > than page tables. > > > > > > > How confident are you in > > > > > > https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb > > > ? > > > (although I suppose you aren't running in ACPI mode if you are > > > running > > > Xen?) > > > > > > > > > I'm not confident at all, but Linux (last I checked was v3.19) doesn't > > > boot without it, so not sure if there's an > > > alternative? Mark? > > > > This patch is key: it doesn't look like it is setting > > dev->archdata.dma_coherent appropriately, see the implementation of > > set_arch_dma_coherent_ops. > > You'd want this if booting with ACPI. You might also need it for > enumerated PCI devices even if booting with devicetree. There's an updated version of this patch for newer kernels in the devel branch of git.fedorahosted.org/git/kernel-arm64.git There is also this one in Linus' tree which may be of interest to you: commit 7132813c384515c9dede1ae20e56f3895feb7f1e Author: Suzuki K. Poulose <suzuki.poulose@xxxxxxx> Date: Thu Mar 19 18:17:09 2015 +0000 arm64: Honor __GFP_ZERO in dma allocations > > > > > > > > > > If we think the issue might be to do with coherency of foreign > > > mappings > > > undergoing i/o from dom0 and we've already ruled out disk (by using > > > a > > > loopback mounted rootfs) then it might be worth bodging netback to > > > always copy too. > > > > > > Adding a call to skb_orphan_frags right before the > > > netif_receive_skb in > > > drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but > > > rather inefficient way of doing that (so I hope it doesn't perturb > > > the > > > issue). > > > > > > > > > I'll be happy to try this. > > > > If we are right and the problem is due to the commit above not setting > > dma_coherent to true (the kernel will think that actually the network > > card is not coherent), then Ian's workaround should hide the problem. > > > > > > > > > > Stefano (who is more familiar with the Linux swiotlb side of things > > > than > > > me) is travelling this week so he'll be on West coast time, not sure > > > when he gets off a plane nor if he's on email anyway (he's at ELC + > > > this > > > ARM ACPI thing) > > > > > > > > > ok, we'll see what happens. > > > > > > -Christoffer > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |