[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate
Hi, (my previous reply was eaten by the list, maybe it was too big with the attachments, maybe because posted from wrong email address, but text is in here:) On 09/12/2018 10:44 PM, Sarah Newman wrote: > On 09/12/2018 01:21 PM, Hans van Kranenburg wrote: >> On 09/12/2018 08:55 PM, Sarah Newman wrote: >>> On 09/04/2018 08:41 AM, Hans van Kranenburg wrote: >>> >>>>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 >>>>> (Debian >>>>> Stretch) and 4.15.11-1 (Debian Buster). >>>>> >>>>> [...] >>>> >>>> So... flash forward *whoosh*: >>>> >>>> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for >>>> dom0 as well as domU) if you want to use live migration, or maybe even >>>> in general together with Xen. >>>> >>>> A few of the things I could cause to happen with recent Linux 4.9 in >>>> dom0/domU: >>>> >>>> 1) blk-mq related Oops >>>> >>>> Oops in the domU while resuming after live migrate (blkfront_resume -> >>>> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit -> >>>> blk_mq_insert_requests). A related fix might be >>>> https://patchwork.kernel.org/patch/9462771/ but that's only present in >>>> later kernels. >>>> >>>> Apparently having this happen upsets the dom0 side of it, since any >>>> subsequent domU that is live migrated to the same dom0, also using >>>> blk-mq will immediately crash with the same Oops, after which is starts >>>> raining general protection faults inside. But, at the same time, I can >>>> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off >>>> that dom0. >>> >>> Do you see any errors at all on the dom0? >> >> Nope. > > What is your storage stack? iSCSI ----> dm_multipath -> dm_crypt --, iSCSI --' \---> LVM / iSCSI ----> dm_multipath -> dm_crypt --' iSCSI --' An LVM logical volume is the block device for e.g. a domU xvda. >>> You said you tested with both 4.9 and 4.15 kernels, does this depend only >>> on a 4.9 kernel in the domU? >> >> I don't know for sure (about 4.15 and if it has the mentioned patch or >> not). We (exploratory style) tested a few combinations of things some >> time ago, when 4.15 was in stretch-backports. At the end of the day the >> results were so unpredictable that we put doing testing in a more >> structured way on the todo-list (6-dimensional matrix of possibilities >> D: ). What I did recently is again just randomly trying things for a few >> hours, and then I started to see the pattern that whenever 4.9 was in >> the mix anywhere, bad things happened. Doing the reverse, eliminating >> 4.9 in dom0 as well as domU resulted in not being able to reproduce >> anything bad any more. >> >> So, very pragmatic. :) > > So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 > dom0? Correct, I don't have notes about that, so I can't say for sure. > The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added > in 4.10. I assume you think it should be added to 4.9? Why do you think > it is related? I'm not an expert here. What happens feels like some sort of race condition or wrong order of doing things, where a function runs before something it depends on is there yet. I do not think the mentioned patch is the fix. It is not a good match for the shown behavior here. I meant that it's probably a similar kind of fix related to doing IO and onlining/offlining a cpu, setting up queues etc? just like what's this one about... >>>> 2) Dom0 crash on live migration with multiple active nics >>>> >>>> I actually have to do more testing for specifically this, but at least >>>> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last >>>> tested a few months ago, Debian Jessie) by live migrating a domU that >>>> has multiple network interfaces, actively routing traffic over them, to >>>> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set >>>> - not rebooting.' *BOOM* everything gone. >>> >>> Can you post a full backtrace? Did you ever test with anything other than >>> 4.9 kernel + 4.4 hypervisor? >> >> Did not re-test yet. >> >> Ah, I found my notes. It's a bit different. When just doing live >> migrate, it would upset the bnx2x driver or network card itself and I >> would lose network connectivity to the machine (and all other domUs). >> See attached bnx2x-crash.txt for console output while the poor thing is >> drowning and gasping for air. >> >> When disabling SR-IOV (which I do not use, but which was listed >> somewhere as a workaround for a similar problem, related to HP Shared >> Memory blah, so why not try it to see what happens) in the BIOS for the >> 10G card and then trying the same, the dom0 crashed immediately when the >> live migrated domU was resumed. See dom0-crash.txt No trace or anything, >> it just disappears. > > This shared memory is an HP only thing, right? I think so yes. > I think I saw some recommendations to the reverse, to disable shared memory > and enable SR-IOV. > > >>> What does "actively routing traffic" mean in terms of packet frequency, and >>> did you test when there was >>> no network traffic but the interface was up? >> >> A linux domU doing NAT with 1 external and 6 internal interfaces, having >> a conntrack table with ~20k entries of active traffic flows. However, >> not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s). >> >> Without any traffic it doesn't explode immediately. I think I could live >> migrate the inactive router of a stateful (conntrackd) pair. >> >>> A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network >>> traffic did not duplicate this. >> I'll get around to reproducing this (or not being able to with Xen 4.11+ >> Linux 4.17+ with maybe newer bnx2x). >> >> Currently the network infra related domUs are still on Jessie (Xen 4.4 >> Linux 3.16 dom0) hardware, also because of this one: >> >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044 >> >> And while speaking of that, we've not seen this happen again with 4.17+ >> in the dom0, and same openvswitch and Xen 4.11 version. >> > > Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I > found some errors almost immediately with one of our network drivers after > doing so. No, thanks for the hint. Right now the top of the todo list is to reinstall some HP dl360 gen8 as well as and gen9 to latest BIOS + Stretch/Linux 4.17+ dom0 + Xen 4.11 and then start testing different scenarios to see if it's as stable as the same on the g7 and if I can still reproduce things like above. Hans _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |