Xen project Mailing List

Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate

To: Sarah Newman <srn@xxxxxxxxx>, Pim van den Berg <pim.van.den.berg@xxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx

From: Hans van Kranenburg <hans@xxxxxxxxxxx>

Date: Thu, 13 Sep 2018 00:12:08 +0200

Autocrypt: addr=hans@xxxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFo2pooBEADwTBe/lrCa78zuhVkmpvuN+pXPWHkYs0LuAgJrOsOKhxLkYXn6Pn7e3xm+ ySfxwtFmqLUMPWujQYF0r5C6DteypL7XvkPP+FPVlQnDIifyEoKq8JZRPsAFt1S87QThYPC3 mjfluLUKVBP21H3ZFUGjcf+hnJSN9d9MuSQmAvtJiLbRTo5DTZZvO/SuQlmafaEQteaOswme DKRcIYj7+FokaW9n90P8agvPZJn50MCKy1D2QZwvw0g2ZMR8yUdtsX6fHTe7Ym+tHIYM3Tsg 2KKgt17NTxIqyttcAIaVRs4+dnQ23J98iFmVHyT+X2Jou+KpHuULES8562QltmkchA7YxZpT mLMZ6TPit+sIocvxFE5dGiT1FMpjM5mOVCNOP+KOup/N7jobCG15haKWtu9k0kPz+trT3NOn gZXecYzBmasSJro60O4bwBayG9ILHNn+v/ZLg/jv33X2MV7oYXf+ustwjXnYUqVmjZkdI/pt 30lcNUxCANvTF861OgvZUR4WoMNK4krXtodBoEImjmT385LATGFt9HnXd1rQ4QzqyMPBk84j roX5NpOzNZrNJiUxj+aUQZcINtbpmvskGpJX0RsfhOh2fxfQ39ZP/0a2C59gBQuVCH6C5qsY rc1qTIpGdPYT+J1S2rY88AvPpr2JHZbiVqeB3jIlwVSmkYeB/QARAQABzR5Kb2hhbm5lcyBN YXJpam4gdmFuIEtyYW5lbmJ1cmfCwZEEEwEKADsCGwMFCwkIBwMFFQoJCAsFFgIDAQACHgEC F4AWIQTib9aPwejUthlFRk7ngVcyGAwqVQUCWjawgAIZAQAKCRDngVcyGAwqVZZ3D/98GzxN iFK38eh60e9TARh4HCgEWHD14/YK6KGpzF5UXM7CkKnb0NDjM3TzeeaIYzsOJITSW6rMOm5L NcJTUmw0x4vt43yc+DFAaBNiywXWgFc6g9RpYg5X33y+jhbjDREsGMDAk89isKWo8I8+rZwl S9FSSopWkrj0wV64TRwAlTCrYaTlS56mHa9T5RJkmIY+suxRr3Xl2gcKvng2Kh2WCDHjItUF /t3DfjMCIEL18QlXieyY2w1K0h4iT93YNkEdSpElsD5lFdt7XUfy3Q89eQHtd5n21cXG9lMc fcSbmHdn0ugYF0Hu2xVKCcYwWEgLjLRJ+G4aLQW122PKVVpn15/n7KMX9hQNMH4T8krEqOpd Vdb982gx5GSa+2j44+kOFTCnREN0w15JZI8Osi48xLdPqcrMVtvq9ga8tIPebAs8IM8Mf4JY okBS5sbCGEWZSSsDSdYm/Fp39HA3AEl2nI+wnJZCdgLx5NEnCd5Ni9d6rzC8Te7SvVvA/qlo sVDZAo6MJBYgoO9lPKHYD0FWomAeOlFVjdob0G2n1xBRjroVK0JQI3jpPQoZpc1TLauUQ+kT BQwWwFlpbfBbf0+CACWiQL0YgNNiZn885h4vU0EQI/FizjWUHxVLhXt1K4+x7nkhCZYzaIFL jLqw4y8f6SF9DxRMTM8dcaIQyThkms7BTQRaOtArARAA50ylThKbq0ACHyomxjQ6nFNxa9IC p6byU9LhhKOax0GB6l4WebMsQLhVGRQ8H7DT84E7QLRYsidEbneB1ciToZkL5YFFaVxY0Hj1 wKxCFcVoCRNtOfoPnHQ5m/eDLaO4o0KKL/kaxZwTn2jnl6BQDGX1Aak0u4KiUlFtoWn/E/NI v5QbTGSwIYuzWqqYBIzFtDbiQRvGw0NuKxAGMhwXy8VP05mmNwRdyh/CC4rWQPBTvTeMwr3n l8/G+16/cn4RNGhDiGTTXcX03qzZ5jZ5N7GLY5JtE6pTpLG+EXn5pAnQ7MvuO19cCbp6Dj8f XRmI0SVXWKSo0A2C8xH6KLCRfUMzD7nvDRU+bAHQmbi5cZBODBZ5yp5CfIL1KUCSoiGOMpMi n3FrarIlcxhNtoE+ya23A+JVtOwtM53ESra9cJL4WPkyk/E3OvNDmh8U6iZXn4ZaKQTHaxN9 yvmAUhZQiQi/sABwxCcQQ2ydRb86Vjcbx+FUr5OoEyQS46gc3KN5yax9D3H9wrptOzkNNMUh Fj0oK0fX/MYDWOFeuNBTYk1uFRJDmHAOp01rrMHRogQAkMBuJDMrMHfolivZw8RKfdPzgiI5 00okLTzHC0wgSSAOyHKGZjYjbEwmxsl3sLJck9IPOKvqQi1DkvpOPFSUeX3LPBIav5UUlXt0 wjbzInUAEQEAAcLBdgQYAQoAIBYhBOJv1o/B6NS2GUVGTueBVzIYDCpVBQJaOtArAhsMAAoJ EOeBVzIYDCpV4kgP+wUh3BDRhuKaZyianKroStgr+LM8FIUwQs3Fc8qKrcDaa35vdT9cocDZ jkaGHprpmlN0OuT2PB+Djt7am2noV6Kv1C8EnCPpyDBCwa7DntGdGcGMjH9w6aR4/ruNRUGS 1aSMw8sRQgpTVWEyzHlnIH92D+k+IhdNG+eJ6o1fc7MeC0gUwMt27Im+TxVxc0JRfniNk8PU Ag4kvJq7z7NLBUcJsIh3hM0WHQH9AYe/mZhQq5oyZTsz4jo/dWFRSlpY7zrDS2TZNYt4cCfZ j1bIdpbfSpRi9M3W/yBF2WOkwYgbkqGnTUvr+3r0LMCH2H7nzENrYxNY2kFmDX9bBvOWsWpc MdOEo99/Iayz5/q2d1rVjYVFRm5U9hG+C7BYvtUOnUvSEBeE4tnJBMakbJPYxWe61yANDQub PsINB10ingzsm553yqEjLTuWOjzdHLpE4lzD416ExCoZy7RLEHNhM1YQSI2RNs8umlDfZM9L ek1+1kgBvT3RH0/CpPJgveWV5xDOKuhD8j5l7FME+t2RWP+gyLid6dE0C7J03ir90PlTEkME HEzyJMPtOhO05Phy+d51WPTo1VSKxhL4bsWddHLfQoXW8RQ388Q69JG4m+JhNH/XvWe3aQFp YP+GZuzOhkMez0lHCaVOOLBSKHkAHh9i0/pH+/3hfEa4NsoHCpyy

Delivery-date: Wed, 12 Sep 2018 22:13:27 +0000

List-id: Xen user discussion <xen-users.lists.xenproject.org>

Openpgp: preference=signencrypt

Hi, (my previous reply was eaten by the list, maybe it was too big with the attachments, maybe because posted from wrong email address, but text is in here:) On 09/12/2018 10:44 PM, Sarah Newman wrote: > On 09/12/2018 01:21 PM, Hans van Kranenburg wrote: >> On 09/12/2018 08:55 PM, Sarah Newman wrote: >>> On 09/04/2018 08:41 AM, Hans van Kranenburg wrote: >>> >>>>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 >>>>> (Debian >>>>> Stretch) and 4.15.11-1 (Debian Buster). >>>>> >>>>> [...] >>>> >>>> So... flash forward *whoosh*: >>>> >>>> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for >>>> dom0 as well as domU) if you want to use live migration, or maybe even >>>> in general together with Xen. >>>> >>>> A few of the things I could cause to happen with recent Linux 4.9 in >>>> dom0/domU: >>>> >>>> 1) blk-mq related Oops >>>> >>>> Oops in the domU while resuming after live migrate (blkfront_resume -> >>>> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit -> >>>> blk_mq_insert_requests). A related fix might be >>>> https://patchwork.kernel.org/patch/9462771/ but that's only present in >>>> later kernels. >>>> >>>> Apparently having this happen upsets the dom0 side of it, since any >>>> subsequent domU that is live migrated to the same dom0, also using >>>> blk-mq will immediately crash with the same Oops, after which is starts >>>> raining general protection faults inside. But, at the same time, I can >>>> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off >>>> that dom0. >>> >>> Do you see any errors at all on the dom0? >> >> Nope. > > What is your storage stack? iSCSI ----> dm_multipath -> dm_crypt --, iSCSI --' \---> LVM / iSCSI ----> dm_multipath -> dm_crypt --' iSCSI --' An LVM logical volume is the block device for e.g. a domU xvda. >>> You said you tested with both 4.9 and 4.15 kernels, does this depend only >>> on a 4.9 kernel in the domU? >> >> I don't know for sure (about 4.15 and if it has the mentioned patch or >> not). We (exploratory style) tested a few combinations of things some >> time ago, when 4.15 was in stretch-backports. At the end of the day the >> results were so unpredictable that we put doing testing in a more >> structured way on the todo-list (6-dimensional matrix of possibilities >> D: ). What I did recently is again just randomly trying things for a few >> hours, and then I started to see the pattern that whenever 4.9 was in >> the mix anywhere, bad things happened. Doing the reverse, eliminating >> 4.9 in dom0 as well as domU resulted in not being able to reproduce >> anything bad any more. >> >> So, very pragmatic. :) > > So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 > dom0? Correct, I don't have notes about that, so I can't say for sure. > The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added > in 4.10. I assume you think it should be added to 4.9? Why do you think > it is related? I'm not an expert here. What happens feels like some sort of race condition or wrong order of doing things, where a function runs before something it depends on is there yet. I do not think the mentioned patch is the fix. It is not a good match for the shown behavior here. I meant that it's probably a similar kind of fix related to doing IO and onlining/offlining a cpu, setting up queues etc? just like what's this one about... >>>> 2) Dom0 crash on live migration with multiple active nics >>>> >>>> I actually have to do more testing for specifically this, but at least >>>> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last >>>> tested a few months ago, Debian Jessie) by live migrating a domU that >>>> has multiple network interfaces, actively routing traffic over them, to >>>> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set >>>> - not rebooting.' *BOOM* everything gone. >>> >>> Can you post a full backtrace? Did you ever test with anything other than >>> 4.9 kernel + 4.4 hypervisor? >> >> Did not re-test yet. >> >> Ah, I found my notes. It's a bit different. When just doing live >> migrate, it would upset the bnx2x driver or network card itself and I >> would lose network connectivity to the machine (and all other domUs). >> See attached bnx2x-crash.txt for console output while the poor thing is >> drowning and gasping for air. >> >> When disabling SR-IOV (which I do not use, but which was listed >> somewhere as a workaround for a similar problem, related to HP Shared >> Memory blah, so why not try it to see what happens) in the BIOS for the >> 10G card and then trying the same, the dom0 crashed immediately when the >> live migrated domU was resumed. See dom0-crash.txt No trace or anything, >> it just disappears. > > This shared memory is an HP only thing, right? I think so yes. > I think I saw some recommendations to the reverse, to disable shared memory > and enable SR-IOV. > > >>> What does "actively routing traffic" mean in terms of packet frequency, and >>> did you test when there was >>> no network traffic but the interface was up? >> >> A linux domU doing NAT with 1 external and 6 internal interfaces, having >> a conntrack table with ~20k entries of active traffic flows. However, >> not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s). >> >> Without any traffic it doesn't explode immediately. I think I could live >> migrate the inactive router of a stateful (conntrackd) pair. >> >>> A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network >>> traffic did not duplicate this. >> I'll get around to reproducing this (or not being able to with Xen 4.11+ >> Linux 4.17+ with maybe newer bnx2x). >> >> Currently the network infra related domUs are still on Jessie (Xen 4.4 >> Linux 3.16 dom0) hardware, also because of this one: >> >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044 >> >> And while speaking of that, we've not seen this happen again with 4.17+ >> in the dom0, and same openvswitch and Xen 4.11 version. >> > > Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I > found some errors almost immediately with one of our network drivers after > doing so. No, thanks for the hint. Right now the top of the todo list is to reinstall some HP dl360 gen8 as well as and gen9 to latest BIOS + Stretch/Linux 4.17+ dom0 + Xen 4.11 and then start testing different scenarios to see if it's as stable as the same on the g7 and if I can still reproduce things like above. Hans _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.