[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate


  • To: Sarah Newman <srn@xxxxxxxxx>, Pim van den Berg <pim.van.den.berg@xxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx
  • From: Hans van Kranenburg <hans@xxxxxxxxxxx>
  • Date: Thu, 13 Sep 2018 00:12:08 +0200
  • Autocrypt: addr=hans@xxxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFo2pooBEADwTBe/lrCa78zuhVkmpvuN+pXPWHkYs0LuAgJrOsOKhxLkYXn6Pn7e3xm+ ySfxwtFmqLUMPWujQYF0r5C6DteypL7XvkPP+FPVlQnDIifyEoKq8JZRPsAFt1S87QThYPC3 mjfluLUKVBP21H3ZFUGjcf+hnJSN9d9MuSQmAvtJiLbRTo5DTZZvO/SuQlmafaEQteaOswme DKRcIYj7+FokaW9n90P8agvPZJn50MCKy1D2QZwvw0g2ZMR8yUdtsX6fHTe7Ym+tHIYM3Tsg 2KKgt17NTxIqyttcAIaVRs4+dnQ23J98iFmVHyT+X2Jou+KpHuULES8562QltmkchA7YxZpT mLMZ6TPit+sIocvxFE5dGiT1FMpjM5mOVCNOP+KOup/N7jobCG15haKWtu9k0kPz+trT3NOn gZXecYzBmasSJro60O4bwBayG9ILHNn+v/ZLg/jv33X2MV7oYXf+ustwjXnYUqVmjZkdI/pt 30lcNUxCANvTF861OgvZUR4WoMNK4krXtodBoEImjmT385LATGFt9HnXd1rQ4QzqyMPBk84j roX5NpOzNZrNJiUxj+aUQZcINtbpmvskGpJX0RsfhOh2fxfQ39ZP/0a2C59gBQuVCH6C5qsY rc1qTIpGdPYT+J1S2rY88AvPpr2JHZbiVqeB3jIlwVSmkYeB/QARAQABzR5Kb2hhbm5lcyBN YXJpam4gdmFuIEtyYW5lbmJ1cmfCwZEEEwEKADsCGwMFCwkIBwMFFQoJCAsFFgIDAQACHgEC F4AWIQTib9aPwejUthlFRk7ngVcyGAwqVQUCWjawgAIZAQAKCRDngVcyGAwqVZZ3D/98GzxN iFK38eh60e9TARh4HCgEWHD14/YK6KGpzF5UXM7CkKnb0NDjM3TzeeaIYzsOJITSW6rMOm5L NcJTUmw0x4vt43yc+DFAaBNiywXWgFc6g9RpYg5X33y+jhbjDREsGMDAk89isKWo8I8+rZwl S9FSSopWkrj0wV64TRwAlTCrYaTlS56mHa9T5RJkmIY+suxRr3Xl2gcKvng2Kh2WCDHjItUF /t3DfjMCIEL18QlXieyY2w1K0h4iT93YNkEdSpElsD5lFdt7XUfy3Q89eQHtd5n21cXG9lMc fcSbmHdn0ugYF0Hu2xVKCcYwWEgLjLRJ+G4aLQW122PKVVpn15/n7KMX9hQNMH4T8krEqOpd Vdb982gx5GSa+2j44+kOFTCnREN0w15JZI8Osi48xLdPqcrMVtvq9ga8tIPebAs8IM8Mf4JY okBS5sbCGEWZSSsDSdYm/Fp39HA3AEl2nI+wnJZCdgLx5NEnCd5Ni9d6rzC8Te7SvVvA/qlo sVDZAo6MJBYgoO9lPKHYD0FWomAeOlFVjdob0G2n1xBRjroVK0JQI3jpPQoZpc1TLauUQ+kT BQwWwFlpbfBbf0+CACWiQL0YgNNiZn885h4vU0EQI/FizjWUHxVLhXt1K4+x7nkhCZYzaIFL jLqw4y8f6SF9DxRMTM8dcaIQyThkms7BTQRaOtArARAA50ylThKbq0ACHyomxjQ6nFNxa9IC p6byU9LhhKOax0GB6l4WebMsQLhVGRQ8H7DT84E7QLRYsidEbneB1ciToZkL5YFFaVxY0Hj1 wKxCFcVoCRNtOfoPnHQ5m/eDLaO4o0KKL/kaxZwTn2jnl6BQDGX1Aak0u4KiUlFtoWn/E/NI v5QbTGSwIYuzWqqYBIzFtDbiQRvGw0NuKxAGMhwXy8VP05mmNwRdyh/CC4rWQPBTvTeMwr3n l8/G+16/cn4RNGhDiGTTXcX03qzZ5jZ5N7GLY5JtE6pTpLG+EXn5pAnQ7MvuO19cCbp6Dj8f XRmI0SVXWKSo0A2C8xH6KLCRfUMzD7nvDRU+bAHQmbi5cZBODBZ5yp5CfIL1KUCSoiGOMpMi n3FrarIlcxhNtoE+ya23A+JVtOwtM53ESra9cJL4WPkyk/E3OvNDmh8U6iZXn4ZaKQTHaxN9 yvmAUhZQiQi/sABwxCcQQ2ydRb86Vjcbx+FUr5OoEyQS46gc3KN5yax9D3H9wrptOzkNNMUh Fj0oK0fX/MYDWOFeuNBTYk1uFRJDmHAOp01rrMHRogQAkMBuJDMrMHfolivZw8RKfdPzgiI5 00okLTzHC0wgSSAOyHKGZjYjbEwmxsl3sLJck9IPOKvqQi1DkvpOPFSUeX3LPBIav5UUlXt0 wjbzInUAEQEAAcLBdgQYAQoAIBYhBOJv1o/B6NS2GUVGTueBVzIYDCpVBQJaOtArAhsMAAoJ EOeBVzIYDCpV4kgP+wUh3BDRhuKaZyianKroStgr+LM8FIUwQs3Fc8qKrcDaa35vdT9cocDZ jkaGHprpmlN0OuT2PB+Djt7am2noV6Kv1C8EnCPpyDBCwa7DntGdGcGMjH9w6aR4/ruNRUGS 1aSMw8sRQgpTVWEyzHlnIH92D+k+IhdNG+eJ6o1fc7MeC0gUwMt27Im+TxVxc0JRfniNk8PU Ag4kvJq7z7NLBUcJsIh3hM0WHQH9AYe/mZhQq5oyZTsz4jo/dWFRSlpY7zrDS2TZNYt4cCfZ j1bIdpbfSpRi9M3W/yBF2WOkwYgbkqGnTUvr+3r0LMCH2H7nzENrYxNY2kFmDX9bBvOWsWpc MdOEo99/Iayz5/q2d1rVjYVFRm5U9hG+C7BYvtUOnUvSEBeE4tnJBMakbJPYxWe61yANDQub PsINB10ingzsm553yqEjLTuWOjzdHLpE4lzD416ExCoZy7RLEHNhM1YQSI2RNs8umlDfZM9L ek1+1kgBvT3RH0/CpPJgveWV5xDOKuhD8j5l7FME+t2RWP+gyLid6dE0C7J03ir90PlTEkME HEzyJMPtOhO05Phy+d51WPTo1VSKxhL4bsWddHLfQoXW8RQ388Q69JG4m+JhNH/XvWe3aQFp YP+GZuzOhkMez0lHCaVOOLBSKHkAHh9i0/pH+/3hfEa4NsoHCpyy
  • Delivery-date: Wed, 12 Sep 2018 22:13:27 +0000
  • List-id: Xen user discussion <xen-users.lists.xenproject.org>
  • Openpgp: preference=signencrypt

Hi,

(my previous reply was eaten by the list, maybe it was too big with the
attachments, maybe because posted from wrong email address, but text is
in here:)

On 09/12/2018 10:44 PM, Sarah Newman wrote:
> On 09/12/2018 01:21 PM, Hans van Kranenburg wrote:
>> On 09/12/2018 08:55 PM, Sarah Newman wrote:
>>> On 09/04/2018 08:41 AM, Hans van Kranenburg wrote:
>>>
>>>>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 
>>>>> (Debian
>>>>> Stretch) and 4.15.11-1 (Debian Buster).
>>>>>
>>>>> [...]
>>>>
>>>> So... flash forward *whoosh*:
>>>>
>>>> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for
>>>> dom0 as well as domU) if you want to use live migration, or maybe even
>>>> in general together with Xen.
>>>>
>>>> A few of the things I could cause to happen with recent Linux 4.9 in
>>>> dom0/domU:
>>>>
>>>> 1) blk-mq related Oops
>>>>
>>>> Oops in the domU while resuming after live migrate (blkfront_resume ->
>>>> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit ->
>>>> blk_mq_insert_requests). A related fix might be
>>>> https://patchwork.kernel.org/patch/9462771/ but that's only present in
>>>> later kernels.
>>>>
>>>> Apparently having this happen upsets the dom0 side of it, since any
>>>> subsequent domU that is live migrated to the same dom0, also using
>>>> blk-mq will immediately crash with the same Oops, after which is starts
>>>> raining general protection faults inside. But, at the same time, I can
>>>> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off
>>>> that dom0.
>>>
>>> Do you see any errors at all on the dom0?
>>
>> Nope.
> 
> What is your storage stack?

iSCSI ----> dm_multipath -> dm_crypt --,
iSCSI --'                               \---> LVM
                                        /
iSCSI ----> dm_multipath -> dm_crypt --'
iSCSI --'

An LVM logical volume is the block device for e.g. a domU xvda.

>>> You said you tested with both 4.9 and 4.15 kernels, does this depend only 
>>> on a 4.9 kernel in the domU?
>>
>> I don't know for sure (about 4.15 and if it has the mentioned patch or
>> not). We (exploratory style) tested a few combinations of things some
>> time ago, when 4.15 was in stretch-backports. At the end of the day the
>> results were so unpredictable that we put doing testing in a more
>> structured way on the todo-list (6-dimensional matrix of possibilities
>> D: ). What I did recently is again just randomly trying things for a few
>> hours, and then I started to see the pattern that whenever 4.9 was in
>> the mix anywhere, bad things happened. Doing the reverse, eliminating
>> 4.9 in dom0 as well as domU resulted in not being able to reproduce
>> anything bad any more.
>>
>> So, very pragmatic. :)
> 
> So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 
> dom0?

Correct, I don't have notes about that, so I can't say for sure.

> The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added 
> in 4.10. I assume you think it should be added to 4.9? Why do you think
> it is related?

I'm not an expert here. What happens feels like some sort of race
condition or wrong order of doing things, where a function runs before
something it depends on is there yet.

I do not think the mentioned patch is the fix. It is not a good match
for the shown behavior here. I meant that it's probably a similar kind
of fix related to doing IO and onlining/offlining a cpu, setting up
queues etc? just like what's this one about...

>>>> 2) Dom0 crash on live migration with multiple active nics
>>>>
>>>> I actually have to do more testing for specifically this, but at least
>>>> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last
>>>> tested a few months ago, Debian Jessie) by live migrating a domU that
>>>> has multiple network interfaces, actively routing traffic over them, to
>>>> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set
>>>> - not rebooting.' *BOOM* everything gone.
>>>
>>> Can you post a full backtrace? Did you ever test with anything other than 
>>> 4.9 kernel + 4.4 hypervisor?
>>
>> Did not re-test yet.
>>
>> Ah, I found my notes. It's a bit different. When just doing live
>> migrate, it would upset the bnx2x driver or network card itself and I
>> would lose network connectivity to the machine (and all other domUs).
>> See attached bnx2x-crash.txt for console output while the poor thing is
>> drowning and gasping for air.
>>
>> When disabling SR-IOV (which I do not use, but which was listed
>> somewhere as a workaround for a similar problem, related to HP Shared
>> Memory blah, so why not try it to see what happens) in the BIOS for the
>> 10G card and then trying the same, the dom0 crashed immediately when the
>> live migrated domU was resumed. See dom0-crash.txt No trace or anything,
>> it just disappears.
> 
> This shared memory is an HP only thing, right?

I think so yes.

> I think I saw some recommendations to the reverse, to disable shared memory 
> and enable SR-IOV.
> 
> 
>>> What does "actively routing traffic" mean in terms of packet frequency, and 
>>> did you test when there was
>>> no network traffic but the interface was up?
>>
>> A linux domU doing NAT with 1 external and 6 internal interfaces, having
>> a conntrack table with ~20k entries of active traffic flows. However,
>> not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s).
>>
>> Without any traffic it doesn't explode immediately. I think I could live
>> migrate the inactive router of a stateful (conntrackd) pair.
>>
>>> A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network 
>>> traffic did not duplicate this.
>> I'll get around to reproducing this (or not being able to with Xen 4.11+
>> Linux 4.17+ with maybe newer bnx2x).
>>
>> Currently the network infra related domUs are still on Jessie (Xen 4.4
>> Linux 3.16 dom0) hardware, also because of this one:
>>
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044
>>
>> And while speaking of that, we've not seen this happen again with 4.17+
>> in the dom0, and same openvswitch and Xen 4.11 version.
>>
> 
> Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I 
> found some errors almost immediately with one of our network drivers after
> doing so.

No, thanks for the hint.

Right now the top of the todo list is to reinstall some HP dl360 gen8 as
well as and gen9 to latest BIOS + Stretch/Linux 4.17+ dom0 + Xen 4.11
and then start testing different scenarios to see if it's as stable as
the same on the g7 and if I can still reproduce things like above.

Hans


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.