[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate


  • To: Hans van Kranenburg <hans.van.kranenburg@xxxxxxxxxx>, Pim van den Berg <pim.van.den.berg@xxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx
  • From: Sarah Newman <srn@xxxxxxxxx>
  • Date: Wed, 12 Sep 2018 13:44:20 -0700
  • Autocrypt: addr=srn@xxxxxxxxx; prefer-encrypt=mutual; keydata= xsDNBFefnQQBDACet9GO9NJA4Fd9SM+pNWgqVYDxVSKUpzQ9RK6/5cfZdys4U6U+VqME5WNi GTP/1G+ho22RprxzlewIoO+fCpCeYM+ccbx0tx5pLxJICTDZ/vG6RGHsEgjrt+7d1MGAsw6f 8xIHDGqZ10kXrH79NiOLzjdFa4TCczGd+bWBJAOzSQ0x3+i6CeXA9nhYD+lvGwMzxMBDea1B iiyGndekQTKE7V2YEjRakbclGED7QPrGQYkOOOtjAo3arqN8AKa/BnaUYGz0sGJ7CGjObfi8 iz4Sab26nbPXqW1k2JGeG/LaIIlBfn0eqFacW/k5DiaXruOd6/Khz8YMzu9VzZVlhcq9QoFQ they4BAB8kE47OWyV42w1ZbNs168i5EEjIa6QC37ayp1PHtPH18QpsdWJPd3IImGA0dQpm9f gA/CYOx5SrCzhGr58UldviZ+c5j8v/+oo9UtH91yf2qJsapehbmvIhwRW/OnJn4CysAlbqcJ 4mr4aKjydfsnkEZkUz2Ehn0AEQEAAc0cU2FyYWggTmV3bWFuIDxzcm5AcHJnbXIuY29tPsLA /QQTAQgAJwIbAwULCQgHAwUVCgkICwUWAgMBAAIeAQIXgAUCV6JnrwUJCWjMIQAKCRANAk7W zO1EmRJsC/4zyHRyU+5QA98frZgT/JXtsWYRRJNbrxqa0h7yLle+YECrcHsgAngjVlkM/Z1q QPzF+2M2fLJCzFfagNZyUdKhPuUHT6epNzFPoSM0doxlVA1ZgT0lqZ8O2rMd9ILzbxPNZBwr pbnB3SvIPBtqtBZij9ZjWsypfvNpcyu1Wyo08bp7JmRqDrTGrAoveMaWE+/vL6kWPOe/aCL7 3P3leTCOj+RqdDCE1S9izOyqJnrX9IzUptvpxDdqMfLWkUpMtl7AQKT0ETO3Bmyk/sPk/zRt OqSub5f1hoi+5csa7j83+J9N93lxaV2FNK5FK2FYG0khNHc6K2X2yqjJ4QcYMx3xItdkVv50 OzXyqjjfrYq13naG3t6WhW/NWui3I7KOComUNECaLPKUPU0+XvRPXnlasprD84HwxEg6p85S Iz7dB1OhVOGo1W2/S/xESWlBhFhVrCU0GREhkAtOVZ/WygJj7V60mtzxOtLWnRtsdpu3DVPy gUeW98ajXGzt/E9T21DOwM0EV5+dKAEMAKOAHzfoYh5dxKz8G0cCsGFGyzVpEWB4OhiItX0h Li1alSQsdZc1BI2kpmtvFgH8kyaMMhyHTXLo2oI8kcE4BT9WTpbATSSUoIm6taXZ6c8xAYkx a19DVL3JaJ53CpeRDimIhr8wAPgdkqGJevt856QCvR4qxwtAhdsplcSuGxJATROgmimiYxGJ xoLDxSONHUeIED3rJNmR2b0iyH2hZ40kmKV9Qf8m9HJM4JTAM/lu3Yhh7NiK73weMOpuH1BQ UzhHgteihwJ350DXsEcqVMjwp+7zMVOFVnek8AnJSRb9dGuzrgdo+F8YT381Qd6o1kDkfZP3 svxOiUMwUU69s/6JjcQ+4rjzVBvL0vk5xrRmYdGKOjvUTmj994fXtq1foSuN0q5T/7gxUGZQ I47jiXyB+ZSnHTCvxzT3Ca06SBr3RBHhItB9Oj6TVpy+aVZZ5vM7WXGl7SIX9nXL8vsQryyg UDAaC0IZt0nJb47tzaX+Z6YW+A5+AuzJQiZc8N3p7wARAQABwsDlBBgBCAAPAhsMBQJXomfK BQkJaMwfAAoJEA0CTtbM7USZh0sMAJVQWgOGU0tFgDjMK64qUmke7UUioUQKYrkNC6YO9RD/ alPW53nhXd6e6xSe7y9XOu6JINJymokktUsMV+USZnvHw2IQLK8woQrUO7eOr1uc/Upii9vD SqWDnP/VsQTnocw/3bDd/qN8eNjGyS7t7j2qzvSwQUsdzLzlrMFmIpPninBp8HKg+uTE/wlr /ycjiNCHqPU+6XhNQdAYggptcZadh8++lp/162zFhxM0qglBVrPA491iHmqzymy9/03Xszoh DbdcLN2UEswliLrf0BqrPal2Mhp4CkSz9SIf1gKvtaamhSufgW6Y+RFazUipQTObNcRBMMO4 ds0OwDec5IUFVGgmpRC+d4BobmeHH6/91Y6h8G9w3XTzJdmskKT3uqkxNSKkcCIgnsWlJvLw QQTcLVzjSlbCLXTt2VHTaQggaJbFlqrP88jZDlb38GpfoP3bmc7649g4TTg0Et8vD5u9Re+Y 6Yflo+eKd4520SYBi3yOCJ1osmSWnkaxjsEGyg==
  • Delivery-date: Wed, 12 Sep 2018 20:45:29 +0000
  • Dkim-filter: OpenDKIM Filter v2.11.0 mail.prgmr.com 162CE28C002
  • List-id: Xen user discussion <xen-users.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 09/12/2018 01:21 PM, Hans van Kranenburg wrote:
> On 09/12/2018 08:55 PM, Sarah Newman wrote:
>> On 09/04/2018 08:41 AM, Hans van Kranenburg wrote:
>>
>>>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 
>>>> (Debian
>>>> Stretch) and 4.15.11-1 (Debian Buster).
>>>>
>>>> [...]
>>>
>>> So... flash forward *whoosh*:
>>>
>>> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for
>>> dom0 as well as domU) if you want to use live migration, or maybe even
>>> in general together with Xen.
>>>
>>> A few of the things I could cause to happen with recent Linux 4.9 in
>>> dom0/domU:
>>>
>>> 1) blk-mq related Oops
>>>
>>> Oops in the domU while resuming after live migrate (blkfront_resume ->
>>> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit ->
>>> blk_mq_insert_requests). A related fix might be
>>> https://patchwork.kernel.org/patch/9462771/ but that's only present in
>>> later kernels.
>>>
>>> Apparently having this happen upsets the dom0 side of it, since any
>>> subsequent domU that is live migrated to the same dom0, also using
>>> blk-mq will immediately crash with the same Oops, after which is starts
>>> raining general protection faults inside. But, at the same time, I can
>>> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off
>>> that dom0.
>>
>> Do you see any errors at all on the dom0?
> 
> Nope.

What is your storage stack?

> 
>> You said you tested with both 4.9 and 4.15 kernels, does this depend only on 
>> a 4.9 kernel in the domU?
> 
> I don't know for sure (about 4.15 and if it has the mentioned patch or
> not). We (exploratory style) tested a few combinations of things some
> time ago, when 4.15 was in stretch-backports. At the end of the day the
> results were so unpredictable that we put doing testing in a more
> structured way on the todo-list (6-dimensional matrix of possibilities
> D: ). What I did recently is again just randomly trying things for a few
> hours, and then I started to see the pattern that whenever 4.9 was in
> the mix anywhere, bad things happened. Doing the reverse, eliminating
> 4.9 in dom0 as well as domU resulted in not being able to reproduce
> anything bad any more.
> 
> So, very pragmatic. :)

So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 
dom0?

The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added 
in 4.10. I assume you think it should be added to 4.9? Why do you think
it is related?

> 
>>> 2) Dom0 crash on live migration with multiple active nics
>>>
>>> I actually have to do more testing for specifically this, but at least
>>> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last
>>> tested a few months ago, Debian Jessie) by live migrating a domU that
>>> has multiple network interfaces, actively routing traffic over them, to
>>> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set
>>> - not rebooting.' *BOOM* everything gone.
>>
>> Can you post a full backtrace? Did you ever test with anything other than 
>> 4.9 kernel + 4.4 hypervisor?
> 
> Did not re-test yet.
> 
> Ah, I found my notes. It's a bit different. When just doing live
> migrate, it would upset the bnx2x driver or network card itself and I
> would lose network connectivity to the machine (and all other domUs).
> See attached bnx2x-crash.txt for console output while the poor thing is
> drowning and gasping for air.
> 
> When disabling SR-IOV (which I do not use, but which was listed
> somewhere as a workaround for a similar problem, related to HP Shared
> Memory blah, so why not try it to see what happens) in the BIOS for the
> 10G card and then trying the same, the dom0 crashed immediately when the
> live migrated domU was resumed. See dom0-crash.txt No trace or anything,
> it just disappears.

This shared memory is an HP only thing, right? I think I saw some 
recommendations to the reverse, to disable shared memory and enable SR-IOV.


>> What does "actively routing traffic" mean in terms of packet frequency, and 
>> did you test when there was
>> no network traffic but the interface was up?
> 
> A linux domU doing NAT with 1 external and 6 internal interfaces, having
> a conntrack table with ~20k entries of active traffic flows. However,
> not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s).
> 
> Without any traffic it doesn't explode immediately. I think I could live
> migrate the inactive router of a stateful (conntrackd) pair.
> 
>> A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network 
>> traffic did not duplicate this.
> I'll get around to reproducing this (or not being able to with Xen 4.11+
> Linux 4.17+ with maybe newer bnx2x).
> 
> Currently the network infra related domUs are still on Jessie (Xen 4.4
> Linux 3.16 dom0) hardware, also because of this one:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044
> 
> And while speaking of that, we've not seen this happen again with 4.17+
> in the dom0, and same openvswitch and Xen 4.11 version.
> 

Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I found 
some errors almost immediately with one of our network drivers after
doing so.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.