[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate
Hi, On 04/13/2018 09:38 AM, Pim van den Berg wrote: > Hi all, > > We (at Mendix) are upgrading our dom0s to Xen 4.10 (PV) running on Debian > Stretch (Linux 4.9), but we are running into an issue regarding > live-migration. > > We are experiencing domU crashes while live-migrating and in the seconds after > the live-migration has been completed. This doesn't happen all the time. But > we > are able to reproduce the issue within 1 to max 10 times live migrating > between > 2 dom0s. > > We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian > Stretch) and 4.15.11-1 (Debian Buster). > > [...] So... flash forward *whoosh*: For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for dom0 as well as domU) if you want to use live migration, or maybe even in general together with Xen. A few of the things I could cause to happen with recent Linux 4.9 in dom0/domU: 1) blk-mq related Oops Oops in the domU while resuming after live migrate (blkfront_resume -> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit -> blk_mq_insert_requests). A related fix might be https://patchwork.kernel.org/patch/9462771/ but that's only present in later kernels. Apparently having this happen upsets the dom0 side of it, since any subsequent domU that is live migrated to the same dom0, also using blk-mq will immediately crash with the same Oops, after which is starts raining general protection faults inside. But, at the same time, I can still live migrate 3.16 kernels, but also 4.17 domU kernels on and off that dom0. 2) Dom0 crash on live migration with multiple active nics I actually have to do more testing for specifically this, but at least I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last tested a few months ago, Debian Jessie) by live migrating a domU that has multiple network interfaces, actively routing traffic over them, to it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.' *BOOM* everything gone. 3) xenconsoled disappearing When live migration errors happen, it regularly happens that the xenconsoled process in dom0 just disappears. I have no idea why. No segfault message in dmesg or anything, it's just gone. These are just examples. There are more errors that I ran into, and that I still have to re-test again. If someone is interested in more details, I have a collection of errors and stack traces etc. What did I end up with now? * Xen 4.11 (latest stable-4.11) * Linux 4.17.17 in (Debian Stretch) dom0 and in (Stretch, Buster) domUs * Linux 3.16.57 for old Jessie domUs is not a problem. In a small test environment, I just completed about 2000 random live migrations movements of ~20 domUs over 6 dom0s, throwing 10 concurrent at it, without anything bad happen. To generate at least some extra load, I was continuously running puppet on them, while the puppet masters were also in the domU mix. With 4.9 anywhere, it would only take a few minutes for everything to explode. To be continued... Hans _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |