[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
Hi, We've been tracking down a live migration bug during the last three days here at work, and here's what we found so far. 1. Xen version and dom0 linux kernel version don't matter. 2. DomU kernel is >= Linux 4.13. When using live migrate to another dom0, this often happens: [ 37.511305] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 37.513316] OOM killer disabled. [ 37.513323] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 37.514837] suspending xenstore... [ 37.515142] xen:grant_table: Grant tables using version 1 layout [18446744002.593711] OOM killer enabled. [18446744002.593726] Restarting tasks ... done. [18446744002.604527] Setting capacity to 6291456 As a side effect, all open TCP connections stall, because the timestamp counters of packets sent to the outside world are affected: https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png "The problem seems to occur after the domU is resumed. The first packet (#90) has wrong timestamp value (far from the past), marked red in the image. Green is the normal sequence of timestamps from the server (domU), acknowledged by the client. Once client receives the packet from the past, it attempts re-sending everything from the start. As the timestamp never reaches normal value, the client goes crazy thinking that the server has not received anything, keeping the loop on. But they just exist in different ages." ----------- >8 ----------- Ad 1. We reproduced this on different kinds of HP dl360 g7/8/9 gear, both with Xen 4.11 / Linux 4.19.9 dom0 kernel and with Xen 4.4 / Linux 3.16 as dom0 kernel. Ad 2. This was narrowed down by just grabbing old debian kernel images from https://snapshot.debian.org/binary/?cat=l and trying them. OK linux-image-4.12.0-2-amd64_4.12.13-1_amd64.deb FAIL linux-image-4.13.0-rc5-amd64_4.13~rc5-1~exp1_amd64.deb FAIL linux-image-4.13.0-trunk-amd64_4.13.1-1~exp1_amd64.deb FAIL linux-image-4.13.0-1-amd64_4.13.4-1_amd64.deb FAIL linux-image-4.13.0-1-amd64_4.13.13-1_amd64.deb FAIL linux-image-4.14.0-3-amd64_4.14.17-1_amd64.deb FAIL linux-image-4.15.0-3-amd64_4.15.17-1_amd64.deb FAIL linux-image-4.16.0-2-amd64_4.16.16-2_amd64.deb FAIL ... everything up to 4.19.9 here So, there seems to be a change introduced in 4.13 that makes this behaviour appear. We didn't start compiling old kernels yet to be able to bisect it further. ----------- >8 ----------- For the rest of the info, I'm focussing on a test environment for reproduction, which is 4x identical HP DL360G7, named sirius, gamma, omega and flopsy. It's running the 4.11 packages from Debian, rebuilt for Stretch: 4.11.1~pre.20180911.5acdd26fdc+dfsg-5~bpo9+1 https://salsa.debian.org/xen-team/debian-xen/commits/stretch-backports Dom0 kernel is 4.19.9 from Debian, rebuilt for Stretch: https://salsa.debian.org/knorrie-guest/linux/commits/debian/4.19.9-1_mxbp9+1 xen_commandline : placeholder dom0_max_vcpus=1-4 dom0_mem=4G,max:4G com2=115200,8n1 console=com2,vga noreboot xpti=no-dom0,domu smt=off vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5675 @ 3.07GHz stepping : 2 microcode : 0x1f cpu MHz : 3066.727 ----------- >8 ----------- There are some interesting additional patterns: 1. consistent success / failure paths. After rebooting all 4 physical servers, starting a domU with 4.19 kernel and then live migrating it, it might first time fail, or it might succeed. However, from the first time it fails, the specific direction of movement keeps showing the failure every single time this combination is used. Same goes for successful live migrate. E.g.: sirius -> flopsy OK sirius -> gamma OK flopsy -> gamma OK flopsy -> sirius OK gamma -> flopsy FAIL gamma -> sirius FAIL omega -> flopsy FAIL After rebooting all of the servers again, and restarting the whole test procedure, the combinations and results change, but are again consistent as soon as we start live migrating and seeing results. 2. TCP connections only hang when opened while "timestamp value in dmesg is low", followed with a "time is 18 gazillion" situation. When opening a TCP connection to the domU while it's at 18 gazillion seconds uptime, the TCP connection keeps working all the time after subsequent live migrations, even when it jumps up and down, following the OK and FAIL paths. 3. Since this is related to time and clocks, the last thing today we tried was, instead of using default settings, put "clocksource=tsc tsc=stable:socket" on the xen command line and "clocksource=tsc" on the domU linux kernel line. What we observed after doing this, is that the failure happens less often, but still happens. Everything else applies. ----------- >8 ----------- Additional question: It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19 domUs? All our hardware has 'TscInvariant = true'. Related: https://news.ycombinator.com/item?id=13813079 ----------- >8 ----------- I realize this problem might not be caused by Xen itself, but this list is the most logical place to start asking for help. Reproducing this in other environments should be pretty easy. 9 out of 10 times it already happens on first live migrate after the domU is started. We're available to test other stuff or provide more info if needed. Thanks, Hans _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |