[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.6 Live Migration and Hotplugging Issues



About the cpu hotplug issue, I am able to reproduce it as well.

I think the lockup is due to the following code in xen_cpu_up()
(arch/x86/xen/smp.c) as it is spinning until cpu_hotplug_state of new vcpu is
CPU_ONLINE.

    while (cpu_report_state(cpu) != CPU_ONLINE)
        HYPERVISOR_sched_op(SCHEDOP_yield, NULL);

cpu_hotplug_state is set to CPU_ONLINE with cpu_set_state_online().

Have you tried with the latest mainline linux? As far as I remember, I have
tried with latest mainline linux and got warning related to block-mq when I
online vcpu.

I am not sure if below patch would help:

   1 commit ae039001054b34c4a624539b32a8b6ff3403aaf9
   2 Author: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
   3 Date:   Fri Jun 2 17:06:02 2017 -0700
   4
   5     xen/vcpu: Handle xen_vcpu_setup() failure at boot
   6
   7     On PVH, PVHVM, at failure in the VCPUOP_register_vcpu_info hypercall
   8     we limit the number of cpus to to MAX_VIRT_CPUS. However, if this
   9     failure had occurred for a cpu beyond MAX_VIRT_CPUS, we continue
  10     to function with > MAX_VIRT_CPUS.
  11
  12     This leads to problems at the next save/restore cycle when there
  13     are > MAX_VIRT_CPUS threads going into stop_machine() but coming
  14     back up there's valid state for only the first MAX_VIRT_CPUS.
  15
  16     This patch pulls the excess CPUs down via cpu_down().
  17
  18     Reviewed-by: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
  19     Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
  20     Signed-off-by: Juergen Gross <jgross@xxxxxxxx>

Dongli Zhang

On 10/31/2017 12:14 AM, Tim Evers wrote:
> Hi,
> 
> I am trying to set up two Ubuntu 16.04 / Xen 4.6 Machines to perform live
> migration and CPU / memory hotplug. So far I encountered several catastrophic
> issues. They are so severe that I am thinking I might be on the wrong track
> alltogether.
> 
> Any input is highly appreciated!
> 
> The setup:
> 
> 2 Dell M630 with Ubuntu 16.04 and Xen 4.6, 64bit Dom0 (node1 + node2)
> 
> 2 Domus, Debian Jessie 64bit PV and Debian Jessie 64bit HVM
> 
> Now create a PV Domu on node1 with 1 CPU Core and 2 GB RAM and plenty of room
> for hot-add / hotplug:
> 
> Config excerpt:
> 
> kernel       = "/home/xen/shared/boot/tests/vmlinuz-3.16.0-4-amd64"
> ramdisk      = "/home/xen/shared/boot/tests/initrd.img-3.16.0-4-amd64"
> maxmem       = 16384
> memory       = 2048
> maxvcpus     = 8
> vcpus        = 1
> cpus         = "18"
> 
> xm list:
> 
> root1823     97  2048     1     -b----      15.1
> 
> All is fine. Now migrate to node2. Immediately after the migratiion we see:
> 
> xm list:
> 
> root182      360 16384     1     -b----      10.5
> 
> So the DomU immediately ballooned to its maxmem after the migration, and even
> better, inside the Domu we see all CPUs are suddenly hotplugged (but not 
> online
> due to missing udev rules):
> 
> root@debian8:~# ls /sys/devices/system/cpu/ | grep cpu
> cpu0
> cpu1
> cpu2
> cpu3
> cpu4
> cpu5
> cpu6
> cpu7
> 
> So this is already not how it is supposed to be (DomU should look the same
> before and after migration).
> 
> Now we take cpu1 online:
> 
> echo 1 > /sys/devices/system/cpu/cpu1/online
> 
> Result as seen through hvc on the Dom0:
> 
> [  373.360949] installing Xen timer for CPU 1
> [  400.032003] BUG: soft lockup - CPU#0 stuck for 22s! [bash:733]
> [  400.032003] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs
> lockd fscache sunrpc evdev pcspkr x86_pkg_temp_thermal thermal_sys coretemp
> crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper 
> cryptd
> autofs4 ext4 crc16 mbcache jbd2 crct10dif_pclmul crct10dif_common xen_netfront
> xen_blkfront crc32c_intel
> [  400.032003] CPU: 0 PID: 733 Comm: bash Not tainted 3.16.0-4-amd64 #1 Debian
> 3.16.43-2+deb8u3
> [  400.032003] task: ffff88000470e1d0 ti: ffff88006acec000 task.ti:
> ffff88006acec000
> [  400.032003] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>]
> xen_hypercall_sched_op+0xa/0x20
> [  400.032003] RSP: e02b:ffff88006acefdd0  EFLAGS: 00000246
> [  400.032003] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 
> ffffffff810013aa
> [  400.032003] RDX: ffff88007d640000 RSI: 0000000000000000 RDI: 
> 0000000000000000
> [  400.032003] RBP: ffff88006bcf6000 R08: ffff88007d03d5c8 R09: 
> 0000000000000122
> [  400.032003] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000000000001
> [  400.032003] R13: 000000000000cd60 R14: ffff88006d1dca20 R15: 
> 000000000007d649
> [  400.032003] FS:  00007fe4b215e700(0000) GS:ffff88007d600000(0000)
> knlGS:0000000000000000
> [  400.032003] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  400.032003] CR2: 00000000016de6d0 CR3: 0000000004a67000 CR4: 
> 0000000000042660
> [  400.032003] Stack:
> [  400.032003]  ffff88006acefb3e 0000000000000000 ffffffff81010dc1 
> 0000000001323d35
> [  400.032003]  0000000000000000 0000000000000000 0000000000000001 
> 0000000000000001
> [  400.032003]  ffff88006d1dca20 0000000000000000 ffffffff81068cac 
> 000000306aceff3c
> [  400.032003] Call Trace:
> [  400.032003]  [<ffffffff81010dc1>] ? xen_cpu_up+0x211/0x500
> [  400.032003]  [<ffffffff81068cac>] ? _cpu_up+0x12c/0x160
> [  400.032003]  [<ffffffff81068d59>] ? cpu_up+0x79/0xa0
> [  400.032003]  [<ffffffff8150b615>] ? cpu_subsys_online+0x35/0x80
> [  400.032003]  [<ffffffff813a608d>] ? device_online+0x5d/0xa0
> [  400.032003]  [<ffffffff813a6145>] ? online_store+0x75/0x80
> [  400.032003]  [<ffffffff8121b56a>] ? kernfs_fop_write+0xda/0x150
> [  400.032003]  [<ffffffff811aaf32>] ? vfs_write+0xb2/0x1f0
> [  400.032003]  [<ffffffff811aba72>] ? SyS_write+0x42/0xa0
> [  400.032003]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
> [  400.032003] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc 
> cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 
> 59
> c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> 
> The same happens on the HVM DomU but always only _after_ live migration.
> Hotplugging works flawlessly if done on the Dom0 where the DomU is started on.
> 
> Any idea what might be happening here? Anyone who has managed to migrate and
> afterwards hotplug a DomU?
> 
> Thanks
> 
> Tim
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@xxxxxxxxxxxxx
> https://lists.xen.org/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.