[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] repeated live migration for VM failed



On Mon, May 22, 2017 at 11:18 AM, George Dunlap
<george.dunlap@xxxxxxxxxx> wrote:
> On 22/05/17 07:35, Hao, Xudong wrote:
>> Bug detailed description:
>>
>> ----------------
>>
>> Create one RHEL7.3 HVM and do live migration continuously, while doing the 
>> 200+ or 300+ times live-migration, tool stack report error and migration 
>> failed.
>>
>>
>>
>> Environment :
>>
>> ----------------
>>
>> HW: Skylake server
>>
>> Xen: Xen 4.9.0 RC4
>>
>> Dom0: Linux 4.11.0
>>
>>
>>
>> Reproduce steps:
>>
>> ----------------
>>
>> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
>>
>> 2.      Boot RHEL7.3 HVM guest
>>
>> 3.      Migrate guest to localhost, sleep 10 seconds
>>
>> 4.      Repeat doing the step3.
>>
>>
>>
>> Current result:
>>
>> ----------------
>>
>> VM Migration fail.
>>
>>
>>
>> Base error log:
>>
>> ----------------
>>
>> xl migrate 24hrs_lm_guest_2 localhost
>>
>> root@localhost's password:
>>
>> migration target: Ready to receive domain.
>>
>> Saving to migration stream new xl format (info 0x3/0x0/1761)
>>
>> Loading new save file <incoming migration stream> (new xl fmt info 
>> 0x3/0x0/1761)
>>
>> Savefile contains xl domain config in JSON format
>>
>> Parsing config from <saved>
>>
>> xc: info: Saving domain 273, type x86 HVM
>>
>> xc: info: Found x86 HVM domain from Xen 4.9
>>
>> xc: info: Restoring domain
>>
>> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system 
>> call should ): Internal error
>>
>> xc: error: Restore failed (85 = Interrupted system call should ): Internal 
>> error
>
> Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail
> with -ERESTART.  But the comment for ERESTART makes it explicit that it
> should be internal only -- it should cause a hypercall continuation (so
> that the hypercall restarts automatically), rather than returning to the
> guest.
>
> But the hypercall continuation code seems to have disappeared from
> do_hvm_op() at some point?
>
> /me digs a bit more...

The problem turns out to be commit ae20ccf ("dm_op: convert
HVMOP_set_mem_type"), which says:

    This patch removes the need for handling HVMOP restarts, so that
    infrastructure is removed.

While it's true that there are no more operations which need iteration
information restored, but there are two operations which may still
need to be restarted to avoid deadlocks with other operations.

Attached is a patch which restores hypercall continuation checking.
Xudong, can you give it a test?

Thanks,
 -George

Attachment: 0001-Restore-HVM_OP-hypercall-continuation-partial-revert.patch
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.