[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] question about migration

To: Wen Congyang <wency@xxxxxxxxxxxxxx>
From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Date: Tue, 29 Dec 2015 12:46:25 +0000
Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "Ian.Campbell@xxxxxxxxxx" <Ian.Campbell@xxxxxxxxxx>, xen devel <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Tue, 29 Dec 2015 12:46:34 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 25/12/2015 03:06, Wen Congyang wrote:

On 12/25/2015 09:45 AM, Wen Congyang wrote:

On 12/24/2015 08:36 PM, Andrew Cooper wrote:

On 24/12/15 02:29, Wen Congyang wrote:

Hi Andrew Cooper:

I rebase the COLO codes to the newest upstream xen, and test it. I found
a problem in the test, and I can reproduce this problem via the migration.

How to reproduce:
1. xl cr -p hvm_nopv
2. xl migrate hvm_nopv 192.168.3.1

You are the very first person to try a usecase like this.

It works as much as it does because of your changes to the uncooperative HVM 
domain logic.  I have said repeatedly during review, this is not necessarily a 
safe change to make without an in-depth analysis of the knock-on effects; it 
looks as if you have found the first knock-on effect.

The migration successes, but the vm doesn't run in the target machine.
You can get the reason from 'xl dmesg':
(XEN) HVM2 restore: VMCE_VCPU 1
(XEN) HVM2 restore: TSC_ADJUST 0
(XEN) HVM2 restore: TSC_ADJUST 1
(d2) HVM Loader
(d2) Detected Xen v4.7-unstable
(d2) Get guest memory maps[128] failed. (-38)
(d2) *** HVMLoader bug at e820.c:39
(d2) *** HVMLoader crashed.

The reason is that:
We don't call xc_domain_set_memory_map() in the target machine.
When we create a hvm domain:
libxl__domain_build()
      libxl__build_hvm()
          libxl__arch_domain_construct_memmap()
              xc_domain_set_memory_map()

Should we migrate the guest memory from source machine to target machine?

This bug specifically is because HVMLoader is expected to have run and turned 
the hypercall information in an E820 table in the guest before a migration 
occurs.

Unfortunately, the current codebase is riddled with such assumption and 
expectations (e.g. the HVM save code assumed that FPU context is valid when it 
is saving register state) which is a direct side effect of how it was developed.


Having said all of the above, I agree that your example is a usecase which 
should work.  It is the ultimate test of whether the migration stream contains 
enough information to faithfully reproduce the domain on the far side.  Clearly 
at the moment, this is not the case.

I have an upcoming project to work on the domain memory layout logic, because 
it is unsuitable for a number of XenServer usecases. Part of that will require 
moving it in the migration stream.

I found another migration problem in the test:
If the migration fails, we will resume it in the source side.
But the hvm guest doesn't response any more.

In my test envirionment, the migration always successses, so I
use a hack way to reproduce it:
1. modify the target xen tools:

diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 258dec4..da95606 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -767,6 +767,8 @@ void libxl__xc_domain_restore_done(libxl__egc *egc, void 
*dcs_void,
          goto err;
      }

+ rc = ERROR_FAIL;

+
   err:
      check_all_finished(egc, stream, rc);

2. xl cr hvm_nopv, and wait some time(You can login to the guest)

3. xl migrate hvm_nopv 192.168.3.1

Another problem:
If migration fails after the guest is suspended, we will resume it in the 
source.
In this case, we cannot shutdown it. because no process hanlds the shutdown 
event.
The log in /var/log/xen/xl-hvm_nopv.log:
Waiting for domain hvm_nopv (domid 1) to die [pid 5508]
Domain 1 has shut down, reason code 2 0x2
Domain has suspended.
Done. Exiting now

The xl has exited...

Thanks
Wen Congyang

Hmm yes. This is a libxl bug in libxl_evenable_domain_death(). CC'ingthe toolstack maintainers.

It waits for the @releasedomain watch, but doesn't interpret the resultscorrectly. In particular, if it can still make successful hypercallswith the provided domid, that domain was not the subject of@releasedomain. (I also observe that domain_death_xswatch_callback() isvery inefficient. It only needs to make a single hypercall, not querythe entire state of all domains.)


~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] question about migration
  - From: Wen Congyang
- Re: [Xen-devel] question about migration
  - From: Andrew Cooper
- Re: [Xen-devel] question about migration
  - From: Wen Congyang
- Re: [Xen-devel] question about migration
  - From: Wen Congyang

Prev by Date: Re: [Xen-devel] [PATCH trivial v2] xen/Makefile.objs: simplify
Next by Date: Re: [Xen-devel] [PATCH RFC 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va
Previous by thread: Re: [Xen-devel] question about migration
Next by thread: Re: [Xen-devel] question about migration
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.