[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] balloon driver broken in 3.12+ after save+restore



On 27.06.2014 11:51, David Vrabel wrote:
> On 22/05/14 02:31, Marek Marczykowski-GÃrecki wrote:
>> Hi,
>>
>> I have a problem with balloon driver after/during restoring a saved domain.
>> There are two symptoms:
>> 1. When domain was 'xl mem-set <some size smaller than initial>' just before
>> save, it still needs initial memory size to restore. Details below.
>>
>> 2. Restored domain sometimes (most of the time) do not want to balloon down.
>> For example when the domain has 3300MB and I mem-set it to 2800MB, nothing
>> changes immediately (only "target" in sysfs) - both 'xl list' and 'free'
>> inside reports the same size (and plenty of free memory in the VM). After 
>> some
>> time it get ballooned down to ~3000, still not 2800. I haven't found any
>> pattern here.
>>
>> Both of above was working perfectly in 3.11.
>>
>> I'm running Xen 4.1.6.1.
>>
>> Details for the first problem:
>> Preparation:
>> I start the VM as in config at the end of email (memory=400, maxmem=4000),
>> wait some time, then 'xl mem-set' to size just about really used memory 
>> (about
>> 200MB in most cases). Then 'sleep 1' and 'xl save'.
>> When I want to restore that domain, I get initial config file, replace memory
>> setting with size used in 'xl mem-set' above and call 'xl restore' providing
>> that config. It fails with this error:
>> ---
>> Loading new save file /var/run/qubes/current-savefile (new xl fmt info
>> 0x0/0x0/849)
>>  Savefile contains xl domain config
>> xc: detail: xc_domain_restore start: p2m_size = fa800
>> xc: detail: Failed allocation for dom 51: 1024 extents of order 0
>> xc: error: Failed to allocate memory for batch.!: Internal error
>> xc: detail: Restore exit with rc=1
>> libxl: error: libxl_dom.c:313:libxl__domain_restore_common restoring domain:
>> Resource temporarily unavailable
>> cannot (re-)build domain: -3
>> libxl: error: libxl.c:713:libxl_domain_destroy non-existant domain 51
>> ---
>> When memory set back to 400 (or slightly lower, like 380) - restore 
>> succeeded,
>> but still the second problem is happening.
>>
>> I've bisected the first problem down to this commit:
>> commit cd9151e26d31048b2b5e00fd02e110e07d2200c9
>>     xen/balloon: set a mapping for ballooned out pages
> 
> Sorry for the delay. I somehow missed this.
> 
> This is likely caused by the balloon driver creating multiple entries
> in the p2m all pointing to the MFNs of the scratch pages. These
> duplicates are de-duped on save/restore.
> 
> I suspect your 2nd issue may also be caused by this.
> 
> Can you try this patch, please?

Looks to be the right fix, thanks!

> 
> 8<----------------------------------------------
> xen/balloon: set ballooned out pages as invalid in p2m
> 
> Since cd9151e26d31048b2b5e00fd02e110e07d2200c9 (xen/balloon: set a
> mapping for ballooned out pages), a ballooned out page had its entry
> in the p2m set to the MFN of one of the scratch page.  This means that
> the p2m will contain many entries pointing to the same MFN.
> 
> During a domain save, this many-to-one entries are not considered and
> the scratch page is saved multiple times. On restore the ballooned
> pages are populated with new frames and the domain may use up its
> allocation before all pages can be restores.
> 
> Set ballooned out pages as INVALID_P2M_ENTRY in the p2m (as they
> werebefore), preventing them from being saved and re-populated on
> restore.
> 
> Signed-off-by: David Vrabel <david.vrabel@xxxxxxxxxx>
> ---
>  drivers/xen/balloon.c |   12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index b7a506f..5c660c7 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -426,20 +426,18 @@ static enum bp_state decrease_reservation(unsigned long 
> nr_pages, gfp_t gfp)
>                * p2m are consistent.
>                */
>               if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> -                     unsigned long p;
> -                     struct page   *scratch_page = 
> get_balloon_scratch_page();
> -
>                       if (!PageHighMem(page)) {
> +                             struct page *scratch_page = 
> get_balloon_scratch_page();
> +
>                               ret = HYPERVISOR_update_va_mapping(
>                                               (unsigned long)__va(pfn << 
> PAGE_SHIFT),
>                                               
> pfn_pte(page_to_pfn(scratch_page),
>                                                       PAGE_KERNEL_RO), 0);
>                               BUG_ON(ret);
> -                     }
> -                     p = page_to_pfn(scratch_page);
> -                     __set_phys_to_machine(pfn, pfn_to_mfn(p));
>  
> -                     put_balloon_scratch_page();
> +                             put_balloon_scratch_page();
> +                     }
> +                     __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
>               }
>  #endif
>  
> 


-- 
Best Regards,
Marek Marczykowski-GÃrecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.