[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/4] x86/PoD: Identify when a domain has already been killed from PoD exhaustion

On Mon, Nov 2, 2015 at 2:32 PM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
> On 02/11/15 14:07, George Dunlap wrote:
>> On 30/10/15 18:33, Andrew Cooper wrote:
>>> p2m_pod_demand_populate() can be entered repeatedly during a single path
>>> through the hypervisor, e.g. on a toolstack batch map operation.
>>> The domain might be crashed, but the interface currently lacks a way of
>>> passing an error back through the generic p2m layer.
>>> Longterm the p2m layer needs reworking to allow errors to be returned, but 
>>> in
>>> the short term, avoid repeatedly re-sweeping the domain after it has already
>>> been crashed from PoD exhaustion.
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>>> ---
>>> CC: Jan Beulich <JBeulich@xxxxxxxx>
>>> CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
>>> ---
>>>  xen/arch/x86/mm/p2m-pod.c | 3 ++-
>>>  xen/include/asm-x86/p2m.h | 2 ++
>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>> diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
>>> index be15cf3..6fb054f 100644
>>> --- a/xen/arch/x86/mm/p2m-pod.c
>>> +++ b/xen/arch/x86/mm/p2m-pod.c
>>> @@ -1048,7 +1048,7 @@ p2m_pod_demand_populate(struct p2m_domain *p2m, 
>>> unsigned long gfn,
>>>      /* This check is done with the pod lock held.  This will make sure that
>>>       * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
>>>       * won't start until we're done. */
>>> -    if ( unlikely(d->is_dying) )
>>> +    if ( unlikely(d->is_dying) || p2m->pod.dead )
>> So after getting lost in a maze of twisty passages, it looks like
>> "d->is_dying" might be the wrong thing to check here.  d->is_dying is
>> *only* set, AFAICT, in two places:
>>  - in domain_kill(), which is only called for XEN_DOMCTL_destroydomain
>>  - in domain_create(), if the creation failed for some reason.
> d->is_dying indicates that the domains resources are starting to be torn
> down.  It is the correct check here.
>> Would it make more sense to check d->is_shutting_down instead?
> A domain resume hypercall can clear d->is_shutting_down, making it an
> inappropriate check.

Well if domain_resume() cleared it, then the sweep would be run one
more time and then crashed again, which wouldn't hurt too much.

The real reason d->is_shutting_down is an inappropriate check is that
there are lots of *other* reasons why a domain might shut down -- or
even crash -- which should not prevent p2m sweeps from happening on
the domain.  The only legitimate reason to do no more sweeps, other
than the final tear-down, is a failed PoD sweep.

So I think I've convinced myself that something like this is
(unfortunately) necessary.

I do think it should have a more descriptive name -- "out_of_memory"
or "sweep_failed" would be better than "dead".

In some imaginary future where p2m functions have proper error
handling, one could imagine the toolstack deciding to repopulate the
PoD cache and clear this flag, allowing the domain to continue rather
than crashing it.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.