[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] domain: use PGC_extra domheap page for shared_info



> -----Original Message-----
> From: Julien Grall <julien@xxxxxxx>
> Sent: 26 February 2020 11:33
> To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; Volodymyr Babchuk
> <Volodymyr_Babchuk@xxxxxxxx>; Andrew Cooper <andrew.cooper3@xxxxxxxxxx>;
> George Dunlap <george.dunlap@xxxxxxxxxx>; Ian Jackson
> <ian.jackson@xxxxxxxxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>; Konrad
> Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Wei Liu <wl@xxxxxxx>
> Subject: Re: [PATCH 2/2] domain: use PGC_extra domheap page for
> shared_info
> 
> Hi Paul,
> 
> On 25/02/2020 09:53, Paul Durrant wrote:
> > There's no particular reason shared_info need use a xenheap page.
> 
> AFAICT, a side-effect of this change is the shared_info is now going to
> be part of the migration stream. One issue with this is on the restore
> side, they will be accounted in number of pages allocated to the domain.
> 
> I am fairly certain dirty logging on page with PGC_extra set would not
> work properly as well.

True, those two do need fixing before expanding the use of PGC_extra. I guess 
this may already be a problem with the current VMX use case.

> 
> As the pages will be recreated in the restore side, I would suggest to
> skip them in XEN_DOMCTL_getpageframeinfo3.
>

At some point in future I guess it may be useful to migrate PGC_extra pages, 
but they would need to be marked as such in the stream. Skipping sounds like 
the right approach for now.

> > It's
> > only purpose is to be mapped by the guest so use a PGC_extra domheap
> page
> > instead. This does entail freeing shared_info during
> > domain_relinquish_resources() rather than domain_destroy() so care is
> > needed to avoid de-referencing a NULL shared_info pointer hence some
> > extra checks of 'is_dying' are needed.
> > ASSERTions are added before apparently vulnerable dereferences in the
> > event channel code. These should not be hit because domain_kill() calls
> > evtchn_destroy() before calling domain_relinquish_resources() but are
> > added to catch any future re-ordering.
> 
> IHMO, the ASSERTions in the event channel pending/mask/... helpers will
> not protect against re-ordering. You may never hit them even if there is
> a re-ordering. It would be better to add an ASSERT()/BUG_ON() in
> evtchn_destroy() and possibly a comment in domain_kill() to mention
> ordering.

I'm not sure about that. Why would they not be hit?

> 
> > For Arm, the call to free_shared_info() in arch_domain_destroy() is left
> > in place since it called in the error path for arch_domain_create().
> >
> > NOTE: A modification in p2m_alloc_table() is required to avoid a false
> >        positive when checking for domain memory.
> >        A fix is also needed in dom0_construct_pv() to avoid
> automatically
> >        adding PGC_extra pages to the physmap.
> 
> I am not entirely sure how this is related to this patch. Was it a
> latent bug? If so, I think it would make sense to split it from this
> patch.
> 

It's related because the check relies on the fact that no PGC_extra pages are 
created before it is called. This is indeed true until this patch is applied.

> >
> > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
> > ---
> > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > Cc: Julien Grall <julien@xxxxxxx>
> > Cc: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
> > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > Cc: George Dunlap <george.dunlap@xxxxxxxxxx>
> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> > Cc: Jan Beulich <jbeulich@xxxxxxxx>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > Cc: Wei Liu <wl@xxxxxxx>
> > ---
> >   xen/arch/arm/domain.c        |  2 ++
> >   xen/arch/x86/domain.c        |  3 ++-
> >   xen/arch/x86/mm/p2m.c        |  3 +--
> >   xen/arch/x86/pv/dom0_build.c |  4 ++++
> >   xen/common/domain.c          | 25 +++++++++++++++++++------
> >   xen/common/event_2l.c        |  4 ++++
> >   xen/common/event_fifo.c      |  1 +
> >   xen/common/time.c            |  3 +++
> >   8 files changed, 36 insertions(+), 9 deletions(-)
> >
> > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> > index 2cbcdaac08..3904519256 100644
> > --- a/xen/arch/arm/domain.c
> > +++ b/xen/arch/arm/domain.c
> > @@ -1006,6 +1006,8 @@ int domain_relinquish_resources(struct domain *d)
> >           BUG();
> >       }
> >
> > +    free_shared_info(d);
> > +
> >       return 0;
> >   }
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index eb7b0fc51c..3ad532eccf 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -691,7 +691,6 @@ void arch_domain_destroy(struct domain *d)
> >           pv_domain_destroy(d);
> >       free_perdomain_mappings(d);
> >
> > -    free_shared_info(d);
> >       cleanup_domain_irq_mapping(d);
> >
> >       psr_domain_free(d);
> > @@ -2246,6 +2245,8 @@ int domain_relinquish_resources(struct domain *d)
> >       if ( is_hvm_domain(d) )
> >           hvm_domain_relinquish_resources(d);
> >
> > +    free_shared_info(d);
> > +
> >       return 0;
> >   }
> >
> > diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> > index c5f428d67c..787d97d85e 100644
> > --- a/xen/arch/x86/mm/p2m.c
> > +++ b/xen/arch/x86/mm/p2m.c
> > @@ -695,8 +695,7 @@ int p2m_alloc_table(struct p2m_domain *p2m)
> >
> >       p2m_lock(p2m);
> >
> > -    if ( p2m_is_hostp2m(p2m)
> > -         && !page_list_empty(&d->page_list) )
> > +    if ( p2m_is_hostp2m(p2m) && domain_tot_pages(d) )
> >       {
> >           P2M_ERROR("dom %d already has memory allocated\n", d-
> >domain_id);
> >           p2m_unlock(p2m);
> > diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> > index dc16ef2e79..f8f1bbe2f4 100644
> > --- a/xen/arch/x86/pv/dom0_build.c
> > +++ b/xen/arch/x86/pv/dom0_build.c
> > @@ -792,6 +792,10 @@ int __init dom0_construct_pv(struct domain *d,
> >       {
> >           mfn = mfn_x(page_to_mfn(page));
> >           BUG_ON(SHARED_M2P(get_gpfn_from_mfn(mfn)));
> > +
> > +        if ( page->count_info & PGC_extra )
> > +            continue;
> 
> I would add a comment explaining why we skip page with PGC_extra set.
> 
> > +
> >           if ( get_gpfn_from_mfn(mfn) >= count )
> >           {
> >               BUG_ON(is_pv_32bit_domain(d));
> > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > index ba7a905258..1d42fbcc0f 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -128,9 +128,9 @@ static void vcpu_info_reset(struct vcpu *v)
> >   {
> >       struct domain *d = v->domain;
> >
> > -    v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
> > +    v->vcpu_info = (!d->is_dying && v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
> >                       ? (vcpu_info_t *)&shared_info(d, vcpu_info[v-
> >vcpu_id])
> > -                    : &dummy_vcpu_info);
> > +                    : &dummy_vcpu_info;
> 
> Without holding domain_lock(), I don't think there is a guarantee that
> is_dying (and therefore the shared_info) is not going to change behind
> your back. So v->vcpu_info may point to garbagge.
> 
> Looking at the callers, XEN_DOMCTL_soft_reset will not hold the lock.
> 

True, it does need locking in some way.

> As an aside, it would be good to explain in a comment why we are using
> dummy_vcpu_info when the domain is dying.
> 

Simply because it is guaranteed to exist, but I'll add a comment to that effect.

> >       v->vcpu_info_mfn = INVALID_MFN;
> >   }
> >
> > @@ -1650,24 +1650,37 @@ int continue_hypercall_on_cpu(
> >
> >   int alloc_shared_info(struct domain *d, unsigned int memflags)
> >   {
> > -    if ( (d->shared_info.virt = alloc_xenheap_pages(0, memflags)) ==
> NULL )
> > +    struct page_info *pg;
> > +
> > +    pg = alloc_domheap_page(d, MEMF_no_refcount | memflags);
> > +    if ( !pg )
> >           return -ENOMEM;
> >
> > -    d->shared_info.mfn = virt_to_mfn(d->shared_info.virt);
> > +    if ( !get_page_and_type(pg, d, PGT_writable_page) )
> > +        return -ENODATA;
> 
> I think the page will never be freed if this fails.
>

Good spot.

> > +
> > +    d->shared_info.mfn = page_to_mfn(pg);
> > +    d->shared_info.virt = __map_domain_page_global(pg);
> >
> >       clear_page(d->shared_info.virt);
> > -    share_xen_page_with_guest(mfn_to_page(d->shared_info.mfn), d,
> SHARE_rw);
> >
> >       return 0;
> >   }
> >
> >   void free_shared_info(struct domain *d)
> >   {
> > +    struct page_info *pg;
> > +
> >       if ( !d->shared_info.virt )
> >           return;
> >
> > -    free_xenheap_page(d->shared_info.virt);
> > +    unmap_domain_page_global(d->shared_info.virt);
> >       d->shared_info.virt = NULL;
> > +
> > +    pg = mfn_to_page(d->shared_info.mfn);
> > +
> > +    put_page_alloc_ref(pg);
> > +    put_page_and_type(pg);
> >   }
> >
> >   /*
> > diff --git a/xen/common/event_2l.c b/xen/common/event_2l.c
> > index e1dbb860f4..a72fe0232b 100644
> > --- a/xen/common/event_2l.c
> > +++ b/xen/common/event_2l.c
> > @@ -27,6 +27,7 @@ static void evtchn_2l_set_pending(struct vcpu *v,
> struct evtchn *evtchn)
> >        * others may require explicit memory barriers.
> >        */
> >
> > +    ASSERT(d->shared_info.virt);
> >       if ( guest_test_and_set_bit(d, port, &shared_info(d,
> evtchn_pending)) )
> >           return;
> >
> > @@ -54,6 +55,7 @@ static void evtchn_2l_unmask(struct domain *d, struct
> evtchn *evtchn)
> >        * These operations must happen in strict order. Based on
> >        * evtchn_2l_set_pending() above.
> >        */
> > +    ASSERT(d->shared_info.virt);
> >       if ( guest_test_and_clear_bit(d, port, &shared_info(d,
> evtchn_mask)) &&
> >            guest_test_bit(d, port, &shared_info(d, evtchn_pending)) &&
> >            !guest_test_and_set_bit(d, port / BITS_PER_EVTCHN_WORD(d),
> > @@ -67,6 +69,7 @@ static bool evtchn_2l_is_pending(const struct domain
> *d, evtchn_port_t port)
> >   {
> >       unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) *
> BITS_PER_EVTCHN_WORD(d);
> >
> > +    ASSERT(d->shared_info.virt);
> >       ASSERT(port < max_ports);
> >       return (port < max_ports &&
> >               guest_test_bit(d, port, &shared_info(d, evtchn_pending)));
> > @@ -76,6 +79,7 @@ static bool evtchn_2l_is_masked(const struct domain
> *d, evtchn_port_t port)
> >   {
> >       unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) *
> BITS_PER_EVTCHN_WORD(d);
> >
> > +    ASSERT(d->shared_info.virt);
> >       ASSERT(port < max_ports);
> >       return (port >= max_ports ||
> >               guest_test_bit(d, port, &shared_info(d, evtchn_mask)));
> > diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
> > index 230f440f14..e8c6045d72 100644
> > --- a/xen/common/event_fifo.c
> > +++ b/xen/common/event_fifo.c
> > @@ -497,6 +497,7 @@ static void setup_ports(struct domain *d)
> >
> >           evtchn = evtchn_from_port(d, port);
> >
> > +        ASSERT(d->shared_info.virt);
> >           if ( guest_test_bit(d, port, &shared_info(d, evtchn_pending))
> )
> >               evtchn->pending = 1;
> >
> > diff --git a/xen/common/time.c b/xen/common/time.c
> > index 58fa9abc40..938226c7b1 100644
> > --- a/xen/common/time.c
> > +++ b/xen/common/time.c
> > @@ -99,6 +99,9 @@ void update_domain_wallclock_time(struct domain *d)
> >       uint32_t *wc_version;
> >       uint64_t sec;
> >
> > +    if ( d->is_dying )
> > +        return;
> 
> This is another instance where I think the use of d->is_dying is not
> safe. I looked at how other places in Xen dealt with d->is_dying.
> 
> We don't seem to have a common way to deal with it:
>     1) It may be checked under domain_lock() -> No issue with that
>     2) It may be checked under d->page_alloc_lock (e.g assign_pages()).
> The assign_pages() case is fine because it will act as a full barrier.
> So if we call happen after relinquish_memory() then we will surely have
> observed d->is_dying. I haven't checked the others.
> 
> Some of the instances user neither the 2 locks above. We probably ought
> to investigate them (I will add a TODO in my list).
> 
> Regarding the two cases here, domain_lock() might be the solution. If we
> are concern about the contention (it is a spinlock), then we could
> switch the domain_lock() from spinlock to rwlock.
> 

I'll see if there is any other suitable lock guaranteed to be taken during 
domain_kill() but it may indeed have to be domain_lock().

  Paul

> Cheers,
> 
> --
> Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.