[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] [PATCH] x86/mm: use wait queues for mem_paging



Hi, 

I'm about to repost this patch, with a few other changes and fixups. 
Further comments below.

At 12:57 -0800 on 17 Feb (1329483430), Andres Lagar-Cavilla wrote:
> > At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote:
> >> Well, thanks for taking a stab at it. Looks fairly well. My main
> >> observation is that we do not want to unconditionally go on a wait queue
> >> everywhere. For example, the grant table code pretty reliable unwinds
> >> itself and correctly tells the grant mapper to retry, in the presence of
> >> a
> >> paged page.
> >
> > The grant table code is unlikely to hit a paged-out gfn in the caller's
> > address space.
> 
> Oh yeah they do. At least for the target gfn's behind the grants. That's
> why we need the retry loops in the PV backends in dom0.

The target gfns aren't in the caller's p2m, though. 

> > The balloon code should not be trying to populate at all!
> 
> Agreed. get_gfn_query is needed there. Nothing bad happens because we
> immediately send drop_page. But it's wrong-ish.

Fixed

> >> Hence I had proposed introducing get_gfn_sleep, and I'd be happy if the
> >> wait queue code is invoked only there. And then we can call
> >> get_gfn_sleep
> >> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.
> >
> > I see.  In the longer term I definitely do not want get_gfn()'s callers
> > to have to deal with p2m_is_paged(), and PoD, and sharing, and every
> > other p2m hack we come up with.  I want all that stuff hidden behind
> > get_gfn_*(), with at most a flag to say 'fix up magic MM tricks'.
> >
> > I don't like get_gfn_sleep() very much but I'll see if I can find a way
> > to do the equivalent with a flag to get_gfn_*().  But I suspect it will
> > end up with a tri-state argument of:
> >  a) just give me the current state;
> >  b) try (without sleeping) to fix up paging, PoD and sharing but
> >     still maybe make me handle it myself by error propagation; and
> >  c) just fix it up.
> >
> > 'b' is a pretty weird interface.  Specially given that in corner cases,
> > 'make me handle it' might involve hitting a wait queue anyway.  If we
> > intend in the fullness of time to get rid of it (as I hope we would)
> > then probably we shouldn't introduce it.  (And if that means pushing
> > this whole change out past 4.2, we should consider doing that.)
> >
> It's weird. b) is an evolutionary side-effect. It should disappear. I
> guess what I'm arguing for is "it should disappear in the long term, not
> on a patch so close to 4.2"

Well, I tried making the 'but don't go to a waitq' flag; the
implementation is trivial on top of the patches I'm about to post but
the question of when to set it was so ugly I gave up.

I have hit another stumbling block though - get_two_gfns() can't be
safely ask get_gfn to populate or unshare if that might involve a waitq,
since by definition one of its two calls is made with a lock held.  I
can't see a nice way of having it retry (not one that's guaranteed to
make progress) without pulling the fixup-and-retry up into that
function.  I might do that in the next revision.

> > Can't be soon enough for me.  I've been sharpening the axe for that one
> > for years now. :)
> 
> What's stopping the death blow?

People said they still need it.

> >> You want to make sure the mfn is not valid. For p2m_ram_paging_out we
> >> still have the mfn in place.
> >
> > Doesn't p2m_mem_paging_populate already DTRT in all these cases?  If
> > not, it really should.
> 
> It's not about populate, it's about killing the domain unnecessarily in
> the in_atomic check.

Oh I see, thanks.  Fixed.

> >> >> +
> >> >> +        /* Safety catch: it _should_ be safe to wait here but if
> >> it's
> >> >> not,
> >> >> +         * crash the VM, not the host */
> >> >> +        if ( in_atomic() )
> >> >> +        {
> >> >> +            WARN();
> >> >> +            domain_crash(p2m->domain);
> >> >> +        }
> >> >> +        else
> >> >> +        {
> >> >> +            current->arch.mem_paging_gfn = gfn;
> >> >> +            wait_event(current->arch.mem_paging_wq, ({
> >> >> +                        mfn = p2m->get_entry(p2m, gfn, t, a,
> >> >> +                                             p2m_query, page_order);
> >> >> +                        (*t != p2m_ram_paging_in);
> >>
> >> Well, first of all mfn->get_entry will not happen in a locked context,
> >> so
> >> you will exit the wait loop not holding the p2m lock and crash later.
> >
> > Yes - we always exit the loop without the lock; then we 'goto again' and
> > do things properly.  I realise it's a bit inefficient, but on the
> > page-in path the p2m lookups shouldn't be the bottleneck. :)  I did
> > write the version that handled the locking in this loop but it got a bit
> > twisty, plus we need the 'goto again' for other reasons (see below).
> 
> Yes, but, you'll be reading the p2m in a racy manner.

I don't think there are any real races here.  Either we see
p2m_paging_in (and someone will eventually wake us when it pages in) 
or we don't (and we go back and do the lookup safely).

> > As a follow-up I ought to make the page-sharing fixup code that's just
> > below this use 'goto again' as well; we shouldn't really leave this
> > function until we get a lookup where all these cases are dealt with.

On closer inspection the page-sharing version is OK because it doesn't
release the lock, and the unshare function won't leave us with a
paged-out page.

> I think we won't be able to have a magic
> get_gfn_solve_everything_silently() call until we find a solution to the
> fact that get_gfn happens in locked contexts.

Yes, I agree, and I'm coming round to the opinion that we might be too
late to fix that in 4.2. :|

> I've though about a wacky
> idea that
> 1. schedules a sleep-n-retry softirq
> 2. fails the call
> 3. avoids crashing the domain on the unwind path
> 4. when executing softirqs pauses the vcpu as per 1.
> 5. When woken up retries the last failed hypercall/vmexit/whatever
> 
> I shouldn't have said that out loud...

The first time I approached the idea of paging guest RAM (before Patrick
started on the current system) I prototyped something not a million
miles from this (basically, a sort of setjmp() on hypervisor entry) but
the question of making all hypercalls and VMEXITs idempotent scared me
off. :)

> >> >> void p2m_mem_paging_populate(struct domain *d, unsigned long gfn)
> >> >> {
> >> >>     struct vcpu *v = current;
> >> >> -    mem_event_request_t req;
> >> >> +    mem_event_request_t req = {0};
> >> >>     p2m_type_t p2mt;
> >> >>     p2m_access_t a;
> >> >>     mfn_t mfn;
> >> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
> >> >> +    int send_request = 0;
> >> >>
> >> >>     /* We're paging. There should be a ring */
> >> >>     int rc = mem_event_claim_slot(d, &d->mem_event->paging);
> >>
> >> Given all the optimizations around the send_request flag, it might be
> >> worth moving the claim_slot call to an if(send_request) block at the
> >> bottom, and get rid of cancel_slot altogether. Claim_slot could
> >> potentially go (or cause someone else to go) to a wait queue, and it's
> >> best to not be that rude.
> >
> > Sure.  It might be tricky to make sure we unwind properly in the failure
> > case, but I'll have a go at a follow-up patch that looks at that.
> >
> > (This is not really a regression in this patch AFAICS - in the existing
> > code
> > the get_gfn() callers end up calling this function anyway.)
> 
> *After* they've put gfn. Not a regression, just cleaner code imo.

This is also called after putting the gfn - but in any case I've
reshuffled the mem_event_claim_slot in the new version.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.