[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Domain relinquish resources racing with p2m access



At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote:
> So we've run into this interesting (race?) condition while doing
> stress-testing. We pummel the domain with paging, sharing and mmap
> operations from dom0, and concurrently we launch a domain destruction.
> Often we get in the logs something along these lines
> 
> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1
> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
> 
> We're using the synchronized p2m patches just posted, so my analysis is as
> follows:
> 
> - the domain destroy domctl kicks in. It calls relinquish resources. This
> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
> entries
> 
> - In parallel, a do_mmu_update is making progress, it has no issues
> performing a p2m lookup because the p2m has not been torn down yet; we
> haven't gotten to the RCU callback. Eventually, the mapping fails in
> page_get_owner in get_pafe_from_l1e.
> 
> The map is failed, as expected, but what makes me uneasy is the fact that
> there is a still active p2m lurking around, with seemingly valid
> translations to valid mfn's, while all the domain pages are gone.

Yes.  That's OK as long as we know that any user of that page will
fail, but I'm not sure that we do.   

At one point we talked about get_gfn() taking a refcount on the
underlying MFN, which would fix this more cleanly.  ISTR the problem was
how to make sure the refcount was moved when the gfn->mfn mapping
changed. 

Can you stick a WARN() in mm.c to get the actual path that leads to the
failure?

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.