[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1] domctl: hold domctl lock while domain is destroyed



Hi Jan,

On 17/09/2021 14:47, Jan Beulich wrote:
On 17.09.2021 11:41, Andrew Cooper wrote:
On 17/09/2021 10:27, Julien Grall wrote:
Hi,

(+ some AWS folks)

On 17/09/2021 11:17, Jan Beulich wrote:
On 16.09.2021 19:52, Andrew Cooper wrote:
On 16/09/2021 13:30, Jan Beulich wrote:
On 16.09.2021 13:10, Dmitry Isaikin wrote:
From: Dmitry Isaykin <isaikin-dmitry@xxxxxxxxx>

This significantly speeds up concurrent destruction of multiple
domains on x86.
This effectively is a simplistic revert of 228ab9992ffb ("domctl:
improve locking during domain destruction"). There it was found to
actually improve things;

Was it?  I recall that it was simply an expectation that performance
would be better...

My recollection is that it was, for one of our customers.

Amazon previously identified 228ab9992ffb as a massive perf hit, too.

Interesting. I don't recall any mail to that effect.

Here we go:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fxen-devel%2Fde46590ad566d9be55b26eaca0bc4dc7fbbada59.1585063311.git.hongyxia%40amazon.com%2F&amp;data=04%7C01%7CAndrew.Cooper3%40citrix.com%7C8cf65b3fb3324abe7cf108d979bd7171%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637674676843910175%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=si7eYIxSqsJY77sWuwsad5MzJDMzGF%2F8L0JxGrWTmtI%3D&amp;reserved=0


We have been using the revert for quite a while in production and didn't
notice any regression.


Clearly some of the reasoning behind 228ab9992ffb was flawed and/or
incomplete, and it appears as if it wasn't necessarily a wise move in
hindsight.

Possible; I continue to think though that the present observation wants
properly understanding instead of more or less blindly undoing that
change.

To be honest, I think this is the other way around. You wrote and merged
a patch with the following justification:

"
     There is no need to hold the global domctl lock across domain_kill() -
     the domain lock is fully sufficient here, and parallel cleanup after
     multiple domains performs quite a bit better this way.
"

Clearly, the original commit message is lacking details on the exact
setups and numbers. But we now have two stakeholders with proof that
your patch is harmful to the setup you claim perform better with your
patch.

To me this is enough justification to revert the original patch. Anyone
against the revert, should provide clear details of why the patch should
not be reverted.

I second a revert.

I was concerned at the time that the claim was unsubstantiated, and now
there is plenty of evidence to counter the claim.

Well, I won't object to a proper revert. I still think we'd better get to
the bottom of this, not the least because I thought there was agreement
that mid to long term we should get rid of global locking wherever
possible. Or are both of you saying that using a global lock here is
obviously fine? And does either of you have at least a theory to explain
the observation? I can only say that I find it puzzling.

I will quote what Hongyan wrote back on the first report:

"
The best solution is to make the heap scalable instead of a global
lock, but that is not going to be trivial.

Of course, another solution is to keep the domctl lock dropped in
domain_kill() but have another domain_kill lock so that competing
domain_kill()s will try to take that lock and back off with hypercall
continuation. But this is kind of hacky (we introduce a lock to reduce
spinlock contention elsewhere), which is probably not a solution but a
workaround.
"

Cheers,

--
Julien Grall



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.