[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue


  • To: "marmarek@xxxxxxxxxxxxxxxxxxxxxx" <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>
  • From: Frédéric Pierret <frederic.pierret@xxxxxxxxxxxx>
  • Date: Tue, 3 Nov 2020 14:23:23 +0100
  • Arc-authentication-results: i=1; mx.zohomail.com; dkim=pass header.i=qubes-os.org; spf=pass smtp.mailfrom=frederic.pierret@xxxxxxxxxxxx; dmarc=pass header.from=<frederic.pierret@xxxxxxxxxxxx> header.from=<frederic.pierret@xxxxxxxxxxxx>
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1604409810; h=Content-Type:Cc:Date:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:To; bh=ZlTOgByCt7DOl2He7UDNW7nUhQJH71ZDkAVL6UNd+jg=; b=Glc/+f0pw5k08U6xZlygS0D6GHGZE2kM9RO3zmiIESrm8kvesadfUtGRuUuYnOALCLQWTHDmXFzTQI2IHxu2cvxXnQR3nLTJZnLPIAIHdhZLgOH7xZWKNDfjLVrDajGZMm5TvJYBI2jvnuLRUQSsuobA/d8rrLOaqOmcPoUlQVs=
  • Arc-seal: i=1; a=rsa-sha256; t=1604409810; cv=none; d=zohomail.com; s=zohoarc; b=KGF0NMy1kAv0vXrL4Tw8mkirUZk3GIV6779qQyYE4TIq1EcVG+pQ2fSQy+3+fYPby/lWHK8juXe4YpW0NKhUidGVc5h+qNP5d/digAi9Au6nLGK1LlRRLsKn8nM2LPjZPULwX14AdR2QN4PI97gszCnjNn9XEUYliWFEgnTIz14=
  • Autocrypt: addr=frederic.pierret@xxxxxxxxxxxx; keydata= xsFNBFwkq3EBEADcfyaOkeuf+g96S1ieq05tJ8vTGsQrNXQ5RDE7ffagL0+EpfIP3x73x5Q0 Dy2rUVQ+oN1DHcueNL70RtNs9BFnoW0KZnskbT4nEJ9wQCQa22lQaIk9kCNVddh2HJKljtd8 vtovi97sWIjtzxx5Qwc2md0DY9AHhNC4KqKIW3tSPC17UsI8fASoNAHItYtyn2bO67p8pCIv ltoBrYnElD1Pyp5IGWiD2/YD325iPl2+qHVkUSWmb92hRRU19Rg+Uds8bVHqhz4cOqIE7jpX gYzTN/kq8sxBMh2OrQ/bSxLaccaNApIVSZVSAasVJfdscNDL9fjkHERK/AiSTleHrsgLf4PL w5koqPs/6JEIVI+t0pyg+Pa8uwFoeYTPrLSlw0f7bXSmlVfv8g7M7RWmk3T5QIpeHA0j3lEZ NbYRXzkI91HCt40X2bTb2jTKgvB9jQjEarpk6euvGs2Ig/U4MlUy3pG5Ehd2Ebn8Rz31JXpa A/GPaJ5DjzV0q9mkYkGDLYI3J/J+s2u0Kr0VswLaIN3WJn7kKEDwfc4s2kaAYfblE/p0zVir EVBum723MFH4DxhTrOoWgta2nyRHOoi0z0EVhYA+D86mFPWKb9roWvtnmFlssggGmqbJEMvt LbYnlSt3v32nfUXh12aQPwU/LCGIzq4oFNVrNp3aWPnSajLPpQARAQABzTxGcsOpZMOpcmlj IFBpZXJyZXQgKGZlcGl0cmUpIDxmcmVkZXJpYy5waWVycmV0QHF1YmVzLW9zLm9yZz7CwXgE EwECACIFAlwkq3ECGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEEhAELXNxXbiPLkQ AI6kEDyLl0TpvRDOanuD5YkVHLEYVuG62CJNwMjFoFRgZJnl+Fb5HBgthU9lBdMqNySg+s8y ekM9KRlUHKYjwAsyjPIjRtca4bH3V11/waKpvPBgPsC75CxSZ9uITprfEqX7V2OLbrYW94qw R8jX+n/wlEGG3pbfXG7FTnjxQWM0E0aSvO0Yb5EkjiJ7cwEiqvL04Uekt5I2Zc8iRDF9kneI NiNhzRtvrR1UN6KtiZNSk2NsLOptrUQ/1AU5jwH4mnQQymtYDsWddlRoDRC/bsAow7cBudj+ lekM3cNRZOazKZx5UPnN8nqvD7FqeAcZBVyrHZ4hcWqABaJEPv6CCHRiLQnGR9ze2O5Yh+/B unrOJdjdsib1ZECH9GtIcj4mmPAN84NO4r8a6Sn9jsXkd2Wj2N5wNrZMPslhfiaW2VHTfLmA Ot+wRwLRsFfqLykF8hMlNXXE4frxotwa6+PTd48Ws9H9aalSs0lebsG0623b4mBjy1coxFUw eclPInXsPEdu/Yu2r7xrgGouXH8KgDhqlqq60UaA5n/0XhIeZ8tBTYs+1B5/C9TjvNAUsBko b1EpfW3J4Gq14GqwK+eodOTL5t2f2PWN/IQyop/j0FMgVU5/PUS0pciz5ybyIJBLhbsJBvKb xM/NyxHrmNwGEknpoeq+XT8rEJ+/Ag8Wnjl0zsFNBFwkq3EBEADAPJdyFy4KeYpuGATWwWCN He8XNVqBplV0yVlT5pSiCyA3UK34JlGX9YJOj/FlMZGgh61vbiK+piRjm/lyb128wpMjnoOm qpbSLbra8NP8Mu5FZMcv8OxrSIr/RHq2heFg1j11QOMGwe6vPC918qpzmiaYj2qpKY/RYsG8 V+9+dpLEU75+mpHU7GlECfPmHYbnsismL/4+xH+8BG56yg0UFbfrNYonIQFSn5k/w6i7jt7M ++ZmWfEV5nCP2qvzeYDGAL6BbWVOjuDhrKsAIKnomCyy+MjcVP955PVdN2+OlPJng07oKtQr 5aNCaNpv/i4gLO1IScdfDwm6gdfB2Zg/7jTJrKw0kWPFl9rHfN7dLAR28u3uT8Rhicjdd7hg YlDWdbImhNL/Z7iL3eayH7T9qAVNU587MhWvIREyE1gj22cs0e1m6qMFpbFYG0709N2UwlpA H+Pd35bTi9q2o1pH91xBYH6QvvrwsuVYHwuc3xXLRVRXWXY8xvNFSlY1LB8A46JOtV/ZodYD yhxVGbeWp820cb0s1f689XCXqFYAzTfCit+EeboYORN5CGioXzS+z0S9IhPbdUuvqs7xvC24 8bM7nm84YdgVM7HWybOtpRpWpycwGs73IvbxyLE9aPe/Zw4PTKWvbJlcFioofLwTQE1XvWom FPD9LLrBl5NUjQARAQABwsFfBBgBAgAJBQJcJKtxAhsMAAoJEEhAELXNxXbilSkP/2NcazvU DGyQLm7tFp4HNqSQfFJ3+chzxfOOdNtdWE+RFetyx9R8DBGrPX8hjITWD9ZA2bbZZ+J+a39v yY7bNZkCGbWzPGK//O1cInL4Ecmj7Xm8DXjk3E2Xzv1YrZk/GBz9xK8mWXwhn90SHNadEf28 ghMXcmUJSqT+KTxQQjUVaEtQDdzQnYQKh/dHxs760QSAnXkWr0YVYxk8q8aa+G8iAkNJcb+W x5gWEw4ft3HpKMRq74OQvWayy0fXpTlusdnvZs0VVMeRpCW6iCt9UmsbfG6Nyf2MKKbWRJnt jy8mjJiFjiJ2j9s4yNIookRv8IfocULuhnx5FWsvIzX2Vwcd7G5objnY1DlCNQrhJUs/geoC UBjBJp7sfbHakWfTKxZjFsuCXT1dCEN7JXX6ABOshzDTwB0kq7Bq/EkOzPDQGfOPoX2h1KjH uvGWw5cBe8WLnEuhIyf/DWfMS1LbjFB4JlMUEcood5xvE4owpfZog+0a9gpBS6cg9bMgRUex 1C+w3fudJdPQwIRAjJgac0jTT6uDY8re9RhBDv83PRSM7AzxqEFvDj8K46dg1XvJcKs7K5PX pm5Pw4stVEAxIks5uR62wxygImkdvgjQRzJe4JWwAniBWsZG+cNYj6xcItqkupIb4PeOWgNQ QMhGv8DnbAdOOOnumAXWq0+wl5uP
  • Cc: Juergen Gross <JGross@xxxxxxxx>, "George.Dunlap@xxxxxxxxxx" <George.Dunlap@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>
  • Delivery-date: Tue, 03 Nov 2020 13:23:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>



Le 10/31/20 à 5:08 AM, marmarek@xxxxxxxxxxxxxxxxxxxxxx a écrit :
On Sat, Oct 31, 2020 at 04:27:58AM +0100, Dario Faggioli wrote:
On Sat, 2020-10-31 at 03:54 +0100, marmarek@xxxxxxxxxxxxxxxxxxxxxx
wrote:
On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote:
(XEN) *** Dumping CPU7 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

And so on, for (almost?) all CPUs.

Right. So, it seems like a live (I would say) lock. It might happen on
some resource which his shared among domains. And introduced (the
livelock, not the resource or the sharing) in 4.14.

Just giving a quick look, I see that vmx_do_resume() calls
vmx_clear_vmcs() which calls on_selected_cpus() which takes the
call_lock spinlock.

And none of these seems to have received much attention recently.

But this is just a really basic analysis!

I've looked at on_selected_cpus() and my understanding is this:
1. take call_lock spinlock
2. set function+args+what cpus to be called in a global "call_data" variable
3. ask CPUs to execute that function (smp_send_call_function_mask() call)
4. wait for all requested CPUs to execute the function, still holding
the spinlock
5. only then - release the spinlock

So, if any CPU does not execute requested function for any reason, it
will keep the call_lock locked forever.

I don't see any CPU waiting on step 4, but also I don't see call traces
from CPU3 and CPU8 in the log - that's because they are in guest (dom0
here) context, right? I do see "guest state" dumps from them.
The only three CPUs that do logged xen call traces and are not waiting on that
spin lock are:

CPU0:
(XEN) Xen call trace:
(XEN)    [<ffff82d040240f89>] R vcpu_unblock+0x9/0x50
(XEN)    [<ffff82d0402e0171>] S vcpu_kick+0x11/0x60
(XEN)    [<ffff82d0402259c8>] S tasklet.c#do_tasklet_work+0x68/0xc0
(XEN)    [<ffff82d040225a59>] S tasklet.c#tasklet_softirq_action+0x39/0x60
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

CPU4:
(XEN) Xen call trace:
(XEN)    [<ffff82d040227043>] R set_timer+0x133/0x220
(XEN)    [<ffff82d040234e90>] S credit.c#csched_tick+0/0x3a0
(XEN)    [<ffff82d04022660f>] S timer.c#timer_softirq_action+0x9f/0x300
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

CPU14:
(XEN) Xen call trace:
(XEN)    [<ffff82d040222dc0>] R do_softirq+0/0x10
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

I'm not sure if any of those is related to that spin lock,
on_selected_cpus() call, or anything like that...


Hi,
Some new info :). Marek has sent me a patch 
(https://gist.github.com/marmarek/810ae5c079d218928535514b08a07716) to help in 
debugging this issue. With default credit2 as scheduler, here are some log from 
two successive crashes:
- https://gist.github.com/fepitre/76a1e154249c7326c743d6a6d004a67b
- https://gist.github.com/fepitre/ab00091980cb8110fb3d349aecc3a644

Also, I've been testing "dom0=pvh" option and the system remains stable since 
more than 24h. At least, I've not experimented any freeze since I'm using this option.

I hope that would give some hints.

Regards,
Frédéric

Attachment: OpenPGP_0x484010B5CDC576E2.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.