[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 15 Jun 2021 10:17:24 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LjZN2yR43dGRx39ih2tx5i2/Ql+3p+wJnVZbqSFLIdY=; b=iirwa3Xl4IG8icN411VSQqiH6CT22oyMzmIl5Xq5yeFxwr4SsKUtFG3Xc9EBVod0lafWo+FYUuPKOTmT8UU3Q9cp23Gpbyz84V8Ha5BOPJAzpMXMA4FVc6FX4dOM7sAhJXqBUGr8v37TxYPSFFPAkoz+uJlaBfkIPhzG/Cl9rpDvWrzIJqyCUx4Zo3Ppo9NsxkSASNqrTUbliS45hOmVCHdkmC8HkEIi/SKLrc7p3BequhaikeEka5+nlizvioDBHVDVMmoJPWYAYmOYoPxwHpRGe3lu0wX6TPp8zqJ7KX2NHSTqkEXRcc+T/HH8q9lEmdZxyDZ+PtP9rlz5BpH6/w==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ho0dQ1CXXj1chiM1/FFJwKNch/N9PTPKY8iKIg4UIsllgMJVUdgLAJtzrItaxvbKBdr7nQL+7KKOBQG8ZA/RYDtBXCxu/Z6YIx5/MRCQT99XtYtpn5d9Hy/IWFRDyh3Mzf72EY9qZkvPAMKLAr88BH0hSGXTX14xvVw+y09AQfaiypoWC5SOK0UM8l9n23iOj9bAd2h3TLDT1NhdOT8YYqu0lCNox9J4CbsXdHd4bJoOMAgQSoIZJWUNvLyN+Msf2gRfauerqElCpCOZRfaReUbeKJ8Sohegp1fXh/kc1Gh/Ihtv2isayYD4bEKYM06oDQ6kh4p0sC+gkmTbOnKNXw==
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Igor Druzhinin <igor.druzhinin@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxx>, <boris.ostrovsky@xxxxxxxxxx>, <stephen.s.brennan@xxxxxxxxxx>
  • Delivery-date: Tue, 15 Jun 2021 09:17:57 +0000
  • Ironport-hdrordr: A9a23:a5R8/ajEltdZLIrX27eFcFokg3BQXiAji2hC6mlwRA09TyX5ra 2TdTogtSMc6QxhPE3I/OrrBEDuexzhHPJOj7X5Xo3SOTUO2lHYT72KhLGKq1Hd8kXFndK1vp 0QEZSWZueQMbB75/yKnTVREbwbsaW6GHbDv5ag859vJzsaFZ2J921Ce2Gm+tUdfng8OXI+fq DsgPZvln6bVlk8SN+0PXUBV/irnaywqHq3CSR2fiLO8WO1/EuV1II=
  • Ironport-sdr: XuxVKP2yk99lEiSTrqD+SxgwtAZRekbHDOASHwPLCJHs05lqJ0U1m5zn+F/IVaT2im39WJLufJ BeYWtO2lHVoSLSILsEF0TeIZBnd9/XV+MmRs2f2DAWLxfC7j8dKRH+incvXtufY7rzjnVXv9Ih KVQwMeY0ljp1+fSgMNNqrPPqd7Pt1nl6/AhIzqggeWgwpaAkXwMnrTvTCLpbvEz7ktnCQ/VgEZ sILk9fbC0Ezypl76as4G2fo3omj6RNJlqnPTK4OkhTb7Z3q1KiW9Y9mpSZv7ys/6RuFX/qOR6h fEc=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15/06/2021 09:12, Roger Pau Monné wrote:
> On Mon, Jun 14, 2021 at 06:01:17PM +0200, Jan Beulich wrote:
>> On 14.06.2021 15:27, Roger Pau Monné wrote:
>>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>>>> Hi, Boris, Stephen, Roger,
>>>>>
>>>>> We have stress tested recent changes on staging-4.13 which includes a
>>>>> backport of the subject. Since the backport is identical to the
>>>>> master branch and all of the pre-reqs are in place - we have no reason
>>>>> to believe the issue is not the same on master.
>>>>>
>>>>> Here is what we got by running heavy stress testing including multiple
>>>>> repeated VM lifecycle operations with storage and network load:
>>>>>
>>>>>
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
>>>>> CPU:    17
>>>>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>>>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
>>>>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
>>>>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
>>>>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
>>>>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
>>>>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
>>>>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
>>>>> cr3: 00000013c1a32000   cr2: 0000000000000000
>>>>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 
>>>>> 55
>>>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>>>> Xen call trace:
>>>>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>>>     [<ffff82d0802414e0>] F 
>>>>> common/schedule.c#sched_context_switch+0x146/0x151
>>>>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>>>
>>>>> ****************************************
>>>>> Panic on CPU 17:
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ****************************************
>>>> Since this suggests a timer was found on the list without ever having been
>>>> initialized, I've spotted a case where this indeed could now happen. Could
>>>> you give the patch below a try?
>>>>
>>>> Jan
>>>>
>>>> x86/vpt: fully init timers before putting onto list
>>>>
>>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>>> iterating the list and acting on the timers of the list entries will no
>>>> longer be kept from entering their loops by create_periodic_time()'s
>>>> holding of that lock. Therefore at least init_timer() needs calling
>>>> ahead of list insertion, but keep this and set_timer() together.
>>>>
>>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some 
>>>> cases")
>>>> Reported-by: Igor Druzhinin <igor.druzhinin@xxxxxxxxxx>
>>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>>> Thanks for looking into this so quickly, and sorry for not realizing
>>> myself when relaxing the locking. Adding the timer to the list without
>>> it being fully initialized was a latent issue even if protected by the
>>> lock initially.
>>>
>>> Provided testing shows the issue is fixed:
>> I guess the change here is needed anyway, even if testing finds there's
>> still something amiss?
> Indeed, just wondered whether there might be other instances using a
> similar pattern, but I'm not able to spot any.
>
> It might even be better to fix other issues (if any) on a different
> commit.

To be honest, this change is clearly good, and necessary.  I'd be
tempted to commit it now, as is, irrespective of whether there are
further bugs in this area.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.