[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] x86/mm: Short circuit damage from "fishy" ref/typecount failure


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 25 Jan 2021 17:59:02 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=7oor4WSttdgqgfmJHFltiV8nncWX5IH/D45bE5zwbhg=; b=mMcU6Y28f7ZVs4qpZC3qBefxAoaD0QxFn66JGGo0eZ9FseYKY/UkCbf+4qT3Ml4ZDOnjaeAtv6v3EzuzZsHA2ZMm+5ro+ZZu4/smVSluDc4UKHcxyIz1lwBFdv5fSjCJUpREPCI6mYCiCK2y5Uh8O0IJ3b7kLNd/fv1d7duopHw9TjbWlPOzh1goXzaBwm2B6h4RqJ0WUKjA4G4/grbVs/ghIUgmZRbxy0l4eUKlXOzZIoF5j2JThjHdb/ThT4pCWJDSb8sJJh1Dp2HdxHpH/zYWIE3NKpnhI/7RSjEjoz1wyahNu6FUQ3HTvSPrB3GBbVQKayUsdaw/JAS/jqClaA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KqJxvNS6exAM1P9tDNi527rBRu20f+5WwxYFk9Q3NMQdKj1IsxwAoiAgotK0SpP8fyi+6MY0wjJoyP4BV4Fc6gEybffDTFnb7vgLQ9wnOf6n67bw0ekj6kaogQ8bvH1MYR8intp2xRWx1zRhAlt+e5pzkey242PgR481WdvZg/CCJBExZN4LNUaAyvyms2L3mmKp2HlHyiaR6ZtK5Gto/5p3zGyg2dEkUyTUH6eE3EnTIo3JbJyEnqX1GrVkn1lt7ZCX7czJxBRZVuRZfM2VbqD8VT/VBMguUWPruJTeGqJrDuXSb856SC+3I27vSlHeEqsSLqS7xMeYfZ9GOZ22Yw==
  • Authentication-results: esa4.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Paul Durrant <paul@xxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 25 Jan 2021 17:59:14 +0000
  • Ironport-sdr: qsBeJ5z774Ajkvi7nyUysousjv3pLHJWqRoSg0TW/yAE6In9JJiCaMWzthQ7STVH03/enx4B45 ilfl2baRFG/B4pMhLf4/Zhv9YgAKC66kUIzIYz03nSG02OWMC0VtZu9Ga8MJCk3iJ4U6SSPf8e CROXY5yt0iFuQXKFKnsYBaPy3BRKIK+FsYyFcwmggdkV9yuLsNmbgUq+g35PbsdVtL+809xlZO /RdDIbeeSTXf07vk6KNYuGDVV+vFP7k0ziALQhqbom+PwX9IzMjJPuviRxRFs9Mqa/exkxTsm1 0/0=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20/01/2021 08:06, Jan Beulich wrote:
> On 19.01.2021 19:09, Andrew Cooper wrote:
>> On 19/01/2021 16:48, Jan Beulich wrote:
>>> On 19.01.2021 14:02, Andrew Cooper wrote:
>>>> This code has been copied in 3 places, but it is problematic.
>>>>
>>>> All cases will hit a BUG() later in domain teardown, when a the missing
>>>> type/count reference is underflowed.
>>> I'm afraid I could use some help with this: Why would there
>>> be a missing reference, when the getting of one failed?
>> Look at the cleanup logic for the associated fields.
>>
>> Either the plain ref fails (impossible without other fatal refcounting
>> errors AFAICT), or the typeref fails (a concern, but impossible AFAICT).
> In principle I would agree, if there wasn't the question of
> count overflows. The type count presently is 56 bits wide,
> while the general refcount has 54 bits. It'll be a long time
> until they overflow, but it's not impossible. The underlying
> problem there that I see is - where do we draw the line
> between "can't possibly overflow in practice" (as we would
> typically assume for 64-bit counters) and "is to be expected
> to overflow (as we would typically assume for 32-bit
> counters)?

Ok fine - I was treating 54 bits as "not going to happen in practice".

A PV guest needs 2^43 pages of RAM to turn into pagetables to approach
the general refcount limit.  This is more RAM than most people can
accord, and this is way in excess of our security supported limits.


Errors in this area are already hit BUGs in loads of cases, because that
is less bad than the alternatives.

In principle, and as previously discussed, some issues in this area
could be fixed by porting refcount_t from PaX/Linux KSPP which will turn
refcount overflows into memory leaks, which is an even less bad alternative.

>
> Also, as far as "impossible" here goes - the constructs all
> anyway exist only to deal with what we consider impossible.
> The question therefore really is of almost exclusively
> theoretical nature, and hence something like a counter
> possibly overflowing imo needs to be accounted for as
> theoretically possible, albeit impossible with today's
> computers and realistic timing assumptions. If a counter
> overflow occurred, it definitely wouldn't be because of a
> bug in Xen, but because of abnormal behavior elsewhere.
> Hence I remain unconvinced it is appropriate to deal with
> the situation by BUG().

I'm not sure how to be any clearer.

I am literally not changing the current behaviour.  Xen *will* hit a
BUG() if any of these domain_crash() paths are taken.

If you do not believe me, then please go and actually check what happens
when simulating a ref-acquisition failure.


What I am doing is removing complexity (the point of the change) which
gives a false sense of the error being survivable.

If you want to do something other than BUG() in these cases, then you
need to figure some way for the teardown logic to identify which ref
went missing, but this would be a different, follow-on patch.

> But yes, if otoh we assume the failure here to be the result
> of a bug elsewhere in Xen (and not an overflow), then BUG()
> may be warranted. Yet afaic these constructs weren't meant
> to deal with bugs elsewhere in Xen, but with the
> "impossible". So if we change our collective mind here, I
> think the conversion to BUG() would then need accompanying
> by respective commentary.

BUG() is, and has always been, Xen's way of dealing with impossibles,
particularly when it comes to memory handling.

This isn't a "changing minds" occasion.  Removals of BUG()s elsewhere
pertains to logical error based on guest state, which is indeed
inappropriate error handling.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.