[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] x86/mm: Short circuit damage from "fishy" ref/typecount failure

On 25.01.2021 18:59, Andrew Cooper wrote:
> On 20/01/2021 08:06, Jan Beulich wrote:
>> On 19.01.2021 19:09, Andrew Cooper wrote:
>>> On 19/01/2021 16:48, Jan Beulich wrote:
>>>> On 19.01.2021 14:02, Andrew Cooper wrote:
>>>>> This code has been copied in 3 places, but it is problematic.
>>>>> All cases will hit a BUG() later in domain teardown, when a the missing
>>>>> type/count reference is underflowed.
>>>> I'm afraid I could use some help with this: Why would there
>>>> be a missing reference, when the getting of one failed?
>>> Look at the cleanup logic for the associated fields.
>>> Either the plain ref fails (impossible without other fatal refcounting
>>> errors AFAICT), or the typeref fails (a concern, but impossible AFAICT).
>> In principle I would agree, if there wasn't the question of
>> count overflows. The type count presently is 56 bits wide,
>> while the general refcount has 54 bits. It'll be a long time
>> until they overflow, but it's not impossible. The underlying
>> problem there that I see is - where do we draw the line
>> between "can't possibly overflow in practice" (as we would
>> typically assume for 64-bit counters) and "is to be expected
>> to overflow (as we would typically assume for 32-bit
>> counters)?
> Ok fine - I was treating 54 bits as "not going to happen in practice".
> A PV guest needs 2^43 pages of RAM to turn into pagetables to approach
> the general refcount limit.  This is more RAM than most people can
> accord, and this is way in excess of our security supported limits.
> Errors in this area are already hit BUGs in loads of cases, because that
> is less bad than the alternatives.
> In principle, and as previously discussed, some issues in this area
> could be fixed by porting refcount_t from PaX/Linux KSPP which will turn
> refcount overflows into memory leaks, which is an even less bad alternative.
>> Also, as far as "impossible" here goes - the constructs all
>> anyway exist only to deal with what we consider impossible.
>> The question therefore really is of almost exclusively
>> theoretical nature, and hence something like a counter
>> possibly overflowing imo needs to be accounted for as
>> theoretically possible, albeit impossible with today's
>> computers and realistic timing assumptions. If a counter
>> overflow occurred, it definitely wouldn't be because of a
>> bug in Xen, but because of abnormal behavior elsewhere.
>> Hence I remain unconvinced it is appropriate to deal with
>> the situation by BUG().
> I'm not sure how to be any clearer.
> I am literally not changing the current behaviour.  Xen *will* hit a
> BUG() if any of these domain_crash() paths are taken.
> If you do not believe me, then please go and actually check what happens
> when simulating a ref-acquisition failure.

Well, okay, if you don't want to share the knowledge you've
gained, I will have to go this route. Will likely take
longer though than if you had tried to clarify how a ref
would go "missing" and hence allow for an underflow. This
isn't a matter of believing you, but one of wanting to
understand the situation you say you've run into.


> What I am doing is removing complexity (the point of the change) which
> gives a false sense of the error being survivable.
> If you want to do something other than BUG() in these cases, then you
> need to figure some way for the teardown logic to identify which ref
> went missing, but this would be a different, follow-on patch.
>> But yes, if otoh we assume the failure here to be the result
>> of a bug elsewhere in Xen (and not an overflow), then BUG()
>> may be warranted. Yet afaic these constructs weren't meant
>> to deal with bugs elsewhere in Xen, but with the
>> "impossible". So if we change our collective mind here, I
>> think the conversion to BUG() would then need accompanying
>> by respective commentary.
> BUG() is, and has always been, Xen's way of dealing with impossibles,
> particularly when it comes to memory handling.
> This isn't a "changing minds" occasion.  Removals of BUG()s elsewhere
> pertains to logical error based on guest state, which is indeed
> inappropriate error handling.
> ~Andrew



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.