[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HVM/PVH Balloon crash


  • To: Elliott Mitchell <ehem+xen@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Tue, 7 Sep 2021 17:57:10 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=3em18/5I60ro58fWzw7m4WZD8kA9dd11ewhblexpUAc=; b=if4/rzahEFo09d7dOQ7Io9ZebCIXM9vZmBMPeMIaoLCenDifVp+lo1CV3nD79TUbNQk/pTjVwaY0tX3LPmpodz0L+57PMwm2Gc3vWOba3RvJFZimvP10aB7LcA2bUa7AJ7q1CR5RvRAxQ60wPq2PZsviHKfe6Pu8FkI2mXMEXIznSpDqV6xQEYe5RWK7bpLwKN/JfidFE5np+3P+dvuivZoMboGqfV+auVaOUpWRidAwMsXhESk4LLHmTA6Wb5UWs0X0r5E8ICnZyhqA47z1qa3mTiqmyMQiEBDRlZhj/npqFbwFTWNLbNrtVbq6w0y8LEHnbsHeockrzK76c6COYQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QpoW4jOkG792Rsx+dZiz6nck9Pva6+FaiX24JBxU1GgOSIQ+oYipDalrXbT68eQPadgNVLnn1wpN/DD3aMJsy+r7wXbSU3unAwuLyBA/eZ+Gr/cc63I8noTm4kfb5jUzJHa6nAH9nT76KWESGwlE/kdboRCBJx7XaB+PqrAnAoYZ9SS7mPpz5emJRwUtTEzcLVgsQwmc5Lw7wtm6swmWFPgYjdMrk+DdVY6O/CXDoyOYRSgLq3P26cLk2T47mMz+UWKwUQc/6I+wO0gThWZeiMBLqiKH9swvW/Qb/xWXOs/WmaMFWT8FhMrxX6M4WpZ5j7vtjenJUzw1Q0tIH9yckg==
  • Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 07 Sep 2021 15:57:26 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07.09.2021 17:03, Elliott Mitchell wrote:
> On Tue, Sep 07, 2021 at 10:03:51AM +0200, Jan Beulich wrote:
>> On 06.09.2021 22:47, Elliott Mitchell wrote:
>>> On Mon, Sep 06, 2021 at 09:52:17AM +0200, Jan Beulich wrote:
>>>> On 06.09.2021 00:10, Elliott Mitchell wrote:
>>>>> I brought this up a while back, but it still appears to be present and
>>>>> the latest observations appear rather serious.
>>>>>
>>>>> I'm unsure of the entire set of conditions for reproduction.
>>>>>
>>>>> Domain 0 on this machine is PV (I think the BIOS enables the IOMMU, but
>>>>> this is an older AMD IOMMU).
>>>>>
>>>>> This has been confirmed with Xen 4.11 and Xen 4.14.  This includes
>>>>> Debian's patches, but those are mostly backports or environment
>>>>> adjustments.
>>>>>
>>>>> Domain 0 is presently using a 4.19 kernel.
>>>>>
>>>>> The trigger is creating a HVM or PVH domain where memory does not equal
>>>>> maxmem.
>>>>
>>>> I take it you refer to "[PATCH] x86/pod: Do not fragment PoD memory
>>>> allocations" submitted very early this year? There you said the issue
>>>> was with a guest's maxmem exceeding host memory size. Here you seem to
>>>> be talking of PoD in its normal form of use. Personally I uses this
>>>> all the time (unless enabling PCI pass-through for a guest, for being
>>>> incompatible). I've not observed any badness as severe as you've
>>>> described.
>>>
>>> I've got very little idea what is occurring as I'm expecting to be doing
>>> ARM debugging, not x86 debugging.
>>>
>>> I was starting to wonder whether this was widespread or not.  As such I
>>> was reporting the factors which might be different in my environment.
>>>
>>> The one which sticks out is the computer has an older AMD processor (you
>>> a 100% Intel shop?).
>>
>> No, AMD is as relevant to us as is Intel.
>>
>>>  The processor has the AMD NPT feature, but a very
>>> early/limited IOMMU (according to Linux "AMD IOMMUv2 functionality not
>>> available").
>>>
>>> Xen 4.14 refused to load the Domain 0 kernel as PVH (not enough of an
>>> IOMMU).
>>
>> That sounds odd at the first glance - PVH simply requires that there be
>> an (enabled) IOMMU. Hence the only thing I could imagine is that Xen
>> doesn't enable the IOMMU in the first place for some reason.
> 
> Doesn't seem that odd to me.  I don't know the differences between the
> first and second versions of the AMD IOMMU, but could well be v1 was
> judged not to have enough functionality to bother with.
> 
> What this does make me wonder is, how much testing was done on systems
> with functioning NPT, but disabled IOMMU?

No idea. During development is may happen (rarely) that one disables
the IOMMU on purpose. Beyond that - can't tell.

>  Could be this system is in an
> intergenerational hole, and some spot in the PVH/HVM code makes an
> assumption of the presence of NPT guarantees presence of an operational
> IOMMU.  Otherwise if there was some copy and paste while writing IOMMU
> code, some portion of the IOMMU code might be checking for presence of
> NPT instead of presence of IOMMU.

This is all very speculative; I consider what you suspect not very likely,
but also not entirely impossible. This is not the least because for a
long time we've been running without shared page tables on AMD.

I'm afraid without technical data and without knowing how to repro, I
don't see a way forward here.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.