[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through vs PoD


  • To: Andrew Cooper <amc96@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 18 Nov 2021 09:08:34 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=VxRyreEDEqFACnbtzNddexK+m4lB6miLPve8d8QolxQ=; b=Qf6OTSeznFtJ7jp5wNfVkmsDWew3q3Mf+A0tri+LqN6GkJJ87O3JcGo7x3h448My59PYHza6xd8GDLJ3jG2r6TBNBhpmSh19rUjrv5CPtV/CdshaoGCbGKQLKgx33b0HPcyrZMGbZRPdxzdoTjnJsky/sS0ffvy8cY3B7YtOFGw2fkC9ri5Dt4VIYWNIK88AoytbMSPUNk/9w5svlcJiwPOBqkwP9nCTi4NmIhyKo8u5Hp6X/otlFHFCUc5/972AuFK/wnokNMqLPni5W6/TtNUou2ipIhA3U4TubgN9ekQFtVcwlO9LEChosdoXn0ASejmQOw3waKBOnuhyhbK2WQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HIjPu7xTiyCnc3Qjy6wco1Vstj4weTGBJCIJCkNZXPTjNTzA9ppFT6iHjCscYOx6lQ7a5yCt+UfJLRRC2IgIFkjFAuSoPYL8NIU7s9cdazGQjFeryDa7tIUKmuacCL48dMOpYzb1be+s+qJlALEJI9sv+emABPb9IOj8Gix+TLrIV+zI/TDsUUH6mT9Sbfnw3mUGhYgZycGDK/tXHamJb6iY4VU5iqnQDLOtg1S+4RgTmXR4YYmVqUhaV8uDO3gcvvrXgJIkJ/xV2Z4K4x6AjPNTMzYuw9Tc5+z5jzlbjsuFK9nFkysrD0w6y6VQFIQRTN24JxG5Fa/bpYz7sYs7Jg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Thu, 18 Nov 2021 08:09:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 17.11.2021 14:07, Andrew Cooper wrote:
> On 17/11/2021 11:23, Jan Beulich wrote:
>> On 17.11.2021 12:09, Andrew Cooper wrote:
>>> On 17/11/2021 10:13, Jan Beulich wrote:
>>>> On 17.11.2021 09:55, Roger Pau Monné wrote:
>>>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>>>>>> On 13.09.2021 11:02, Jan Beulich wrote:
>>>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>>>>>> enabled and fails domain creation if at the same time devices would get
>>>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
>>>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right
>>>>> thing to do, at least for our toolstack.
>>>> May I ask about the reasons of you being unsure?
>>> PoD and passthrough is a total nonsense.  You cannot have IOMMU mappings
>>> to bits of the guest physical address space which don't exist.
>>>
>>> It is now the case that IOMMU (or not) must be specified at domain
>>> creation time, which is ahead of creating PoD pages.  Certainly as far
>>> as Xen is concerned, the logic probably wants reversing to have
>>> add_to_physmap&friends reject PoD if an IOMMU was configured.
>>>
>>> A toolstack could, in principle, defer the decision to first device
>>> assignment.
>> Right, which is what I consider the preferred approach.
> 
> Why?
> 
> Just because something is technically possible, does not mean it is an 
> appropriate or clever thing to do.
> 
> In this case, we're talking about extra complexity in Xen and the 
> toolstack, which in the very best case comes with unattractive user 
> experience properties, to "fix" an issue which doesn't happen in practice.

IOW you're suggesting to wait for the first report of this being a problem.

>>> and liable to suffer -ENOMEM,
>> Not if (as suggested) we first check that the PoD cache is large enough
>> to cover all PoD entries.
> 
> Just because at this instant we have enough free RAM to force-populate 
> all PoD entries doesn't mean the same is true in 2 minutes time after 
> we've been slowly force-populating a massive VM.
> 
> Yes, there are heuristics we can use to short-circuit the failure early, 
> but that's still spelt -ENOMEM and reported to the user as such.
> 
> The only way to succeed here is to force populate the VM and to have not 
> suffered -ENOMEM by the end of this task.

I'm afraid I can't follow you here at all. The PoD cache is memory already
owned by the guest. As long as no new PoD entries get made out of thin air
(i.e. other than taking the backing page and placing it in the PoD cache),
there's no -ENOMEM possible here. That's precisely why entry count wants
to be checked against count of "PoD cache" pages to be sure.

>>> or we have
>>> to reject a control operation with -EBUSY for a task which is dependent
>>> on the guest kernel actions in a known-buggy area.
>> Why reject anything?
> 
> Because the guest kernel has no knowledge of nor the ability to query 
> the PoD status of a page, the only way to not have things malfunction is 
> to enforce that there are no P2M entries of type PoD when devices are 
> assigned.
> 
> If you don't want to / can't force-populate the entire VM prior to 
> having device assigned, then the assign operation needs to fail.

Well, yes, that's what I have been saying form the beginning. All we
appear to disagree on is whether tool stack or hypervisor should
actually put effort in doing such a force-populate.

>>> There is no point trying to make this work.  If a user wants a device,
>>> they don't get to have PoD.  Anything else is a waste of time and effort
>>> on our behalf for a usecase that doesn't exist in practice.
>> Not sure where you take the latter from. I suppose I'll submit the patch
>> as I have it now (once I have properly resolved dependencies on other
>> patches I have queued and/or pending), and if that's not deemed acceptable
>> plus if at the same time I don't really agree with proposed alternatives,
>> I'll leave fixing the bug to someone else. Of course the expectation then
>> is that such a bug fix come forward within a reasonable time frame ...
> 
> What bug?  PoD and PCI Passthrough are mutually exclusive technologies.

I wonder in how far you've read my earlier mails properly. After initially
only suspecting this might be possible, I did _verify_ that I can assign a
device with the guest still in PoD mode, including before the balloon
driver has kicked in (in which case even force-populate wouldn't help, i.e.
assignment ought to fail no matter what). While initially I thought this
would have been an unintended side effect of f89f555827a6 ("remove late
(on-demand) construction of IOMMU page tables"), I now think this has been
an issue even before. There's no check in the hypervisor (in particular 
arch_iommu_use_permitted() hasn't been checking for PoD so far, which is
used during assignment only anyway), while the tool stack checks only
during domain construction afaics (in libxl__domain_config_setdefault()).

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.