[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for virtual IOMMU binding b/w vIOMMU and passthrough devices


  • To: Julien Grall <julien@xxxxxxx>, Rahul Singh <Rahul.Singh@xxxxxxx>
  • From: Michal Orzel <michal.orzel@xxxxxxx>
  • Date: Thu, 27 Oct 2022 19:18:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=xen.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=QOGjgVScCEyvCzW4wOzlicFrgQI4xTDncbB9kJaxG2s=; b=H90TGcpbWJGyJQvlaJ/8nLBwRzZqKR0lUIF/g5A39SpJxaZ330gaL0F9bqTxv7RgBL2dQpDLQEpo7/I4+r+sRJuUSHHEVNW7TO6NJAbxgfxDmPxDMutCOZ/cRDyothQJ1ii8WnDsphL8k2MvqtqB31i3d+GMJf5AaoKsJ02xgCH8osq29VaQXRYxo3fOJ8821nxCwsdBeX02BvyfwhqOldvzbiRXvUjRbkN9tPoKf08s39Y69d6iz6KOOoX7Z/Ne6N28MezvRsZV5URpFykxdBXN23lj8Z4eVuPbNofGQz8QthzdraZFVJye0tRZThgw3K4KvVG0clu0+rawIpm+qQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=E0bjR/UC9KeHphw4P/DDOJ/HptTy23ev5Ji/KQV2IK3CqJTcXioqRXSfn/beVdr6jX/Kor6OB+zykhsM07UAh8dgJv8XkMqaKJGYaMkH5EGDYuX6JZt0IAZuOWD34oeklJCA+W6hmwZU+o6xNrRhZeHtxXgOVdTEb9LZUUfHJQ0Ktl12LHl3BvP410pqKMb6HxtjuB6ivzO5YK1KIeLGxjpKXtRZCK/S4s4tCeIQU8TncOc4wMuLvVgwzwK3WF/AoJAyvdJVV9IAsAcMQZco9zjxjH6j8HJ3bEI2Tqfxc4X2nNF8bnrqRdC40flmojsn90cJplGCdr+bmhN3W3G4EA==
  • Cc: Xen developer discussion <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Michal Orzel <Michal.Orzel@xxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>
  • Delivery-date: Thu, 27 Oct 2022 17:18:41 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Rahul,

On 27/10/2022 18:33, Julien Grall wrote:
> 
> 
> On 27/10/2022 17:08, Rahul Singh wrote:
>> Hi Julien,
> 
> Hi Rahul,
> 
>>> On 26 Oct 2022, at 8:48 pm, Julien Grall <julien@xxxxxxx> wrote:
>>>
>>>
>>>
>>> On 26/10/2022 15:33, Rahul Singh wrote:
>>>> Hi Julien,
>>>
>>> Hi Rahul,
>>>
>>>>> On 26 Oct 2022, at 2:36 pm, Julien Grall <julien@xxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/2022 14:17, Rahul Singh wrote:
>>>>>> Hi All,
>>>>>
>>>>> Hi Rahul,
>>>>>
>>>>>> At Arm, we started to implement the POC to support 2 levels of page 
>>>>>> tables/nested translation in SMMUv3.
>>>>>> To support nested translation for guest OS Xen needs to expose the 
>>>>>> virtual IOMMU. If we passthrough the
>>>>>> device to the guest that is behind an IOMMU and virtual IOMMU is enabled 
>>>>>> for the guest there is a need to
>>>>>> add IOMMU binding for the device in the passthrough node as per [1]. 
>>>>>> This email is to get an agreement on
>>>>>> how to add the IOMMU binding for guest OS.
>>>>>> Before I will explain how to add the IOMMU binding let me give a brief 
>>>>>> overview of how we will add support for virtual
>>>>>> IOMMU on Arm. In order to implement virtual IOMMU Xen need SMMUv3 Nested 
>>>>>> translation support. SMMUv3 hardware
>>>>>> supports two stages of translation. Each stage of translation can be 
>>>>>> independently enabled. An incoming address is logically
>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 
>>>>>> which translates the IPA to the output PA. Stage 1 is
>>>>>> intended to be used by a software entity( Guest OS) to provide isolation 
>>>>>> or translation to buffers within the entity, for example,
>>>>>> DMA isolation within an OS. Stage 2 is intended to be available in 
>>>>>> systems supporting the Virtualization Extensions and is
>>>>>> intended to virtualize device DMA to guest VM address spaces. When both 
>>>>>> stage 1 and stage 2 are enabled, the translation
>>>>>> configuration is called nesting.
>>>>>> Stage 1 translation support is required to provide isolation between 
>>>>>> different devices within the guest OS. XEN already supports
>>>>>> Stage 2 translation but there is no support for Stage 1 translation for 
>>>>>> guests. We will add support for guests to configure
>>>>>> the Stage 1 transition via virtual IOMMU. XEN will emulate the SMMU 
>>>>>> hardware and exposes the virtual SMMU to the guest.
>>>>>> Guest can use the native SMMU driver to configure the stage 1 
>>>>>> translation. When the guest configures the SMMU for Stage 1,
>>>>>> XEN will trap the access and configure the hardware accordingly.
>>>>>> Now back to the question of how we can add the IOMMU binding between the 
>>>>>> virtual IOMMU and the master devices so that
>>>>>> guests can configure the IOMMU correctly. The solution that I am 
>>>>>> suggesting is as below:
>>>>>> For dom0, while handling the DT node(handle_node()) Xen will replace the 
>>>>>> phandle in the "iommus" property with the virtual
>>>>>> IOMMU node phandle.
>>>>> Below, you said that each IOMMUs may have a different ID space. So 
>>>>> shouldn't we expose one vIOMMU per pIOMMU? If not, how do you expect the 
>>>>> user to specify the mapping?
>>>> Yes you are right we need to create one vIOMMU per pIOMMU for dom0. This 
>>>> also helps in the ACPI case
>>>> where we don’t need to modify the tables to delete the pIOMMU entries and 
>>>> create one vIOMMU.
>>>> In this case, no need to replace the phandle as Xen create the vIOMMU with 
>>>> the same pIOMMU
>>>> phandle and same base address.
>>>> For domU guests one vIOMMU per guest will be created.
>>>
>>> IIRC, the SMMUv3 is using a ring like the GICv3 ITS. I think we need to be 
>>> open here because this may end up to be tricky to security support it (we 
>>> have N guest ring that can write to M host ring).
>>
>> If xl want to creates the one vIOMMU per pIOMMU for domU then xl needs to 
>> know the below information:
>>   -  Find the number of holes in guest memory same as the number of vIOMMU 
>> that needs the creation to create the vIOMMU DT nodes. (Think about a big 
>> system that has 50+ IOMMUs)
>>      Yes, we will create vIOMMU for only those devices that are assigned to 
>> guests but still we need to find the hole in guest memory.
> 
> I agree this is a problem with the one vIOMMU per pIOMMU.
> 
>>   -  Find the pIOMMU attached to the assigned device and create mapping b/w 
>> vIOMMU -> pIOMMU to register the MMIO handler.
>>      Either we need to modify the current hyerpcall or need to implement a 
>> new hypercall to find this information.
> 
> Adding hypercalls are is not a big problem.
> 
>>
>> Because of the above reason I thought of creating one vIOMMU for domU. Yes 
>> you are right this may end up to be tricky to security support
>> but as per my understanding one vIOMMU  per domU guest is easy to implement 
>> and simple to handle as compared to one vIOMMU per pIOMMU
> 
> I am not sure about this. My gut feeling is the code in Xen will end up
> to be tricky (there more that Xen doesn't support preemption). So I
> think we will trade-off complexity in Xen over simplicity in libxl.
> 
> That said, I haven't looked deeper in the code. So I may be wrong. I
> will need to see the code to confirm.
> 
>>>>>> For domU guests, when passthrough the device to the guest as per [2],  
>>>>>> add the below property in the partial device tree
>>>>>> node that is required to describe the generic device tree binding for 
>>>>>> IOMMUs and their master(s)
>>>>>> "iommus = < &magic_phandle 0xvMasterID>
>>>>>>   • magic_phandle will be the phandle ( vIOMMU phandle in xl)  that will 
>>>>>> be documented so that the user can set that in partial DT node (0xfdea).
>>>>>
>>>>> Does this mean only one IOMMU will be supported in the guest?
>>>> Yes.
>>>>>
>>>>>>   • vMasterID will be the virtual master ID that the user will provide.
>>>>>> The partial device tree will look like this:
>>>>>> /dts-v1/;
>>>>>>   / {
>>>>>>      /* #*cells are here to keep DTC happy */
>>>>>>      #address-cells = <2>;
>>>>>>      #size-cells = <2>;
>>>>>>        aliases {
>>>>>>          net = &mac0;
>>>>>>      };
>>>>>>        passthrough {
>>>>>>          compatible = "simple-bus";
>>>>>>          ranges;
>>>>>>          #address-cells = <2>;
>>>>>>          #size-cells = <2>;
>>>>>>          mac0: ethernet@10000000 {
>>>>>>              compatible = "calxeda,hb-xgmac";
>>>>>>              reg = <0 0x10000000 0 0x1000>;
>>>>>>              interrupts = <0 80 4  0 81 4  0 82 4>;
>>>>>>             iommus = <0xfdea 0x01>;
>>>>>>          };
>>>>>>      };
>>>>>> };
>>>>>>   In xl.cfg we need to define a new option to inform Xen about vMasterId 
>>>>>> to pMasterId mapping and to which IOMMU device this
>>>>>> the master device is connected so that Xen can configure the right 
>>>>>> IOMMU. This is required if the system has devices that have
>>>>>> the same master ID but behind a different IOMMU.
>>>>>
>>>>> In xl.cfg, we already pass the device-tree node path to passthrough. So 
>>>>> Xen should already have all the information about the IOMMU and 
>>>>> Master-ID. So it doesn't seem necessary for Device-Tree.
>>>>>
>>>>> For ACPI, I would have expected the information to be found in the IOREQ.
>>>>>
>>>>> So can you add more context why this is necessary for everyone?
>>>> We have information for IOMMU and Master-ID but we don’t have information 
>>>> for linking vMaster-ID to pMaster-ID.
>>>
>>> I am confused. Below, you are making the virtual master ID optional. So 
>>> shouldn't this be mandatory if you really need the mapping with the virtual 
>>> ID?
>>
>> vMasterID is optional if user knows pMasterID is unique on the system. But 
>> if pMasterId is not unique then user needs to provide the vMasterID.
> 
> So the expectation is the user will be able to know that the pMasterID
> is uniq. This may be easy with a couple of SMMUs, but if you have 50+
> (as suggested above). This will become a pain on larger system.
> 
> IHMO, it would be much better if we can detect that in libxl (see below).
> 
>>
>>>
>>>> The device tree node will be used to assign the device to the guest and 
>>>> configure the Stage-2 translation. Guest will use the
>>>> vMaster-ID to configure the vIOMMU during boot. Xen needs information to 
>>>> link vMaster-ID to pMaster-ID to configure
>>>> the corresponding pIOMMU. As I mention we need vMaster-ID in case a system 
>>>> could have 2 identical Master-ID but
>>>> each one connected to a different SMMU and assigned to the guest.
>>>
>>> I am afraid I still don't understand why this is a requirement. Libxl could 
>>> have enough knowledge (which will be necessarry for the PCI case) to know 
>>> the IOMMU and pMasterID associated with a device.
>>>
>>> So libxl could allocate the vMasterID, tell Xen the corresponding mapping 
>>> and update the device-tree.
>>>
>>> IOW, it doesn't seem to be necessary to involve the user in the process 
>>> here.
>>
>> Yes, libxl could allocate the vMasterID but there is no way we can find the 
>> link b/w vMasterID created to pMasterID from dtdev.
>>
>> What I understand from the code is that there is no link between the 
>> passthrough node and dtdev config option. The passthrough
>> node is directly copied to guest DT without any modification. Dtdev is used 
>> to add and assign the device to IOMMU.
>>
>> Let's take an example if the user wants to assign two devices to the guest 
>> via passthrough node.
>>
>> /dts-v1/;
>>
>> / {
>>     /* #*cells are here to keep DTC happy */
>>     #address-cells = <2>;
>>     #size-cells = <2>;
>>
>>     aliases {
>>         net = &mac0;
>>     };
>>
>>     passthrough {
>>         compatible = "simple-bus";
>>         ranges;
>>         #address-cells = <2>;
>>         #size-cells = <2>;
>>
>>         mac0: ethernet@10000000 {
>>             compatible = "calxeda,hb-xgmac";
>>             reg = <0 0x10000000 0 0x1000>;
>>             interrupts = <0 80 4  0 81 4  0 82 4>;
>>         };
>>
>>       mac1: ethernet@20000000 {
>>             compatible = “r8169";
>>             reg = <0 0x10000000 0 0x1000>;
>>             interrupts = <0 80 4  0 81 4  0 82 4>;
>>         };
>>
>>     };
>> };
>>
>> dtdev = [ "/soc/ethernet@10000000”, “/soc/ethernet@f2000000” ]
>>
>> There is no link which dtdev entry belongs to which node. Therefor there is 
>> no way to link the vMasterID created to pMasterID.
> 
> I agree there is no link today. But we could add a property in the
> partial device-tree to mention which physical device is associated.
+1

And we already have this property in partial device trees for dom0less domUs:
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/arm/passthrough.txt;h=219d1cca571b01bc8f0afbbe64435299547fed75;hb=HEAD#l104

FWIK, the solution proposed in this thread was chosen due to the fact that at 
the moment we do not parse the partial device tree in libxl.
But if this is a way to go (to reduce the complexity in Xen), then it will 
allow us to drop the need for both specifying vMasterID and iommu_devid_map.

> 
> With that, I think all, the complexity is moved to libxl and it will be
> easier for the user to use vIOMMU.
> 
> [...]
> 
>>>>>>   iommu_devid_map = [ “PMASTER_ID[@VMASTER_ID],IOMMU_BASE_ADDRESS” , 
>>>>>> “PMASTER_ID[@VMASTER_ID],IOMMU_BASE_ADDRESS”]
>>>>>>   • PMASTER_ID is the physical master ID of the device from the physical 
>>>>>> DT.
>>>>>>   • VMASTER_ID is the virtual master Id that the user will configure in 
>>>>>> the partial device tree.
>>>>>>   • IOMMU_BASE_ADDRESS is the base address of the physical IOMMU device 
>>>>>> to which this device is connected.
>>>>>
>>>>> Below you give an example for Platform device. How would that fit in the 
>>>>> context of PCI passthrough?
>>>> In PCI passthrough case, xl will create the "iommu-map" property in vpci 
>>>> host bridge node with phandle to vIOMMU node.
>>>> vSMMUv3 node will be created in xl.
>>>
>>> This means that libxl will need to know the associated pMasterID to a PCI 
>>> device. So, I don't understand why you can't do the same for platform 
>>> devices.
>>
>> For the PCI passthrough case, we don’t need to provide the MasterID to 
>> create "iommu-map” property as for
>> PCI device MasterID is RID ( BDF ). For non-PCI devices, MasterID is 
>> required to create “iommus” property.
> 
> Are you talking about the physical MasterID or virtual one? If physical
> MasterID then I don't think this is always the RID (see [1]). But for
> the virtual Master ID we could make this association.
> 
> This still means that in some way the toolstack need to let Xen know (or
> the other way around) the mapping between the pMasterID and vMasterID.
> 
> [1] Documentation/devicetree/bindings/pci/pci-iommu.txt.
> 
> Cheers,
> 
> --
> Julien Grall
> 

~Michal



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.