[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RMRR Fix Design for Xen



Jan,

Thanks for your time but I'm not going to address your comments here. Because I heard this design is totally not satisfied your expectation. But this really was reviewed with several revisions by Kevin and Yang before sending in public...

Anyway, I guess the only thing what I can do is that, Kevin and Yang, or other appropriate guys should finish this design as you expect. So now I'd better not say anything to avoid bringing any inconvenience.

Tiejun

On 2014/12/19 23:13, Jan Beulich wrote:
On 19.12.14 at 02:21, <tiejun.chen@xxxxxxxxx> wrote:
#4 Something like USB, is still restricted to current RMRR implementation. We
    should work out this case.

This can mean all or nothing. My understanding is that right now code
assumes that USB devices won't use their RMRR-specified memory
regions post-boot (kind of contrary to your earlier statement that in
such a case the regions shouldn't be listed in RMRRs in the first place).

Design Overview
===============

First of all we need to make sure all resources don't overlap RMRR. And then
in case of shared ept, we can set these identity entries. And Certainly we
will
group all devices associated to one same RMRR entry, then make sure all
group
devices should be assigned to same VM.

1. Setup RMRR identity mapping

current status:
     * identity mapping only setup in non-shared ept case

proposal:

In non-shared ept case, IOMMU stuff always set those entries and RMRR is
already marked reserved in host so its fine enough.

Is it? Where? Or am I misunderstanding the whole statement, likely
due to me silently replacing "host" by "guest" (since reservation in
host address spaces is of no interest here afaict)?

But in shared ept case, we need to
check any conflit, so we should follow up

   - gfn space unoccupied
         -> insert mapping: success.
             gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
   - gfn space already occupied by 1:1 RMRR mapping
         -> do nothing; success.
   - gfn space already occupied by other mapping
         -> fail.

expectation:
     * only devices w/ non-conflicting RMRR can be assigned
     * fortunately this achieves the very initial intention to support IGD
       pass-through on BDW

Are you trying to say here that doing the above is all you need for
your specific machine? If so, that's clearly not something to go into
a design document.

Also there's clearly an alternative proposal: Drop support for sharing
page tables. Your colleagues will surely have told you that we've
been considering this for quite some time, and had actually hoped
for them to do the necessary VT-d side work to allow for this without
causing performance regressions.

2.1 Expose RMRR to user space

current status:
     * Xen always record RMRR info into one list, acpi_rmrr_units, while
parsing
       acpi. So we can retrieve these info by lookup that list.

proposal:
     * RMRR would be exposed by a new hypercall, which Jan already finished
in
       current version but just expose all RMRR info unconditionally.
     * Furthermore we can expose RMRR on demand to diminish shrinking guest
       RAM/MMIO space.
     * So we will introduce a new parameter, 'rdm_forcecheck' and to
collaborate
       with SBDFs to control which RMRR should be exposed:

         - We can set this parameter in .cfg file like,

             rdm_forcecheck = 1 => Of course this should be 0 by default.

         '1' means we should force check to reserve all ranges
unconditionally.
         and if failed VM wouldn't be created successfully. This also can
give
         user a chance to work well with later hotplug, even if not a device
         assignment while creating VM.

         If 0, we just check those assigned pci devices. As you know we already

assigned? Wasn't the plan to have a separate potentially-to-be-
assigned list? And I can only re-iterate that the name
"rdm_forcecheck" doesn't really express what you mean. Since
your intention is to check all devices (rather than do a check that
otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
seem closer to the intended behavior.

         have such an existing hypercall to assign PCI devices, looks we can
work
         directly under this hypercall to get that necessary SBDF to sort
which
         RMRR should be handled. But obviously, we need to get these info
before
         we populate guest memory to make sure these RMRR ranges should be
         excluded from guest memory. But unfortunately the memory populating
         takes place before a device assignment, so we can't live on that
         directly.

         But as we discussed it just benefit that assigned case to reorder
that
         order, but not good to hotplug case. So we have to introduce a new
         DOMCTL to pass that global parameter with SBDF at the same time.

         For example, if we own these two RMRR entries,

"own" is confusing here, I assume you mean if there are such entries.

         [00:14.0]    RMRR region: base_addr ab80a000 end_address ab81dfff
         [00:02.0]    RMRR region: base_addr ad000000 end_address af7fffff

         If 'rdm_forcecheck = 1', any caller to that hypercall may get these two

s/may/will/ I suppose.

         entries. But if 'rdm_forcecheck = 0', and in .cfg file,

         #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
         #2 'pci=[00:02.0]' -> The caller just get one entry,
0xad000000:0xaf7fffff
         #2 'pci=[00:14.0]' -> The caller just get one entry,
0xab80a000:0xab81dfff
         #4 'pci=[others]' or no any pci configuration -> The caller get
nothing.

Oh, interesting, you dropped the idea of a separate list of possibly-
to-be-assigned devices. Why that?

         And ultimately, if we expose any RMRR entry at gfn,

         in non-shared ept case,

         p2m table:  valid non-identity mapping, gfn:mfn != 1:1

This would mean ...

         VT-D table: no such a mapping until set identity mapping,
                     gfn:_mfn(gfn) == 1:1 when we have a associated device
                     assignment.

... the two mappings to go out of sync when such a 1:1 mapping gets
established. I think we should avoid such a situation if at all possible.

         in shared ept case,

         p2m table\VT-D table:  always INVALID until set identity mapping,
                                gfn:_mfn(gfn) == 1:1 when we have a
associated
                                device assignment.

         But if we don't expose any RMRR entry,

         in non-shared ept case,

         p2m table:  valid non-identity mapping, gfn:mfn != 1:1
         VT-D table: no such a mapping until set identity mapping,
                     gfn:_mfn(gfn) == 1:1 when we have a associated device
                     assignment.

         in shared ept case,

         p2m table\VT-D table:  If guest RAM already cover gfn, we have sunc a
                                valid non-identity mapping, gfn:mfn != 1:1,
but
                                we can't set any identity mapping again then
                                that associated device can't be assigned
                                successfully. If not, we'll set identity
mapping,
                                gfn:_mfn(gfn) == 1:1, to work well.

So in the end we'd still have various cases of different behavior,
when really it would be much better (if not a requirement) if guest
observable behavior wouldn't depend on implementation details
like (non-)shared page tables.

2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO

current status:
     * Currently Xen doesn't detect anything to avoid any conflict.

proposal:
     * We need to cover all points to check any conflict. Below lists places
       where reserved region conflictions should be detected:

         1>. libxc:setup_guest():modules_init()

         There are some modules, like acpi_module and smbios_module. They may
         occupy some ranges so we need to check if they're conflicting with
         all ranges from that new hypercall above.

I'm missing of what conflict resolution you expect to do here.
Following earlier discussion in the context of patch review made
pretty clear that this isn't really straightforward, having the
potential to prevent a guest from booting (e.g. when the virtual
BIOS conflicts with an RMRR).

         2>. libxc:setup_guest():xc_domain_populate_physmap()

         There are two ways to exclude RMRR ranges here:

             #1 Before we populate guest RAM without any RMRR range from that
                new hypercall to skip RMRR ranges when populating guest RAM.
             #2 In Xen we can fliter RMRR range while calling
                XENMEM_populate_physmap, its no change to libxc, but skip
                setting p2m entry for RMRR ranges in Xen hypervisor

         But to compare #1, #2 is not better since Xen still allocate those
         range, and we have to sync somewhere else in Xen. And #1 is cleaner
         because #2 actually shrinks the available memory size to the guest.

         3>. hvmloader:pci_setup()

         Here we should populate guest MMIO excluding all ranges from that
new
         hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.

         4>. hvmloader:mem_hole_alloc()

         Here we should populate mem holw excluding all ranges from that new
         hypercall to detect RMRR conflictions for dynamic allocation in
         hvmloader, e.g. as used for IGD opregion

         5>. hvmloader:build_e820_table():

         Finally we need let VM know that RMRR regions are reserved through
e820
         table

     Its a little bit tricky to handle this inside hvmloader since as you
know,
     struct hvm_info_table is only a approach between libxc and hvmloader.
But
     however, making up all above places will bring some duplicated logic.
     Especially between libxc and hvmloader, which skip same regions in guest
     physical layer(one in populating guest RAM, the other in constructing
e820)
     But current design has some limitation to pass information between libxc
and
     hvmloader,

struct hvm_info_table {
     ...
     /*
      *  0x0 to page_to_phys(low_mem_pgend)-1:
      *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
      */
     uint32_t    low_mem_pgend;
     ...
     /*
      *  0x100000000 to page_to_phys(high_mem_pgend)-1:
      *    RAM above 4GB
      */
     uint32_t    high_mem_pgend;
     ...
}

     nothing valuable is passed to hvmloader so we have to figure out a way
to
     handle those points in hvmloader. Currently,

             #1 Reconstruct hvm_info_table{}

             We can store all necessary RMRR info into hvm_info_table as
well,
             then we can pick them conveniently but oftentimes these RMRR
             entries are scattered and the number is also undertimined, so
its
             a little bit ugly to do.

             #2 Xenstore

             We may store all necessary RMRR info as Xenstore then pick them
in
             hvmloader.

             #3 A hypercall

             We may have to call our new hypercall again to pick them, but
             obviously this may introduce some duplicated codes.

             #4 XENMEM_{set_,}memory_map pair of hypercalls

             As Jan commented it "could be used(and is readily available to
be
             extended that way, since for HVM domains XENMEM_set_memory_map
             returns -EPERM at present). The only potentially problematic
aspect
             I can see with using it might be its limiting of the entry count
to
             E820MAX." So a preparation patch is required before RMRR, and
             hvmloader still needs to query RMRR information.

Somewhere above I'm missing the mentioning of the option to reorder
operations of hvmloader, to e.g. base PCI BAR assignment on the
previously set up E820 table, rather than assuming a (mostly) fixed
size window where to put these.

3. Group multiple devices sharing same RMRR entry

current status:
     * Xen doesn't distinguish if multiple devices share same RMRR entry.
This
       may lead a leak to corruption, e.g. Device A and device B share same
entry
       and then device A is assigned to domain A, device B is assigned to
domain
       B. So domain A can read something from that range, even rewrite
       maliciously that range to corrupt device B inside domain B.

proposal:
     * We will group all devices by means of one same RMRR entry.
Theoretically,
       we should make sure all devices in one group are allowed to assign to
one
       VM. But in Xen side the hardware domain owns all devices at first, we
       should unbound all group devies before assign one of group device. But
its
       hard to do such thing in current VT-d stuff. And actually its rare to
have
       the group device in real world so we just prevent assigning any group
       device simply.
     * We will introduce two field, gid, in struct, acpi_rmrr_unit:
         gid: indicate if this device belongs a group.
       Then when we add or attach device in iommu side, we will check this
field
       if we're assigning a group device then determine if we assign that.


expectation:
     * Make all device associated to one RMRR to be assigned same VM.

A few lines up you say you're intending to refuse such assignments
rather than making them happen in a consistent way - what's the
plan in the end?

4. Handle in case of force accessing to RMRR regions

proposal:
     * Although we already reserve these ranges in guest e820 table, its
       possible to access these ranges.
       In non-shared ept case it will issue such EPT violation since we have
no
       p2m table actually. But its same to access other reserved range so
just
       live on Xen's default behavior. In shared-ept case guest can read or
       write anything from this range, but such a leak or corruption just
       happens inside same domain so this behavior is also same as a native
       case, it should be fine.

expectation:
     * Don't broken normal RMRR usage.

I'm not getting the purpose of this whole section.

5. Re-enable USB

current status:
     * Currently Xen doesn't allow USB passthrough.

???

6. Hotplug case

current status:
     * only work well in non-shared ept case or in case of shared ept & all
       associated gfns are free.

proposal:
     * If user ensure there'll be a hotplug device in advace,
'rdm_forcecheck'
       can be set to reserve all RMRR range as we describe above. Then any
       device can be hotplugged successfully.
     * If not, there are still two scenarios here:
       in non-shared ept case it still work well as original;

So you continue to assume that the two tables going out of sync is
acceptable.

       in shared ept case, its just going a case of "1. Setup RMRR identity
       mapping"
         - gfn space unoccupied -> insert mapping: success.
         - gfn space already occupied by 1:1 RMRR mapping -> do nothing;
success.
         - gfn space already occupied by other mapping -> fail.

expectation:
     * provide a way to let hotplug work in case of shared ept totally

Open Issue
==========

When other stuffs like ballon mechanism, to populate memory accessing RMRR
range, or others like qemu, force map RMRR range, we may need to avoid such
a
possibility but we're not 100% sure this so just open this to double check.

I think a HVM guest can be expected to play by what E820 table
says is reserved. That includes not using such ranges for ballooning.
If it still does, it'll get what it deserves.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.