|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RMRR Fix Design for Xen
I'll work out a new design proposal based on below content and previous
discussions. Thanks Tiejun for your hard-working and Jan for your careful
reviews so far.
For below comment:
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.
let's separate it from RMRR discussion, because RMRR issues are about
general p2m and thus orthogonal to the implementation difference between
shared or not-shared fact (though it did lead to different behaviors w/ current
bogus logic).
Thanks
Kevin
> From: Chen, Tiejun
> Sent: Monday, December 22, 2014 10:12 AM
>
> Jan,
>
> Thanks for your time but I'm not going to address your comments here.
> Because I heard this design is totally not satisfied your expectation.
> But this really was reviewed with several revisions by Kevin and Yang
> before sending in public...
>
> Anyway, I guess the only thing what I can do is that, Kevin and Yang, or
> other appropriate guys should finish this design as you expect. So now
> I'd better not say anything to avoid bringing any inconvenience.
>
> Tiejun
>
> On 2014/12/19 23:13, Jan Beulich wrote:
> >>>> On 19.12.14 at 02:21, <tiejun.chen@xxxxxxxxx> wrote:
> >> #4 Something like USB, is still restricted to current RMRR implementation.
> We
> >> should work out this case.
> >
> > This can mean all or nothing. My understanding is that right now code
> > assumes that USB devices won't use their RMRR-specified memory
> > regions post-boot (kind of contrary to your earlier statement that in
> > such a case the regions shouldn't be listed in RMRRs in the first place).
> >
> >> Design Overview
> >> ===============
> >>
> >> First of all we need to make sure all resources don't overlap RMRR. And
> then
> >> in case of shared ept, we can set these identity entries. And Certainly we
> >> will
> >> group all devices associated to one same RMRR entry, then make sure all
> >> group
> >> devices should be assigned to same VM.
> >>
> >> 1. Setup RMRR identity mapping
> >>
> >> current status:
> >> * identity mapping only setup in non-shared ept case
> >>
> >> proposal:
> >>
> >> In non-shared ept case, IOMMU stuff always set those entries and RMRR is
> >> already marked reserved in host so its fine enough.
> >
> > Is it? Where? Or am I misunderstanding the whole statement, likely
> > due to me silently replacing "host" by "guest" (since reservation in
> > host address spaces is of no interest here afaict)?
> >
> >> But in shared ept case, we need to
> >> check any conflit, so we should follow up
> >>
> >> - gfn space unoccupied
> >> -> insert mapping: success.
> >> gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
> p2m_access_rw
> >> - gfn space already occupied by 1:1 RMRR mapping
> >> -> do nothing; success.
> >> - gfn space already occupied by other mapping
> >> -> fail.
> >>
> >> expectation:
> >> * only devices w/ non-conflicting RMRR can be assigned
> >> * fortunately this achieves the very initial intention to support IGD
> >> pass-through on BDW
> >
> > Are you trying to say here that doing the above is all you need for
> > your specific machine? If so, that's clearly not something to go into
> > a design document.
> >
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.
> >
> >> 2.1 Expose RMRR to user space
> >>
> >> current status:
> >> * Xen always record RMRR info into one list, acpi_rmrr_units, while
> >> parsing
> >> acpi. So we can retrieve these info by lookup that list.
> >>
> >> proposal:
> >> * RMRR would be exposed by a new hypercall, which Jan already
> finished
> >> in
> >> current version but just expose all RMRR info unconditionally.
> >> * Furthermore we can expose RMRR on demand to diminish
> shrinking guest
> >> RAM/MMIO space.
> >> * So we will introduce a new parameter, 'rdm_forcecheck' and to
> >> collaborate
> >> with SBDFs to control which RMRR should be exposed:
> >>
> >> - We can set this parameter in .cfg file like,
> >>
> >> rdm_forcecheck = 1 => Of course this should be 0 by
> default.
> >>
> >> '1' means we should force check to reserve all ranges
> >> unconditionally.
> >> and if failed VM wouldn't be created successfully. This also can
> >> give
> >> user a chance to work well with later hotplug, even if not a
> device
> >> assignment while creating VM.
> >>
> >> If 0, we just check those assigned pci devices. As you know we
> already
> >
> > assigned? Wasn't the plan to have a separate potentially-to-be-
> > assigned list? And I can only re-iterate that the name
> > "rdm_forcecheck" doesn't really express what you mean. Since
> > your intention is to check all devices (rather than do a check that
> > otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
> > seem closer to the intended behavior.
> >
> >> have such an existing hypercall to assign PCI devices, looks we
> can
> >> work
> >> directly under this hypercall to get that necessary SBDF to sort
> >> which
> >> RMRR should be handled. But obviously, we need to get these
> info
> >> before
> >> we populate guest memory to make sure these RMRR ranges
> should be
> >> excluded from guest memory. But unfortunately the memory
> populating
> >> takes place before a device assignment, so we can't live on that
> >> directly.
> >>
> >> But as we discussed it just benefit that assigned case to reorder
> >> that
> >> order, but not good to hotplug case. So we have to introduce a
> new
> >> DOMCTL to pass that global parameter with SBDF at the same
> time.
> >>
> >> For example, if we own these two RMRR entries,
> >
> > "own" is confusing here, I assume you mean if there are such entries.
> >
> >> [00:14.0] RMRR region: base_addr ab80a000 end_address
> ab81dfff
> >> [00:02.0] RMRR region: base_addr ad000000 end_address
> af7fffff
> >>
> >> If 'rdm_forcecheck = 1', any caller to that hypercall may get
> these two
> >
> > s/may/will/ I suppose.
> >
> >> entries. But if 'rdm_forcecheck = 0', and in .cfg file,
> >>
> >> #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two
> entries.
> >> #2 'pci=[00:02.0]' -> The caller just get one entry,
> >> 0xad000000:0xaf7fffff
> >> #2 'pci=[00:14.0]' -> The caller just get one entry,
> >> 0xab80a000:0xab81dfff
> >> #4 'pci=[others]' or no any pci configuration -> The caller get
> >> nothing.
> >
> > Oh, interesting, you dropped the idea of a separate list of possibly-
> > to-be-assigned devices. Why that?
> >
> >> And ultimately, if we expose any RMRR entry at gfn,
> >>
> >> in non-shared ept case,
> >>
> >> p2m table: valid non-identity mapping, gfn:mfn != 1:1
> >
> > This would mean ...
> >
> >> VT-D table: no such a mapping until set identity mapping,
> >> gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >> assignment.
> >
> > ... the two mappings to go out of sync when such a 1:1 mapping gets
> > established. I think we should avoid such a situation if at all possible.
> >
> >> in shared ept case,
> >>
> >> p2m table\VT-D table: always INVALID until set identity
> mapping,
> >> gfn:_mfn(gfn) == 1:1 when we have
> a
> >> associated
> >> device assignment.
> >>
> >> But if we don't expose any RMRR entry,
> >>
> >> in non-shared ept case,
> >>
> >> p2m table: valid non-identity mapping, gfn:mfn != 1:1
> >> VT-D table: no such a mapping until set identity mapping,
> >> gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >> assignment.
> >>
> >> in shared ept case,
> >>
> >> p2m table\VT-D table: If guest RAM already cover gfn, we
> have sunc a
> >> valid non-identity mapping,
> gfn:mfn != 1:1,
> >> but
> >> we can't set any identity mapping
> again then
> >> that associated device can't be
> assigned
> >> successfully. If not, we'll set identity
> >> mapping,
> >> gfn:_mfn(gfn) == 1:1, to work well.
> >
> > So in the end we'd still have various cases of different behavior,
> > when really it would be much better (if not a requirement) if guest
> > observable behavior wouldn't depend on implementation details
> > like (non-)shared page tables.
> >
> >> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
> >>
> >> current status:
> >> * Currently Xen doesn't detect anything to avoid any conflict.
> >>
> >> proposal:
> >> * We need to cover all points to check any conflict. Below lists
> >> places
> >> where reserved region conflictions should be detected:
> >>
> >> 1>. libxc:setup_guest():modules_init()
> >>
> >> There are some modules, like acpi_module and smbios_module.
> They may
> >> occupy some ranges so we need to check if they're conflicting
> with
> >> all ranges from that new hypercall above.
> >
> > I'm missing of what conflict resolution you expect to do here.
> > Following earlier discussion in the context of patch review made
> > pretty clear that this isn't really straightforward, having the
> > potential to prevent a guest from booting (e.g. when the virtual
> > BIOS conflicts with an RMRR).
> >
> >> 2>. libxc:setup_guest():xc_domain_populate_physmap()
> >>
> >> There are two ways to exclude RMRR ranges here:
> >>
> >> #1 Before we populate guest RAM without any RMRR
> range from that
> >> new hypercall to skip RMRR ranges when populating
> guest RAM.
> >> #2 In Xen we can fliter RMRR range while calling
> >> XENMEM_populate_physmap, its no change to libxc,
> but skip
> >> setting p2m entry for RMRR ranges in Xen hypervisor
> >>
> >> But to compare #1, #2 is not better since Xen still allocate those
> >> range, and we have to sync somewhere else in Xen. And #1 is
> cleaner
> >> because #2 actually shrinks the available memory size to the
> guest.
> >>
> >> 3>. hvmloader:pci_setup()
> >>
> >> Here we should populate guest MMIO excluding all ranges
> from that
> >> new
> >> hypercall to detect RMRR conflictions for allocating PCI MMIO
> BAR.
> >>
> >> 4>. hvmloader:mem_hole_alloc()
> >>
> >> Here we should populate mem holw excluding all ranges from
> that new
> >> hypercall to detect RMRR conflictions for dynamic allocation in
> >> hvmloader, e.g. as used for IGD opregion
> >>
> >> 5>. hvmloader:build_e820_table():
> >>
> >> Finally we need let VM know that RMRR regions are reserved
> through
> >> e820
> >> table
> >>
> >> Its a little bit tricky to handle this inside hvmloader since as you
> >> know,
> >> struct hvm_info_table is only a approach between libxc and
> hvmloader.
> >> But
> >> however, making up all above places will bring some duplicated
> logic.
> >> Especially between libxc and hvmloader, which skip same regions in
> guest
> >> physical layer(one in populating guest RAM, the other in
> constructing
> >> e820)
> >> But current design has some limitation to pass information between
> libxc
> >> and
> >> hvmloader,
> >>
> >> struct hvm_info_table {
> >> ...
> >> /*
> >> * 0x0 to page_to_phys(low_mem_pgend)-1:
> >> * RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
> >> */
> >> uint32_t low_mem_pgend;
> >> ...
> >> /*
> >> * 0x100000000 to page_to_phys(high_mem_pgend)-1:
> >> * RAM above 4GB
> >> */
> >> uint32_t high_mem_pgend;
> >> ...
> >> }
> >>
> >> nothing valuable is passed to hvmloader so we have to figure out a
> way
> >> to
> >> handle those points in hvmloader. Currently,
> >>
> >> #1 Reconstruct hvm_info_table{}
> >>
> >> We can store all necessary RMRR info into hvm_info_table
> as
> >> well,
> >> then we can pick them conveniently but oftentimes these
> RMRR
> >> entries are scattered and the number is also undertimined,
> so
> >> its
> >> a little bit ugly to do.
> >>
> >> #2 Xenstore
> >>
> >> We may store all necessary RMRR info as Xenstore then
> pick them
> >> in
> >> hvmloader.
> >>
> >> #3 A hypercall
> >>
> >> We may have to call our new hypercall again to pick them,
> but
> >> obviously this may introduce some duplicated codes.
> >>
> >> #4 XENMEM_{set_,}memory_map pair of hypercalls
> >>
> >> As Jan commented it "could be used(and is readily
> available to
> >> be
> >> extended that way, since for HVM domains
> XENMEM_set_memory_map
> >> returns -EPERM at present). The only potentially
> problematic
> >> aspect
> >> I can see with using it might be its limiting of the entry
> count
> >> to
> >> E820MAX." So a preparation patch is required before
> RMRR, and
> >> hvmloader still needs to query RMRR information.
> >
> > Somewhere above I'm missing the mentioning of the option to reorder
> > operations of hvmloader, to e.g. base PCI BAR assignment on the
> > previously set up E820 table, rather than assuming a (mostly) fixed
> > size window where to put these.
> >
> >> 3. Group multiple devices sharing same RMRR entry
> >>
> >> current status:
> >> * Xen doesn't distinguish if multiple devices share same RMRR entry.
> >> This
> >> may lead a leak to corruption, e.g. Device A and device B share
> same
> >> entry
> >> and then device A is assigned to domain A, device B is assigned to
> >> domain
> >> B. So domain A can read something from that range, even rewrite
> >> maliciously that range to corrupt device B inside domain B.
> >>
> >> proposal:
> >> * We will group all devices by means of one same RMRR entry.
> >> Theoretically,
> >> we should make sure all devices in one group are allowed to
> assign to
> >> one
> >> VM. But in Xen side the hardware domain owns all devices at first,
> we
> >> should unbound all group devies before assign one of group device.
> But
> >> its
> >> hard to do such thing in current VT-d stuff. And actually its rare
> >> to
> >> have
> >> the group device in real world so we just prevent assigning any
> group
> >> device simply.
> >> * We will introduce two field, gid, in struct, acpi_rmrr_unit:
> >> gid: indicate if this device belongs a group.
> >> Then when we add or attach device in iommu side, we will check
> this
> >> field
> >> if we're assigning a group device then determine if we assign that.
> >>
> >>
> >> expectation:
> >> * Make all device associated to one RMRR to be assigned same VM.
> >
> > A few lines up you say you're intending to refuse such assignments
> > rather than making them happen in a consistent way - what's the
> > plan in the end?
> >
> >> 4. Handle in case of force accessing to RMRR regions
> >>
> >> proposal:
> >> * Although we already reserve these ranges in guest e820 table, its
> >> possible to access these ranges.
> >> In non-shared ept case it will issue such EPT violation since we
> have
> >> no
> >> p2m table actually. But its same to access other reserved range so
> >> just
> >> live on Xen's default behavior. In shared-ept case guest can read
> or
> >> write anything from this range, but such a leak or corruption just
> >> happens inside same domain so this behavior is also same as a
> native
> >> case, it should be fine.
> >>
> >> expectation:
> >> * Don't broken normal RMRR usage.
> >
> > I'm not getting the purpose of this whole section.
> >
> >> 5. Re-enable USB
> >>
> >> current status:
> >> * Currently Xen doesn't allow USB passthrough.
> >
> > ???
> >
> >> 6. Hotplug case
> >>
> >> current status:
> >> * only work well in non-shared ept case or in case of shared ept & all
> >> associated gfns are free.
> >>
> >> proposal:
> >> * If user ensure there'll be a hotplug device in advace,
> >> 'rdm_forcecheck'
> >> can be set to reserve all RMRR range as we describe above. Then
> any
> >> device can be hotplugged successfully.
> >> * If not, there are still two scenarios here:
> >> in non-shared ept case it still work well as original;
> >
> > So you continue to assume that the two tables going out of sync is
> > acceptable.
> >
> >> in shared ept case, its just going a case of "1. Setup RMRR identity
> >> mapping"
> >> - gfn space unoccupied -> insert mapping: success.
> >> - gfn space already occupied by 1:1 RMRR mapping -> do
> nothing;
> >> success.
> >> - gfn space already occupied by other mapping -> fail.
> >>
> >> expectation:
> >> * provide a way to let hotplug work in case of shared ept totally
> >>
> >> Open Issue
> >> ==========
> >>
> >> When other stuffs like ballon mechanism, to populate memory accessing
> RMRR
> >> range, or others like qemu, force map RMRR range, we may need to avoid
> such
> >> a
> >> possibility but we're not 100% sure this so just open this to double check.
> >
> > I think a HVM guest can be expected to play by what E820 table
> > says is reserved. That includes not using such ranges for ballooning.
> > If it still does, it'll get what it deserves.
> >
> > Jan
> >
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |