[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-devel] RMRR Fix Design for Xen
- To: Tiejun Chen <tiejun.chen@xxxxxxxxx>, JBeulich@xxxxxxxx, konrad.wilk@xxxxxxxxxx, tim@xxxxxxx, kevin.tian@xxxxxxxxx, yang.z.zhang@xxxxxxxxx, stefano.stabellini@xxxxxxxxxxxxx, ian.campbell@xxxxxxxxxx, ian.jackson@xxxxxxxxxxxxx, wei.liu2@xxxxxxxxxx
- From: Fabio Fantoni <fabio.fantoni@xxxxxxx>
- Date: Fri, 19 Dec 2014 10:49:59 +0100
- Cc: xen-devel@xxxxxxxxxxxxx
- Delivery-date: Fri, 19 Dec 2014 09:50:02 +0000
- List-id: Xen developer discussion <xen-devel.lists.xen.org>
Il 19/12/2014 02:21, Tiejun Chen ha scritto:
RMRR Fix Design for Xen
This design is a goal to fix RMRR for Xen. It includes four sectors as
follows:
* Background
* What is RMRR
* Current RMRR Issues
* Design Overview
We hope this can help us to understand current problem then figure out a
clean and better solution everyone can agree now to go forward.
Background
==========
We first identified this RMRR defect when trying to pass-through IGD device,
which can be simply fixed by adding an identity mapping in case of shared
EPT table. However along with the community discussion, it boiled down to
a more general RMRR problem, i.e. the identity mapping is brute-added
in hypervisor, w/o considering whether conflicting with an existing guest
PFN ranges. As a general solution we need invent a new mechanism so
reserved ranges allocated by hypervisor can be exported to the user space
toolstack and hvmloader, so conflict can be detected when constructing
guest PFN layout, with best-effort avoidance policies to further help.
What is RMRR
============
RMRR is a acronym for Reserved Memory Region Reporting.
BIOS may report each such reserved memory region through the RMRR structures,
along with the devices that requires access to the specified reserved memory
region. Reserved memory ranges that are either not DMA targets, or memory
ranges that may be target of BIOS initiated DMA only during pre-boot phase
(such as from a boot disk drive) must not be included in the reserved memory
region reporting. The base address of each RMRR region must be 4KB aligned and
the size must be an integer multiple of 4KB. BIOS must report the RMRR reported
memory addresses as reserved in the system memory map returned through methods
suchas INT15, EFI GetMemoryMap etc. The reserved memory region reporting
structures are optional. If there are no RMRR structures, the system software
concludes that the platform does not have any reserved memory ranges that are
DMA targets.
The RMRR regions are expected to be used for legacy usages (such as USB, UMA
Graphics, etc.) requiring reserved memory. Platform designers shouldavoid or
limit use of reserved memory regions since these require system software to
create holes in the DMA virtual address range available to system software and
its drivers.
The following is grabbed from my BDW:
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679: RMRR region: base_addr ab80a000 end_address ab81dfff
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679: RMRR region: base_addr ad000000 end_address af7fffff
Here USB occupies 0xab80a000:0xab81dfff, IGD owns 0xad000000:0xaf7fffff.
Note there are zero or more Reserved Memory Region Reporting (RMRR) in one given
platform. And multiple devices may share one RMRR range. Additionally RMRR can
go anyplace.
Current RMRR Issues
===================
#1 RMRR may conflict RAM, mmio or other ranges in Guest physical level.
Sorry if my question is not inherent, I don't have good knowledge about
it, xen domUs require that all memory regions is correctly defined in
hvmloader or do them automatically and correct in any emulated devices
assigned to domUs?
I'm unable to have qxl vga working in linux domUs and unable to found
the problem for now, I saw a memory warning in system logs of domU with
qxl which makes me think that perhaps the differences in memory regions
of qxl are not considered properly in hvmloader.
The warning I found in one Fedora domU with qxl is:
ioremap error for 0xfc001000-0xfc002000, requested 0x10, got 0x0
Here a post about with logs compared also with stdvga tests and kvm
(with qxl full working) test:
http://lists.xen.org/archives/html/xen-devel/2013-10/msg00016.html
Here the qxl support for libxl patch updated to latest xen if someone
want fast try it:
https://github.com/Fantu/Xen/commit/fadecf8d6ee0e8c7e421fafba67aa11879e8b8fe
Can someone tell me is can be a hvmloader memory problem or this is not
related?
Thanks for any reply and sorry for my bad english.
#2 Xen doesn't create RMRR mapping in case of shared ept, then the assigned
device can't work well.
#3 Xen doesn't consider that case multiple devices may share one RMRR entry.
This also is a damage between different VMs when we assign such devices to
different VMs.
#4 Something like USB, is still restricted to current RMRR implementation. We
should work out this case.
Design Overview
===============
First of all we need to make sure all resources don't overlap RMRR. And then
in case of shared ept, we can set these identity entries. And Certainly we will
group all devices associated to one same RMRR entry, then make sure all group
devices should be assigned to same VM.
1. Setup RMRR identity mapping
current status:
* identity mapping only setup in non-shared ept case
proposal:
In non-shared ept case, IOMMU stuff always set those entries and RMRR is already
marked reserved in host so its fine enough. But in shared ept case, we need to
check any conflit, so we should follow up
- gfn space unoccupied
-> insert mapping: success.
gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
- gfn space already occupied by 1:1 RMRR mapping
-> do nothing; success.
- gfn space already occupied by other mapping
-> fail.
expectation:
* only devices w/ non-conflicting RMRR can be assigned
* fortunately this achieves the very initial intention to support IGD
pass-through on BDW
* provide last level protection in hypervisor to ensure a secure assignment
2. Check to reserve RMRR in guest physical level
current status:
* Xen doesn't reserve RMRR in guest physical level
proposal:
* Get RMRR info as necessary
* Then check all potential points to reserve or skip RMRR
expectation:
* Make sure all guest resources don't overlap RMRR range.
2.1 Expose RMRR to user space
current status:
* Xen always record RMRR info into one list, acpi_rmrr_units, while parsing
acpi. So we can retrieve these info by lookup that list.
proposal:
* RMRR would be exposed by a new hypercall, which Jan already finished in
current version but just expose all RMRR info unconditionally.
* Furthermore we can expose RMRR on demand to diminish shrinking guest
RAM/MMIO space.
* So we will introduce a new parameter, 'rdm_forcecheck' and to collaborate
with SBDFs to control which RMRR should be exposed:
- We can set this parameter in .cfg file like,
rdm_forcecheck = 1 => Of course this should be 0 by default.
'1' means we should force check to reserve all ranges unconditionally.
and if failed VM wouldn't be created successfully. This also can give
user a chance to work well with later hotplug, even if not a device
assignment while creating VM.
If 0, we just check those assigned pci devices. As you know we already
have such an existing hypercall to assign PCI devices, looks we can
work
directly under this hypercall to get that necessary SBDF to sort which
RMRR should be handled. But obviously, we need to get these info before
we populate guest memory to make sure these RMRR ranges should be
excluded from guest memory. But unfortunately the memory populating
takes place before a device assignment, so we can't live on that
directly.
But as we discussed it just benefit that assigned case to reorder that
order, but not good to hotplug case. So we have to introduce a new
DOMCTL to pass that global parameter with SBDF at the same time.
For example, if we own these two RMRR entries,
[00:14.0] RMRR region: base_addr ab80a000 end_address ab81dfff
[00:02.0] RMRR region: base_addr ad000000 end_address af7fffff
If 'rdm_forcecheck = 1', any caller to that hypercall may get these two
entries. But if 'rdm_forcecheck = 0', and in .cfg file,
#1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
#2 'pci=[00:02.0]' -> The caller just get one entry,
0xad000000:0xaf7fffff
#2 'pci=[00:14.0]' -> The caller just get one entry,
0xab80a000:0xab81dfff
#4 'pci=[others]' or no any pci configuration -> The caller get
nothing.
And ultimately, if we expose any RMRR entry at gfn,
in non-shared ept case,
p2m table: valid non-identity mapping, gfn:mfn != 1:1
VT-D table: no such a mapping until set identity mapping,
gfn:_mfn(gfn) == 1:1 when we have a associated device
assignment.
in shared ept case,
p2m table\VT-D table: always INVALID until set identity mapping,
gfn:_mfn(gfn) == 1:1 when we have a associated
device assignment.
But if we don't expose any RMRR entry,
in non-shared ept case,
p2m table: valid non-identity mapping, gfn:mfn != 1:1
VT-D table: no such a mapping until set identity mapping,
gfn:_mfn(gfn) == 1:1 when we have a associated device
assignment.
in shared ept case,
p2m table\VT-D table: If guest RAM already cover gfn, we have sunc a
valid non-identity mapping, gfn:mfn != 1:1, but
we can't set any identity mapping again then
that associated device can't be assigned
successfully. If not, we'll set identity
mapping,
gfn:_mfn(gfn) == 1:1, to work well.
expectation:
* Provice a way to get necessary RMRR info.
Note here that new DOMCTL is already finished but need to fix some code style
issues.
2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
current status:
* Currently Xen doesn't detect anything to avoid any conflict.
proposal:
* We need to cover all points to check any conflict. Below lists places
where reserved region conflictions should be detected:
1>. libxc:setup_guest():modules_init()
There are some modules, like acpi_module and smbios_module. They may
occupy some ranges so we need to check if they're conflicting with
all ranges from that new hypercall above.
2>. libxc:setup_guest():xc_domain_populate_physmap()
There are two ways to exclude RMRR ranges here:
#1 Before we populate guest RAM without any RMRR range from that
new hypercall to skip RMRR ranges when populating guest RAM.
#2 In Xen we can fliter RMRR range while calling
XENMEM_populate_physmap, its no change to libxc, but skip
setting p2m entry for RMRR ranges in Xen hypervisor
But to compare #1, #2 is not better since Xen still allocate those
range, and we have to sync somewhere else in Xen. And #1 is cleaner
because #2 actually shrinks the available memory size to the guest.
3>. hvmloader:pci_setup()
Here we should populate guest MMIO excluding all ranges from that new
hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.
4>. hvmloader:mem_hole_alloc()
Here we should populate mem holw excluding all ranges from that new
hypercall to detect RMRR conflictions for dynamic allocation in
hvmloader, e.g. as used for IGD opregion
5>. hvmloader:build_e820_table():
Finally we need let VM know that RMRR regions are reserved through e820
table
Its a little bit tricky to handle this inside hvmloader since as you know,
struct hvm_info_table is only a approach between libxc and hvmloader. But
however, making up all above places will bring some duplicated logic.
Especially between libxc and hvmloader, which skip same regions in guest
physical layer(one in populating guest RAM, the other in constructing e820)
But current design has some limitation to pass information between libxc
and
hvmloader,
struct hvm_info_table {
...
/*
* 0x0 to page_to_phys(low_mem_pgend)-1:
* RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
*/
uint32_t low_mem_pgend;
...
/*
* 0x100000000 to page_to_phys(high_mem_pgend)-1:
* RAM above 4GB
*/
uint32_t high_mem_pgend;
...
}
nothing valuable is passed to hvmloader so we have to figure out a way to
handle those points in hvmloader. Currently,
#1 Reconstruct hvm_info_table{}
We can store all necessary RMRR info into hvm_info_table as well,
then we can pick them conveniently but oftentimes these RMRR
entries are scattered and the number is also undertimined, so its
a little bit ugly to do.
#2 Xenstore
We may store all necessary RMRR info as Xenstore then pick them in
hvmloader.
#3 A hypercall
We may have to call our new hypercall again to pick them, but
obviously this may introduce some duplicated codes.
#4 XENMEM_{set_,}memory_map pair of hypercalls
As Jan commented it "could be used(and is readily available to be
extended that way, since for HVM domains XENMEM_set_memory_map
returns -EPERM at present). The only potentially problematic aspect
I can see with using it might be its limiting of the entry count to
E820MAX." So a preparation patch is required before RMRR, and
hvmloader still needs to query RMRR information.
expectation:
* Cover all points to make sure guest doesn't occupt any RMRR range.
3. Group multiple devices sharing same RMRR entry
current status:
* Xen doesn't distinguish if multiple devices share same RMRR entry. This
may lead a leak to corruption, e.g. Device A and device B share same
entry
and then device A is assigned to domain A, device B is assigned to domain
B. So domain A can read something from that range, even rewrite
maliciously that range to corrupt device B inside domain B.
proposal:
* We will group all devices by means of one same RMRR entry. Theoretically,
we should make sure all devices in one group are allowed to assign to one
VM. But in Xen side the hardware domain owns all devices at first, we
should unbound all group devies before assign one of group device. But
its
hard to do such thing in current VT-d stuff. And actually its rare to
have
the group device in real world so we just prevent assigning any group
device simply.
* We will introduce two field, gid, in struct, acpi_rmrr_unit:
gid: indicate if this device belongs a group.
Then when we add or attach device in iommu side, we will check this field
if we're assigning a group device then determine if we assign that.
expectation:
* Make all device associated to one RMRR to be assigned same VM.
4. Handle in case of force accessing to RMRR regions
proposal:
* Although we already reserve these ranges in guest e820 table, its
possible to access these ranges.
In non-shared ept case it will issue such EPT violation since we have no
p2m table actually. But its same to access other reserved range so just
live on Xen's default behavior. In shared-ept case guest can read or
write anything from this range, but such a leak or corruption just
happens inside same domain so this behavior is also same as a native
case, it should be fine.
expectation:
* Don't broken normal RMRR usage.
5. Re-enable USB
current status:
* Currently Xen doesn't allow USB passthrough.
proposal:
* Cleanup some legacy codes related to RMRR to accommodate our RMRR
support.
expectation:
* Make USB wor well after we handle RMRR properly.
6. Hotplug case
current status:
* only work well in non-shared ept case or in case of shared ept & all
associated gfns are free.
proposal:
* If user ensure there'll be a hotplug device in advace, 'rdm_forcecheck'
can be set to reserve all RMRR range as we describe above. Then any
device can be hotplugged successfully.
* If not, there are still two scenarios here:
in non-shared ept case it still work well as original;
in shared ept case, its just going a case of "1. Setup RMRR identity
mapping"
- gfn space unoccupied -> insert mapping: success.
- gfn space already occupied by 1:1 RMRR mapping -> do nothing;
success.
- gfn space already occupied by other mapping -> fail.
expectation:
* provide a way to let hotplug work in case of shared ept totally
Open Issue
==========
When other stuffs like ballon mechanism, to populate memory accessing RMRR
range, or others like qemu, force map RMRR range, we may need to avoid such a
possibility but we're not 100% sure this so just open this to double check.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|