|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only
On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either), although
> > in the latter case the spec forbids placing other registers on the same
> > page. Current API allows only marking whole pages pages read-only,
> > which sometimes may cover other registers that guest may need to
> > write into.
> >
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> >
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables (they are in EPT,
> > but not granular enough), do this via emulation (or simply page fault
> > handler for PV) that handles writes that are supposed to be allowed.
> > The new subpage_mmio_ro_add() takes a start physical address and the
> > region size in bytes. Both start address and the size need to be 8-byte
> > aligned, as a practical simplification (allows using smaller bitmask,
> > and a smaller granularity isn't really necessary right now).
> > It will internally add relevant pages to mmio_ro_ranges, but if either
> > start or end address is not page-aligned, it additionally adds that page
> > to a list for sub-page R/O handling. The list holds a bitmask which
> > qwords are supposed to be read-only and an address where page is mapped
> > for write emulation - this mapping is done only on the first access. A
> > plain list is used instead of more efficient structure, because there
> > isn't supposed to be many pages needing this precise r/o control.
> >
> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash() for
> > non hardware domains. To reach mmio_ro_emulated_write(), change how
> > write violations for p2m_mmio_direct are handled - specifically, check
> > if they relate to such partially protected page via
> > subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
> > them too. This decodes what guest is trying write and finally calls
> > mmio_ro_emulated_write(). The EPT write violation is detected as
> > npfec.write_access and npfec.present both being true (similar to other
> > places), which may cover some other (future?) cases - if that happens,
> > emulator might get involved unnecessarily, but since it's limited to
> > pages marked with subpage_mmio_ro_add() only, the impact is minimal.
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> >
> > This may give a bit more access to the instruction emulator to HVM
> > guests (the change in hvm_hap_nested_page_fault()), but only for pages
> > explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
> > passed through a device partially used by Xen.
> > As of the next patch, it applies only configuration explicitly
> > documented as not security supported.
> >
> > The subpage_mmio_ro_add() function cannot be called with overlapping
> > ranges, and on pages already added to mmio_ro_ranges separately.
> > Successful calls would result in correct handling, but error paths may
> > result in incorrect state (like pages removed from mmio_ro_ranges too
> > early). Debug build has asserts for relevant cases.
> >
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently than
> > HAP in areas related to this patch.
> >
> > Changes in v4:
> > - rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
> > - guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
> > there
> > - rename ro_qwords to ro_elems
> > - use unsigned arguments for subpage_mmio_ro_remove_page()
> > - use volatile for __iomem
> > - do not set mmio_ro_ctxt.mfn for mmcfg case
> > - comment where fields of mmio_ro_ctxt are used
> > - use bool for result of __test_and_set_bit
> > - do not open-code mfn_to_maddr()
> > - remove leftover RCU
> > - mention hvm_hap_nested_page_fault() explicitly in the commit message
> > Changes in v3:
> > - use unsigned int for loop iterators
> > - use __set_bit/__clear_bit when under spinlock
> > - avoid ioremap() under spinlock
> > - do not cast away const
> > - handle unaligned parameters in release build
> > - comment fixes
> > - remove RCU - the add functions are __init and actual usage is only
> > much later after domains are running
> > - add checks overlapping ranges in debug build and document the
> > limitations
> > - change subpage_mmio_ro_add() so the error path doesn't potentially
> > remove pages from mmio_ro_ranges
> > - move printing message to avoid one goto in
> > subpage_mmio_write_emulate()
> > Changes in v2:
> > - Simplify subpage_mmio_ro_add() parameters
> > - add to mmio_ro_ranges from within subpage_mmio_ro_add()
> > - use ioremap() instead of caller-provided fixmap
> > - use 8-bytes granularity (largest supported single write) and a bitmap
> > instead of a rangeset
> > - clarify commit message
> > - change how it's plugged in for HVM domain, to not change the behavior for
> > read-only parts (keep it hitting domain_crash(), instead of ignoring
> > write)
> > - remove unused subpage_mmio_ro_remove()
> > ---
> > xen/arch/x86/hvm/emulate.c | 2 +-
> > xen/arch/x86/hvm/hvm.c | 4 +-
> > xen/arch/x86/include/asm/mm.h | 25 +++-
> > xen/arch/x86/mm.c | 273 +++++++++++++++++++++++++++++++++-
> > xen/arch/x86/pv/ro-page-fault.c | 6 +-
> > 5 files changed, 305 insertions(+), 5 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > index ab1bc516839a..e98513afc69b 100644
> > --- a/xen/arch/x86/hvm/emulate.c
> > +++ b/xen/arch/x86/hvm/emulate.c
> > @@ -2735,7 +2735,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned
> > long gla)
> > .write = mmio_ro_emulated_write,
> > .validate = hvmemul_validate,
> > };
> > - struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > + struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn =
> > _mfn(mfn) };
> > struct hvm_emulate_ctxt ctxt;
> > const struct x86_emulate_ops *ops;
> > unsigned int seg, bdf;
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 9594e0a5c530..73bbfe2bdc99 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -2001,8 +2001,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned
> > long gla,
> > goto out_put_gfn;
> > }
> >
> > - if ( (p2mt == p2m_mmio_direct) && is_hardware_domain(currd) &&
> > - npfec.write_access && npfec.present &&
> > + if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present
> > &&
> > + (is_hardware_domain(currd) || subpage_mmio_write_accept(mfn,
> > gla)) &&
> > (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> > {
> > rc = 1;
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index 98b66edaca5e..d04cf2c4165e 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,34 @@ extern struct rangeset *mmio_ro_ranges;
> > void memguard_guard_stack(void *p);
> > void memguard_unguard_stack(void *p);
> >
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via
> > another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to MMIO_RO_SUBPAGE_GRAN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already
> > added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Since there is currently no subpage_mmio_ro_remove(), relevant device
> > should
> > + * not be hot-unplugged.
> > + *
> > + * Return values:
> > + * - negative: error
> > + * - 0: success
> > + */
> > +#define MMIO_RO_SUBPAGE_GRAN 8
> > +int subpage_mmio_ro_add(paddr_t start, size_t size);
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
> > +#endif
> > +
> > struct mmio_ro_emulate_ctxt {
> > unsigned long cr2;
> > + /* Used only for mmcfg case */
> > unsigned int seg, bdf;
> > + /* Used only for non-mmcfg case */
> > + mfn_t mfn;
> > };
> >
> > int cf_check mmio_ro_emulated_write(
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index d968bbbc7315..dab7cc018c3f 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -150,6 +150,17 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >
> > struct rangeset *__read_mostly mmio_ro_ranges;
> >
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > + struct list_head list;
> > + mfn_t mfn;
> > + void __iomem *mapped;
> > + DECLARE_BITMAP(ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN);
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> > +static DEFINE_SPINLOCK(subpage_ro_lock);
> > +
> > static uint32_t base_disallow_mask;
> > /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings.
> > */
> > #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) &
> > ~_PAGE_GLOBAL)
> > @@ -4910,6 +4921,265 @@ long arch_memory_op(unsigned long cmd,
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> > return rc;
> > }
> >
> > +/*
> > + * Mark part of the page as R/O.
> > + * Returns:
> > + * - 0 on success - first range in the page
> > + * - 1 on success - subsequent range in the page
> > + * - <0 on error
> > + *
> > + * This needs subpage_ro_lock already taken.
> > + */
> > +static int __init subpage_mmio_ro_add_page(
> > + mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
>
> Nit: parameters here seem to be indented differently than below.
>
> > +{
> > + struct subpage_ro_range *entry = NULL, *iter;
> > + unsigned int i;
> > +
> > + list_for_each_entry(iter, &subpage_ro_ranges, list)
> > + {
> > + if ( mfn_eq(iter->mfn, mfn) )
> > + {
> > + entry = iter;
> > + break;
> > + }
> > + }
>
> AFAICT you could put the search logic into a separate function and use
> it here, plus in subpage_mmio_ro_remove_page(),
> subpage_mmio_write_emulate() and subpage_mmio_write_accept() possibly.
Good idea.
> > + if ( !entry )
> > + {
> > + /* iter == NULL marks it was a newly allocated entry */
> > + iter = NULL;
> > + entry = xzalloc(struct subpage_ro_range);
> > + if ( !entry )
> > + return -ENOMEM;
> > + entry->mfn = mfn;
> > + }
> > +
> > + for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > + {
> > + bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > + entry->ro_elems);
> > + ASSERT(!oldbit);
> > + }
> > +
> > + if ( !iter )
> > + list_add(&entry->list, &subpage_ro_ranges);
> > +
> > + return iter ? 1 : 0;
> > +}
> > +
> > +/* This needs subpage_ro_lock already taken */
> > +static void __init subpage_mmio_ro_remove_page(
> > + mfn_t mfn,
> > + unsigned int offset_s,
> > + unsigned int offset_e)
> > +{
> > + struct subpage_ro_range *entry = NULL, *iter;
> > + unsigned int i;
> > +
> > + list_for_each_entry(iter, &subpage_ro_ranges, list)
> > + {
> > + if ( mfn_eq(iter->mfn, mfn) )
> > + {
> > + entry = iter;
> > + break;
> > + }
> > + }
> > + if ( !entry )
> > + return;
> > +
> > + for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > + __clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > +
> > + if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN) )
> > + return;
> > +
> > + list_del(&entry->list);
> > + if ( entry->mapped )
> > + iounmap(entry->mapped);
> > + xfree(entry);
> > +}
> > +
> > +int __init subpage_mmio_ro_add(
> > + paddr_t start,
> > + size_t size)
> > +{
> > + mfn_t mfn_start = maddr_to_mfn(start);
> > + paddr_t end = start + size - 1;
> > + mfn_t mfn_end = maddr_to_mfn(end);
> > + unsigned int offset_end = 0;
> > + int rc;
> > + bool subpage_start, subpage_end;
> > +
> > + ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > + ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > + if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > + size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > +
> > + if ( !size )
> > + return 0;
> > +
> > + if ( mfn_eq(mfn_start, mfn_end) )
> > + {
> > + /* Both starting and ending parts handled at once */
> > + subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) !=
> > PAGE_SIZE - 1;
> > + subpage_end = false;
>
> Given the intended usage of this, don't we want to limit to only a
> single page? So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> that would simplify the logic here?
I have considered that, but I haven't found anything in the spec
mandating the XHCI DbC registers to not cross page boundary. Currently
(on a system I test this on) they don't cross page boundary, but I don't
want to assume extra constrains - to avoid issues like before (when
on the older system I tested the DbC registers didn't shared page with
other registers, but then they shared the page on a newer hardware).
> Mostly asking because I think for the usage of XHCI the registers that
> need to be marked RO are all inside the same page, and hence would
> like to avoid introducing logic to handle multipage ranges if that's
> not tested at all.
>
> > + }
> > + else
> > + {
> > + subpage_start = PAGE_OFFSET(start);
> > + subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > + }
> > +
> > + spin_lock(&subpage_ro_lock);
>
> Do you really need the lock if modifications can only happen during
> init? Xen initialization is single threaded, so you can likely avoid
> the lock during boot.
With adding (and removing) firmly tied to init (via __ro_after_init), I
think I'm okay with dropping the spinlock here. Yet, it's still needed
for mapping the page.
> > +
> > + if ( subpage_start )
> > + {
> > + offset_end = mfn_eq(mfn_start, mfn_end) ?
> > + PAGE_OFFSET(end) :
> > + (PAGE_SIZE - 1);
> > + rc = subpage_mmio_ro_add_page(mfn_start,
> > + PAGE_OFFSET(start),
> > + offset_end);
> > + if ( rc < 0 )
> > + goto err_unlock;
> > + /* Check if not marking R/W part of a page intended to be fully
> > R/O */
> > + ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > + mfn_x(mfn_start)));
>
> I think it would be better if this check was done ahead, and an error
> was returned. I see no point in delaying the check until the region
> has already been registered.
I need return value from subpage_mmio_ro_add_page() for this check,
because currently it's okay to mark further regions read-only (at which
point the page is already on mmio_ro_ranges). Theoretically I could
probably limit the scope of this API even further - to just one R/O
region per page, but even in the XHCI driver, I can imagine needing
marking more regions (which might share a page, depending on hardware
layout) in some future version that could gain some more features.
> > + }
> > +
> > + if ( subpage_end )
> > + {
> > + rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> > + if ( rc < 0 )
> > + goto err_unlock_remove;
> > + /* Check if not marking R/W part of a page intended to be fully
> > R/O */
> > + ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > + mfn_x(mfn_end)));
> > + }
> > +
> > + spin_unlock(&subpage_ro_lock);
> > +
> > + rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start),
> > mfn_x(mfn_end));
> > + if ( rc )
> > + goto err_remove;
> > +
> > + return 0;
> > +
> > + err_remove:
> > + spin_lock(&subpage_ro_lock);
> > + if ( subpage_end )
> > + subpage_mmio_ro_remove_page(mfn_end, 0, PAGE_OFFSET(end));
> > + err_unlock_remove:
> > + if ( subpage_start )
> > + subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start),
> > offset_end);
> > + err_unlock:
> > + spin_unlock(&subpage_ro_lock);
> > + return rc;
> > +}
> > +
> > +static void __iomem *subpage_mmio_map_page(
> > + struct subpage_ro_range *entry)
> > +{
> > + void __iomem *mapped_page;
> > +
> > + if ( entry->mapped )
> > + return entry->mapped;
> > +
> > + mapped_page = ioremap(mfn_to_maddr(entry->mfn), PAGE_SIZE);
> > +
> > + spin_lock(&subpage_ro_lock);
> > + /* Re-check under the lock */
> > + if ( entry->mapped )
> > + {
> > + spin_unlock(&subpage_ro_lock);
> > + if ( mapped_page )
> > + iounmap(mapped_page);
> > + return entry->mapped;
> > + }
> > +
> > + entry->mapped = mapped_page;
> > + spin_unlock(&subpage_ro_lock);
> > + return entry->mapped;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > + mfn_t mfn,
> > + unsigned int offset,
> > + const void *data,
> > + unsigned int len)
> > +{
> > + struct subpage_ro_range *entry;
> > + volatile void __iomem *addr;
> > +
> > + list_for_each_entry(entry, &subpage_ro_ranges, list)
> > + {
> > + if ( mfn_eq(entry->mfn, mfn) )
> > + {
> > + if ( test_bit(offset / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems) )
> > + {
> > + write_ignored:
> > + gprintk(XENLOG_WARNING,
> > + "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len
> > %u\n",
> > + mfn_x(mfn), offset, len);
> > + return;
> > + }
> > +
> > + addr = subpage_mmio_map_page(entry);
>
> Given the very limited usage of this subpage RO infrastructure, I
> would be tempted to just map the mfn when the page is registered, in
> order to simplify the logic here. The only use-case we have is XHCI,
> and further usage of this are likely to be limited to similar hardware
> that's shared between Xen and the hardware domain.
In an earlier similar series (which was about 1 or 2 pages in practice
per device) Jan requested doing lazy mapping, so I did it similar in
this series too.
> > + if ( !addr )
> > + {
> > + gprintk(XENLOG_ERR,
> > + "Failed to map page for MMIO write at
> > 0x%"PRI_mfn"%03x\n",
> > + mfn_x(mfn), offset);
> > + return;
> > + }
> > +
> > + switch ( len )
> > + {
> > + case 1:
> > + writeb(*(const uint8_t*)data, addr);
> > + break;
> > + case 2:
> > + writew(*(const uint16_t*)data, addr);
> > + break;
> > + case 4:
> > + writel(*(const uint32_t*)data, addr);
> > + break;
> > + case 8:
> > + writeq(*(const uint64_t*)data, addr);
> > + break;
> > + default:
> > + /* mmio_ro_emulated_write() already validated the size */
> > + ASSERT_UNREACHABLE();
> > + goto write_ignored;
> > + }
> > + return;
> > + }
> > + }
> > + /* Do not print message for pages without any writable parts. */
> > +}
> > +
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> > +{
> > + unsigned int offset = PAGE_OFFSET(gla);
> > + const struct subpage_ro_range *entry;
> > +
> > + list_for_each_entry(entry, &subpage_ro_ranges, list)
> > + if ( mfn_eq(entry->mfn, mfn) &&
> > + !test_bit(offset / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems) )
> > + {
> > + /*
> > + * We don't know the write size at this point yet, so it could
> > be
> > + * an unaligned write, but accept it here anyway and deal with
> > it
> > + * later.
> > + */
> > + return true;
>
> For accesses that fall into the RO region, I think you need to accept
> them here and just terminate them? I see no point in propagating
> them further in hvm_hap_nested_page_fault().
If write hits an R/O region on a page with some writable regions the
handling should be the same as it would be just on the mmio_ro_ranges.
This is what the patch does.
There may be an opportunity to simplify mmio_ro_ranges handling
somewhere, but I don't think it belongs to this patch.
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |