[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only



On Tue, Mar 28, 2023 at 04:04:47PM +0200, Roger Pau Monné wrote:
> On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either).
> > Current API allows only marking whole pages pages read-only, which
> > sometimes may cover other registers that guest may need to write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables, do this via
> > emulation (or simply page fault handler for PV) that handles writes that
> > are supposed to be allowed. Those writes require the page to be mapped
> > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> > page. The page needs to be added to mmio_ro_ranges, first anyway.
> > Sub-page ranges are stored using rangeset for each added page, and those
> > pages are stored on a plain list (as there isn't supposed to be many
> > pages needing this precise r/o control).
> 
> Since mmio_ro_ranges is x86 only, it is possible to mutate
> it to track address ranges instead of page frames.  The current type
> is unsigned long, so that should be fine, and would avoid having to
> create a per-page rangeset to just track offsets.

I was thinking about it, but rangeset doesn't allow attaching extra data
(fixmap index, or mapped address as you propose with ioremap()).
Changing all the places where mmio_ro_ranges is used will be a bit
tedious, but that isn't really a problem.

> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash().
> > To reach mmio_ro_emulated_write(), change how write violations for
> > p2m_mmio_direct are handled - specifically, treat them similar to
> > p2m_ioreq_server. This makes relevant ioreq handler being called,
> > that finally end up calling mmio_ro_emulated_write().
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently than
> > HAP in areas related to this patch.
> > The used locking should make it safe to use similar to mmio_ro_ranges,
> > but frankly the only use (introduced in the next patch) could go without
> > locking at all, as subpage_mmio_ro_add() is called only before any
> > domain is constructed and subpage_mmio_ro_remove() is never called.
> > ---
> >  xen/arch/x86/hvm/emulate.c      |   2 +-
> >  xen/arch/x86/hvm/hvm.c          |   3 +-
> >  xen/arch/x86/include/asm/mm.h   |  22 ++++-
> >  xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
> >  xen/arch/x86/pv/ro-page-fault.c |   1 +-
> >  5 files changed, 207 insertions(+), 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > index 95364deb1996..311102724dea 100644
> > --- a/xen/arch/x86/hvm/emulate.c
> > +++ b/xen/arch/x86/hvm/emulate.c
> > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned 
> > long gla)
> >          .write      = mmio_ro_emulated_write,
> >          .validate   = hvmemul_validate,
> >      };
> > -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = 
> > _mfn(mfn) };
> >      struct hvm_emulate_ctxt ctxt;
> >      const struct x86_emulate_ops *ops;
> >      unsigned int seg, bdf;
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index d326fa1c0136..f1c928e3e4ee 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
> > long gla,
> >       */
> >      if ( (p2mt == p2m_mmio_dm) ||
> >           (npfec.write_access &&
> > -          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
> > +          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
> > +           p2mt == p2m_mmio_direct)) )
> >      {
> >          if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) )
> >              hvm_inject_hw_exception(TRAP_gp_fault, 0);
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index db29e3e2059f..91937d556bac 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Bytes range specified here
> > + * will still be R/O, but the rest of the page (nor marked as R/O via 
> > another
> > + * call) will have writes passed through. The write passthrough requires
> > + * providing fixmap entry by the caller.
> > + * Since multiple callers can mark different areas of the same page, they 
> > might
> > + * provide different fixmap entries (although that's very unlikely in
> > + * practice). Only the one provided by the first caller will be used. 
> > Return value
> > + * indicates whether this fixmap entry will be used, or a different one
> > + * provided earlier (in which case the caller might decide to release it).
> 
> Why not use ioremap() to map the page instead of requiring a fixmap
> entry?

In all the cases this feature is used (for now), I do have a fixmap
anyway. So, I don't need to worry if I can call ioremap() at that boot
stage (I think it's okay in console_init_postirq(), but that may not
be obvious in other places).

> > + *
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success, fixmap entry is claimed
> > + *  - 1: success, fixmap entry set earlier will be used
> > + */
> > +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s,
> > +                        unsigned long offset_e, int fixmap_idx);
> > +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s,
> > +                           unsigned long offset_e);
> > +
> >  struct mmio_ro_emulate_ctxt {
> >          unsigned long cr2;
> >          unsigned int seg, bdf;
> > +        mfn_t mfn;
> >  };
> >  
> >  int cf_check mmio_ro_emulated_write(
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index 0fe14faa5fa7..b50bdee40b6b 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +    struct list_head list;
> > +    mfn_t mfn;
> > +    int fixmap_idx;
> > +    struct rangeset *ro_bytes;
> > +    struct rcu_head rcu;
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> > +static DEFINE_SPINLOCK(subpage_ro_lock);
> > +
> >  static uint32_t base_disallow_mask;
> >  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. 
> > */
> >  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & 
> > ~_PAGE_GLOBAL)
> > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, 
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return 0;
> >  }
> >  
> > +int subpage_mmio_ro_add(
> > +    mfn_t mfn,
> > +    unsigned long offset_s,
> > +    unsigned long offset_e,
> 
> Since those are page offset, you can likely use unsigned int rather
> than long?
> 
> I also wonder why not use [paddr_t start, paddr_t end] (or start and
> size) and drop the mfn parameter?  You can certainly get the mfn from
> the full address, and it seems more natural that having the caller
> pass an mfn and two offsets.

That would work for the function parameters indeed, regardless of what's
really stored.

> > +    int fixmap_idx)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc;
> > +
> > +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> > +    ASSERT(offset_s < PAGE_SIZE);
> > +    ASSERT(offset_e < PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    list_for_each_entry( iter, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +    {
> > +        /* iter==NULL marks it was a newly allocated entry */
> > +        iter = NULL;
> > +        entry = xmalloc(struct subpage_ro_range);
> > +        rc = -ENOMEM;
> > +        if ( !entry )
> > +            goto err_unlock;
> > +        entry->mfn = mfn;
> > +        entry->fixmap_idx = fixmap_idx;
> > +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> > +                                       RANGESETF_prettyprint_hex);
> > +        rc = -ENOMEM;
> 
> rc will already be -ENOMEM, albeit doing error handling this way is
> tricky...
> 
> > +        if ( !entry->ro_bytes )
> > +            goto err_unlock;
> > +    }
> > +
> > +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> > +    if ( rc < 0 )
> > +        goto err_unlock;
> > +
> > +    if ( !iter )
> > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> > +        return 0;
> > +    else
> > +        return 1;
> > +
> > +err_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    if ( !iter )
> > +    {
> > +        if ( entry )
> > +        {
> > +            if ( entry->ro_bytes )
> > +                rangeset_destroy(entry->ro_bytes);
> > +            xfree(entry);
> > +        }
> > +    }
> > +    return rc;
> > +}
> > +
> > +static void subpage_mmio_ro_free(struct rcu_head *rcu)
> > +{
> > +    struct subpage_ro_range *entry = container_of(rcu, struct 
> > subpage_ro_range, rcu);
> > +
> > +    rangeset_destroy(entry->ro_bytes);
> > +    xfree(entry);
> > +}
> > +
> > +int subpage_mmio_ro_remove(
> > +    mfn_t mfn,
> > +    unsigned long offset_s,
> > +    unsigned long offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    int rc;
> > +
> > +    ASSERT(offset_s < PAGE_SIZE);
> > +    ASSERT(offset_e < PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +
> > +    list_for_each_entry_rcu( iter, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    rc = -ENOENT;
> > +    if ( !entry )
> > +        goto out_unlock;
> > +
> > +    rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e);
> > +    if ( rc < 0 )
> > +        goto out_unlock;
> > +
> > +    rc = 0;
> 
> You can use `if ( rc ) goto out_unlock;` and that avoids having to
> explicitly set rc to 0.
> 
> > +
> > +    if ( !rangeset_is_empty(entry->ro_bytes) )
> > +        goto out_unlock;
> > +
> > +    list_del_rcu(&entry->list);
> > +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> > +
> > +out_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    return rc;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned long offset,
> > +    void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> 
> const.
> 
> > +    void __iomem *addr;
> 
> Do we care about the access being aligned?

I don't think Xen cares about it when page is mapped R/W to the guest,
so why should it care when it's partially R/W only?

> 
> > +
> > +    rcu_read_lock(&subpage_ro_rcu);
> > +
> > +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> 
> You need to take the spinlock at this point, since the contents of
> entry->ro_bytes are not protected by the RCU.  The RCU only provides
> safe iteration over the list, but not the content of the elements on
> the list.

mfn is not supposed to change ever on the specific list element, and
IIUC, rangeset does locking itself. Am I missing something?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.