[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 28 Mar 2023 17:31:07 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=fZlafUHjvj0NetsmAe4XBuVz19+xsy3QF5zg+Tv72Cw=; b=X3gDk6Yq9Ugea7CSRubYrZGPAsZEGTBxijg+lRq/HjDXAtQjr84cCkCNzx2WEKTCvnlS2NZkvAE1JiVmFCqSMogyYMe/KaS1yt3dj/m8EcCmieLICO2nECEcn/cLzaf3Kd4QhF0C6Oof7vJTvIRtMH5HtJfLyC5Y8DPc9XZGqfdE6e0hw2S9+jcXQ1LBE7M83sYgfDXHQG0/sO3d3BEZIfvWtd//znU7YQgYCN337u+ncoEqrjekpzV9c57W3fJ897iZ0KENThqwI6X+vwMKjMHi8BIPXxTqwQzA2Ucr2JzCc/5DONkqOgdA51WlKvSjIVwm/K1w3I42CAjpa4RigA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MyIqLqwtoB8gaHF15YHzXFSCzohs4Cf12LbIgP1ajBs8mrhV4ugLMqK+zaccbnlnvjJk8WBeSZW99RgIRDX9VHDtUprmxkscbDWFDOKNbGBzhVhzbZsDv56QIB5HtiMcq1W/fTPIUrWtRZlLK7B9iqL3PvrV1mRz1za0ThCliaGygeh4pELQZo/ACyvBw7VP5W5Tk5I/HUM3G5nI3HMAoXGCE1Z0D/p1qHhMuE9JZ/hqsjlEz0OMlQj198jt/TMVVuTnLEPjVysvF6qFL+zbWX+IzJlyDpG86aN5Frr6Dh+/BUrX2bjifcukYw55Jz1uSpyU9XI4h6e29FPHUmsOcA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Paul Durrant <paul@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Tue, 28 Mar 2023 15:31:50 +0000
  • Ironport-data: A9a23:IhpuF61+UZKeNVrvLfbD5QRwkn2cJEfYwER7XKvMYLTBsI5bpzFUx mFNXD2Haa6NM2TxLo0lbduz8UME6p/WzIQ1QVQ+pC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+HuDgNyo4GlD5gdkNagR1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfG1lj0 dk9C2A0YTum28KS5IiHTOxBiZF2RCXrFNt3VnBI6xj8VaxjbbWYBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxqvS6PkWSd05C0WDbRUsaNSshP2F6Ru 0rN/njjAwFcP9uaodaA2iv03bWexH+kAur+EpXg3N5Omn2J7FdCJz1RUljqufeZl0uxDoc3x 0s8v3BGQbIJ3EmiVNz0RRC7iH+CoB8HWtBUHvE66QeC0a7d6UCSAW1sZi5MbpkqudE7QRQu1 0SVhJX5CDp3qrqXRHmBsLCOoluaISEIKUcSaClCShEKi/HhqowuihPETv54DbW4yNbyHFnY3 DSivCU4wbIJgqY2O76T+FnGh3ego8PPRwttvAHPBDr5s0V+eZKvYJGu5R7D9/FcIY2FT16H+ n8Zh8yZ6+NIBpaI/MCQfNgw8HiSz67tGFXhbZRHRcBJG+iFk5J7Qb1t3Q==
  • Ironport-hdrordr: A9a23:0wvQeaDbLMLCO4PlHehWsseALOsnbusQ8zAXPhhKOG9omwmj5r eTdPRy726OtN9jYgB1pTngAtjWfZq4z/VICOYqTMiftWXdyRKVxcRZnPvfKl7bamLDH4xmpN ldmsFFYbWbYTca7beckW+F+pQbsai6GciT9KbjJhxWPHtXgtRbnntE43GgYzBLrWd9dOIE/a 6nl4p6jgvlVWUca8y6AnUffu7YutHHrpLpZhYaGwUq8k2rgSmz4LD3KgOf1BsFST9DqI1Skl TtokjU96+nu/G+xgT903bJ75NKsNH9yt1Fbfb87vQ9G3HBmwysbIRkV6ajuCkvoOazzV42nN 7Hs34bTqFOwkKUUnC+pBPs3wX66S0p+m/GwUKVhnHyyPaJJg7SRvAxw76wvXPimgYdVHwW6s 929lPck6ASIQLLnSz76dSNfxZ2lnCsqX5nvf8Pg2dZWY4+bqYUiYAE5ktaHLoJASq/sekcYb FTJfCZwMwTXUKRbnjfsGUq6NuwXk4rFhPDblketteT2z12mmk860cD3sQQkloJ6Zp4YZhZ4O bvNLhuidh1P5YrRJM4IN1Ebdq8C2TLTx6JGGWOIW7/HKVCAH7Jo46f2sR/2An/EqZn8LIC3L D6FH9Iv287fEzjTeeU2odQzxzLSGKhGRzw18B3/fFCy+zBbYuuFRfGZEElksOmrflaKNbcQe yPNJVfBOKmBXfyGLxOwxb1V/BpWCcjufUuy4YGsm+10572w8zRx7Hmmc/oVeDQ+OMfKzzC6n hqZkm6GCwP1DHkKyzFaN64YQKvRqW1x+MDLEHgxZlb9GDWXrc89zT9uW7JpP1jYQcyx5DeXH EOZI8PwZnL4lVfCw7znihU0u00NDce3F1mOEk68DPj5yjPAOQ+UpOkCDtvNHfrHG4Kcyr/Kn 8om2hK
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Mar 28, 2023 at 04:49:36PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Mar 28, 2023 at 04:04:47PM +0200, Roger Pau Monné wrote:
> > On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote:
> > > In some cases, only few registers on a page needs to be write-protected.
> > > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > > PBA table (which doesn't need to span the whole table either).
> > > Current API allows only marking whole pages pages read-only, which
> > > sometimes may cover other registers that guest may need to write into.
> > > 
> > > Currently, when a guest tries to write to an MMIO page on the
> > > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > > from userspace (like, /dev/mem), it will try to fixup by updating page
> > > tables (that Xen again will force to read-only) and will hit that #PF
> > > again (looping endlessly). Both behaviors are undesirable if guest could
> > > actually be allowed the write.
> > > 
> > > Introduce an API that allows marking part of a page read-only. Since
> > > sub-page permissions are not a thing in page tables, do this via
> > > emulation (or simply page fault handler for PV) that handles writes that
> > > are supposed to be allowed. Those writes require the page to be mapped
> > > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> > > page. The page needs to be added to mmio_ro_ranges, first anyway.
> > > Sub-page ranges are stored using rangeset for each added page, and those
> > > pages are stored on a plain list (as there isn't supposed to be many
> > > pages needing this precise r/o control).
> > 
> > Since mmio_ro_ranges is x86 only, it is possible to mutate
> > it to track address ranges instead of page frames.  The current type
> > is unsigned long, so that should be fine, and would avoid having to
> > create a per-page rangeset to just track offsets.
> 
> I was thinking about it, but rangeset doesn't allow attaching extra data
> (fixmap index, or mapped address as you propose with ioremap()).
> Changing all the places where mmio_ro_ranges is used will be a bit
> tedious, but that isn't really a problem.

Yes, you would still need to keep a separate structure in order to
store the maps, a simple list that contains the physical mfn and the
virtual address where it's mapped would suffice I think.  That would
avoid creating a separate rangeset for each mfn that you need to
track.

> > > The mechanism this API is plugged in is slightly different for PV and
> > > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > > violation on p2m_mmio_direct page results in a direct domain_crash().
> > > To reach mmio_ro_emulated_write(), change how write violations for
> > > p2m_mmio_direct are handled - specifically, treat them similar to
> > > p2m_ioreq_server. This makes relevant ioreq handler being called,
> > > that finally end up calling mmio_ro_emulated_write().
> > > Both of those paths need an MFN to which guest tried to write (to check
> > > which part of the page is supposed to be read-only, and where
> > > the page is mapped for writes). This information currently isn't
> > > available directly in mmio_ro_emulated_write(), but in both cases it is
> > > already resolved somewhere higher in the call tree. Pass it down to
> > > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> > > ---
> > > Shadow mode is not tested, but I don't expect it to work differently than
> > > HAP in areas related to this patch.
> > > The used locking should make it safe to use similar to mmio_ro_ranges,
> > > but frankly the only use (introduced in the next patch) could go without
> > > locking at all, as subpage_mmio_ro_add() is called only before any
> > > domain is constructed and subpage_mmio_ro_remove() is never called.
> > > ---
> > >  xen/arch/x86/hvm/emulate.c      |   2 +-
> > >  xen/arch/x86/hvm/hvm.c          |   3 +-
> > >  xen/arch/x86/include/asm/mm.h   |  22 ++++-
> > >  xen/arch/x86/mm.c               | 181 +++++++++++++++++++++++++++++++++-
> > >  xen/arch/x86/pv/ro-page-fault.c |   1 +-
> > >  5 files changed, 207 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > > index 95364deb1996..311102724dea 100644
> > > --- a/xen/arch/x86/hvm/emulate.c
> > > +++ b/xen/arch/x86/hvm/emulate.c
> > > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, 
> > > unsigned long gla)
> > >          .write      = mmio_ro_emulated_write,
> > >          .validate   = hvmemul_validate,
> > >      };
> > > -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > > +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = 
> > > _mfn(mfn) };
> > >      struct hvm_emulate_ctxt ctxt;
> > >      const struct x86_emulate_ops *ops;
> > >      unsigned int seg, bdf;
> > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > > index d326fa1c0136..f1c928e3e4ee 100644
> > > --- a/xen/arch/x86/hvm/hvm.c
> > > +++ b/xen/arch/x86/hvm/hvm.c
> > > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
> > > long gla,
> > >       */
> > >      if ( (p2mt == p2m_mmio_dm) ||
> > >           (npfec.write_access &&
> > > -          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
> > > +          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) ||
> > > +           p2mt == p2m_mmio_direct)) )
> > >      {
> > >          if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, 
> > > npfec) )
> > >              hvm_inject_hw_exception(TRAP_gp_fault, 0);
> > > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > > index db29e3e2059f..91937d556bac 100644
> > > --- a/xen/arch/x86/include/asm/mm.h
> > > +++ b/xen/arch/x86/include/asm/mm.h
> > > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges;
> > >  void memguard_guard_stack(void *p);
> > >  void memguard_unguard_stack(void *p);
> > >  
> > > +/*
> > > + * Add more precise r/o marking for a MMIO page. Bytes range specified 
> > > here
> > > + * will still be R/O, but the rest of the page (nor marked as R/O via 
> > > another
> > > + * call) will have writes passed through. The write passthrough requires
> > > + * providing fixmap entry by the caller.
> > > + * Since multiple callers can mark different areas of the same page, 
> > > they might
> > > + * provide different fixmap entries (although that's very unlikely in
> > > + * practice). Only the one provided by the first caller will be used. 
> > > Return value
> > > + * indicates whether this fixmap entry will be used, or a different one
> > > + * provided earlier (in which case the caller might decide to release 
> > > it).
> > 
> > Why not use ioremap() to map the page instead of requiring a fixmap
> > entry?
> 
> In all the cases this feature is used (for now), I do have a fixmap
> anyway. So, I don't need to worry if I can call ioremap() at that boot
> stage (I think it's okay in console_init_postirq(), but that may not
> be obvious in other places).

If we moved for a model where mmio_ro_ranges tracks at address
granularity I would suggest that the underlying addresses are only
mapped when there's a guest access to handle, so that we don't waste
fixmap entries for no reason.

> > > +        if ( !entry->ro_bytes )
> > > +            goto err_unlock;
> > > +    }
> > > +
> > > +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> > > +    if ( rc < 0 )
> > > +        goto err_unlock;
> > > +
> > > +    if ( !iter )
> > > +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> > > +
> > > +    spin_unlock(&subpage_ro_lock);
> > > +
> > > +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> > > +        return 0;
> > > +    else
> > > +        return 1;
> > > +
> > > +err_unlock:
> > > +    spin_unlock(&subpage_ro_lock);
> > > +    if ( !iter )
> > > +    {
> > > +        if ( entry )
> > > +        {
> > > +            if ( entry->ro_bytes )
> > > +                rangeset_destroy(entry->ro_bytes);
> > > +            xfree(entry);
> > > +        }
> > > +    }
> > > +    return rc;
> > > +}
> > > +
> > > +static void subpage_mmio_ro_free(struct rcu_head *rcu)
> > > +{
> > > +    struct subpage_ro_range *entry = container_of(rcu, struct 
> > > subpage_ro_range, rcu);

Overly long line.

Out of precaution I would also add an extra
ASSERT(rangeset_is_empty(entry->ro_bytes)) here.

> > > +
> > > +    if ( !rangeset_is_empty(entry->ro_bytes) )
> > > +        goto out_unlock;
> > > +
> > > +    list_del_rcu(&entry->list);
> > > +    call_rcu(&entry->rcu, subpage_mmio_ro_free);
> > > +
> > > +out_unlock:
> > > +    spin_unlock(&subpage_ro_lock);
> > > +    return rc;
> > > +}
> > > +
> > > +static void subpage_mmio_write_emulate(
> > > +    mfn_t mfn,
> > > +    unsigned long offset,
> > > +    void *data,
> > > +    unsigned int len)
> > > +{
> > > +    struct subpage_ro_range *entry;
> > 
> > const.
> > 
> > > +    void __iomem *addr;
> > 
> > Do we care about the access being aligned?
> 
> I don't think Xen cares about it when page is mapped R/W to the guest,
> so why should it care when it's partially R/W only?

It's kind of the same issue we have with adjacent MSIX accesses: we
don't really know whether the device might choke on unaligned accesses
and generate some kind of exception that results in a NMI to the CPU.
Handling those it's likely more feasible if they happen in guest
context, rather than Xen context.

> > 
> > > +
> > > +    rcu_read_lock(&subpage_ro_rcu);
> > > +
> > > +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> > > +    {
> > > +        if ( mfn_eq(entry->mfn, mfn) )
> > > +        {
> > 
> > You need to take the spinlock at this point, since the contents of
> > entry->ro_bytes are not protected by the RCU.  The RCU only provides
> > safe iteration over the list, but not the content of the elements on
> > the list.
> 
> mfn is not supposed to change ever on the specific list element, and
> IIUC, rangeset does locking itself. Am I missing something?

Oh, sorry, it was me being wrong, I didn't recall and didn't check that
rangesets do have internal locking.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.