[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 29 Mar 2023 10:50:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WDwTQjm9f4YS0A6Zf/UkCq2GJX8v/ebzIxfJ7bjcWB0=; b=gaRUEkkNujS2LlvrMlc3Bmn/Lem2jkOvLd+Q7jzXu1cMzgKqs2G7C/qkwF/SFuLeaLzHOzluHSceEnxZrAskmv1REQ/FcpUmKMm8ePXTIw/AzDrzvGlsPMdnMxifmhlhJZd0H3kVZYMkO7GmCaRfCnGUjdcRa12M+CNSe5H9HYUVRoEERtzpt0WVsPDAf3TAuR3U+ujgvAunqT1PFs0oq5V9xhK3p9b4pHWte0BeEWTsPqIq7r6Q+AJB753gKCPGs5hx/QpjYhbnvtOjGy7cGmVmhdf3B597IWU0RQRHCiXYLS1TfCEEsBblkJWDDnGhSM3wiWmsqG7kWlyVim0R8A==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HODU2ap3wxttrYE+nj0+9XhMTE7/xriCOd6BTn+BP1U/rkOMqzcYhImlTuLRHecOptAz6MIqjUM8ZWjnmXoG5M+d5nizsMRwJb0kuvtXNuCiAK43bq4XjP55f8N+WcIkhhZ/8J2RlMSknKkwmkCxN6/r0786WAQsx+P219RPYllj3KbAsZ04/oaPDD7K9yVmFs0lQ48EXcwF5lMPm6Qy6IHqmX0XcUHSmnQcX9s/Y1xwxc2t6DdPCt5/oDM78SXUhJx92syz1o2pkSbkovqm/9HIhz+bV4OEEqoAE5vp/m8TY3Wx7bgOkDABKdfNsCjtwQXwZLSwOC/mIo9ABfjrcA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Paul Durrant <paul@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 29 Mar 2023 08:50:41 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote:
> In some cases, only few registers on a page needs to be write-protected.
> Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> PBA table (which doesn't need to span the whole table either).

Yet like the MSI-X table the PBA is not permitted to share a page with
other registers (the table itself excluded). So a need there would
(again) only arise for devices violating the spec.

> Current API allows only marking whole pages pages read-only, which
> sometimes may cover other registers that guest may need to write into.
> 
> Currently, when a guest tries to write to an MMIO page on the
> mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> from userspace (like, /dev/mem), it will try to fixup by updating page
> tables (that Xen again will force to read-only) and will hit that #PF
> again (looping endlessly). Both behaviors are undesirable if guest could
> actually be allowed the write.
> 
> Introduce an API that allows marking part of a page read-only. Since
> sub-page permissions are not a thing in page tables, do this via
> emulation (or simply page fault handler for PV) that handles writes that
> are supposed to be allowed. Those writes require the page to be mapped
> to Xen, so subpage_mmio_ro_add() function takes fixmap index of the
> page. The page needs to be added to mmio_ro_ranges, first anyway.
> Sub-page ranges are stored using rangeset for each added page, and those
> pages are stored on a plain list (as there isn't supposed to be many
> pages needing this precise r/o control).

While, unlike Roger, I don't want to see mmio_ro_ranges' granularity
changed, I wonder if a bitmap isn't going to be easier to use (even
if perhaps requiring a little more memory, but whether that's the case
depends on intended granularity - I'm not convinced we need byte-
granular ranges). I'm also worried of this yet more heavily tying to
x86 of (as of the next patch) the USB3 debug console driver (i.e. as
opposed to Roger I wouldn't take anything that's x86-only as
justification for making wider changes).

As to sub-page permissions: EPT has, as one of its extensions, support
for this. It might be worth at least mentioning, even if the feature
isn't intended to be used right here.

Taking both remarks together, limiting granularity to 16(?) bytes
would allow using the advanced EPT functionality down the road, and
would at the same time limit the suggested bitmap to just 256 bits /
32 bytes, which I think gets us below what even an empty rangeset
would require. Plus lookup would also be quite a bit more lightweight.

> The mechanism this API is plugged in is slightly different for PV and
> HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> it's already called for #PF on read-only MMIO page. For HVM however, EPT
> violation on p2m_mmio_direct page results in a direct domain_crash().
> To reach mmio_ro_emulated_write(), change how write violations for
> p2m_mmio_direct are handled - specifically, treat them similar to
> p2m_ioreq_server. This makes relevant ioreq handler being called,
> that finally end up calling mmio_ro_emulated_write().

I wonder whether that isn't too heavy a change to existing behavior.
I understand that taking this path is necessary for the purpose of the
patch, but I don't think it is necessary for all p2m_mmio_direct pages.
Checking at least mmio_ro_ranges right in hvm_hap_nested_page_fault()
looks reasonable to me.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid;
>  
>  struct rangeset *__read_mostly mmio_ro_ranges;
>  
> +/* Handling sub-page read-only MMIO regions */
> +struct subpage_ro_range {
> +    struct list_head list;
> +    mfn_t mfn;
> +    int fixmap_idx;
> +    struct rangeset *ro_bytes;
> +    struct rcu_head rcu;
> +};
> +
> +static LIST_HEAD(subpage_ro_ranges);
> +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu);
> +static DEFINE_SPINLOCK(subpage_ro_lock);
> +
>  static uint32_t base_disallow_mask;
>  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */
>  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & 
> ~_PAGE_GLOBAL)
> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, 
> XEN_GUEST_HANDLE_PARAM(void) arg)
>      return 0;
>  }
>  
> +int subpage_mmio_ro_add(

As long as patch 2 is going to add the only users, __init please, and
there's no need for a "remove" counterpart.

> +    mfn_t mfn,
> +    unsigned long offset_s,
> +    unsigned long offset_e,
> +    int fixmap_idx)

enum fixed_addresses?

> +{
> +    struct subpage_ro_range *entry = NULL, *iter;
> +    int rc;
> +
> +    ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)));
> +    ASSERT(offset_s < PAGE_SIZE);
> +    ASSERT(offset_e < PAGE_SIZE);
> +
> +    spin_lock(&subpage_ro_lock);
> +
> +    list_for_each_entry( iter, &subpage_ro_ranges, list )

Nit: Style (either you view list_for_each_entry as a [pseudo]keyword
for the purposes of what blanks should be present/absent, or you don't,
a mixture isn't reasonable; also elsewhere).

> +    {
> +        if ( mfn_eq(iter->mfn, mfn) )
> +        {
> +            entry = iter;
> +            break;
> +        }
> +    }
> +    if ( !entry )
> +    {
> +        /* iter==NULL marks it was a newly allocated entry */
> +        iter = NULL;
> +        entry = xmalloc(struct subpage_ro_range);
> +        rc = -ENOMEM;
> +        if ( !entry )
> +            goto err_unlock;
> +        entry->mfn = mfn;
> +        entry->fixmap_idx = fixmap_idx;
> +        entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio",
> +                                       RANGESETF_prettyprint_hex);
> +        rc = -ENOMEM;
> +        if ( !entry->ro_bytes )
> +            goto err_unlock;
> +    }
> +
> +    rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e);
> +    if ( rc < 0 )
> +        goto err_unlock;
> +
> +    if ( !iter )
> +        list_add_rcu(&entry->list, &subpage_ro_ranges);
> +
> +    spin_unlock(&subpage_ro_lock);
> +
> +    if ( !iter || entry->fixmap_idx == fixmap_idx )
> +        return 0;
> +    else
> +        return 1;

If this case is really needed, this special return value (as documented
in the header) probably needs specially handling in the callers - it's
not necessarily an error after all. But I wonder whether it wouldn't be
easier to check earlier on and fail right away (with e.g. -EBUSY), rather
than adding the range and _then_ (kind of, as per patch 2) failing.

> +err_unlock:

Nit: Style (labels indented by at least one space please, also elsewhere).

> +static void subpage_mmio_write_emulate(
> +    mfn_t mfn,
> +    unsigned long offset,

unsigned int

> +    void *data,

const

> +    unsigned int len)
> +{
> +    struct subpage_ro_range *entry;
> +    void __iomem *addr;
> +
> +    rcu_read_lock(&subpage_ro_rcu);
> +
> +    list_for_each_entry_rcu( entry, &subpage_ro_ranges, list )
> +    {
> +        if ( mfn_eq(entry->mfn, mfn) )
> +        {
> +            if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + 
> len - 1) )
> +                goto out_unlock;

This case ...

> +            addr = fix_to_virt(entry->fixmap_idx) + offset;
> +            switch ( len )
> +            {
> +            case 1:
> +                writeb(*(uint8_t*)data, addr);
> +                break;
> +            case 2:
> +                writew(*(uint16_t*)data, addr);
> +                break;
> +            case 4:
> +                writel(*(uint32_t*)data, addr);
> +                break;
> +            case 8:
> +                writeq(*(uint64_t*)data, addr);
> +                break;
> +            default:
> +                /* mmio_ro_emulated_write() already validated the size */
> +                ASSERT_UNREACHABLE();

... as well as, in a release build, this one wants to ...

> +            }
> +            goto out_unlock;

... not use this path but ...

> +        }
> +    }
> +    gdprintk(XENLOG_WARNING,
> +             "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len 
> %u\n",
> +             mfn_x(mfn), offset, len);

... make it here. (By implication: I'm not convinced this wants to be a
gdprintk(), as I think at least one such warning would better surface in
release builds as well.)

At the same time I don't think any message should be issued for write
attempts to pages which don't have part of it marked writable. We deal
with such silently right now, and this shouldn't change. In fact even
for ranges which don't overlap the writable portion of a page we may want
to consider remaining silent. The log message ought to be of interest
mainly for writes which we meant to permit, but which for whatever reason
we elect to suppress nevertheless.

As to the message - why do you split MFN from offset, rather than simply
logging an address ("... %" PRI_mfn "%03x ...")?

Like (iirc) Roger I think that misaligned accesses should be refused (and
hence also make it here).

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.