[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 2/2] x86/ioreq: Extend ioreq server to support multiple ioreq pages


  • To: Julian Vetter <julian.vetter@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 26 Feb 2026 16:53:58 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 26 Feb 2026 15:54:20 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 23.02.2026 10:38, Julian Vetter wrote:
> A single shared ioreq page provides PAGE_SIZE/sizeof(ioreq_t) = 128
> slots, limiting HVM guests to 128 vCPUs. To support more vCPUs, extend
> the ioreq server to use xvzalloc_array() for allocating a contiguous
> virtual array of ioreq_t slots sized to d->max_vcpus, backed by
> potentially non-contiguous physical pages.
> 
> For the GFN-mapped path (x86), page and writable type references are
> obtained directly via check_get_page_from_gfn() and get_page_type() for
> each GFN. The pages are then combined into a single contiguous VA using
> vmap(). The number of ioreq pages is computed at runtime via
> nr_ioreq_pages(d) = DIV_ROUND_UP(d->max_vcpus, IOREQS_PER_PAGE), so
> small VMs only allocate one page. All existing single-page paths
> (bufioreq, legacy clients) remain unchanged.
> 
> Mark the now-unused shared_iopage_t in the public header as deprecated.

For this I think we need to settle on one of two options: Either it was a
mistake that this was used in the hypervisor (and added to the public
interface), in which case the removal of the use may want to be separate
(without, imo, any need to mark the item deprecated in the public header,
as the property remains). Or we deem it legitimate / useful, in which case
you would want to continue using it (in struct ioreq_server).

> @@ -89,6 +91,39 @@ static gfn_t hvm_alloc_ioreq_gfn(struct ioreq_server *s)
>      return hvm_alloc_legacy_ioreq_gfn(s);
>  }
>  
> +static gfn_t hvm_alloc_ioreq_gfns(struct ioreq_server *s,
> +                                  unsigned int nr_pages)
> +{
> +    struct domain *d = s->target;
> +    unsigned long mask;
> +    unsigned int i, run;
> +
> +    if ( nr_pages == 1 )
> +        return hvm_alloc_ioreq_gfn(s);
> +
> +    /* Find nr_pages consecutive set bits */
> +    mask = d->arch.hvm.ioreq_gfn.mask;
> +
> +    for ( i = 0, run = 0; i < BITS_PER_LONG; i++ )
> +    {
> +        if ( !test_bit(i, &mask) )
> +            run = 0;
> +        else if ( ++run == nr_pages )
> +        {
> +            /* Found a run - clear all bits and return base GFN */
> +            unsigned int start = i - nr_pages + 1;
> +            unsigned int j;
> +
> +            for ( j = start; j <= i; j++ )
> +                clear_bit(j, &d->arch.hvm.ioreq_gfn.mask);

You using clear_bit() here doesn't make the while operation atomic. There will
need to be synchronization (also with hvm_alloc_ioreq_gfn()), and once that's
there (or if things are suitably synchronized already) __clear_bit() ought to
suffice here.

> +            return _gfn(d->arch.hvm.ioreq_gfn.base + start);
> +        }
> +    }
> +
> +    return INVALID_GFN;
> +}

Did you consider whether fragmentation could get in the way here, as is usually
the case when doing mixed-size allocations from a single pool? In how far is it
necessary for the GFNs used to be consecutive?

> @@ -134,16 +181,41 @@ static void hvm_unmap_ioreq_gfn(struct ioreq_server *s, 
> bool buf)
>  
>          hvm_free_ioreq_gfn(s, iorp->gfn);
>          iorp->gfn = INVALID_GFN;
> +        return;
>      }
> +
> +    if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> +        return;
> +
> +    nr_pages = nr_ioreq_pages(s->target);
> +
> +    for ( i = 0; i < nr_pages; i++ )
> +    {
> +        struct page_info *pg = vmap_to_page((char *)s->ioreq +

Please can you preferably cast to const void * (or void *)? But perhaps you
would better latch the pointer into a local variable anyway, for ...

> +                                            i * PAGE_SIZE);
> +
> +        put_page_and_type(pg);
> +        put_page(pg);

> +    }
> +    vunmap(s->ioreq);
> +    s->ioreq = NULL;

... this clearing to move up ahead of any teardown done.

> +    hvm_free_ioreq_gfns(s, s->ioreq_gfn, nr_pages);
> +    s->ioreq_gfn = INVALID_GFN;

This similarly may want moving up.

As to the loop body, destroy_ring_for_helper() has put_page_and_type(), but
no put_page(). Why is this different here?

> @@ -173,30 +245,132 @@ static int hvm_map_ioreq_gfn(struct ioreq_server *s, 
> bool buf)
>  
>          return rc;
>      }
> +
> +    /* ioreq: multi-page with contiguous VA */
> +    if ( s->ioreq )
> +    {
> +        if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> +            return -EPERM;

The comment on the other path at least wants referencing here.

> +        return 0;
> +    }
> +
> +    if ( d->is_dying )
> +        return -EINVAL;
> +
> +    nr_pages = nr_ioreq_pages(d);
> +    base_gfn = hvm_alloc_ioreq_gfns(s, nr_pages);
> +
> +    if ( gfn_eq(base_gfn, INVALID_GFN) )
> +        return -ENOMEM;
> +
> +    mfns = xmalloc_array(mfn_t, nr_pages);

Especially when there are only a few MFNs, this dynamic allocation would be
nice to avoid. If there was a not overly larger upper bound, using a per-CPU
array might also be an option.

If the dynamic allocation stays, new code wants to use xvmalloc() and
friends.

> +    if ( !mfns )
> +    {
> +        hvm_free_ioreq_gfns(s, base_gfn, nr_pages);
> +        return -ENOMEM;
> +    }
> +
> +    /*
> +     * Obtain a page reference and writable type reference for each GFN.
> +     * No per-page VA is needed; all pages are mapped as a single contiguous
> +     * VA via vmap() below.
> +     */

I think this comment wants to mention prepare_ring_for_helper(), so that if
updates are done there, there's at least a fair chance that people might spot
that changes need doing here as well.

> +    for ( i = 0; i < nr_pages; i++ )
> +    {
> +        struct page_info *pg;
> +        p2m_type_t p2mt;
> +
> +        rc = check_get_page_from_gfn(d, _gfn(gfn_x(base_gfn) + i),

gfn_add() please

> +                                     false, &p2mt, &pg);
> +        if ( rc )
> +        {
> +            if ( rc == -EAGAIN )
> +                rc = -ENOENT;
> +            goto fail;
> +        }
> +
> +        if ( !get_page_type(pg, PGT_writable_page) )
> +        {
> +            put_page(pg);
> +            rc = -EINVAL;
> +            goto fail;
> +        }
> +
> +        mfns[i] = page_to_mfn(pg);
> +    }
> +
> +    /* Map all mfns as single contiguous VA */
> +    s->ioreq = vmap(mfns, nr_pages);
> +    if ( !s->ioreq )
> +    {
> +        rc = -ENOMEM;
> +        goto fail;
> +    }
> +
> +    s->ioreq_gfn = base_gfn;
> +    xfree(mfns);
> +
> +    return 0;
> +
> + fail:
> +    while ( i-- > 0 )
> +    {
> +        struct page_info *pg = mfn_to_page(mfns[i]);
> +
> +        put_page_and_type(pg);
> +        put_page(pg);

Same issue here - I can't spot which page reference you're dropping. You
obtained only one above.

> @@ -208,6 +382,32 @@ static int hvm_add_ioreq_gfn(struct ioreq_server *s, 
> bool buf)
>  
>          return rc;
>      }
> +
> +    if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> +        return 0;
> +
> +    nr_pages = nr_ioreq_pages(d);
> +    memset(s->ioreq, 0, nr_pages * PAGE_SIZE);
> +
> +    for ( i = 0; i < nr_pages; i++ )
> +    {
> +        gfn_t gfn = gfn_add(s->ioreq_gfn, i);
> +        struct page_info *pg = vmap_to_page((char *)s->ioreq +
> +                                            i * PAGE_SIZE);
> +
> +        rc = p2m_add_page(d, gfn, page_to_mfn(pg), 0, p2m_ram_rw);
> +        if ( rc )
> +            /*
> +             * No rollback of previously added pages: The caller
> +             * (arch_ioreq_server_disable) has no error handling path,
> +             * and partial failure here will be cleaned up when the
> +             * ioreq server is eventually destroyed.
> +             */
> +            return rc;

Shouldn't you continue the loop, to try to add back as many pages as
you can, so the domain encountering problems later is as unlikely as
possible (albeit a single missing page is already bad enough)?

> @@ -260,9 +259,32 @@ bool vcpu_ioreq_handle_completion(struct vcpu *v)
>  
>  static int ioreq_server_alloc_mfn(struct ioreq_server *s, bool buf)
>  {
> -    struct ioreq_page *iorp = buf ? &s->bufioreq : &s->ioreq;
> +    struct ioreq_page *iorp;
>      struct page_info *page;
>  
> +    if ( !buf )
> +    {
> +        if ( s->ioreq )
> +        {
> +            /*
> +             * If a guest frame has already been mapped (which may happen
> +             * on demand if ioreq_server_get_info() is called), then
> +             * allocating a page is not permitted.
> +             */
> +            if ( !gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> +                return -EPERM;
> +
> +            return 0;
> +        }
> +
> +        s->ioreq = xvzalloc_array(ioreq_t, s->target->max_vcpus);
> +
> +        return s->ioreq ? 0 : -ENOMEM;

At this point you haven't fulfilled what the function is supposed to be
doing. The pages you allocate also aren't associated with the domain,
and you haven't obtained writable references.

> @@ -812,26 +861,30 @@ int ioreq_server_get_frame(struct domain *d, ioservid_t 
> id,
>      if ( rc )
>          goto out;
>  
> -    switch ( idx )
> +    if ( idx == XENMEM_resource_ioreq_server_frame_bufioreq)
>      {
> -    case XENMEM_resource_ioreq_server_frame_bufioreq:
>          rc = -ENOENT;
>          if ( !HANDLE_BUFIOREQ(s) )
>              goto out;
>  
>          *mfn = page_to_mfn(s->bufioreq.page);
>          rc = 0;
> -        break;
> -
> -    case XENMEM_resource_ioreq_server_frame_ioreq(0):
> -        *mfn = page_to_mfn(s->ioreq.page);
> -        rc = 0;
> -        break;
> +    }
> +    else if (( idx >= XENMEM_resource_ioreq_server_frame_ioreq(0) ) &&
> +             ( idx < 
> XENMEM_resource_ioreq_server_frame_ioreq(nr_ioreq_pages(d)) ))
> +    {
> +        unsigned int page_idx = idx - 
> XENMEM_resource_ioreq_server_frame_ioreq(0);
>  
> -    default:
>          rc = -EINVAL;
> -        break;
> +        if ( idx >= XENMEM_resource_ioreq_server_frame_ioreq(0) &&

You checked this above already, didn't you?

> +             page_idx < nr_ioreq_pages(d) && s->ioreq )

The former part of this check also looks redundant with the earlier one.

> --- a/xen/include/xen/ioreq.h
> +++ b/xen/include/xen/ioreq.h
> @@ -19,9 +19,19 @@
>  #ifndef __XEN_IOREQ_H__
>  #define __XEN_IOREQ_H__
>  
> +#include <xen/macros.h>
>  #include <xen/sched.h>
>  
>  #include <public/hvm/dm_op.h>
> +#include <public/hvm/ioreq.h>
> +
> +/* 4096 / 32 = 128 ioreq slots per page */
> +#define IOREQS_PER_PAGE  (PAGE_SIZE / sizeof(ioreq_t))

I think the comment wants dropping, as it would end up wrong / useless for
page size different from 4k. I'm also not quite sure the #define is overly
useful, as ...

> +static inline unsigned int nr_ioreq_pages(const struct domain *d)
> +{
> +    return DIV_ROUND_UP(d->max_vcpus, IOREQS_PER_PAGE);

... this is its only use.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.