|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 2/2] x86/ioreq: Extend ioreq server to support multiple ioreq pages
On 23.02.2026 10:38, Julian Vetter wrote:
> A single shared ioreq page provides PAGE_SIZE/sizeof(ioreq_t) = 128
> slots, limiting HVM guests to 128 vCPUs. To support more vCPUs, extend
> the ioreq server to use xvzalloc_array() for allocating a contiguous
> virtual array of ioreq_t slots sized to d->max_vcpus, backed by
> potentially non-contiguous physical pages.
>
> For the GFN-mapped path (x86), page and writable type references are
> obtained directly via check_get_page_from_gfn() and get_page_type() for
> each GFN. The pages are then combined into a single contiguous VA using
> vmap(). The number of ioreq pages is computed at runtime via
> nr_ioreq_pages(d) = DIV_ROUND_UP(d->max_vcpus, IOREQS_PER_PAGE), so
> small VMs only allocate one page. All existing single-page paths
> (bufioreq, legacy clients) remain unchanged.
>
> Mark the now-unused shared_iopage_t in the public header as deprecated.
For this I think we need to settle on one of two options: Either it was a
mistake that this was used in the hypervisor (and added to the public
interface), in which case the removal of the use may want to be separate
(without, imo, any need to mark the item deprecated in the public header,
as the property remains). Or we deem it legitimate / useful, in which case
you would want to continue using it (in struct ioreq_server).
> @@ -89,6 +91,39 @@ static gfn_t hvm_alloc_ioreq_gfn(struct ioreq_server *s)
> return hvm_alloc_legacy_ioreq_gfn(s);
> }
>
> +static gfn_t hvm_alloc_ioreq_gfns(struct ioreq_server *s,
> + unsigned int nr_pages)
> +{
> + struct domain *d = s->target;
> + unsigned long mask;
> + unsigned int i, run;
> +
> + if ( nr_pages == 1 )
> + return hvm_alloc_ioreq_gfn(s);
> +
> + /* Find nr_pages consecutive set bits */
> + mask = d->arch.hvm.ioreq_gfn.mask;
> +
> + for ( i = 0, run = 0; i < BITS_PER_LONG; i++ )
> + {
> + if ( !test_bit(i, &mask) )
> + run = 0;
> + else if ( ++run == nr_pages )
> + {
> + /* Found a run - clear all bits and return base GFN */
> + unsigned int start = i - nr_pages + 1;
> + unsigned int j;
> +
> + for ( j = start; j <= i; j++ )
> + clear_bit(j, &d->arch.hvm.ioreq_gfn.mask);
You using clear_bit() here doesn't make the while operation atomic. There will
need to be synchronization (also with hvm_alloc_ioreq_gfn()), and once that's
there (or if things are suitably synchronized already) __clear_bit() ought to
suffice here.
> + return _gfn(d->arch.hvm.ioreq_gfn.base + start);
> + }
> + }
> +
> + return INVALID_GFN;
> +}
Did you consider whether fragmentation could get in the way here, as is usually
the case when doing mixed-size allocations from a single pool? In how far is it
necessary for the GFNs used to be consecutive?
> @@ -134,16 +181,41 @@ static void hvm_unmap_ioreq_gfn(struct ioreq_server *s,
> bool buf)
>
> hvm_free_ioreq_gfn(s, iorp->gfn);
> iorp->gfn = INVALID_GFN;
> + return;
> }
> +
> + if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> + return;
> +
> + nr_pages = nr_ioreq_pages(s->target);
> +
> + for ( i = 0; i < nr_pages; i++ )
> + {
> + struct page_info *pg = vmap_to_page((char *)s->ioreq +
Please can you preferably cast to const void * (or void *)? But perhaps you
would better latch the pointer into a local variable anyway, for ...
> + i * PAGE_SIZE);
> +
> + put_page_and_type(pg);
> + put_page(pg);
> + }
> + vunmap(s->ioreq);
> + s->ioreq = NULL;
... this clearing to move up ahead of any teardown done.
> + hvm_free_ioreq_gfns(s, s->ioreq_gfn, nr_pages);
> + s->ioreq_gfn = INVALID_GFN;
This similarly may want moving up.
As to the loop body, destroy_ring_for_helper() has put_page_and_type(), but
no put_page(). Why is this different here?
> @@ -173,30 +245,132 @@ static int hvm_map_ioreq_gfn(struct ioreq_server *s,
> bool buf)
>
> return rc;
> }
> +
> + /* ioreq: multi-page with contiguous VA */
> + if ( s->ioreq )
> + {
> + if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> + return -EPERM;
The comment on the other path at least wants referencing here.
> + return 0;
> + }
> +
> + if ( d->is_dying )
> + return -EINVAL;
> +
> + nr_pages = nr_ioreq_pages(d);
> + base_gfn = hvm_alloc_ioreq_gfns(s, nr_pages);
> +
> + if ( gfn_eq(base_gfn, INVALID_GFN) )
> + return -ENOMEM;
> +
> + mfns = xmalloc_array(mfn_t, nr_pages);
Especially when there are only a few MFNs, this dynamic allocation would be
nice to avoid. If there was a not overly larger upper bound, using a per-CPU
array might also be an option.
If the dynamic allocation stays, new code wants to use xvmalloc() and
friends.
> + if ( !mfns )
> + {
> + hvm_free_ioreq_gfns(s, base_gfn, nr_pages);
> + return -ENOMEM;
> + }
> +
> + /*
> + * Obtain a page reference and writable type reference for each GFN.
> + * No per-page VA is needed; all pages are mapped as a single contiguous
> + * VA via vmap() below.
> + */
I think this comment wants to mention prepare_ring_for_helper(), so that if
updates are done there, there's at least a fair chance that people might spot
that changes need doing here as well.
> + for ( i = 0; i < nr_pages; i++ )
> + {
> + struct page_info *pg;
> + p2m_type_t p2mt;
> +
> + rc = check_get_page_from_gfn(d, _gfn(gfn_x(base_gfn) + i),
gfn_add() please
> + false, &p2mt, &pg);
> + if ( rc )
> + {
> + if ( rc == -EAGAIN )
> + rc = -ENOENT;
> + goto fail;
> + }
> +
> + if ( !get_page_type(pg, PGT_writable_page) )
> + {
> + put_page(pg);
> + rc = -EINVAL;
> + goto fail;
> + }
> +
> + mfns[i] = page_to_mfn(pg);
> + }
> +
> + /* Map all mfns as single contiguous VA */
> + s->ioreq = vmap(mfns, nr_pages);
> + if ( !s->ioreq )
> + {
> + rc = -ENOMEM;
> + goto fail;
> + }
> +
> + s->ioreq_gfn = base_gfn;
> + xfree(mfns);
> +
> + return 0;
> +
> + fail:
> + while ( i-- > 0 )
> + {
> + struct page_info *pg = mfn_to_page(mfns[i]);
> +
> + put_page_and_type(pg);
> + put_page(pg);
Same issue here - I can't spot which page reference you're dropping. You
obtained only one above.
> @@ -208,6 +382,32 @@ static int hvm_add_ioreq_gfn(struct ioreq_server *s,
> bool buf)
>
> return rc;
> }
> +
> + if ( gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> + return 0;
> +
> + nr_pages = nr_ioreq_pages(d);
> + memset(s->ioreq, 0, nr_pages * PAGE_SIZE);
> +
> + for ( i = 0; i < nr_pages; i++ )
> + {
> + gfn_t gfn = gfn_add(s->ioreq_gfn, i);
> + struct page_info *pg = vmap_to_page((char *)s->ioreq +
> + i * PAGE_SIZE);
> +
> + rc = p2m_add_page(d, gfn, page_to_mfn(pg), 0, p2m_ram_rw);
> + if ( rc )
> + /*
> + * No rollback of previously added pages: The caller
> + * (arch_ioreq_server_disable) has no error handling path,
> + * and partial failure here will be cleaned up when the
> + * ioreq server is eventually destroyed.
> + */
> + return rc;
Shouldn't you continue the loop, to try to add back as many pages as
you can, so the domain encountering problems later is as unlikely as
possible (albeit a single missing page is already bad enough)?
> @@ -260,9 +259,32 @@ bool vcpu_ioreq_handle_completion(struct vcpu *v)
>
> static int ioreq_server_alloc_mfn(struct ioreq_server *s, bool buf)
> {
> - struct ioreq_page *iorp = buf ? &s->bufioreq : &s->ioreq;
> + struct ioreq_page *iorp;
> struct page_info *page;
>
> + if ( !buf )
> + {
> + if ( s->ioreq )
> + {
> + /*
> + * If a guest frame has already been mapped (which may happen
> + * on demand if ioreq_server_get_info() is called), then
> + * allocating a page is not permitted.
> + */
> + if ( !gfn_eq(s->ioreq_gfn, INVALID_GFN) )
> + return -EPERM;
> +
> + return 0;
> + }
> +
> + s->ioreq = xvzalloc_array(ioreq_t, s->target->max_vcpus);
> +
> + return s->ioreq ? 0 : -ENOMEM;
At this point you haven't fulfilled what the function is supposed to be
doing. The pages you allocate also aren't associated with the domain,
and you haven't obtained writable references.
> @@ -812,26 +861,30 @@ int ioreq_server_get_frame(struct domain *d, ioservid_t
> id,
> if ( rc )
> goto out;
>
> - switch ( idx )
> + if ( idx == XENMEM_resource_ioreq_server_frame_bufioreq)
> {
> - case XENMEM_resource_ioreq_server_frame_bufioreq:
> rc = -ENOENT;
> if ( !HANDLE_BUFIOREQ(s) )
> goto out;
>
> *mfn = page_to_mfn(s->bufioreq.page);
> rc = 0;
> - break;
> -
> - case XENMEM_resource_ioreq_server_frame_ioreq(0):
> - *mfn = page_to_mfn(s->ioreq.page);
> - rc = 0;
> - break;
> + }
> + else if (( idx >= XENMEM_resource_ioreq_server_frame_ioreq(0) ) &&
> + ( idx <
> XENMEM_resource_ioreq_server_frame_ioreq(nr_ioreq_pages(d)) ))
> + {
> + unsigned int page_idx = idx -
> XENMEM_resource_ioreq_server_frame_ioreq(0);
>
> - default:
> rc = -EINVAL;
> - break;
> + if ( idx >= XENMEM_resource_ioreq_server_frame_ioreq(0) &&
You checked this above already, didn't you?
> + page_idx < nr_ioreq_pages(d) && s->ioreq )
The former part of this check also looks redundant with the earlier one.
> --- a/xen/include/xen/ioreq.h
> +++ b/xen/include/xen/ioreq.h
> @@ -19,9 +19,19 @@
> #ifndef __XEN_IOREQ_H__
> #define __XEN_IOREQ_H__
>
> +#include <xen/macros.h>
> #include <xen/sched.h>
>
> #include <public/hvm/dm_op.h>
> +#include <public/hvm/ioreq.h>
> +
> +/* 4096 / 32 = 128 ioreq slots per page */
> +#define IOREQS_PER_PAGE (PAGE_SIZE / sizeof(ioreq_t))
I think the comment wants dropping, as it would end up wrong / useless for
page size different from 4k. I'm also not quite sure the #define is overly
useful, as ...
> +static inline unsigned int nr_ioreq_pages(const struct domain *d)
> +{
> + return DIV_ROUND_UP(d->max_vcpus, IOREQS_PER_PAGE);
... this is its only use.
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |