[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)



On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.

Strictly speaking the size is related to the maximum bound evtchn, which
need not be the same as the number of bound evtchns.

> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.

The diagram shows transitions B<->P and P<->L, which involve two bits in
each case. So do BM<->PM and LM<->BM now I think about it.

Is the L bit redundant with the LINK field == 0 or != 0?

> 
> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).`
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.

Isn't the event array per-domain?

> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };

I think this will have a different layout on 32 and 64 bit x86, if you
care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).

> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.

This doesn't seem all that different to _init, except the former handles
the transition from 0->1 event pages and this handles N->N+1?

> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.



> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.

EBUSY?

> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs,

I'm not sure this is the case, since evtchns are bidirectional (i.e.
either end can use it to signal the other) so one is often shared for
both rqst and rsp notifications.

 the upper bound on the total number of pages
> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 
> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}

This accounts for the driver domains and dom0 but not the domains which
they are serving, doesn't it?
 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).

Is there scope to reduce the maximum then?

> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.

There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel
which I suppose becomes redundant now.

I don't think it would be a problem to predicate use of this new
interface on the use of the VCPU_placement API and therefore give scope
to expand the vcpu_info.

> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.

> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.

What are the conditions for actually performing the upcall when
returning from the hypervisor to the guest?

>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p

Do you not need a barrier towards the end here to ensure that a consumer
who is currently processing interrupts sees the updated link when they
get there?

> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 
> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()

What is this mb() needed for?

>             H = E[p].link
>         E[H].linked = 0

Did you mean E[p].linked here?

If at this point the interrupt is reraised then the if in the raising
pseudo code becomes true, linked will set again and don't we also race
with the clearing of E[p].pending below?

Is a barrier needed around here due to Xen also reading E[p].pending?

>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` 

What barriers are required for this and where? Is it done at the start
or the end of the handler? (early of late EOI)

> and EOIs level-triggered PIRQs.
> 
> > Note: When the event queue contains a single event, removing the
> > event may race with Xen appending another event because the load of
> > `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> > the guest must recheck `E[p].link` if the list appeared empty.

It appears that both "Raising an Event" and "Consuming Events" can write
H? Is that correct? Likewise for the linked bit.

It's a bit unclear because the pseudo-code doesn't make it explicitly
which variables are par of the shared data structure and which are
private to the local routine.

> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)

Can the hypercall do the E[p].masked = 0?

> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 
> > Note: that after clearing the mask bit, the event may be raised and
> > thus it may already be linked by the time the hypercall is done.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.