[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] FW: VT-d async invalidation for Device-TLB.



>On 12.06.15 at 14:47, <JBeulich@xxxxxxxx> wrote:
> >>> On 12.06.15 at 04:40, <quan.xu@xxxxxxxxx> wrote:
> >> > >>> On 10.06.15 at 16:05, <JBeulich@xxxxxxxx> wrote:
> >> >>> On 03.06.15 at 09:49, <quan.xu@xxxxxxxxx> wrote:
> >> >     For Context Invalidation and IOTLB invalidation without
> >> > Device-TLB invalidation, Invalidation Queue flushes synchronous
> >> > invalidation as before(This is a tradeoff and the cost of interrupt is
> overhead).
> >>
> >> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not
> >> intending to replace the current spinning for the non-ATS case?
> >
> > Yes, we are not intending to replace the current spinning for the
> > non-ATS case.
> 
> I'm not really happy about that.

Jan, could you share more about what you expect? We can enable it step by step.
Do you want to replace the current spinning for all of queued invalidation? 

> 
> >> Considering that expiring these loops results in panic()s, I would
> >> expect
> > these to
> >> become asynchronous _and_ contained to the affected VM alongside the
> >> ATS induced changed behavior. You talking of overhead - can you quantify
> that?
> >>
> >
> > I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
> > for an invalidation:
> >  By sync way, it takes about 1.4 ms.
> >  By async way, it takes about 4.3 ms.
> 
> What's the theory on why this is? After all, it shouldn't matter how the
> completion of the invalidation gets signaled.
>

Let me introduce more about how I get these test data.

For sync way, I get the time by NOW() macro, when I update Queue Tail Register  
(qinval_update_qtail()).
Then get time by NOW() macro in current spinning when it has wrote the value in 
the Status Data field to the address specified in the Status Address.
The difference of these 2 value is the time of an sync invalidation.


For async way, as the previous email, I have introduced the IF bit of 
Invalidation Wait Descriptor. 
(IF: Indicate the invalidation wait descriptor completion by generating an 
invalidation completion event per the programing of the Invalidation Completion 
Event Registers.)
I have implemented an interrupt for invalidation completion event. 
Also I get the time by NOW() macro, when I update Queue Tail Register (by 
qinval_update_qtail()).
Then get time by NOW() macro in invalidation completion event handler.
The difference of these 2 value is the time of an async invalidation.



> Apart from that measuring the ATS case (in which case we're set to use async
> mode anyway) is kind of pointless here - we'd need to know the overhead of
> non-ATS async compared to non-ATS sync.
> 

I will send out these data in next Monday, now I am out of the office, and I 
can't ping that test machine when modify Xen source code to get
non-ATS async data.


> >> > More details:
> >> >
> >> > 1. invalidation table. We define iommu _invl structure in domain.
> >> > Struct iommu _invl {
> >> >     volatile u64 iommu _invl _poll_slot :62;
> >> >     domid_t dom_id;
> >> >     u64 iommu _invl _status_data :32; }__attribute__ ((aligned
> >> > (64)));
> >> >
> >> >    iommu _invl _poll_slot: Set it equal to the status address of
> >> > wait descriptor when the invalidation queue is with Device-TLB.
> >> >    dom_id: Keep the id of the domain.
> >> >    iommu _invl _status_data: Keep the count of in-flight queue with
> >> > Device-TLB invalidation.
> >>
> >> Without further explanation above/below I don't think I really
> >> understand
> > the
> >> purpose of this structure, nor its organization: Is this something
> >> imposed
> > by the
> >> VT-d specification? If so, a reference to the respective section in
> >> the spec
> > would
> >> be useful. If not, I can't see why the structure is laid out the
> >> (odd) way
> > it is.
> >>
> >
> > Refer to the explanation above. If it is still not clear, I will
> > continue to explain in next email.
> 
> The explanation above helped for what I asked above, but didn't make clear to
> me what the structure here is, how it relates to hw defined structures, and
> hence (as said) why it is laid out the way it is.
> 

The below is the structures.

--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h

+struct iommu_invl {
+    volatile u64 iommu_invl_poll_slot;
+    u32 iommu_invl_status_data;
+};
+
 
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
+#include <xen/iommu.h>
@@ -364,6 +365,10 @@ struct domain
+
+    struct iommu_invl iommu_qinval_invl;
+    /* The domain is under Device-TLB invalidation flush */
+    s8               iommu_invl_pending_flush;


> >> > 4. New interrupt handler for invalidation completion:
> >> >     - when hardware completes the invalidations with Device IOTLB,
> >> > it generates an interrupt to notify hypervisor.
> >> >     - In interrupt handler, we will schedule a soft-irq to handle
> >> > the finished invalidations.
> >> >     - soft-irq to handle finished invalidation:
> >> >         Scan the pending flush list
> >> >      for each entry in list
> >> >             check the values of iommu _invl _poll_slot and iommu
> >> > _invl _status_data in each domain's invalidation table.
> >> >             if yes, clear iommu_pending_flush and invalidation
> >> > table, then wakeup the domain.
> >>
> >> Did you put some consideration into how long this list may get, and
> >> hence
> > how
> >> long it may take you to iterate through the entire list?
> >>
> >
> > Only the domain which has the ATS device assigned will be tracked in
> > this list. So the list length shouldn't be very long.
> 
> Okay, if this is a list of domains (or of devices), that would hopefully be
> acceptable (albeit on a huge system this could still be dozens). If this was 
> a list of
> pending flush requests, it might be worse.


Yes, this is a list of domains.

> 
> > Besides, the DEVICE-IOTLB
> > invalidation doesn't happened frequently so the cost should be acceptable.
> 
> That's not a valid consideration: At no time must any processing inside the
> hypervisor take arbitrarily long. This requirement is entirely independent of 
> how
> frequently such cases may occur.
>

agreed.


 
> Jan


Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.