[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> -----Original Message----- > From: Xu, Quan > Sent: Wednesday, September 16, 2015 9:24 PM > To: andrew.cooper3@xxxxxxxxxx; Dong, Eddie; ian.campbell@xxxxxxxxxx; > ian.jackson@xxxxxxxxxxxxx; jbeulich@xxxxxxxx; Nakajima, Jun; keir@xxxxxxx; > Tian, Kevin; tim@xxxxxxx; Zhang, Yang Z; george.dunlap@xxxxxxxxxxxxx > Cc: xen-devel@xxxxxxxxxxxxx; Xu, Quan > Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device > > Introduction > ============ > > VT-d code currently has a number of cases where completion of certain > operations is being waited for by way of spinning. The majority of instances > use > that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of > up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this > may be acceptable, the invalidation case seems particularly problematic. > > Currently hypervisor polls the status address of wait descriptor up to 1 > second to > get Invalidation flush result. When Invalidation queue includes Device-TLB > invalidation, using 1 second is a mistake here in the validation sync. As the > 1 > second timeout here is related to response times by the IOMMU engine, Instead > of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in > use. > the ATS specification mandates a timeout of 1 _minute_ for cache flush. The > ATS > case needs to be taken into consideration when doing invalidations. Obviously > we can't spin for a minute, so invalidation absolutely needs to be converted > to a > non-spinning model. > > Also i should fix the new memory security issue. > The page freed from the domain should be on held, until the Device-TLB flush > is > completed (ATS timeout of 1 _minute_). > The page previously associated with the freed portion of GPA should not be > reallocated for another purpose until the appropriate invalidations have been > performed. Otherwise, the original page owner can still access freed page > though > DMA. > > Why RFC > ======= > Patch 0001--0005, 0013 are IOMMU related. > Patch 0006 is about new flag (vCPU / MMU related). > Patch 0007 is vCPU related. > Patch 0008--0012 are MMU related. > > 1. Xen MMU is very complicated. Could Xen MMU experts help me verify > whether I > have covered all of the case? > > 2. For gnttab_transfer, If the Device-TLB flush is still not completed > when to > map the transferring page to a remote domain, schedule and wait on a > waitqueue > until the Device-TLB flush is completed. Is it correct? > > (I have tested waitqueue in decrease_reservation() [do_memory_op() > hypercall]) > I wake up domain(with only one vCPU) with debug-key tool, and the > domain(with only one vCPU) > is still working after waiting 60s in a waitqueue. ) Much more information: If I run a service in this domain and tested this waitqueue case. The domain is still working after 60s, but It prints out Call Trace with $dmesg: [ 161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272] [ 161.978621] Modules linked in: crct10dif_pclmul(F) crc32_pclmul(F) joydev(F) ghash_clmulni_intel(F) cryptd(F) xen_kbdfront(F) microcode(F) cirrus ttm drm_kms_helper drm psmouse(F) serio_raw(F) syscopyarea(F) sysfillrect(F) sysimgblt(F) i2c_piix4 ext2(F) mac_hid lp(F) parport(F) myri10ge dca floppy(F) [ 161.978626] CPU: 0 PID: 272 Comm: kworker/0:1 Tainted: GF 3.11.0-12-generic #19-Ubuntu [ 161.978628] Hardware name: Xen HVM domU, BIOS 4.6.0-rc 08/03/2015 [ 161.978638] Workqueue: events balloon_process [ 161.978640] task: ffff88007a1b4650 ti: ffff88007a1f2000 task.ti: ffff88007a1f2000 [ 161.978650] RIP: 0010:[<ffffffff81001185>] [<ffffffff81001185>] xen_hypercall_memory_op+0x5/0x20 [ 161.978652] RSP: 0018:ffff88007a1f3d60 EFLAGS: 00000246 [ 161.978653] RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000001568 [ 161.978654] RDX: 0000000000000000 RSI: ffff88007a1f3d70 RDI: 0000000000000041 [ 161.978656] RBP: ffff88007a1f3db8 R08: ffffffff81d04888 R09: 000000000006a5dd [ 161.978657] R10: 0000000000003690 R11: ffff88007f7fa750 R12: ffff880036671000 [ 161.978658] R13: ffffffff810c6176 R14: ffff88007a1f3d20 R15: 0000000000000000 [ 161.978660] FS: 0000000000000000(0000) GS:ffff88007be00000(0000) knlGS:0000000000000000 [ 161.978661] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 161.978662] CR2: 00007f8d3e97e000 CR3: 0000000001c0e000 CR4: 00000000001406f0 [ 161.978669] Stack: [ 161.978673] ffffffff8141a16a 00000000365b0048 ffffffff81fb1520 00000000000000ff [ 161.978676] 0000000000000000 0000000000007ff0 ffffffff81c97301 0000160000000000 [ 161.978678] ffff88007be13e00 ffff88007be17e00 0000000000000000 ffff88007a1f3e20 [ 161.978679] Call Trace: [ 161.978684] [<ffffffff8141a16a>] ? decrease_reservation+0x29a/0x2e0 [ 161.978688] [<ffffffff8141a513>] balloon_process+0x333/0x430 [ 161.978695] [<ffffffff8107d05c>] process_one_work+0x17c/0x430 [ 161.978699] [<ffffffff8107dcac>] worker_thread+0x11c/0x3c0 [ 161.978702] [<ffffffff8107db90>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 161.978710] [<ffffffff810847b0>] kthread+0xc0/0xd0 [ 161.978715] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 [ 161.978722] [<ffffffff816f516c>] ret_from_fork+0x7c/0xb0 [ 161.978727] [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 Thanks Quan > > > Design Overview > =============== > > This design implements a non-spinning model for Device-TLB invalidation -- > using > an interrupt based mechanism. Track the Device-TLB invalidation status in an > invalidation table per-domain. The invalidation table keeps the count of > in-flight > Device-TLB invalidation requests, and also provides a global polling > parameter per > domain for in-flight Device-TLB invalidation requests. > Update invalidation table's count of in-flight Device-TLB invalidation > requests and > assign the address of global polling parameter per domain in the Status > Address > of each invalidation wait descriptor, when to submit invalidation requests. > > For example: > . > > |invl | Status Data = 1 (the count of in-flight Device-TLB invalidation > |requests) wait | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > |dsc | > . > . > > |invl | > |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > |invl | > |wait | Status Data = 3 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > More information about VT-d Invalidation Wait Descriptor, please refer to > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > 6.5.2.8 Invalidation Wait Descriptor. > Status Address and Data: Status address and data is used by hardware to > perform > wait descriptor > completion status write when the Status Write(SW) > field is Set. Hardware Behavior > is undefined if the Status address range of > 0xFEEX_XXXX etc.). The Status Address > and Data fields are ignored by hardware when the > Status Write field is Clear. > > The invalidation completion event interrupt is generated only after the > invalidation wait descriptor completes. In invalidation interrupt handler, it > will > schedule a soft-irq to do the following check: > > if invalidation table's count of in-flight Device-TLB invalidation requests > == > polling parameter: > This domain has no in-flight Device-TLB invalidation requests. > else > This domain has in-flight Device-TLB invalidation requests. > > Track domain Status: > The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield > list if it has in-flight Device-TLB invalidation requests. > > Memory security issue: > In case with PCI-e Address Translation Services(ATS) in use, ATS spec > mandates a timeout of 1 minute for cache flush. > The page freed from the domain should be on held, until the Device-TLB > flush is completed. The page previously associated with the freed portion of > GPA should not be reallocated for another purpose until the appropriate > invalidations have been performed. Otherwise, the original page owner can > still > access freed page though DMA. > > *Held on The page until the Device-TLB flush is completed. > - Unlink the page from the original owner. > - Remove the page from the page_list of domain. > - Decrease the total pages count of domain. > - Add the page to qi_hold_page_list. > > *Put the page in Queued Invalidation(QI) interrupt handler if the > Device-TLB > is completed. > > Invalidation Fault: > A fault event will be generated if an invalidation failed. We can disable the > devices. > > For Context Invalidation and IOTLB invalidation without Device-TLB > invalidation, > Queued Invalidation(QI) submits invalidation requests as before(This is a > tradeoff > and the cost of interrupt is overhead. It will be modified in coming series of > patch). > > More details > ============ > > 1. invalidation table. We define qi_table structure per domain. > +struct qi_talbe { > + u64 qi_table_poll_slot; > + u32 qi_table_status_data; > +}; > > @ struct hvm_iommu { > + /* IOMMU Queued Invalidation(QI) */ > + struct qi_talbe talbe; > } > > 2. Modification to Device IOTLB invalidation: > - Enabled interrupt notification when hardware completes the > invalidations: > Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why > also > set SW bit is that > the interrupt for notification is global not per domain. > So we still need to poll the status address to know which Device-TLB > invalidation request is > completed in QI interrupt handler. > - A new per-domain flag (*qi_flag) is used to track the status of > Device-TLB > invalidation request. > The *qi_flag will be set before sbumitting the Device-TLB invalidation > requests. The vCPU is NOT > allowed to entry guest mode and put into SCHEDOP_yield list, if the > *qi_flag is Set. > - new logic to do synchronize. > if no Device-TLB invalidation: > Back to current invalidation logic. > else > Set IF, SW, FN bit in wait descriptor and prepare the Status Data. > Set *qi_flag. > Put the domain in pending flush list (The vCPU is NOT allowed to > entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.) > Return > > More information about VT-d Invalidation Wait Descriptor, please refer to > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > 6.5.2.8 Invalidation Wait Descriptor. > SW: Indicate the invalidation wait descriptor completion by performing a > coherent DWORD write of the value in the Status Data field > to the address specified in the Status Address. > FN: Indicate the descriptors following the invalidation wait descriptor > must be > processed by hardware only after the invalidation > Wait descriptor completes. > IF: Indicate the invalidation wait descriptor completion by generating an > invalidation completion event per the programing of the > Invalidation Completion Event Registers. > > 3. Modification to domain running lifecycle: > - When the *qi_flag is set, the domain is not allowed to enter guest mode > and put into SCHEDOP_yield list > if there are in-flight Device-TLB invalidation requests. > > 4. New interrupt handler for invalidation completion: > - when hardware completes the Device-TLB invalidation requests, it > generates an interrupt to notify hypervisor. > - In interrupt handler, schedule a tasklet to handle it. > - tasklet to handle below: > *Clear IWC field in the Invalidation Completion Status register. If > the > IWC field in the Invalidation > Completion Status register was already Set at the time of setting > this > field, it is not treated as a new > interrupt condition. > *Scan the domain list. (the domain is with vt-d passthrough devices. > scan 'iommu->domid_bitmap') > for each domain: > check the values invalidation table (qi_table_poll_slot and > qi_table_status_data) of each domain. > if equal: > Put the on hold pages. > Clear the invalidation table. > Clear *qi_flag. > > *If IP field of Invalidation Event Control Register is Set, try to > *Clear IWC > and *Scan the domain list again, instead of > generating another interrupt. > *Clear IM field of Invalidation Event Control Register. > > (( > Logic of IWC / IP / IM as below: > > Interrupt condition (An invalidation wait descriptor > with Interrupt Flag(IF) field Set completed.) > || > v > ----------------------(IWC) ---------------------- > (IWC is Set) (IWC is not Set) > || || > V || > (Not treated as an new interrupt condition) || > V > (Set IWC / IP) > || > V > > ---------------------(IM)--------------------- > (IM is Set) > (IM not Set) > || > || > || > V > || (cause Interrupt > message / then hardware clear IP) > V > (interrupt is held pending, clearing IM to cause interrupt message) > > * If IWC field is being clear, the IP field is cleared. > )) > > 5. invalidation failed. > - A fault event will be generated if invalidation failed. we can disable > the > devices if receive an > invalidation fault event. > > 6. Memory security issue: > > The page freed from the domain should be on held, until the Device-TLB > flush is completed. The page previously associated with the freed portion of > GPA should not be reallocated for another purpose until the appropriate > invalidations have been performed. Otherwise, the original page owner can > still > access freed page though DMA. > > *Held on The page unitl the Device-TLB flush is completed. > - Unlink the page from the original owner. > - Remove the page from the page_list of domain. > - Decrease the total pages count of domain. > - Add the page to qi_hold_page_list. > > *Put the page in Queued Invalidation(QI) interrupt handler if the > Device-TLB is > completed. > > > ---- > There are 3 reasons to submit device-TLB invalidation requests: > *VT-d initialization. > *Reassign device ownership. > *Memory modification. > > 6.1 *VT-d initialization > When VT-d is initializing, there is no guest domain running. So no memory > security issue. > iotlb(iotlb/device-tlb) > |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init( > |) > |--init_vtd_hw() > 6.2 *Reassign device ownership > Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and > arch_do_domctl(). > As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the > appropriate invalidations maybe have not been performed, the *qi_flag is still > Set, and these devices are not ready for guest domains to launch DMA with > these devices. So if the *qi_flag is introduced, there is no memory security > issue. > > iotlb(iotlb/device-tlb) > |-iommu_flush_iotlb_dsi() > |--domain_context_mapping_one() ... > |--domain_context_unmap_one() ... > > |-iommu_flush_iotlb_psi() > |--domain_context_mapping_one() ... > |--domain_context_unmap_one() ... > > 6.3 *Memory modification. > While memory is modified, There are a lot of invoker flow for updating EPT, > but > not all of them will update IOMMU page tables. All of the following three > conditions are met. > * P2M is hostp2m. ( p2m_is_hostp2m(p2m) ) > * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn) > * This domain needs IOMMU. (need_iommu(d)) > > ## > |--iommu_pte_flush()--ept_set_entry() > > #PoD(populate on demand) is not supported while IOMMU passthrough is > enabled. So ignore PoD invoker flow below. > |--p2m_pod_zero_check_superpage() ... > |--p2m_pod_zero_check() ... > |--p2m_pod_demand_populate() ... > |--p2m_pod_decrease_reservation() ... > |--guest_physmap_mark_populate_on_demand() ... > > #Xen paging is not supported while IOMMU passthrough is enabled. So ignore > Xen paging invoker flow below. > |--p2m_mem_paging_evict() ... > |--p2m_mem_paging_resume()... > |--p2m_mem_paging_prep()... > |--p2m_mem_paging_populate()... > |--p2m_mem_paging_nominate()... > |--p2m_alloc_table()--shadow_enable() > --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl() > --do_domctl() > > |--paging_domctl_continuation() > > #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore > Xen paging invoker flow below. > |--set_shared_p2m_entry()... > > > #Domain is paused, the domain can't launch DMA. > |--relinquish_shared_pages()--domain_relinquish_resources( case > RELMEM_shared: ) --domain_kill()--do_domctl() > > #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below. > |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() > --hvm_hap_nested_page_fault() > --ept_handle_violation()--vmx_vmexit_handler() > > #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore > invoker flow below. > |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() > --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check > p2m_type_t) > |--p2m_set_mem_access() ... > |--guest_physmap_mark_populate_on_demand() ... > |--p2m_change_type_one() ... > # The previous page is not put and allocated for Xen or other guest domains. > So > there is no memory security issue. Ignore invoker flow below. > |--p2m_remove_page()--guest_physmap_remove_page() ... > > |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl() > |--map_mmio_regions()--do_domctl() > > > # Held on the pages which are removed in guest_remove_page(), and put in QI > interrupt handler when it has no in-flight Device-TLB invalidation requests. > > |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation() > > |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap() > /xenmem_add_to_physmap_batch() .. --do_memory_op() > |--p2m_add_foreign() -- > xenmem_add_to_physmap_one() ..--do_memory_op() > > |--guest_physmap_add_entry()--create_grant_p2m_mapping() ... > --do_grant_table_op() > > (( > Much more explanation: > Actually, the previous pages are maybe mapped from Xen heap for guest > domains in decrease_reservation() / xenmem_add_to_physmap_one() > / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 > functions will map xen heap page for guest domains: > * share page for xen Oprofile. > * vLAPIC mapping. > * grant table shared page. > * domain share_info page. > )) > > # For grant_unmap*. ignore it at this point, as we can held on the page when > domain free xenbllooned page. > > > |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant > _ref() --gnttab_unmap_grant_ref()--do_grant_table_op() > > |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace() > --do_grant_table_op() > > # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB. > > # For grant_transfer: > |--p2m_remove_page()--guest_physmap_remove_page() > |--gnttab_transfer() ... > --do_grant_table_op() > > If the Device-TLB flush is still not completed when to map the > transferring > page to a remote domain, > schedule and wait on a waitqueue until the Device-TLB flush is completed. > > Plan B: > ((If the Device-TLB flush is still not completed before adding the > transferring > page to the target domain, > allocate a new page for target domain and held on the old transferring page > which will be put in QI interrupt > handler when there are no in-flight Device-TLB invalidation requests.)) > > > Quan Xu (13): > vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt > vt-d: Register MSI for async invalidation completion interrupt. > vt-d: Track the Device-TLB invalidation status in an invalidation table. > vt-d: Clear invalidation table in invaidation interrupt handler > vt-d: Clear the IWC field of Invalidation Event Control Register in > vt-d: Introduce a new per-domain flag - qi_flag. > vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to > vt-d: Held on the freed page until the Device-TLB flush is completed. > vt-d: Put the page in Queued Invalidation(QI) interrupt handler if > vt-d: Held on the removed page until the Device-TLB flush is completed. > vt-d: If the Device-TLB flush is still not completed when > vt-d: For gnttab_transfer, If the Device-TLB flush is still > vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB > > xen/arch/x86/hvm/vmx/entry.S | 10 ++ > xen/arch/x86/x86_64/asm-offsets.c | 1 + > xen/common/domain.c | 15 ++ > xen/common/grant_table.c | 16 ++ > xen/common/memory.c | 16 +- > xen/drivers/passthrough/vtd/iommu.c | 290 > +++++++++++++++++++++++++++++++++-- > xen/drivers/passthrough/vtd/iommu.h | 18 +++ > xen/drivers/passthrough/vtd/qinval.c | 51 +++++- > xen/include/xen/hvm/iommu.h | 42 +++++ > 9 files changed, 443 insertions(+), 16 deletions(-) > > -- > 1.8.3.2 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |