Xen project Mailing List

[Xen-devel] [PATCH] dpci: Replace tasklet with an softirq (v4)

To: xen-devel@xxxxxxxxxxxxxxxxxxxx, jbeulich@xxxxxxxx

Date: Wed, 10 Sep 2014 16:26:14 -0400

Delivery-date: Wed, 10 Sep 2014 19:24:25 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> The existing tasklet mechanism has a single global spinlock that is taken every-time the global list is touched. And we use this lock quite a lot - when we call do_tasklet_work which is called via an softirq and from the idle loop. We take the lock on any operation on the tasklet_list. The problem we are facing is that there are quite a lot of tasklets scheduled. The most common one that is invoked is the one injecting the VIRQ_TIMER in the guest. Guests are not insane and don't set the one-shot or periodic clocks to be in sub 1ms intervals (causing said tasklet to be scheduled for such small intervalls). The problem appears when PCI passthrough devices are used over many sockets and we have an mix of heavy-interrupt guests and idle guests. The idle guests end up seeing 1/10 of its RUNNING timeslice eaten by the hypervisor (and 40% steal time). The mechanism by which we inject PCI interrupts is by hvm_do_IRQ_dpci which schedules the hvm_dirq_assist tasklet every time an interrupt is received. The callchain is: _asm_vmexit_handler -> vmx_vmexit_handler ->vmx_do_extint -> do_IRQ -> __do_IRQ_guest -> hvm_do_IRQ_dpci tasklet_schedule(&dpci->dirq_tasklet); [takes lock to put the tasklet on] [later on the schedule_tail is invoked which is 'vmx_do_resume'] vmx_do_resume -> vmx_asm_do_vmentry -> call vmx_intr_assist -> vmx_process_softirqs -> do_softirq [executes the tasklet function, takes the lock again] While on other CPUs they might be sitting in a idle loop and invoked to deliver an VIRQ_TIMER, which also ends up taking the lock twice: first to schedule the v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to the guests' BLOCKED_state); then to execute it - which is accounted for in the guest's RUNTIME_state. The end result is that on a 8 socket machine with PCI passthrough, where four sockets are busy with interrupts, and the other sockets have idle guests - we end up with the idle guests having around 40% steal time and 1/10 of its timeslice (3ms out of 30 ms) being tied up taking the lock. The latency of the PCI interrupts delieved to guest is also hindered. With this patch the problem disappears completly. That is removing the lock for the PCI passthrough use-case (the 'hvm_dirq_assist' case) by not using tasklets at all. The patch is simple - instead of scheduling an tasklet we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will take care of running 'hvm_dirq_assist'. The information we need on each CPU is which 'struct domain' structure the 'hvm_dirq_assist' needs to run on. That is simple solved by threading the 'struct domain' through a linked list. The rule of only running one 'hvm_dirq_assist' for only one 'struct domain' is also preserved by having 'schedule_dpci_for' ignore any subsequent calls for an domain which has already been scheduled. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> --- v2: On top of ref-cnts also have wait loop for the outstanding 'struct domain' that need to be processed. v3: Add -ERETRY, fix up StyleGuide issues v4: Clean it up mode, redo per_cpu, this_cpu logic --- xen/arch/x86/domain.c | 4 +- xen/drivers/passthrough/io.c | 168 ++++++++++++++++++++++++++++++++++++++-- xen/drivers/passthrough/pci.c | 24 +++++-- xen/include/xen/hvm/irq.h | 2 +- xen/include/xen/pci.h | 2 +- xen/include/xen/sched.h | 6 ++ xen/include/xen/softirq.h | 3 + 7 files changed, 191 insertions(+), 18 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index f7e0e78..f14c611 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1921,7 +1921,9 @@ int domain_relinquish_resources(struct domain *d) switch ( d->arch.relmem ) { case RELMEM_not_started: - pci_release_devices(d); + ret = pci_release_devices(d); + if ( ret ) + return ret; /* Tear down paging-assistance stuff. */ paging_teardown(d); diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index ef75b94..284640d 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -20,14 +20,123 @@ #include <xen/event.h> #include <xen/iommu.h> +#include <xen/cpu.h> #include <xen/irq.h> #include <asm/hvm/irq.h> #include <asm/hvm/iommu.h> #include <asm/hvm/support.h> #include <xen/hvm/irq.h> -#include <xen/tasklet.h> -static void hvm_dirq_assist(unsigned long _d); +static void hvm_dirq_assist(struct domain *d); + +static DEFINE_PER_CPU(struct list_head, dpci_list); + +/* + * These fancy bit manipulations (bit 0 and bit 1) along with using a lock + * operation allow us to have four stages in dpci life-time. + * a) 0x0: Completely empty (not scheduled nor running). + * b) 0x1: Scheduled but not running. Used to guard in 'schedule_dpci_for' + * such that we will only schedule one. If it is scheduled and had never + * run (hence never clearing STATE_SCHED bit), dpci_kill will spin + * forever on said domain. However 'schedule_dpci_for' raises the + * softirq so it will run, albeit there might be a race (dpci_kill + * spinning until the softirq handler runs). + * c) 0x2: it is running (only on one CPU) and can be scheduled on any + * CPU. The bit 0 - scheduled is cleared at this stage allowing + * 'schedule_dpci_for' to succesfully schedule. + * d) 0x3: scheduled and running - only possible if the running hvm_dirq_assist + * calls 'schedule_dpci_for'. It does not do that, but it could in the + * future. + * + * The two bits play a vital role in assuring that the 'hvm_dirq_assist' is + * scheduled once and runs only once for a domain. The steps are: + * + * 1) schedule_dpci_for: STATE_SCHED bit set (0x1), added on the per cpu list. + * 2) dpci_softirq picks one domain from the list. Schedules + * itself later if there are more domain's on it. Tries to set STATE_RUN bit + * (0x3) - if it fails adds domain back to the per-cpu list. If it succeeds + * clears the STATE_SCHED bit (0x2). Once hvm_dirq_assist for an domain + * complets, it unsets STATE_RUN (0x0 or 0x1 if 'schedule_dpci_for' is called + * from 'hvm_dirq_assist'). + */ +enum { + STATE_SCHED, /* Bit 0 */ + STATE_RUN +}; + +static void schedule_dpci_for(struct domain *d) +{ + if ( !test_and_set_bit(STATE_SCHED, &d->state) ) + { + unsigned long flags; + struct list_head *list; + int cpu = smp_processor_id(); + + INIT_LIST_HEAD(&d->list); + + get_knownalive_domain(d); /* To be put by 'dpci_softirq' */ + + local_irq_save(flags); + list = &per_cpu(dpci_list, cpu); + list_add_tail(&d->list, list); + local_irq_restore(flags); + + raise_softirq(HVM_DPCI_SOFTIRQ); + } +} + +int dpci_kill(struct domain *d) +{ + if ( test_and_set_bit(STATE_SCHED, &d->state) ) + return -EAGAIN; + + if ( test_bit(STATE_RUN, &d->state) ) + return -EAGAIN; + + clear_bit(STATE_SCHED, &d->state); + + return 0; +} + +static void dpci_softirq(void) +{ + struct domain *d; + int cpu = smp_processor_id(); + struct list_head *list; + struct list_head our_list; + + INIT_LIST_HEAD(&our_list); + + local_irq_disable(); + list = &per_cpu(dpci_list, cpu); + list_splice_init(list, &our_list); + local_irq_enable(); + + while ( !list_empty(&our_list) ) + { + d = list_entry(our_list.next, struct domain, list); + list_del(&d->list); + + if ( !test_and_set_bit(STATE_RUN, &d->state) ) + { + if ( !test_and_clear_bit(STATE_SCHED, &d->state) ) + BUG(); + hvm_dirq_assist(d); + clear_bit(STATE_RUN, &d->state); + put_domain(d); + continue; + } + + INIT_LIST_HEAD(&d->list); + + local_irq_disable(); + list = &per_cpu(dpci_list, cpu); + list_add_tail(&d->list, list); + local_irq_enable(); + + raise_softirq(HVM_DPCI_SOFTIRQ); + } +} bool_t pt_irq_need_timer(uint32_t flags) { @@ -114,9 +223,7 @@ int pt_irq_create_bind( spin_unlock(&d->event_lock); return -ENOMEM; } - softirq_tasklet_init( - &hvm_irq_dpci->dirq_tasklet, - hvm_dirq_assist, (unsigned long)d); + for ( i = 0; i < NR_HVM_IRQS; i++ ) INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]); @@ -459,7 +566,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq) return 0; pirq_dpci->masked = 1; - tasklet_schedule(&dpci->dirq_tasklet); + schedule_dpci_for(d); return 1; } @@ -568,10 +675,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci, return 0; } -static void hvm_dirq_assist(unsigned long _d) +static void hvm_dirq_assist(struct domain *d) { - struct domain *d = (struct domain *)_d; - ASSERT(d->arch.hvm_domain.irq.dpci); spin_lock(&d->event_lock); @@ -631,3 +736,48 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi, unlock: spin_unlock(&d->event_lock); } + +static int cpu_callback( + struct notifier_block *nfb, unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + + switch ( action ) + { + case CPU_UP_PREPARE: + INIT_LIST_HEAD(&per_cpu(dpci_list, cpu)); + break; + case CPU_UP_CANCELED: + case CPU_DEAD: + /* + * On CPU_DYING this callback is called (on the CPU that is dying) + * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can + * clear out any outstanding domains (by the virtue of the idle loop + * calling the softirq later). In CPU_DEAD case the CPU is deaf and + * there are no pending softirqs for us to handle so we can chill. + */ + ASSERT(list_empty(&per_cpu(dpci_list, cpu))); + break; + default: + break; + } + + return NOTIFY_DONE; +} + +static struct notifier_block cpu_nfb = { + .notifier_call = cpu_callback, +}; + +static int __init setup_dpci_softirq(void) +{ + int cpu; + + for_each_online_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(dpci_list, cpu)); + + open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq); + register_cpu_notifier(&cpu_nfb); + return 0; +} +__initcall(setup_dpci_softirq); diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 1eba833..655c97c 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -770,37 +770,47 @@ static int pci_clean_dpci_irq(struct domain *d, return 0; } -static void pci_clean_dpci_irqs(struct domain *d) +static int pci_clean_dpci_irqs(struct domain *d) { struct hvm_irq_dpci *hvm_irq_dpci = NULL; if ( !iommu_enabled ) - return; + return -ESRCH; if ( !is_hvm_domain(d) ) - return; + return -EINVAL; spin_lock(&d->event_lock); hvm_irq_dpci = domain_get_irq_dpci(d); if ( hvm_irq_dpci != NULL ) { - tasklet_kill(&hvm_irq_dpci->dirq_tasklet); + int ret = dpci_kill(d); + if ( ret ) + { + spin_unlock(&d->event_lock); + return ret; + } pt_pirq_iterate(d, pci_clean_dpci_irq, NULL); d->arch.hvm_domain.irq.dpci = NULL; free_hvm_irq_dpci(hvm_irq_dpci); } spin_unlock(&d->event_lock); + return 0; } -void pci_release_devices(struct domain *d) +int pci_release_devices(struct domain *d) { struct pci_dev *pdev; u8 bus, devfn; + int ret; spin_lock(&pcidevs_lock); - pci_clean_dpci_irqs(d); + ret = pci_clean_dpci_irqs(d); + if ( ret && ret == -EAGAIN ) + return ret; + while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) ) { bus = pdev->bus; @@ -811,6 +821,8 @@ void pci_release_devices(struct domain *d) PCI_SLOT(devfn), PCI_FUNC(devfn)); } spin_unlock(&pcidevs_lock); + + return 0; } #define PCI_CLASS_BRIDGE_HOST 0x0600 diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h index c89f4b1..ccd2966 100644 --- a/xen/include/xen/hvm/irq.h +++ b/xen/include/xen/hvm/irq.h @@ -88,7 +88,6 @@ struct hvm_irq_dpci { DECLARE_BITMAP(isairq_map, NR_ISAIRQS); /* Record of mapped Links */ uint8_t link_cnt[NR_LINK]; - struct tasklet dirq_tasklet; }; /* Machine IRQ to guest device/intx mapping. */ @@ -109,6 +108,7 @@ int pt_pirq_iterate(struct domain *d, struct hvm_pirq_dpci *, void *arg), void *arg); +int dpci_kill(struct domain *d); /* Modify state of a PCI INTx wire. */ void hvm_pci_intx_assert( struct domain *d, unsigned int device, unsigned int intx); diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index 91520bc..5f295f3 100644 --- a/xen/include/xen/pci.h +++ b/xen/include/xen/pci.h @@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev( void setup_hwdom_pci_devices(struct domain *, int (*)(u8 devfn, struct pci_dev *)); -void pci_release_devices(struct domain *d); +int pci_release_devices(struct domain *d); int pci_add_segment(u16 seg); const unsigned long *pci_get_ro_map(u16 seg); int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index c5157e6..7984263 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -456,6 +456,12 @@ struct domain /* vNUMA topology accesses are protected by rwlock. */ rwlock_t vnuma_rwlock; struct vnuma_info *vnuma; + +#ifdef HAS_PASSTHROUGH + /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */ + struct list_head list; + unsigned long state; +#endif }; struct domain_setup_info diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h index 0c0d481..64ae25f 100644 --- a/xen/include/xen/softirq.h +++ b/xen/include/xen/softirq.h @@ -8,6 +8,9 @@ enum { NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ, RCU_SOFTIRQ, TASKLET_SOFTIRQ, +#ifdef HAS_PASSTHROUGH + HVM_DPCI_SOFTIRQ, +#endif NR_COMMON_SOFTIRQS }; -- 1.7.7.6 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.