[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Issue: Networking performance in Xen VM on Arm64
Hi Stefano and all, On Mon, Oct 17, 2022 at 04:50:05PM -0700, Stefano Stabellini wrote: [...] > > We can see DomU sends notification with timestamp (raw counter) is > > 4989078592 and Dom0 receives the interrupt with timestamp 4989092169. > > Since Dom0 and DomU use the same time counter and the counter > > frequency is 25MHz, so we can get the delta value (in macroseconds): > > > > (4989092169 - 4989078592) / 25000000 * 1000 * 1000 > > = 543us > > > > Which means it takes 543us to let Dom0 to receive the notification. > > You could see DomU runs in CPU3 and Dom0 runs on CPU13, there should > > not have contention for CPU resources. Seems to me, it's likely Xen > > hypervisor takes long time to deliver the interrupt, note, it's not > > take so long time for every skb transferring, sometimes the time for > > response a notification is short (about ~10us). > > Good find. I think this is worth investigating further. Do you have > vwfi=native in your Xen command line as well? > > After that, I would add printk also in Xen with the timestamp. The event > channel notification code path is the following: > > # domU side > xen/arch/arm/vgic-v2.c:vgic_v2_to_sgi > xen/arch/arm/vgic.c:vgic_to_sgi > xen/arch/arm/vgic.c:vgic_inject_irq > xen/arch/arm/vgic.c:vcpu_kick > xen/arch/arm/gic-v2.c:gicv2_send_SGI > > # dom0 side > xen/arch/arm/gic.c:do_sgi > xen/arch/arm/traps.c:leave_hypervisor_to_guest > > It would be good to understand why sometimes it takes ~10us and some > other times it takes ~540us Some updates for why it takes several hundreds us for Xen backend driver to respond interrupt. The short answer is the vcpu running Xen backend driver needs to switch context, even I have set options "sched=null vwfi=native" in Xen command line. So please see below detailed logs for how the things happen. Let's take the timestamp 3842008681 as the start point, it's the time for Xen backend driver sending out notification (xennet_notify_tx_irq); at the timestamp 3842008885 the Xen hypervisor injects the interrupt (it's about ~8us duration from the start point). And then at the timestamp 3842008935 it invokes vcpu_kick() to kick the virtual CPU for running Xen forend driver, you could see VCPU_PROCESSOR is 11 and VCPU_ID is 9 for dom0, the duration is 10.16us from the start point. The key point is at this point the vcpu's is_running is 0, this is different from the case without long latency which vcpu's is_running is 1. IIUC, Xen hypervisor needs to take time to restore the vcpu's context, thus we can see the virtual CPU 9 in Dom0 starts to run at the timestamp 3842016505. 3842008548 pub-310 [001] 67.352980: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842008548 3842008652 pub-310 [001] 67.352984: bprint: xennet_tx_setup_grant: id=52 ref=820 offset=2 len=1514 TSC: 3842008652 3842008681 pub-310 [001] 67.352985: bprint: xennet_start_xmit: xennet_notify_tx_irq: TSC: 3842008681 3842008689 (XEN) leave_hypervisor_to_guest: CPU_ID: 0 TSC: 3842008689 3842008766 (XEN) EVTCHNOP_send: CPU_ID: 2 TSC: 3842008766 3842008885 (XEN) vgic_inject_irq: CPU_ID: 2 TSC: 3842008885 3842008929 (XEN) leave_hypervisor_to_guest: CPU_ID: 14 TSC: 3842008929 3842008935 (XEN) vcpu_kick: VCPU_PROCESSOR: 11 VCPU_ID: 9 is_running 0 TSC: 3842008935 3842009049 (XEN) leave_hypervisor_to_guest: CPU_ID: 2 TSC: 3842009049 3842009322 pub-310 [001] 67.353011: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842009322 3842009374 pub-310 [001] 67.353013: bprint: xennet_tx_setup_grant: id=12 ref=780 offset=2050 len=1514 TSC: 3842009374 3842009584 pub-310 [001] 67.353021: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842009584 3842009625 (XEN) leave_hypervisor_to_guest: CPU_ID: 15 TSC: 3842009625 3842009633 pub-310 [001] 67.353023: bprint: xennet_tx_setup_grant: id=83 ref=851 offset=2 len=1514 TSC: 3842009633 3842009853 pub-310 [001] 67.353032: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842009853 3842009899 pub-310 [001] 67.353034: bprint: xennet_tx_setup_grant: id=5 ref=773 offset=2050 len=1514 TSC: 3842009899 3842010080 pub-310 [001] 67.353041: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842010080 3842010121 pub-310 [001] 67.353043: bprint: xennet_tx_setup_grant: id=85 ref=853 offset=2 len=1514 TSC: 3842010121 3842010316 pub-310 [001] 67.353050: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842010316 3842010359 pub-310 [001] 67.353052: bprint: xennet_tx_setup_grant: id=9 ref=777 offset=2050 len=1514 TSC: 3842010359 3842010553 pub-310 [001] 67.353060: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842010553 3842010599 pub-310 [001] 67.353062: bprint: xennet_tx_setup_grant: id=35 ref=803 offset=2 len=1514 TSC: 3842010599 3842010792 pub-310 [001] 67.353069: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842010792 3842010838 pub-310 [001] 67.353071: bprint: xennet_tx_setup_grant: id=17 ref=785 offset=2 len=1514 TSC: 3842010838 3842011032 pub-310 [001] 67.353079: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842011032 3842011072 pub-310 [001] 67.353081: bprint: xennet_tx_setup_grant: id=18 ref=786 offset=2 len=1514 TSC: 3842011072 3842011124 (XEN) leave_hypervisor_to_guest: CPU_ID: 12 TSC: 3842011124 3842011265 pub-310 [001] 67.353088: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842011265 3842011309 pub-310 [001] 67.353090: bprint: xennet_tx_setup_grant: id=86 ref=854 offset=3074 len=606 TSC: 3842011309 3842012807 (XEN) leave_hypervisor_to_guest: CPU_ID: 6 TSC: 3842012807 3842013038 (XEN) leave_hypervisor_to_guest: CPU_ID: 10 TSC: 3842013038 3842013568 (XEN) vgic_inject_irq: CPU_ID: 23 TSC: 3842013568 3842013639 (XEN) vcpu_kick: VCPU_PROCESSOR: 23 VCPU_ID: 10 is_running 1 TSC: 3842013639 3842013769 (XEN) leave_hypervisor_to_guest: CPU_ID: 4 TSC: 3842013769 3842013955 (XEN) leave_hypervisor_to_guest: CPU_ID: 8 TSC: 3842013955 3842014327 (XEN) leave_hypervisor_to_guest: CPU_ID: 5 TSC: 3842014327 3842014381 pub-310 [001] 67.353213: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842014381 3842014433 pub-310 [001] 67.353215: bprint: xennet_tx_setup_grant: id=89 ref=857 offset=2050 len=1514 TSC: 3842014433 3842014621 pub-310 [001] 67.353223: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842014621 3842014663 pub-310 [001] 67.353224: bprint: xennet_tx_setup_grant: id=38 ref=806 offset=2 len=1514 TSC: 3842014663 3842014852 pub-310 [001] 67.353232: bprint: xennet_start_xmit: xennet_start_xmit: TSC: 3842014852 3842014909 pub-310 [001] 67.353234: bprint: xennet_tx_setup_grant: id=4 ref=772 offset=2 len=1422 TSC: 3842014909 3842015153 (XEN) leave_hypervisor_to_guest: CPU_ID: 1 TSC: 3842015153 3842015345 (XEN) leave_hypervisor_to_guest: CPU_ID: 3 TSC: 3842015345 3842015823 (XEN) leave_hypervisor_to_guest: CPU_ID: 13 TSC: 3842015823 3842016505 <idle>-0 [009] 95.098859: bprint: xenvif_tx_interrupt: xenvif_tx_interrupt: TSC: 3842016505 3842016635 <idle>-0 [009] 95.098863: bprint: xenvif_tx_build_gops.constprop.0: id=52 ref=820 offset=2 len=1514 TSC: 3842016635 3842016648 (XEN) leave_hypervisor_to_guest: CPU_ID: 9 TSC: 3842016648 3842016671 <idle>-0 [009] 95.098865: bprint: xenvif_tx_build_gops.constprop.0: id=12 ref=780 offset=2050 len=1514 TSC: 3842016671 3842016700 <idle>-0 [009] 95.098866: bprint: xenvif_tx_build_gops.constprop.0: id=83 ref=851 offset=2 len=1514 TSC: 3842016700 3842016725 <idle>-0 [009] 95.098867: bprint: xenvif_tx_build_gops.constprop.0: id=5 ref=773 offset=2050 len=1514 TSC: 3842016725 I did another two experiments. One experiment is I tried to append option "dom0_max_vcpus=16" into Xen command line, the purpose is I want to allocate 16 virtual CPUs for dom0 and 4 virtual CPUs for domU, and my system has 32 physical CPUs, in theory the physical CPUs are sufficient for dom0 and domU, but I didn't see any performance improvement with option "dom0_max_vcpus=16". Another experiment is to append option "dom0_vcpus_pin" in Xen command line, this can allow the virtual CPU pins on physical CPU so can avoid context switching. With this option, I can see the throughput is improved from 128.26 Mbits/s to 180.3 Mbits/s. But this result is far from the ideal throughput (800+Mbits/s). One thing I need to explore is to measure the duration for copying data via grant table (based on the log we can see xennet driver setup grant table and then xenvif driver fetches data from grant table). I will compare the duration for two directions (from dom0 to domU and from domU to dom0). This part might impact the performance heavily. Thanks, Leo
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |