[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AMD EPYC VM to VM performance investigation



On Thu, 4 Jan 2024, David Morel wrote:
> Hello,
> 
> We have a customer and multiple users on our forum having performances that
> seems quite low related to the general performance of the machines on AMD EPYC
> Zen hosts when doing VM to VM networking.

By "VM to VM networking" I take you mean VM-to-VM on the same host using
PV network?


> Below you'll find a write up about what we had a look at and what's in the
> TODO on our side, but in the meantime we would like to ask here for some
> feedback, suggestions and possible leads.
> 
> To sum up, the VM to VM performance on Zen generation server CPUs seems quite
> low, and only minimally scaling when adding threads. They are outperformed by
> 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014.
> CPU usage does not seem to be the limiting factor as neither the VM threads or
> the kthreads on host seems to go to a 100% cpu usage.
> 
> As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel
> 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was
> borrowed from a colleague I was unsure of the setup, so although it was
> actually worse than on my other test setups, I would not consider that a
> complete validation the issues is also present on recent Xen versions.

I think it might be difficult to triage this if you are working on a
Xen/Linux version that is so different from upstream


> 1. Has anybody else noticed a similar behavior?
> 2. Has anybody done any kind of investigation about it beside us?
> 3. Any insight and suggestions of other points to look at would be welcome :)
> 
> And now the lengthy part about what we tested, I tried to make it shorter and
> more legible than a full report…
> 
> Investigated
> ------------
> 
> - Bench various cpu with iperf2 (iperf3 is not actually multithreaded):
>   - amd fx8320e, xeon 3106: not impacted.
>   - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a
>     bit more than zen1, 2 and 3.
>   - ryzen 5950x, ryzen 7600: performances should likely be better than
>     observed results, but still way better than epycs, and scaling nicely with
>     more threads.
> - Bench with tinymembench[1]: performances were as expected and didn't show
>   issues with rep movsb as discussed in this article[2] and issue[3]. Which
>   makes sense as it looks like this issues is related to ERMS support which is
>   not present on Zen1 and 2 where the issue has been raised.
> - Bench skb allocation with a small kernel module measuring cycles: actually
>   same or lower on epyc than on the xeon with higher frequency so can be
>   considered faster and likely not related to our issue.
> - mitigations: we tried disabling what can be disabled through boot
>   parameters, both for xen, dom0 and guests, but this made no differences.
> - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling
>   when doing heavy AVX load on one core, there was no reason to think this was
>   related, but it was a quick test and as expected had no effect.
> - localhost iperf bench on dom0 and guests: we noticed that on other machines
>   host/guest with 1 threads are almost 1:1, with 4 threads guests are about
>   generally not scaling as well in guests. On epyc machines, host tests were
>   significantly slower than guests both with 1 and 4 threads, first
>   investigation of profiling didn't help finding a cause yet. More in the
>   profiling and TODO.

Wait, are you saying that the localhost iperf benchmark is faster in a
VM compared to host ("host" I take means baremetal Linux without a
hypervisor) ?   Maybe you meant the other way around?


> - cpu load: top/htop/xentop all seem to indicate that machines are not under
>   full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and
>   seem to be all used when traffic is running but at a percentage below 100%
>   per core/thread.
> - pinning: manually pinning dom0 and guests to the same node and avoiding
>   sharing cpu "threads" between host and guests gives a minimal increase of a
>   few percents, but nothing drastic. Note, we do not know about the
>   ccd/ccx/node mapping on these cpus, so we are not sure all memory access are
>   "local".
> - sched weight: playing with sched weight to prioritize dom0 did not make a
>   difference either, which makes sense as the system are not under full load.
> - cpu scaling: it is unlikely the core of the issue, but indeed the cpu
>   scaling does not take advantage of the boost, never going above the base
>   clock of these cpus. Also it also seems that less cores that the number of
>   working kthreads/vcpus are going to base clock, may be normal in regard to
>   the system not being fully loaded, to be defined.
>   - QUESTION: is the powernow support in xen cpufreq implementation sufficient
>     for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use
>     amd_pstate or even amd_pstate_epp. More concerning than the turbo boost
>     could be the handling of package power limitation used in Zen CPUs that
>     could prevent even all cores to base clock, to be checked…
> 
> Profiling
> ---------
> 
> We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon
> machines and gathered profiling traces, but analysis are still ongoing.
> 
> - localhost:
> Client and server were profiled both on dom0 and guests runs for a xeon, an
> old FX and a zen platform, to analyze the discrepancy shown by the localhost
> tests earlier. It shows we spend a larger chunk of time in the copyout() or
> copyin() functions on epyc and fx. This is likely related to the use of
> copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses
> copy_user_enhanced_fast_string(), as it has ERMS support.  But on the same
> machine, guests are going way faster, and the implementation of
> copy_user_generic_string() is the same between the dom0 and guests, so this is
> likely related to other changes in kernel and userland, and not only to these
> function. Therefore it likely isn't directly linked to the issue.
> 
> - vm to vm: server, client & dom0 -> profiling traces to be analysed.
> 
> TODO
> ----
> 
> - More Analysis of profiling traces in VM to VM case
> - X2APIC (not enabled on the machines and setup we are using)
> - Profiling at xen level / hypercalls
> - Tests on a clean install of a newer Xen version
> - Dig some more on cpu scaling, likely not the root of the problem but could
>   be some gain to make.
> 
> [1] https://github.com/ssvb/tinymembench
> [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
> [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
> 
> -- 
> David Morel
> 

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.