[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] AMD EPYC VM to VM performance investigation
Hello, We have a customer and multiple users on our forum having performances that seems quite low related to the general performance of the machines on AMD EPYC Zen hosts when doing VM to VM networking. Below you'll find a write up about what we had a look at and what's in the TODO on our side, but in the meantime we would like to ask here for some feedback, suggestions and possible leads. To sum up, the VM to VM performance on Zen generation server CPUs seems quite low, and only minimally scaling when adding threads. They are outperformed by 10 year old AMD desktop cpu and pretty low frequency XEON bronze from 2014. CPU usage does not seem to be the limiting factor as neither the VM threads or the kthreads on host seems to go to a 100% cpu usage. As we're Vates, I'm talking about XCP-ng here, so Xen 4.13.5 and a dom0 kernel 4.19. I did try a Xen 4.18-rc2 and kernel 6.1.56 on a Zen4 epyc, but as it was borrowed from a colleague I was unsure of the setup, so although it was actually worse than on my other test setups, I would not consider that a complete validation the issues is also present on recent Xen versions. 1. Has anybody else noticed a similar behavior? 2. Has anybody done any kind of investigation about it beside us? 3. Any insight and suggestions of other points to look at would be welcome :) And now the lengthy part about what we tested, I tried to make it shorter and more legible than a full report… Investigated ------------ - Bench various cpu with iperf2 (iperf3 is not actually multithreaded): - amd fx8320e, xeon 3106: not impacted. - epyc 7451, 7443, 7302p, 7313p, 9124: impacted, but the zen4 one scales a bit more than zen1, 2 and 3. - ryzen 5950x, ryzen 7600: performances should likely be better than observed results, but still way better than epycs, and scaling nicely with more threads. - Bench with tinymembench[1]: performances were as expected and didn't show issues with rep movsb as discussed in this article[2] and issue[3]. Which makes sense as it looks like this issues is related to ERMS support which is not present on Zen1 and 2 where the issue has been raised. - Bench skb allocation with a small kernel module measuring cycles: actually same or lower on epyc than on the xeon with higher frequency so can be considered faster and likely not related to our issue. - mitigations: we tried disabling what can be disabled through boot parameters, both for xen, dom0 and guests, but this made no differences. - disabling AVX; Zen cpus before zen4 are know to limit boost and cpu scaling when doing heavy AVX load on one core, there was no reason to think this was related, but it was a quick test and as expected had no effect. - localhost iperf bench on dom0 and guests: we noticed that on other machines host/guest with 1 threads are almost 1:1, with 4 threads guests are about generally not scaling as well in guests. On epyc machines, host tests were significantly slower than guests both with 1 and 4 threads, first investigation of profiling didn't help finding a cause yet. More in the profiling and TODO. - cpu load: top/htop/xentop all seem to indicate that machines are not under full load, queue allocations on dom0 for VIF are by default (1 per vcpu) and seem to be all used when traffic is running but at a percentage below 100% per core/thread. - pinning: manually pinning dom0 and guests to the same node and avoiding sharing cpu "threads" between host and guests gives a minimal increase of a few percents, but nothing drastic. Note, we do not know about the ccd/ccx/node mapping on these cpus, so we are not sure all memory access are "local". - sched weight: playing with sched weight to prioritize dom0 did not make a difference either, which makes sense as the system are not under full load. - cpu scaling: it is unlikely the core of the issue, but indeed the cpu scaling does not take advantage of the boost, never going above the base clock of these cpus. Also it also seems that less cores that the number of working kthreads/vcpus are going to base clock, may be normal in regard to the system not being fully loaded, to be defined. - QUESTION: is the powernow support in xen cpufreq implementation sufficient for zen cpus? Recent kernels/distributions use acpi_cpufreq and can use amd_pstate or even amd_pstate_epp. More concerning than the turbo boost could be the handling of package power limitation used in Zen CPUs that could prevent even all cores to base clock, to be checked… Profiling --------- We profiled iperf on dom0 and guests on epyc, older amd desktop, and xeon machines and gathered profiling traces, but analysis are still ongoing. - localhost: Client and server were profiled both on dom0 and guests runs for a xeon, an old FX and a zen platform, to analyze the discrepancy shown by the localhost tests earlier. It shows we spend a larger chunk of time in the copyout() or copyin() functions on epyc and fx. This is likely related to the use of copy_user_generic_string() on epyc (zen1) and old FX, whereas xeon uses copy_user_enhanced_fast_string(), as it has ERMS support. But on the same machine, guests are going way faster, and the implementation of copy_user_generic_string() is the same between the dom0 and guests, so this is likely related to other changes in kernel and userland, and not only to these function. Therefore it likely isn't directly linked to the issue. - vm to vm: server, client & dom0 -> profiling traces to be analysed. TODO ---- - More Analysis of profiling traces in VM to VM case - X2APIC (not enabled on the machines and setup we are using) - Profiling at xen level / hypercalls - Tests on a clean install of a newer Xen version - Dig some more on cpu scaling, likely not the root of the problem but could be some gain to make. [1] https://github.com/ssvb/tinymembench [2] https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/ [3] https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515 -- David Morel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |