[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: AMD EPYC virtual network performances
On Tue, Aug 13, 2024 at 01:16:06PM +0200, Jürgen Groß wrote: > On 13.08.24 03:10, Elliott Mitchell wrote: > > On Tue, Jul 09, 2024 at 11:37:07AM +0200, Jürgen Groß wrote: > > > > > > In both directories you can see the number of spurious events by looking > > > into the spurious_events file. > > > > > > In the end the question is why so many spurious events are happening. > > > Finding > > > the reason might be hard, though. > > > > Hopefully my comments on this drew your attention, yet lack of response > > suggests otherwise. I'm wondering whether this is an APIC misprogramming > > issue, similar to the x2APIC issue which was causing trouble with recent > > AMD processors. > > > > Trying to go after the Linux software RAID1, my current attempt is > > "iommu=debug iommu=no-intremap". I'm seeing *lots* of messages from > > spurious events in `xl dmesg`. So many I have a difficult time believing > > they are related to hardware I/O. > > Seeing them in `xl dmesg` means those spurious events are seen by the > hypervisor, not by the Linux kernel. Indeed. Yet this seems to be pointing at a problem, whereas most other information sources merely indicate there is a problem. I'm unable to resolve those to hardware. This could mean those are being synthesized by software and when crossing some interface they get reinterpreted as hardware events. This could mean those are hardware events, but somewhere inside Xen information is being corrupted and the information displayed is unrelated to the original event (x2APIC misinterpretation?). > > In which case could the performance problem observed by Andrei Semenov > > be due to misprogramming of [x2]APIC triggering spurious events? > > I don't see a connection here, as spurious interrupts (as seen by the > hypervisor in your case) and spurious events (as seen by Andrei) are > completely different (hardware vs. software level). The entries seem to appear at an average of about 1/hour. Could be most events are being dropped and 10x that number are occuring. If so, those extras could be turning into spurious events seen by various domains. There is a possibility spurious interrupts are being turned into spurious events by the back-end drivers. Jürgen Groß, what is the performance impact of "iommu=debug"? Seems to mostly cause more reporting and have minimal/no performance effect. Andrei Semenov, if you're allowed to release information about your systems: What is your mix of guest types (<1% PV, 90% PVH, 9% HVM?)? Is there any pattern of PIC type and driver on effected/uneffected systems? I would suspect most/all of your systems to have a x2APIC. Which of, cluster mode, mixed mode, physical mode, or other do they use? ("Using APIC driver x2apic_mixed" => x2APIC, mixed-mode) Assuming Jürgen Groß confirms "iommu=debug" has minimal/no performance impact; Would it be possible to try booting one or more effected systems with "iommu=debug" on Xen's command-line? I'm wondering whether you observe lines of "No irq handler for vector" or spurious interrupt lines in Xen's dmesg. The pattern of events in `xl dmesg` seems to better match spurious events being sent to/from my single HVM domain, rather than the RAID1 issue. In particular there are too many distinct interrupt numberss to match the hardware used for RAID. Whereas there aren't enough distinct interrupt numbers to account for every single event channel with spurious events (several PVH domains). -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |