[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi(?) issue
On Thu, Jul 04, 2024 at 03:08:00PM -0700, Elliott Mitchell wrote: > On Mon, Jul 01, 2024 at 11:07:57AM -0700, Elliott Mitchell wrote: > > On Thu, Jun 27, 2024 at 05:18:15PM -0700, Elliott Mitchell wrote: > > > > Most processors were mentioned roughly equally. Several had fewer > > mentions, but not enough to seem significant. I discovered processor 1 > > did NOT show up. Whereas processor 0 had an above average number of > > occurrences. This seems notable as these 2 processors are both reserved > > exclusively for domain 0. > > All of the patterns continue. There are more reports on processor 0 than > any other processor, but not enough to look particularly suspicious. > What *does* look suspicious is the complete absence of reports from > processor 1. Bit more work with sort/uniq here and there is more of a pattern. Odd-numbered processors (1,3,5) are seeing fewer reports, with CPU1 being an outlier for having none. Even-numbered processors (0,2,4) are seeing more reports, with CPU0 displaying the most of any processor. There is also a pattern of lower-numbered processors seeing more of the reports and higher numbered ones seeing less (CPU1 being an outlier). If my reading of `xl dmesg` is correct, then the lower-numbered processors are the first die and higher-numbered processors are the second die. My guess is the 0 and 1 are the first conjoined pair which share more of their silicon with each other. > > There have also been a few "spurious 8259A interrupt" lines. So far > > there haven't been very many of these. The processor and IRQ listed > > don't yet appear to show any patterns. So far no IRQ has been listed > > twice. > > IRQs 3-7 and 9-15 have each shown up once. 1-2 and 8 haven't shown up > so far. #8 has now shown up, so 8259A interrupts 3-15 have now all shown up *once*. 0-2 haven't show up at all. Certain MSI IRQs are showing up. The complete list is: IRQ70 2 IRQ71 82 IRQ72 368 IRQ73 81 IRQ90 22 IRQ107 27 IRQ108 92 IRQ109 23 IRQ111 29 IRQ117 1 I'm unsure whether this actually works, but looking at /proc/interrupts, all of these are associated with Xen according to Domain 0. 68-91 are all listed as "xen-percpu", 105-120 are listed as "xen-dyn-lateeoi". *IF* I am understanding this correctly, this *might* be the same problem https://lists.xenproject.org/archives/html/xen-devel/2024-07/msg00454.html Domain 0 is reportting plenty of spurious events. I'm starting to wonder if this isn't a Linux software RAID1 on AMD hardware issue, but instead a more generalized issue towards the core of Xen's interrupt handling. Just AMD hardware gets hit harder. > Things look different enough to try reenabling Linux software RAID1. I'm > going to continue monitoring closely, but so far it seems > "iommu=no-intremap" may in fact mitigate the issue with software RAID1. At this point I've monitored for problems and not found any for long enough to declare this a tentative mitigation. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |