[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi(?) issue
On Mon, Jul 01, 2024 at 11:07:57AM -0700, Elliott Mitchell wrote: > On Thu, Jun 27, 2024 at 05:18:15PM -0700, Elliott Mitchell wrote: > > I'm rather surprised it was so long before the next system restart. > > Seems a quiet period as far as security updates go. Good news is I made > > several new observations, but I don't know how valuable these are. > > > > On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote: > > > > > > Does booting with `iommu=no-intremap` lead to any issues being > > > reported? > > > > On boot there was in fact less. Notably the "AMD-Vi" messages haven't > > shown up at all. I haven't stressed it very much yet, but previous > > boots a message showed up the moment the MD-RAID1 driver was loaded. > > > > > > I am though seeing two different messages now: > > > > (XEN) CPU#: No irq handler for vector # (IRQ -#, LAPIC) > > (XEN) IRQ# a=#[#,#] v=#[#] t=PCI-MSI s=# > > > > These are to be appearing in pairs. Multiple values show for each field, > > though each field appears to vary between 2-3 different values. There > > are thousands of these messages showing up. > > Some lucky timing so I've done some more experimentation and sampling. > > The "(XEN) IRQ" line almost always shows up with the "(XEN) CPU" line. > I notice it is possible to generate the first without the second, so this > seems notable. Every single "(XEN) CPU" line mentioned "LAPIC". > > The small number (20) of lines where "(XEN) IRQ" did not show up, the > "(XEN) CPU" line always ended with "(IRQ -2147483648, LAPIC)" > > For the "t=" value out of 316 samples, 94 listed "PCI-MSI" while 222 > listed "PCI-MSI/-X". > > For the IRQ, 72 occurred 126 times. 71, 73 and 108 occurred roughly 50 > times each. 109 and 111 occurred under 10 times. Almost no other IRQ > values appeared. > > The "s=" value was "00000030" slightly more often than "00000010". No > other values have been observed so far. > > The other values were didn't show too many patterns. > > Most processors were mentioned roughly equally. Several had fewer > mentions, but not enough to seem significant. I discovered processor 1 > did NOT show up. Whereas processor 0 had an above average number of > occurrences. This seems notable as these 2 processors are both reserved > exclusively for domain 0. All of the patterns continue. There are more reports on processor 0 than any other processor, but not enough to look particularly suspicious. What *does* look suspicious is the complete absence of reports from processor 1. > There have also been a few "spurious 8259A interrupt" lines. So far > there haven't been very many of these. The processor and IRQ listed > don't yet appear to show any patterns. So far no IRQ has been listed > twice. IRQs 3-7 and 9-15 have each shown up once. 1-2 and 8 haven't shown up so far. Things look different enough to try reenabling Linux software RAID1. I'm going to continue monitoring closely, but so far it seems "iommu=no-intremap" may in fact mitigate the issue with software RAID1. This seems odd, but I'm simply reporting what I observe. I would have expected to see problem indications by now, yet there aren't any. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |