[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi(?) issue
On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > On 22.03.2024 20:22, Elliott Mitchell wrote: > > > On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote: > > >> > > >> I can see you've recently engaged with our community with some issues > > >> you'd > > >> like help with. > > >> We love the fact you are participating in our project, however, our > > >> developers aren't able to help if you do not provide the specific > > >> details. > > > > > > Please point to specific details which have been omitted. Fairly little > > > data has been provided as fairly little data is available. The primary > > > observation is large numbers of: > > > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags > > > 0x8 I > > > > > > Lines in Xen's ring buffer. > > > > Yet this is (part of) the problem: By providing only the messages that > > appear > > relevant to you, you imply that you know that no other message is in any way > > relevant. That's judgement you'd better leave to people actually trying to > > investigate. Unless of course you were proposing an actual code change, with > > suitable justification. > > Honestly, I forgot about the very small number of messages from the SATA > subsystem. The question of whether the current mitigation actions are > effective right now was a bigger issue. As such monitoring `xl dmesg` > was a priority to looking at SATA messages which failed to reliably > indicate status. > > I *thought* I would be able to retrieve those via other slow means, but a > different and possibly overlapping issue has shown up. Unfortunately > this means those are no longer retrievable. :-( With some persistence I was able to retrieve them. There are other pieces of software with worse UIs than Xen. > > In fact when running into trouble, the usual course of action would be to > > increase verbosity in both hypervisor and kernel, just to make sure no > > potentially relevant message is missed. > > More/better information might have been obtained if I'd been engaged > earlier. This is still true, things are in full mitigation mode and I'll be quite unhappy to go back with experiments at this point. I now see why I left those out. The messages from the SATA subsystem were from a kernel which a bad patch had leaked into a LTS branch. Looks like the SATA subsystem was significantly broken and I'm unsure whether any useful information could be retrieved. Notably there is quite a bit of noise from SATA devices not effected by this issue. Some of the messages /might/ be useful, but the amount of noise is quite high. Do messages from a broken kernel interest you? -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |