[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serious AMD-Vi(?) issue



On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> > On 22.03.2024 20:22, Elliott Mitchell wrote:
> > > On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
> > >>
> > >> I can see you've recently engaged with our community with some issues 
> > >> you'd
> > >> like help with.
> > >> We love the fact you are participating in our project, however, our
> > >> developers aren't able to help if you do not provide the specific 
> > >> details.
> > > 
> > > Please point to specific details which have been omitted.  Fairly little
> > > data has been provided as fairly little data is available.  The primary
> > > observation is large numbers of:
> > > 
> > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 
> > > 0x8 I
> > > 
> > > Lines in Xen's ring buffer.
> > 
> > Yet this is (part of) the problem: By providing only the messages that 
> > appear
> > relevant to you, you imply that you know that no other message is in any way
> > relevant. That's judgement you'd better leave to people actually trying to
> > investigate. Unless of course you were proposing an actual code change, with
> > suitable justification.
> 
> Honestly, I forgot about the very small number of messages from the SATA
> subsystem.  The question of whether the current mitigation actions are
> effective right now was a bigger issue.  As such monitoring `xl dmesg`
> was a priority to looking at SATA messages which failed to reliably
> indicate status.
> 
> I *thought* I would be able to retrieve those via other slow means, but a
> different and possibly overlapping issue has shown up.  Unfortunately
> this means those are no longer retrievable.   :-(

With some persistence I was able to retrieve them.  There are other
pieces of software with worse UIs than Xen.

> > In fact when running into trouble, the usual course of action would be to
> > increase verbosity in both hypervisor and kernel, just to make sure no
> > potentially relevant message is missed.
> 
> More/better information might have been obtained if I'd been engaged
> earlier.

This is still true, things are in full mitigation mode and I'll be
quite unhappy to go back with experiments at this point.


I now see why I left those out.  The messages from the SATA subsystem
were from a kernel which a bad patch had leaked into a LTS branch.  Looks
like the SATA subsystem was significantly broken and I'm unsure whether
any useful information could be retrieved.  Notably there is quite a bit
of noise from SATA devices not effected by this issue.

Some of the messages /might/ be useful, but the amount of noise is quite
high.  Do messages from a broken kernel interest you?


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.