[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi issue
On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote: > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > > Apparently this was first noticed with 4.14, but more recently I've been > > able to reproduce the issue: > > > > https://bugs.debian.org/988477 > > > > The original observation features MD-RAID1 using a pair of Samsung > > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags > > 0x8 I > > > > Where the device points at the SATA controller. I've ended up > > reproducing this with some noticable differences. > > > > A major goal of RAID is to have different devices fail at different > > times. Hence my initial run had a Samsung device plus a device from > > another reputable flash manufacturer. > > > > I initially noticed this due to messages in domain 0's dmesg about > > errors from the SATA device. Wasn't until rather later that I noticed > > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should > > be duplicated into domain 0's dmesg?). > > > > All of the failures consistently pointed at the Samsung device. Due to > > the expectation it would fail first (lower quality offering with > > lesser guarantees), I proceeded to replace it with a NVMe device. > > > > With some monitoring I discovered the NVMe device was now triggering > > IOMMU errors, though not nearly as many as the Samsung SATA device did. > > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort > > of IOMMU issue with Xen. > > > > > > All I can do is offer speculation about the underlying cause. There > > does seem to be a pattern of higher-performance flash storage devices > > being more severely effected. > > > > I was speculating about the issue being the MD-RAID1 driver abusing > > Linux's DMA infrastructure in some fashion. > > > > Upon further consideration, I'm wondering if this is perhaps a latency > > issue. I imagine there is some sort of flush after the IOMMU tables are > > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to > > execute commands before reloading the IOMMU tables is complete. > > Ping! > > The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe. > > To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU > errors matched the Samsung SATA (a number of times the SATA driver > complained). > > As stated, I'm speculating lower latency devices starting to execute > commands before IOMMU tables have finished reloading. When originally > implemented fast flash devices were rare. I guess I'm lucky I ended up with some slightly higher-latency hardware. This is a very serious issue as data loss can occur. AMD needs to fund their Xen engineers more, otherwise soon AMD hardware may no longer be viable with Xen. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |