[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi issue
On Fri, Jan 24, 2025 at 01:26:23PM -0800, Elliott Mitchell wrote: > On Fri, Jan 24, 2025 at 03:31:30PM +0100, Roger Pau Monné wrote: > > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > > > Apparently this was first noticed with 4.14, but more recently I've been > > > able to reproduce the issue: > > > > > > https://bugs.debian.org/988477 > > > > > > The original observation features MD-RAID1 using a pair of Samsung > > > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags > > > 0x8 I > > > > I think I've figured out the cause for those faults, and posted a fix > > here: > > > > https://lore.kernel.org/xen-devel/20250124120112.56678-1-roger.pau@xxxxxxxxxx/ > > > > Fix is patch 5/5, but you likely want to take them all to avoid > > context conflicts. > > I haven't tested yet, but some analysis from looking at the series. > > This seems a plausible explanation for the interrupt IOMMU messages. As > such I think there is a good chance the reported messages will disappear. > > Nothing in here looks plausible for solving the real problem, that of > RAID1 mirrors diverging (almost certainly getting zeroes during DMA, but > there is a chance stale data is being read). > > Worse, since it removes the observed messages, the next person will > almost certainly have severe data loss by the time they realize there is > a problem. Notably those messages lead me to Debian #988477, so I was > able to take action before things got too bad. I think it's the first time I get complains from the reported of a bug after attempting to fix it. Maybe my original message wasn't clear enough. So far I consider the IOMMU faults and the disk issues different bugs, and hence me asking specifically whether the posted series make any different for any of those issues. I would be surprised if it also fixed the data loss issue, but wanted to ask regardless. > > > I'm not absolutely certain this is a pure Xen bug. There is a > possibility the RAID1 driver is reusing DMA buffers in a fashion which > violates the DMA interface. Yet there is also a good chance Xen isn't > implementing its layer properly either. > > > > There is one pattern emerging at this point. Samsung hardware is badly > effected, other vendors are either uneffected or mildly effected. > Notably the estimated age of the devices meant to be handed off to > someone able to diagnose the issue is >10 years. The uneffected > Crucial/Micron SATA device *should* drastically outperform these, yet > instead it is uneffected. The Crucial/Micron NVMe is very mildly > effected, yet should be more than an order of magnitude faster. > > The simplest explanation is the flash controller on the Samsung devices > is lower latency than the one used by Micron. > > > Both present reproductions feature AMD processors and ASUS motherboards. > I'm doubtful of this being an ASUS issue. This seems more likely a case > of people who use RAID with flash tending to go with a motherboard vendor > who reliably support ECC on all their motherboards. > > I don't know whether this is confined to AMD processors, or not. The > small number of reproductions suggests few people are doing RAID with > flash storage. In which case no one may have tried RAID1 with flash on > Intel processors. On Intel hardware the referenced message would be > absent and people might think their problem was distinct from Debian > #988477. As said above - my current hypothesis is that the IOMMU fault message is just a red herring, and has nothing to do with the underlying data loss issue that you are seeing. I expect there will be no similar IOMMU fault message on Intel hardware, as updating of interrupt remapping entries was already done atomically on VT-d. > In fact what seems a likely reproduction on Intel hardware is the Intel > sound card issue. I notice that issue occurs when sound *starts* > playing. When a sound device starts, its buffers would be empty and the > first DMA request would be turned around with minimal latency. In such > case this matches the Samsung SATA devices handling DMA with low > latency. Can you reproduce the data loss issue without using RAID in Linux? You can use fio with verify or similar to stress test it. Can you reproduce if dom0 is PVH instead of PV? Can you reproduce with dom0-iommu=strict mode in the Xen command line? > > > > Can you give it a try and see if it fixes the fault messages, plus > > your issues with the disk devices? > > Ick. I was hoping to avoid reinstalling the known problematic devices > and simply send them to someone better setup for analyzing x86 problems. > > Looking at the series, it seems likely to remove the fault messages and > turn this into silent data loss. I doubt any AMD processors have an > IOMMU, yet omit cmpxchg16b (older system lacked full IOMMU, yet did have > cmpxchg16b, newer system has both). Even guests have cmpxchg16b > available. Silent data loss> data loss might or not be there, regardless of whether IOMMU faults are being reported. IMO it's unhelpful to make this kind of comments, as you seem to suggest a preference for leaving the IOMMU fault bug unfixed, which I'm sure it's not the case. > If you really want this tested, it will be a while before the next > potential downtime window. No worries, I already have confirmation from someone else that was seeing the same IOMMU faults has tested the fix. I was mostly wondering whether it would affect your data loss issues in any way, as for that I have no one else that can reproduce. Thanks, Roger.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |