Xen project Mailing List

Re: Serious AMD-Vi(?) issue

Date: Thu, 18 Apr 2024 09:09:51 +0200

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Kelly Choi <kelly.choi@xxxxxxxxx>

Delivery-date: Thu, 18 Apr 2024 07:10:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 18.04.2024 08:45, Elliott Mitchell wrote: > On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote: >> On 11.04.2024 04:41, Elliott Mitchell wrote: >>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: >>>> On 27.03.2024 18:27, Elliott Mitchell wrote: >>>>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: >>>>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: >>>>>>> >>>>>>> In fact when running into trouble, the usual course of action would be >>>>>>> to >>>>>>> increase verbosity in both hypervisor and kernel, just to make sure no >>>>>>> potentially relevant message is missed. >>>>>> >>>>>> More/better information might have been obtained if I'd been engaged >>>>>> earlier. >>>>> >>>>> This is still true, things are in full mitigation mode and I'll be >>>>> quite unhappy to go back with experiments at this point. >>>> >>>> Well, it very likely won't work without further experimenting by someone >>>> able to observe the bad behavior. Recall we're on xen-devel here; it is >>>> kind of expected that without clear (and practical) repro instructions >>>> experimenting as well as info collection will remain with the reporter. >>> >>> After looking at the situation and considering the issues, I /may/ be >>> able to setup for doing more testing. I guess I should confirm, which of >>> those criteria do you think currently provided information fails at? >>> >>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + >>> dbench; seems a pretty specific setup. >> >> Indeed. If that's the only way to observe the issue, it suggests to me >> that it'll need to be mainly you to do further testing, and perhaps even >> debugging. Which isn't to say we're not available to help, but from all >> I have gathered so far we're pretty much in the dark even as to which >> component(s) may be to blame. As can still be seen at the top in reply >> context, some suggestions were given as to obtaining possible further >> information (or confirming the absence thereof). > > There may be other ways which haven't yet been found. > > I've been left with the suspicion AMD was to some degree sponsoring > work to ensure Xen works on their hardware. Given the severity of this > problem I would kind of expect them not want to gain a reputation for > having data loss issues. Assuming a suitable pair of devices weren't > already on-hand, I would kind of expect this to be well within their > budget. You've got to talk to AMD then. Plus I assume it's clear to you that even if the (presumably) necessary hardware was available, it still would require respective setup, leaving open whether the issue then could indeed be reproduced. >> I'd also like to come back to the vague theory you did voice, in that >> you're suspecting flushes to take too long. I continue to have trouble >> with this, and I would therefore like to ask that you put this down in >> more technical terms, making connections to actual actions taken by >> software / hardware. > > I'm trying to figure out a pattern. > > Nominally all the devices are roughly on par (only a very cheap flash > device will be unable to overwhelm SATA's bandwidth). Yet why did the > Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe > device demonstrate the issue. > > My guess is the flash controllers Samsung uses may be able to start > executing commands faster than the ones Crucial uses. Meanwhile NVMe > is lower overhead and latency than SATA (SATA's overhead isn't an issue > for actual disks). Perhaps the IOMMU is still flushing its TLB, or > hasn't loaded the new tables. Which would be an IOMMU issue then, that software at best may be able to work around. Jan > I suspect when the MD-RAID1 issues block requests to a pair of devices, > it likely sends the block to one device and then reuses most/all of the > structures for the second device. As a result the second request would > likely get a command to the device rather faster than the first request. > > Perhaps look into what structures the MD-RAID1 subsystem reuses are. > Then see whether doing early setup of those structures triggers the > issue? > > (okay I'm deep into speculation here, but this seems the simplest > explanation for what could be occuring) > >

Follow-Ups:

Re: Serious AMD-Vi(?) issue
- From: Elliott Mitchell

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.