Xen project Mailing List

Re: Crashes under Xen with Radeon graphics card

To: "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>, lkml <linux-kernel@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "amd-gfx@xxxxxxxxxxxxxxxxxxxxx" <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

From: Juergen Gross <jgross@xxxxxxxx>

Date: Fri, 15 Dec 2023 17:12:37 +0100

Authentication-results: smtp-out2.suse.de; dkim=pass header.d=suse.com header.s=susede1 header.b="E/v4ytAb"

Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==

Cc: "Koenig, Christian" <Christian.Koenig@xxxxxxx>, "Pan, Xinhui" <Xinhui.Pan@xxxxxxx>

Delivery-date: Fri, 15 Dec 2023 16:12:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.12.23 17:04, Deucher, Alexander wrote:

[Public]

-----Original Message-----
From: Juergen Gross <jgross@xxxxxxxx>
Sent: Friday, December 15, 2023 6:57 AM
To: lkml <linux-kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>
Subject: Crashes under Xen with Radeon graphics card

Hi,

I recently stumbled over a test system which showed crashes probably
resulting from memory being overwritten randomly.

The problem is occurring only in Dom0 when running under Xen. It seems to
be present since at least kernel 6.3 (I didn't go back further yet), and it 
seems
NOT to be present in kernel 5.14.

I tracked the problem down to the initialization of the graphics card (the
problem might surface only later, but at least an early initialization failure 
made
the problem go away).

# lspci
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
Caicos XTX [Radeon HD 8490 / R5 235X OEM]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]

I had a working .config and one which did produce the crashes, so I narrowed
the problem down to detect that the important difference was in the area of
firmware loading (the working .config didn't have
CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
card to fail). This was of course not the real problem, but it caused the card
initialization to fail.

I manually decompressed the firmware files one by one to see whether the
problem would be in the decompressor or probably in the driver of the card.

The last step without crash was:

# dmesg | grep radeon
[   10.106405] [drm] radeon kernel modesetting enabled.
[   10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
[   10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
-
0x000000003FFFFFFF (1024M used)
[   10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
0x000000007FFFFFFF
[   10.278255] [drm] radeon: 1024M of VRAM memory ready
[   10.295828] [drm] radeon: 1024M of GTT memory ready.
[   10.295867] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_pfp.bin succeeded
[   10.330846] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_me.bin succeeded
[   10.330858] radeon 0000:01:00.0: Direct firmware load for
radeon/BTC_rlc.bin
succeeded
[   10.330870] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_mc.bin failed with error -2
[   10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
[   10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
firmware!
[   10.405765] radeon 0000:01:00.0: Fatal error during GPU init
[   10.432107] [drm] radeon: finishing device.
[   10.439179] [drm] radeon: ttm finalized
[   10.463203] radeon: probe of 0000:01:00.0 failed with error -2

And with decompressing radeon/CAICOS_mc.bin I got:

# dmesg | grep radeon
[   10.266491] [drm] radeon kernel modesetting enabled.
[   10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
[   10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
-
0x000000003FFFFFFF (1024M used)
[   10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
0x000000007FFFFFFF
[   10.566946] [drm] radeon: 1024M of VRAM memory ready
[   10.576891] [drm] radeon: 1024M of GTT memory ready.
[   10.586971] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_pfp.bin succeeded
[   10.611886] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_me.bin succeeded
[   10.611909] radeon 0000:01:00.0: Direct firmware load for
radeon/BTC_rlc.bin
succeeded
[   10.611938] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_mc.bin succeeded
[   10.660599] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_smc.bin failed with error -2
[   10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"


You also need to make sure CAICOS_smc.bin is available.

Of course. But with all firmware files loadable the system is crashing, too. I thought it might help to see after which firmware the crashes are starting.

[   10.661676] [drm] radeon: power management initialized
[   10.713666] radeon 0000:01:00.0: Direct firmware load for
radeon/SUMO_uvd.bin
failed with error -2
[   10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
"radeon/SUMO_uvd.bin"
[   10.713669] radeon 0000:01:00.0: failed UVD (-2) init.


And SUMO_uvd.bin.

Sure.

[   10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[   10.809213] radeon 0000:01:00.0: WB enabled
[   10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
0x0000000040000c00
[   10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
0x0000000040000c0c
[   10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
[   10.862154] radeon 0000:01:00.0: radeon: using MSI.
[   10.871930] [drm] radeon: irq initialized.
[   11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
minor 0
[   11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[   11.411370] fbcon: radeondrmfb (fb0) is primary device
[   11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
device
[   11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[   11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[   28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
radeon_audio_component_bind_ops [radeon])
[   44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[   44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID

followed by a crash some seconds after the system was up.

The crashes vary, but often the kernel accesses non-canonical addresses or
tries to map illegal physical addresses. Sometimes the system is just hanging,
either with softlockups or without any further signs of being alive.

I can easily reproduce the problem, so any debug patches to narrow down the
problem are welcome.


There are still missing firmware required for proper operation.  Please fix 
them up.

That was the starting point, of course! BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that the patch series merging swiotlb and swiotlb-xen could be to blame, but that went into v5.19. Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.