[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NULL pointer dereference in xenbus_thread->...


  • To: Jason Andryuk <jason.andryuk@xxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Jason Andryuk <jandryuk@xxxxxxxxx>
  • From: Jürgen Groß <jgross@xxxxxxxx>
  • Date: Wed, 30 Apr 2025 17:43:58 +0200
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: Julien Grall <julien@xxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 30 Apr 2025 15:44:05 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 30.04.25 16:29, Jason Andryuk wrote:
On 2025-04-30 06:56, Marek Marczykowski-Górecki wrote:
On Tue, Apr 29, 2025 at 08:59:45PM -0400, Jason Andryuk wrote:
Hi Marek,

On Wed, Apr 23, 2025 at 8:42 AM Marek Marczykowski-Górecki
<marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:

I've got some more report confirming it's still happening on Linux
6.12.18. Is there anything I can do to help fixing this? Maybe ask users
to enable some extra logging?

Have you been able to capture a crash with debug symbols and run it
through scripts/decode_stacktrace.sh?

Not really, as I don't have debug symbols for this kernel. And I can't
reliably reproduce it myself (for me it happens about once in a
month...). I can try reproducing debug symbols, theoretically I should
have all ingredients for it.

I'm curious what process_msg+0x18e/0x2f0 is.  process_writes() has a
direct call to wake_up(), but process_msg() calling req->cb(req) may
be xs_wake_up() which is a thin wrapper over wake_up().

There is a code dump in the crash message, does it help?

That's a little deeper in the call chain.  If you have a vmlinux or bzImage with a matching stacktrace, that would work to look up the address in the disassembly.  So if you don't have a matching pair, maybe try to catch it the next time.

They make me wonder if req has been free()ed and at least partially
zero-ed, but it still has wake_up() called.  The call stack here is
reminiscent of the one here
https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/ and the
unexpected value there is 0.

That's interesting idea, the one above I've seen only on 6.15-rc1 (and
no latter rc). But maybe?

I am guessing, so I could be wrong.  NULL pointer and unexpected zero value are both 0 at least.  Also Whonix looks like it may use init_on_free=1 to zero memory at free time.

I have looked at this issue multiple times now.

Just some remarks what IMO could go wrong (I didn't find any proof that
this really happened, though), in case someone wants to double check:

The most probably candidate for something going wrong is a use-after-free
of a struct xb_req_data element (normally named "req" in the related code).

Some words about the not really obvious locking scheme used for those
elements: A "req" is owned by a thread as long as it isn't in any of the
lists it can live in (xs_reply_list or xb_write_list). Putting it into one
of the lists or removing it again requires to hold the xb_write_mutex.

A "req" needs to be in a certain state when either in one of the lists or
when being owned by a worker thread.

I'm wondering whether it could happen that a thread waiting for a "req"
could be woken up and the "req" is being freed before the waiting thread
can react. Normally this shouldn't be possible, but "never say never".
What catched my eye today is the test of req->state == xb_req_state_wait_reply
in process_msg() just after dropping the xb_write_mutex. This looks a little
bit fishy, but OTOH the request has been just removed from the xs_reply_list,
so no mutex should be needed for that test.

Possible candidates for such an "impossible" scenario include a wrap of
xs_request_id (not very probable, though, as having 4 billion Xenstore
requests "in flight" is rather unlikely IMHO).


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.