[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen 4.18 release: Reminder about code freeze


  • To: Julien Grall <julien@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Fri, 13 Oct 2023 17:04:08 +0200
  • Authentication-results: smtp-out1.suse.de; none
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, "community.manager@xxxxxxxxxxxxxx" <community.manager@xxxxxxxxxxxxxx>
  • Delivery-date: Fri, 13 Oct 2023 15:04:17 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 13.10.23 16:51, Julien Grall wrote:
Hi Juergen,

On 13/10/2023 15:47, Juergen Gross wrote:
On 13.10.23 13:22, Julien Grall wrote:
Hi George,

On 13/10/2023 11:16, George Dunlap wrote:
On Thu, Oct 12, 2023 at 11:36 PM Stefano Stabellini
<sstabellini@xxxxxxxxxx> wrote:

On Thu, 12 Oct 2023, George Dunlap wrote:
Stop tinkering in the hope that it hides the problem.  You're only
making it harder to fix properly.

Making it harder to fix properly would be a valid reason not to commit
the (maybe partial) fix. But looking at the fix again:

diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c
index a6cd199fdc..9cd6678015 100644
--- a/tools/xenstored/domain.c
+++ b/tools/xenstored/domain.c
@@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void *ctx,
                 talloc_steal(domain->conn, domain);

                 if (!restore) {
+                       domain_conn_reset(domain);
                         /* Notify the domain that xenstore is available */
                         interface->connection = XENSTORE_CONNECTED;
                         xenevtchn_notify(xce_handle, domain->port);
@@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct connection *conn,
         if (!domain)
                 return errno;

-       domain_conn_reset(domain);
-
         send_ack(conn, XS_INTRODUCE);

It is a 1-line movement. Textually small. Easy to understand and to
revert. It doesn't seem to be making things harder to fix? We could
revert it any time if a better fix is offered.

Maybe we could have a XXX note in the commit message or in-code
comment?

It moves a line from one function (do_domain_introduce()) into a
completely different function (introduce_domain()), nested inside two
if() statements; with no analysis on how the change will impact
things.

I am not the original author of the patch, and I am not the maintainer
of the code, so I don't feel I have the qualifications to give you the
answers you are seeking. Julien as author of the patch and xenstore
reviewer might be in a better position to answer. Or Juergen as xenstore
maintainer.

I understand that; my main point is that the change is more complex
than you're characterizing it.  This is information necessary to
understand whether the patch is correct, but it's not in the patch
description, nor in the subsequent thread back in May.

Are there any paths through do_domain_introduce() that now *won't* get
a domain_conn_reset() call?  Is that OK?

Yes, the already-introduced and the restore code paths. The operations in
the already-introduced or the restore code paths seem simple enough not
to require a domain_conn_reset. Julien and Juergen should confirm.

There is no "restore" codepath through do_domain_introduce(); it
passes "false" for the "restore" argument.  So we  only have two paths
to consider through do_domain_introduce(): The "not introduced and not
restoring" path, and the "already-introduced" path.

I'm not sure what the "simple" elements on the branch in
introduce_domain() have to do with whether the content of the page
needs to be cleaned up.  As I said, I don't 100% understand this code,
but it seems like if anything, the reset would be *more* important to
have in the "reintroduce" case than in the "initial introduction"
case, since I'd expect the "initial introduction" case to be empty
already.
Indeed, there should be no watches/transactions/buffered I/O for the initial introduction. However, the function is also clear part of the interface because we can't guaranteed it was zeroed.

The latter matter for the initial introduction. I believe the rest is just called for simplicity.


Doesn't it seem weird to you that we set a connection to CONNECTED,
notify the domain that it is ready to go, and only *after* that we reset
the connection to zero?

What happens if a domain starts using the connection as soon as it
receives the event channel notification and before domain_conn_reset is
called?

Yes, it does seem weird, which is why I said the following. :-)

I mean, it certainly seems strange to set the state to CONNECTED, send
off an event channel, and then after that delete all watches /
transactions / buffered data and so on;

But just because the current code is probably wrong, doesn't mean that
the modified code is probably correct.

If the problem is the delay between the xenevtchn_notify() in
introduce_domain() and the domain_conn_reset() afterwards in
do_domain(), would it make sense instead to move the notification into
do_introduce(), after the domain_conn_reset()?  It is, after all, in
response to XS_INTRODUCE that we want to send the notification, not in
dom0_init() or read_state_connection() (which seems to be more about
restoring a domain).

I understand that the event channel notification was specifically added for dom0less. But I don't see why we don't want to send it to dom0 as well.

Technically, dom0 has exactly the same problem as dom0less domains it boots before Xenstored is running and therefore it may need to know when it is ready to receive commands.

Umm, no, not really.

The main difference between dom0 and a dom0less domU is, that xenstored
introduces dom0 by itself via a call of dom0_init(), while the dom0less
domUs get introduced by Xen tools in case a dom0 is coming up later. And
that XS_INTRODUCE will clobber any ring page contents, while a call of
dom0_init() won't do that.

Dom0 (especially the kernel) is fine to start filling the ring page with
requests even before xenstored is running. It just shouldn't expect to
receive any responses right away.
I am not sure what you mean by fine. You will see hang notifications if Xenstored is not started in time. Isn't why we decided to go with a different way for dom0less?

The main difference is that dom0 tells xenstored the connection parameters for
itself, so dom0 _knows_ that the ring page is setup correctly when xenstored
starts looking at it (it is dom0 which needs to do the ring page init).

A dom0less domU doesn't have that negotiation with xenstored, as xenstored just
uses the pre-defined grant for looking at the ring page. For the domU there is
no way to tell that xenstored has initialized the ring page (it is not the domU
to do the initialization, as the XS_INTRODUCE might be sent before the domU
even starts running), other than the "connected" indicator in the page itself.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.