[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen 4.18 release: Reminder about code freeze

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
Date: Wed, 27 Sep 2023 17:18:07 -0700 (PDT)
Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "jgross@xxxxxxxx" <jgross@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, "community.manager@xxxxxxxxxxxxxx" <community.manager@xxxxxxxxxxxxxx>
Delivery-date: Thu, 28 Sep 2023 00:18:35 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, 28 Sep 2023, Andrew Cooper wrote:
> On 28/09/2023 12:04 am, Stefano Stabellini wrote:
> > On Mon, 25 Sep 2023, Henry Wang wrote:
> >> 3. dom0less vs xenstored setup race Was: xen | Failed pipeline for staging 
> >> | 6a47ba2f
> >>
> >> https://marc.info/?l=xen-devel&m=168312468808977
> >>
> >> https://marc.info/?l=xen-devel&m=168312687610283
> > For this issue I suggest to go with this fix unless someone can produce
> > a better fix before RC2. I don't think we should cut RC2 with this issue
> > unsolved.
> >
> > ---
> > [PATCH] xenstored: reset domain connection before XENSTORE_CONNECTED
> >
> > From: Julien Grall <julien@xxxxxxx>
> >
> > xenstored will set interface->connection to XENSTORE_CONNECTED before
> > finalizing the connection which can cause initialization errors on the
> > guest side. Make sure to call domain_conn_reset(domain) before setting
> > XENSTORE_CONNECTED.
> >
> > Signed-off-by: Julien Grall <julien@xxxxxxx>
> > [stefano: rebase, commit message]
> > Signed-off-by: Stefano Stabellini <stefano.stabellini@xxxxxxx>
> > Acked-by: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> 
> No - this hasn't got any better at fixing the problem than the last time
> it failed to fix the problem.

It does solve one of the issues: the sporadic failure of the gitlab-ci
job. Even if the fix is not complete, if nothing else, it does that.

Of course, if someone comes up with a better fix all the better!

There hasn't been a lot of movement on this issue so I am being
pessimistic and I am offering the (maybe partial) solution we have
today. I don't mean to take anything away from the value of doing a
better fix.


> You cannot have 3 entities in parallel fight for control in a 2-way
> communication channel.
> 
> Failure to understand this is what created the problem to begin with.
> 
> You took an existing ABI from oxenstored, and implemented it
> incompatibly in other entities, had init-dom0less corrupt a shared comms
> buffer that it isn't the producer or consumer of, and added bug in Linux
> because you didn't write down the behaviour you wanted, let alone the
> behaviour you actually provided.

I think I could write a document covering the intended behavior at the
time of contributing the original feature. That might be good to have
regardless of the bug.


> Stop tinkering in the hope that it hides the problem.  You're only
> making it harder to fix properly.
 
Making it harder to fix properly would be a valid reason not to commit
the (maybe partial) fix. But looking at the fix again:

diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c
index a6cd199fdc..9cd6678015 100644
--- a/tools/xenstored/domain.c
+++ b/tools/xenstored/domain.c
@@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void *ctx,
                talloc_steal(domain->conn, domain);
 
                if (!restore) {
+                       domain_conn_reset(domain);
                        /* Notify the domain that xenstore is available */
                        interface->connection = XENSTORE_CONNECTED;
                        xenevtchn_notify(xce_handle, domain->port);
@@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct connection *conn,
        if (!domain)
                return errno;
 
-       domain_conn_reset(domain);
-
        send_ack(conn, XS_INTRODUCE);

It is a 1-line movement. Textually small. Easy to understand and to
revert. It doesn't seem to be making things harder to fix? We could
revert it any time if a better fix is offered.

Maybe we could have a XXX note in the commit message or in-code
comment?


> Tell me, when was the last time this failed...

I went through all the gitlab reports and this is the last one I found
which is from 1 month ago:
https://gitlab.com/xen-project/xen/-/pipelines/953569140

Even if the heinsenbug manifests only once a month I think it is bad.
Of course it is up to Henry but I don't think we should go far into the
release process without this problem being (at least partially) fixed.

References:
- Xen 4.18 release: Reminder about code freeze
  - From: Henry Wang
- Re: Xen 4.18 release: Reminder about code freeze
  - From: Stefano Stabellini
- Re: Xen 4.18 release: Reminder about code freeze
  - From: Andrew Cooper

Prev by Date: Re: [PATCH v3] docs/misra: add 14.3
Next by Date: Re: [PATCH] docs/misra: add rule 11.9
Previous by thread: Re: Xen 4.18 release: Reminder about code freeze
Next by thread: [linux-linus test] 183146: tolerable FAIL - PUSHED
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.