[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen 4.18 release: Reminder about code freeze



On Thu, 28 Sep 2023, Andrew Cooper wrote:
> On 28/09/2023 12:04 am, Stefano Stabellini wrote:
> > On Mon, 25 Sep 2023, Henry Wang wrote:
> >> 3. dom0less vs xenstored setup race Was: xen | Failed pipeline for staging 
> >> | 6a47ba2f
> >>
> >> https://marc.info/?l=xen-devel&m=168312468808977
> >>
> >> https://marc.info/?l=xen-devel&m=168312687610283
> > For this issue I suggest to go with this fix unless someone can produce
> > a better fix before RC2. I don't think we should cut RC2 with this issue
> > unsolved.
> >
> > ---
> > [PATCH] xenstored: reset domain connection before XENSTORE_CONNECTED
> >
> > From: Julien Grall <julien@xxxxxxx>
> >
> > xenstored will set interface->connection to XENSTORE_CONNECTED before
> > finalizing the connection which can cause initialization errors on the
> > guest side. Make sure to call domain_conn_reset(domain) before setting
> > XENSTORE_CONNECTED.
> >
> > Signed-off-by: Julien Grall <julien@xxxxxxx>
> > [stefano: rebase, commit message]
> > Signed-off-by: Stefano Stabellini <stefano.stabellini@xxxxxxx>
> > Acked-by: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> 
> No - this hasn't got any better at fixing the problem than the last time
> it failed to fix the problem.

It does solve one of the issues: the sporadic failure of the gitlab-ci
job. Even if the fix is not complete, if nothing else, it does that.

Of course, if someone comes up with a better fix all the better!

There hasn't been a lot of movement on this issue so I am being
pessimistic and I am offering the (maybe partial) solution we have
today. I don't mean to take anything away from the value of doing a
better fix.


> You cannot have 3 entities in parallel fight for control in a 2-way
> communication channel.
> 
> Failure to understand this is what created the problem to begin with.
> 
> You took an existing ABI from oxenstored, and implemented it
> incompatibly in other entities, had init-dom0less corrupt a shared comms
> buffer that it isn't the producer or consumer of, and added bug in Linux
> because you didn't write down the behaviour you wanted, let alone the
> behaviour you actually provided.

I think I could write a document covering the intended behavior at the
time of contributing the original feature. That might be good to have
regardless of the bug.


> Stop tinkering in the hope that it hides the problem.  You're only
> making it harder to fix properly.
 
Making it harder to fix properly would be a valid reason not to commit
the (maybe partial) fix. But looking at the fix again:

diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c
index a6cd199fdc..9cd6678015 100644
--- a/tools/xenstored/domain.c
+++ b/tools/xenstored/domain.c
@@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void *ctx,
                talloc_steal(domain->conn, domain);
 
                if (!restore) {
+                       domain_conn_reset(domain);
                        /* Notify the domain that xenstore is available */
                        interface->connection = XENSTORE_CONNECTED;
                        xenevtchn_notify(xce_handle, domain->port);
@@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct connection *conn,
        if (!domain)
                return errno;
 
-       domain_conn_reset(domain);
-
        send_ack(conn, XS_INTRODUCE);

It is a 1-line movement. Textually small. Easy to understand and to
revert. It doesn't seem to be making things harder to fix? We could
revert it any time if a better fix is offered.

Maybe we could have a XXX note in the commit message or in-code
comment?


> Tell me, when was the last time this failed...

I went through all the gitlab reports and this is the last one I found
which is from 1 month ago:
https://gitlab.com/xen-project/xen/-/pipelines/953569140

Even if the heinsenbug manifests only once a month I think it is bad.
Of course it is up to Henry but I don't think we should go far into the
release process without this problem being (at least partially) fixed.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.