[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Xen 4.18 release: Reminder about code freeze
On Thu, 28 Sep 2023, Andrew Cooper wrote: > On 28/09/2023 12:04 am, Stefano Stabellini wrote: > > On Mon, 25 Sep 2023, Henry Wang wrote: > >> 3. dom0less vs xenstored setup race Was: xen | Failed pipeline for staging > >> | 6a47ba2f > >> > >> https://marc.info/?l=xen-devel&m=168312468808977 > >> > >> https://marc.info/?l=xen-devel&m=168312687610283 > > For this issue I suggest to go with this fix unless someone can produce > > a better fix before RC2. I don't think we should cut RC2 with this issue > > unsolved. > > > > --- > > [PATCH] xenstored: reset domain connection before XENSTORE_CONNECTED > > > > From: Julien Grall <julien@xxxxxxx> > > > > xenstored will set interface->connection to XENSTORE_CONNECTED before > > finalizing the connection which can cause initialization errors on the > > guest side. Make sure to call domain_conn_reset(domain) before setting > > XENSTORE_CONNECTED. > > > > Signed-off-by: Julien Grall <julien@xxxxxxx> > > [stefano: rebase, commit message] > > Signed-off-by: Stefano Stabellini <stefano.stabellini@xxxxxxx> > > Acked-by: Stefano Stabellini <sstabellini@xxxxxxxxxx> > > No - this hasn't got any better at fixing the problem than the last time > it failed to fix the problem. It does solve one of the issues: the sporadic failure of the gitlab-ci job. Even if the fix is not complete, if nothing else, it does that. Of course, if someone comes up with a better fix all the better! There hasn't been a lot of movement on this issue so I am being pessimistic and I am offering the (maybe partial) solution we have today. I don't mean to take anything away from the value of doing a better fix. > You cannot have 3 entities in parallel fight for control in a 2-way > communication channel. > > Failure to understand this is what created the problem to begin with. > > You took an existing ABI from oxenstored, and implemented it > incompatibly in other entities, had init-dom0less corrupt a shared comms > buffer that it isn't the producer or consumer of, and added bug in Linux > because you didn't write down the behaviour you wanted, let alone the > behaviour you actually provided. I think I could write a document covering the intended behavior at the time of contributing the original feature. That might be good to have regardless of the bug. > Stop tinkering in the hope that it hides the problem. You're only > making it harder to fix properly. Making it harder to fix properly would be a valid reason not to commit the (maybe partial) fix. But looking at the fix again: diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c index a6cd199fdc..9cd6678015 100644 --- a/tools/xenstored/domain.c +++ b/tools/xenstored/domain.c @@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void *ctx, talloc_steal(domain->conn, domain); if (!restore) { + domain_conn_reset(domain); /* Notify the domain that xenstore is available */ interface->connection = XENSTORE_CONNECTED; xenevtchn_notify(xce_handle, domain->port); @@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct connection *conn, if (!domain) return errno; - domain_conn_reset(domain); - send_ack(conn, XS_INTRODUCE); It is a 1-line movement. Textually small. Easy to understand and to revert. It doesn't seem to be making things harder to fix? We could revert it any time if a better fix is offered. Maybe we could have a XXX note in the commit message or in-code comment? > Tell me, when was the last time this failed... I went through all the gitlab reports and this is the last one I found which is from 1 month ago: https://gitlab.com/xen-project/xen/-/pipelines/953569140 Even if the heinsenbug manifests only once a month I think it is bad. Of course it is up to Henry but I don't think we should go far into the release process without this problem being (at least partially) fixed.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |