[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
On Wed, 3 May 2023, andrew.cooper3@xxxxxxxxxx wrote: > On 03/05/2023 10:53 pm, Stefano Stabellini wrote: > > On Wed, 3 May 2023, Julien Grall wrote: > >> On 03/05/2023 15:38, andrew.cooper3@xxxxxxxxxx wrote: > >>> Hello, > >>> > >>> After what seems like an unreasonable amount of debugging, we've tracked > >>> down exactly what is going wrong here. > >>> > >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 > >>> > >>> Of note is the smoke.serial log around: > >>> > >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) > >>> obj: CREATE connection 0xffff90fff1f0 > >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons > >>> 00000000, rsp_prod 00000000 > >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT > >>> (@introduceDomain domlist ) > >>> > >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always > >>> clobbers the ring pointers. The added pressure on dom0 that the > >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally > >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first > >>> message that dom1 wrote into the ring. > >>> > >>> The other behaviour seen was xenstored observing a header looking like > >>> this: > >>> > >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len > >>> 0x6d726f66 } > >>> > >>> which was rejected as being too long. That's "control/platform" in > >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header > >>> and writing the payload. > >>> > >>> > >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an > >>> unsuspecting connection. It is ultimately init-dom0less's fault for > >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to > >>> complete. > >> So the problem is xenstored will set interface->connection to > >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the > >> following, for now, very hackish patch: > >> > >> diff --git a/tools/xenstore/xenstored_domain.c > >> b/tools/xenstore/xenstored_domain.c > >> index f62be2245c42..bbf85bbbea3b 100644 > >> --- a/tools/xenstore/xenstored_domain.c > >> +++ b/tools/xenstore/xenstored_domain.c > >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, > >> talloc_steal(domain->conn, domain); > >> > >> if (!restore) { > >> + domain_conn_reset(domain); > >> /* Notify the domain that xenstore is available */ > >> interface->connection = XENSTORE_CONNECTED; > >> xenevtchn_notify(xce_handle, domain->port); > >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection > >> *conn, > >> if (!domain) > >> return errno; > >> > >> - domain_conn_reset(domain); > >> - > >> send_ack(conn, XS_INTRODUCE); > > Following Jurgen's suggestion, I made this slightly modified version of > > the patch. With it, the problem is solved: > > > > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703 > > This fails to solve 3(?) of the 4(?) bugs pointed out between this email > thread and on IRC. > > Stop with the bull-in-a-china-shop approach. There is no acceptable fix > to this mess which starts with anything other than corrections to the > documentation, and a plan for how to make startup work robustly given > all the bugs introduced previously by failing to do it properly the > first time around. I am not suggesting this is the fix (I didn't add any Signed-off-by or commit message on purpose). I think it is useful to know if a theoretical proposal would work in practice as it helps us understand the problem. In the little time I had in-between meetings I thought I could help a bit by providing this update. Like you, I would appreciate a comprehensive fix which includes a documentation update. Genuine question: how would you like to proceed? In the project management sense of who does what and what is the suggested timeline.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |