[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f



On Wed, 3 May 2023, andrew.cooper3@xxxxxxxxxx wrote:
> On 03/05/2023 10:53 pm, Stefano Stabellini wrote:
> > On Wed, 3 May 2023, Julien Grall wrote:
> >> On 03/05/2023 15:38, andrew.cooper3@xxxxxxxxxx wrote:
> >>> Hello,
> >>>
> >>> After what seems like an unreasonable amount of debugging, we've tracked
> >>> down exactly what is going wrong here.
> >>>
> >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> >>>
> >>> Of note is the smoke.serial log around:
> >>>
> >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> >>> obj: CREATE connection 0xffff90fff1f0
> >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> >>> 00000000, rsp_prod 00000000
> >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> >>> (@introduceDomain domlist )
> >>>
> >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> >>> clobbers the ring pointers.  The added pressure on dom0 that the
> >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
> >>> message that dom1 wrote into the ring.
> >>>
> >>> The other behaviour seen was xenstored observing a header looking like 
> >>> this:
> >>>
> >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> >>> 0x6d726f66 }
> >>>
> >>> which was rejected as being too long.  That's "control/platform" in
> >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> >>> and writing the payload.
> >>>
> >>>
> >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> >>> unsuspecting connection.  It is ultimately init-dom0less's fault for
> >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
> >>> complete.
> >> So the problem is xenstored will set interface->connection to
> >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
> >> following, for now, very hackish patch:
> >>
> >> diff --git a/tools/xenstore/xenstored_domain.c
> >> b/tools/xenstore/xenstored_domain.c
> >> index f62be2245c42..bbf85bbbea3b 100644
> >> --- a/tools/xenstore/xenstored_domain.c
> >> +++ b/tools/xenstore/xenstored_domain.c
> >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
> >>                 talloc_steal(domain->conn, domain);
> >>
> >>                 if (!restore) {
> >> +                       domain_conn_reset(domain);
> >>                         /* Notify the domain that xenstore is available */
> >>                         interface->connection = XENSTORE_CONNECTED;
> >>                         xenevtchn_notify(xce_handle, domain->port);
> >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection 
> >> *conn,
> >>         if (!domain)
> >>                 return errno;
> >>
> >> -       domain_conn_reset(domain);
> >> -
> >>         send_ack(conn, XS_INTRODUCE);
> > Following Jurgen's suggestion, I made this slightly modified version of
> > the patch. With it, the problem is solved:
> >
> > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703
> 
> This fails to solve 3(?) of the 4(?) bugs pointed out between this email
> thread and on IRC.
> 
> Stop with the bull-in-a-china-shop approach.  There is no acceptable fix
> to this mess which starts with anything other than corrections to the
> documentation, and a plan for how to make startup work robustly given
> all the bugs introduced previously by failing to do it properly the
> first time around.

I am not suggesting this is the fix (I didn't add any Signed-off-by or
commit message on purpose). I think it is useful to know if a
theoretical proposal would work in practice as it helps us understand
the problem. In the little time I had in-between meetings I thought I
could help a bit by providing this update.

Like you, I would appreciate a comprehensive fix which includes a
documentation update.

Genuine question: how would you like to proceed? In the project
management sense of who does what and what is the suggested timeline.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.