[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f



On Wed, 3 May 2023, Julien Grall wrote:
> On 03/05/2023 15:38, andrew.cooper3@xxxxxxxxxx wrote:
> > Hello,
> > 
> > After what seems like an unreasonable amount of debugging, we've tracked
> > down exactly what is going wrong here.
> > 
> > https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> > 
> > Of note is the smoke.serial log around:
> > 
> > io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> > obj: CREATE connection 0xffff90fff1f0
> > *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> > 00000000, rsp_prod 00000000
> > io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> > (@introduceDomain domlist )
> > 
> > XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> > clobbers the ring pointers.  The added pressure on dom0 that the
> > xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> > defers xenstored long enough that the XS_INTRODUCE clobbers the first
> > message that dom1 wrote into the ring.
> > 
> > The other behaviour seen was xenstored observing a header looking like this:
> > 
> > *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> > 0x6d726f66 }
> > 
> > which was rejected as being too long.  That's "control/platform" in
> > ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> > and writing the payload.
> > 
> > 
> > Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> > unsuspecting connection.  It is ultimately init-dom0less's fault for
> > telling dom1 it's good to go before having waited for XS_INTRODUCE to
> > complete.
> 
> So the problem is xenstored will set interface->connection to
> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
> following, for now, very hackish patch:
> 
> diff --git a/tools/xenstore/xenstored_domain.c
> b/tools/xenstore/xenstored_domain.c
> index f62be2245c42..bbf85bbbea3b 100644
> --- a/tools/xenstore/xenstored_domain.c
> +++ b/tools/xenstore/xenstored_domain.c
> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
>                 talloc_steal(domain->conn, domain);
> 
>                 if (!restore) {
> +                       domain_conn_reset(domain);
>                         /* Notify the domain that xenstore is available */
>                         interface->connection = XENSTORE_CONNECTED;
>                         xenevtchn_notify(xce_handle, domain->port);
> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn,
>         if (!domain)
>                 return errno;
> 
> -       domain_conn_reset(domain);
> -
>         send_ack(conn, XS_INTRODUCE);

Following Jurgen's suggestion, I made this slightly modified version of
the patch. With it, the problem is solved:

https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703


diff --git a/tools/xenstore/xenstored_domain.c 
b/tools/xenstore/xenstored_domain.c
index f62be2245c..5ca160d9f2 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -688,6 +688,8 @@ static struct domain *introduce_domain(const void *ctx,
                talloc_steal(domain->conn, domain);
 
                if (!restore) {
+                       if (!is_master_domain)
+                               domain_conn_reset(domain);
                        /* Notify the domain that xenstore is available */
                        interface->connection = XENSTORE_CONNECTED;
                        xenevtchn_notify(xce_handle, domain->port);
@@ -730,8 +732,6 @@ int do_introduce(const void *ctx, struct connection *conn,
        if (!domain)
                return errno;
 
-       domain_conn_reset(domain);
-
        send_ack(conn, XS_INTRODUCE);
 
        return 0;

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.