[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f


  • To: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>
  • From: andrew.cooper3@xxxxxxxxxx
  • Date: Wed, 3 May 2023 23:20:02 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=G+FFpobbSOiQkMF40Buh+1T++dobOOXyl+bvvWhtG1I=; b=RREHfe/DMvhOuBlrJGa8OAlhlX16G7F0MPoe4ORX/iiTbA/cXlWVM1GmrwHYbDqdwsIEgtyZvn17qk+4mN+wPScUAt3WdKIHqcnXv8uJECfye97DV0tleLfB82+co1k9YLaS601eHO0t83hBc+hFwA2GGxsTszZHt2e//in/3e4Of/5G21AVV2ACp0SSgw3OvjzzZ6DvyYWWCIKmhWdvZYsQWzUWqt0BnvHyJlMbBvKppOfu9tf7yV/8P2ddY6q5j9ZE300YTgNdGq47g+l52rvSkNhYveF23ObLe3G8eVxcSP15gxn/EwNwxx77SB6K9XZa8NbaKhxp8x/LNgrY3w==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=RyzEKcg82G7OwiUeZsD6hDzF6oh6pS/cvCF4caKthD+TavpElqoe2p1JUgoWGBsVLjyCwb8rBizpEGf7mJ7/GqvPpnSn4zRKYGxahV2V4fzmjvMKcY8zyqqQtXdrdrebtZDtk5cz/2kcHbToSJDjjhyYn5NPwPleHMaTjSEO0rYziHSY0bMsO95aQgyVKgZLgY01Y8oKzFITAW21NomGYaMaZ4hE09ZI4/XS5mTpThgfuJL7HJQAUFqnWc8otgMxkbp0Ad8raEcrWZ09KgENzAB/9mtQtGFuqRrZBgnWGESDmROxPaqxqmcpjts2Y8dHEov37KIBeukszso0MKXs/Q==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: alejandro.vallejo@xxxxxxxxx, committers@xxxxxxxxxxxxxx, michal.orzel@xxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, Julien Grall <jgrall@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Edwin Török <edwin.torok@xxxxxxxxx>
  • Delivery-date: Wed, 03 May 2023 22:20:41 +0000
  • Ironport-data: A9a23:De1B3Kx7fGXxTxK1Mph6t+d/xirEfRIJ4+MujC+fZmUNrF6WrkUHx 2pMCDyCaf+PYWbxfdpxYY63phgAuZLRx9RlGVE6+CAxQypGp/SeCIXCJC8cHc8wwu7rFxs7s ppEOrEsCOhuExcwcz/0auCJQUFUjP3OHfykTrafYEidfCc8IA85kxVvhuUltYBhhNm9Emult Mj75sbSIzdJ4RYtWo4vw//F+UIHUMja4mtC5QRjPasT5zcyqlFOZH4hDfDpR5fHatE88t6SH 47r0Ly/92XFyBYhYvvNfmHTKxBirhb6ZGBiu1IOM0SQqkEqSh8ai87XAME0e0ZP4whlqvgqo Dl7WT5cfi9yVkHEsLx1vxC1iEiSN4UekFPMCSDXXcB+UyQq2pYjqhljJBheAGEWxgp4KWxK/ PwmLC1OVRug2P718qqga9JPjNt2eaEHPKtH0p1h5RfwKK56BLX8GeDN79Ie2yosjMdTG/qYf 9AedTdkcBXHZVtIJ0sTD5U92uyvgxETcRUB8A7T+fVxvDOVlVMouFTuGIO9ltiibMNZhEuH4 EnB+Hz0GEoyP92D0zuVtHmrg4cjmAuiAN1CTOflp6cCbFu75jYNJSYESXGB4vTkh2zgW49vO 1I/9X97xUQ13AnxJjXnZDWju2KNtBMYX9tWEsU55RuLx66S5ByWbkAGUzpAZdoOpMIwAzsw2 TehltfkBzVpvKeSD2yU8rOZrzSaMiwSMGNEbigBJSMO5NzmoZ0vgwjUZsZuFravid/4Ei22x CqFxAA7hr4ThMpN0L+p8FTvijeg4JPOS2Yd9gjRG26o8A59TIqkfJCzr0jW6+5aK4SURUXHu 2IL8+Cc4/oHCZWlnSmEUuILWrqu4p6tMjLGhkV0N4I87Dnr8HmmFahS6jxjIEZiMu4fZCTkJ kTUvGt56ZNMPX3scahtZIGZAMAt0KSmHtPgPs04dfJLa5l1MQqYpidnYBbM23i3yRd116YiJ Z2cbMCgS24ADrhqxya3QOFb1qI3wic5xiXYQpWTIwmb7IdyrUW9Ed8tWGZipMhghE9YiG05K +piCvY=
  • Ironport-hdrordr: A9a23:2ULlAKgkYY3/fRYsCbTs6P2rynBQX6F13DAbv31ZSRFFG/FwWf re+MjztCWE/Ar5PUtK9+xoV5PhfZqiz+8L3WB8B9aftWrdyRmVxf9ZnOnfKlTbckWVygc379 YCT0ERMqyUMbBw5fyKnjVRe7wbrOVum8qT6ts3AB1WID1CWuVYy0NcNy7eK0txQWB9dO8E/F j33Ls3m9JlE05nHfhSwxM+Lpj+Tqbw5fXbSC9DPQcj9A6NyRuw8dfBYmCl9yZbaSpL3bAhtU PYkwn1j5/Tz82T+1vnzmrO6JYTv9PkxrJ4daqxo/lQECzolgGrIKJ+XLGY1QpF2d2H2RIRid zRpBVlBeRfgkmhBV2dkF/Wwgz91zRr0XP41lOCpnPmraXCNUgHIvsEv5tdbhzar3Utp8t91q Uj5RPli6Zq
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 03/05/2023 10:53 pm, Stefano Stabellini wrote:
> On Wed, 3 May 2023, Julien Grall wrote:
>> On 03/05/2023 15:38, andrew.cooper3@xxxxxxxxxx wrote:
>>> Hello,
>>>
>>> After what seems like an unreasonable amount of debugging, we've tracked
>>> down exactly what is going wrong here.
>>>
>>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
>>>
>>> Of note is the smoke.serial log around:
>>>
>>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
>>> obj: CREATE connection 0xffff90fff1f0
>>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
>>> 00000000, rsp_prod 00000000
>>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
>>> (@introduceDomain domlist )
>>>
>>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
>>> clobbers the ring pointers.  The added pressure on dom0 that the
>>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
>>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
>>> message that dom1 wrote into the ring.
>>>
>>> The other behaviour seen was xenstored observing a header looking like this:
>>>
>>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
>>> 0x6d726f66 }
>>>
>>> which was rejected as being too long.  That's "control/platform" in
>>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
>>> and writing the payload.
>>>
>>>
>>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
>>> unsuspecting connection.  It is ultimately init-dom0less's fault for
>>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
>>> complete.
>> So the problem is xenstored will set interface->connection to
>> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
>> following, for now, very hackish patch:
>>
>> diff --git a/tools/xenstore/xenstored_domain.c
>> b/tools/xenstore/xenstored_domain.c
>> index f62be2245c42..bbf85bbbea3b 100644
>> --- a/tools/xenstore/xenstored_domain.c
>> +++ b/tools/xenstore/xenstored_domain.c
>> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
>>                 talloc_steal(domain->conn, domain);
>>
>>                 if (!restore) {
>> +                       domain_conn_reset(domain);
>>                         /* Notify the domain that xenstore is available */
>>                         interface->connection = XENSTORE_CONNECTED;
>>                         xenevtchn_notify(xce_handle, domain->port);
>> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection 
>> *conn,
>>         if (!domain)
>>                 return errno;
>>
>> -       domain_conn_reset(domain);
>> -
>>         send_ack(conn, XS_INTRODUCE);
> Following Jurgen's suggestion, I made this slightly modified version of
> the patch. With it, the problem is solved:
>
> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703

This fails to solve 3(?) of the 4(?) bugs pointed out between this email
thread and on IRC.

Stop with the bull-in-a-china-shop approach.  There is no acceptable fix
to this mess which starts with anything other than corrections to the
documentation, and a plan for how to make startup work robustly given
all the bugs introduced previously by failing to do it properly the
first time around.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.