[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/3] libxl: Fix libxl_postfork_child_noexec deadlock etc.



George Dunlap wrote:
> On Mon, Feb 24, 2014 at 3:47 PM, George Dunlap
> <George.Dunlap@xxxxxxxxxxxxx> wrote:
>   
>> On Mon, Feb 24, 2014 at 3:17 PM, George Dunlap
>> <george.dunlap@xxxxxxxxxxxxx> wrote:
>>     
>>> On 02/24/2014 02:19 PM, Ian Jackson wrote:
>>>       
>>>> libxl_postfork_child_noexec would nestedly reaquire the non-recursive
>>>> "no_forking" mutex: atfork_lock uses it, as does sigchld_user_remove.
>>>> The result on Linux is that the process always deadlocks before
>>>> returning from this function.
>>>>
>>>> This is used by xl's console child.  So, the ultimate effect is that
>>>> xl with pygrub does not manage to connect to the pygrub console.
>>>> This beahviour was reported by Michael Young in Xen 4.4.0 RC5.
>>>>
>>>> Also, the use of sigchld_user_remove in libxl_postfork_child_noexec is
>>>> not correct with SIGCHLD sharing.  libxl_postfork_child_noexec is
>>>> documented to suffice if called only on one ctx.  So deregistering the
>>>> ctx it's called on is not sufficient.  Instead, we need a new approach
>>>> which discards the whole sigchld_user list and unconditionally removes
>>>> our SIGCHLD handler if we had one.
>>>>
>>>> Prompted by this, clarify the semantics of
>>>> libxl_postfork_child_noexec.  Specifically, expand on the meaning of
>>>> "quickly" by explaining what operations are not permitted; and
>>>> document the fact that the function doesn't reclaim the resources in
>>>> the ctxs.
>>>>
>>>> And add a comment in libxl_postfork_child_noexec explaining the
>>>> internal concurrency situation.
>>>>
>>>> This is an important bugfix.  IMO the bug is a blocker for Xen 4.4.
>>>>
>>>> Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
>>>> Reported-by: M A Young <m.a.young@xxxxxxxxxxxx>
>>>> CC: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
>>>> CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
>>>>         
>>> So it looks like this path gets called from a number of other places in xl:
>>>
>>> libxl_postfork_child_noexec() is called by xl.c:postfork().
>>>
>>> postfork() is called in xl_cmdimpl.c by autoconnect_vncviewer(),
>>> autoconnect_console(), and do_daemonize().
>>>
>>> do_daemonize() is called during "xl create", and "xl devd".
>>>
>>> Was this deadlock not triggered for those, or was it triggered and nobody
>>> noticed?
>>>       
>> In any case, I do think we need to fix this; the main question is, do
>> we need to delay the release a bit further to make sure it gets
>> sufficient testing?
>>     
>
> Also,  it would be nice to get a Tested-by: from someone using it with
> libvirt (before the release at least, if not before the check-in).
>
> Jim / Dario?
>   

I'll update my test system to rc6 tomorrow and restart my tests.

FYI, the tests were running over the weekend on rc5 + libvirt 1.2.2
rc1.  Over 25,000 domains started, shutdown, created, saved, restored,
etc. with no problems noted.

Regards,
Jim


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.