[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] xenbus: bypass xenbus frontend resume if xenstored is not running



On 02/05/13 11:30, Ian Campbell wrote:
> On Thu, 2013-05-02 at 11:10 +0100, AurÃlien Chartier wrote:
>> On 02/05/13 10:24, Ian Campbell wrote:
>>> On Thu, 2013-05-02 at 10:21 +0100, Jan Beulich wrote:
>>>>>>> On 02.05.13 at 10:24, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote:
>>>>> On Wed, 2013-05-01 at 13:57 +0100, Aurelien Chartier wrote:
>>>>>> If the xenbus frontend is running in a domain running xenstored or in 
>>>>>> dom0,
>>>>>> the device resume is hanging because it is happening before the process
>>>>>> resume. This patch adds extra logic to the resume code to check if we are
>>>>>> the domain running xenstored or dom0.
>>>>>>
>>>>>> The frontend will be reconnected later, when the backend resumes from S3.
>>>>>> This logic is working when xenstored is running in dom0, but has not been
>>>>>> tested with a xenstore stub domain.
>>>>>> ---
>>>>>>  drivers/xen/xenbus/xenbus_probe_frontend.c |   15 ++++++++++++++-
>>>>>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/xen/xenbus/xenbus_probe_frontend.c 
>>>>> b/drivers/xen/xenbus/xenbus_probe_frontend.c
>>>>>> index 3159a37..8583afe 100644
>>>>>> --- a/drivers/xen/xenbus/xenbus_probe_frontend.c
>>>>>> +++ b/drivers/xen/xenbus/xenbus_probe_frontend.c
>>>>>> @@ -89,9 +89,22 @@ static void backend_changed(struct xenbus_watch 
>>>>>> *watch,
>>>>>>          xenbus_otherend_changed(watch, vec, len, 1);
>>>>>>  }
>>>>>>  
>>>>>> +static int xenbus_frontend_dev_resume(struct device *dev)
>>>>>> +{
>>>>>> +        /* 
>>>>>> +         * If xenstored is running in that domain, we cannot access the 
>>>>>> backend
>>>>>> +         * state at the moment. If we are running in dom0, the domain 
>>>>>> running
>>>>>> +         * xenstored is still suspended at that point
>>>>>> +         */
>>>>>> +        if (xen_initial_domain() || (xen_store_domain == XS_LOCAL))
>>>>>> +                return 0;
>>>>>> +
>>>>>> +        return xenbus_dev_resume(dev);
>>>>> When or where does this eventually get called for the init domain or
>>>>> XS_LOCAL cases?
>>>> I was about to ask the same question. Plus I don't think the
>>>> description here or in the overview mail really makes clear how
>>>> specifically a deadlock would occur here. That's pretty relevant to
>>>> understand in the light that so far we had no indication of there
>>>> being any special treatment necessary here, and resume from S3
>>>> had been working quite fine without that (at least as long as
>>>> xenstored is running in Dom0 and at least with the traditional/
>>>> forward-port/non-pvops kernels).
>>> I think the unusual feature here is that dom0 has a netfront attached.
>>> Netfront resume is therefore hanging because it is trying to talk to the
>>> still frozen xenstored process in dom0.
>>>
>>> Ian.
>>>
>> Yes, the unusual feature of having a netfront driver in dom0 is
>> triggering the S3 issue I described. Ian made me realize this issue
>> could also happen in Xenstore stub domains.
>>
>> The root cause of the issue is the assomption that a xenstored process
>> is running in another domain when the xenbus frontend is being resumed
>> from S3. This assomption is incorrect if xenstored and the xenbus
>> frontend are running in the same domain. As Linux kernel is waiting for
>> all devices to be resumed before resuming userland tasks, the xenbus
>> frontend resume is blocking the userland process resume, waiting for
>> xenstored (which cannot run as it is a userland process).
>>
>> The xenbus_dev_resume function for frontend devices such as nefront will
>> not be called at all with that patch. I am relying on the fact that the
>> network backend domain will be resumed after dom0 resume is complete.
>> When that resume is happening, it will trigger a call to netback_changed
>> in dom0 netfront. This call will end up resuming xenbus states in netfront.
>>
>> That logic is working for a dom0 netfront, as we can safely rely on the
>> fact that the network backend domain will be resumed after dom0 resume
>> is complete. I don't have a Xen configuration with Xenstore stub domain,
>> but it would probably need some extra logic to reconnect the frontend
>> after xenstored is being resumed. The main goal of this patch is to fix
>> the S3 resume of domains running both a xenbus frontend and xenstored.
> Is the assumption that other domains are all suspended over S3 a valid
> one in the general case?
>
> In principal there is nothing stopping the toolstack from leaving
> domains running over S3, is there?
>
That seems a valid assumption for dom0. It is probably not for Xenstore
stub domains, even if I fail to see a use case for that.

Another solution would be to defer the call to xenbus_dev_resume until
userland processes have been resumed. Any opinion on that ?

Aurelien

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.