[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] Fix scheduler crash after s3 resume



>>> On 25.01.13 at 11:35, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx> wrote:
> Am 25.01.2013 11:31, schrieb Jan Beulich:
>>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@xxxxxxxxxxxxxx>  wrote:
>>> Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@xxxxxxxxxx>   
>>>>>>> wrote:
>>>>
>>>>>> I think I had already raised the question of the placement of
>>>>>> this rcu_barrier() here, and the lack of a counterpart in the
>>>>>> suspend portion of the path. Keir? Or should
>>>>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>>>>> while still resuming, and instead call __do_softirq() with all but
>>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>>>>> or alternatively by open-coding its effect)?
>>>>>>
>>>>> Though I recall these vcpu_wake crashes happen also from other entry
>>>>> points in enter_state but rcu_barrier, so I dont think removing that
>>>>> helps much. Just was unable to get a proper log of them today due to
>>>>> most of them being cut in half. Will try bit more.
>>>>
>>>> In which case making __do_softirq() itself honor being in the
>>>> suspend/resume path might still be an option.
>>>>
>>>>> My belief is that as long as vcpu_migrate is not called in
>>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>>>>> frequency but since vcpu->processor shall point to online cpu, and it
>>>>> won't crash. So likely avoiding the wakes here completely is not the
>>>>> goal, just the offline ones.
>>>>
>>>> But you neglect the fact that waking vCPU-s at this point is
>>>> unnecessary anyway (they have nowhere to run on).
>>>
>>> What about adding a global scheduler_disable() in freeze_domains() and a
>>> scheduler_enable() in thaw_domains() which will switch scheduler locking to
>>> a global lock (or disable it at all?). This should solve all problems
>>> without
>>> any complex changes of current behavior.
>>
>> I don't see how this would address the so far described
>> shortcomings.
> 
> The crash happens due to an access to the scheduler percpu area which isn't
> allocated at the moment. The accessed element is the address of the 
> scheduler
> lock for this cpu. Disabling the percpu locking scheme of the scheduler 
> while
> the non-boot cpus are offline will avoid the crash.

Ah, okay. But that wouldn't prevent other bad effects that could
result from vCPU-s pointing to offline pCPU-s. Hence I think such
a solution, even if sufficient for now, would set us up for future
similar (and similarly hard to debug) issues.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.