[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [Patch] continue_hypercall_on_cpu rework using tasklets



BTW, I suspect if cpu online/offline lock should really wrap the 
stop_machine_run(). I remember Criping questioned this also. Will have a look 
on it.

--jyh

>-----Original Message-----
>From: Jiang, Yunhong
>Sent: Friday, April 16, 2010 2:43 PM
>To: Keir Fraser; Juergen Gross
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Yu, Ke
>Subject: RE: [Xen-devel] [Patch] continue_hypercall_on_cpu rework using 
>tasklets
>
>
>
>>-----Original Message-----
>>From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
>>Sent: Thursday, April 15, 2010 7:07 PM
>>To: Jiang, Yunhong; Juergen Gross
>>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Yu, Ke
>>Subject: Re: [Xen-devel] [Patch] continue_hypercall_on_cpu rework using 
>>tasklets
>>
>>On 15/04/2010 10:59, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>>
>>>> Actually that's a good example because it now won't work, but for other
>>>> reasons! The hypercall continuation can interrupt another vcpu's execution,
>>>> and then try to synchronously pause that vcpu. Which will deadlock.
>>>>
>>>> Luckily I think we can re-jig this code to freeze_domains() before doing 
>>>> the
>>>> continue_hypercall_on_cpu(). I've cc'ed one of the CPU RAS guys. :-)
>>>
>>> Hmm, I have cc'ed one of the PM guys because it is enter_state :-)
>>> Can we add check in vcpu_sleep_sync() for current? It is meaningless to
>>> cpu_relax for current vcpu in that situation, especially if we are not in 
>>> irq
>>> context.
>>> I'm not sure why in freeze_domains it only checkes dom0's vcpu for current,
>>> instead of all domains.
>>
>>Well actually pausing any vcpu from within the hypercall continuation is
>>dangerous. The softirq handler running the hypercall continuation may have
>>interrupted some running VCPU X. And the VCPU Y that the continuation is
>>currently trying to pause may itself be trying to pause X. So we can get a
>>deadlock that way. The freeze_domains() *has* to be pulled outside of the
>>hypercall continuation.
>>
>>It's a little bit similar to the super-subtle stop_machine_run deadlock
>>possibility I just emailed to you a second ago. :-)
>
>Thanks for pointing out the stop_machine_run deadlock issue.
>
>After more consideration and internally discussion, seems the key point is, the
>tasklet softirq is something like getting a lock for the current vcpu's 
>state(i.e. no one
>else could change that state unless this softirq is finished). So any block 
>action in
>softirq context, not just vcpu_pause_sync, is dangerous and should be avoided 
>(we
>can't get a lock and do block action per my understanding).
>This is because vcpu's state can only be changed by schedule softirq (am I 
>right on
>this?), while schedule softirq can't prempt other softirq. So, more generally, 
>anything
>that will be updated by a softirq context, and will be syncrhozed/blocking 
>waitied in
>xen's vcpu context is in fact a implicit lock held by the softirq.
>
>To the tricky bug on the stop_machine_run(), I think it is in fact similar to 
>the
>cpu_add_remove_lock. The stop_machine_run() is a block action, so we must make
>sure no one will be blocking to get the lock that is held by stop_machine_run()
>already. At that time, we change all components that try to get the
>cpu_add_remove_lock to be try_lock.
>
>The changes caused by the tasklet is, a new implicit lock is added, i.e. the 
>vcpu's
>state.
>The straightforward method is like the cpu_add_remove_lock: make everything 
>that
>waiting for the vcpu state change to do softirq between the checking. Maybe the
>cleaner way is your previous suggestion, that is, put the stop_machine_run() 
>in the
>idle_vcpu(), another way is, turn back to the original method, i.e. do it in 
>the
>schedule_tail.
>
>Also We are not sure why the continue_hypercall_on_cpu is changed to use 
>tasklet.
>What's the benifit for it? At least I think this is quite confusing, because 
>per our
>understanding, usually hypercall is assumed to execut in current context, 
>while this
>change break the assumption. So any hypercall that may use this _c_h_o_c, and 
>any
>function called by that hypercall, should be aware of this, I'm not sure if 
>this is really
>so correct, at least it may cause trouble if someone use this without realize 
>the
>limitation. From Juergen Gross's mail, it seems for cpupool, but I have no 
>idea of the
>cpupool :-(
>
>--jyh
>
>
>>
>> -- Keir
>>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.