[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61



>From: MaoXiaoyun
>Sent: Monday, April 25, 2011 11:15 AM
>> Date: Fri, 15 Apr 2011 14:22:29 -0700
>> From: jeremy@xxxxxxxx
>> To: tinnycloud@xxxxxxxxxxx
>> CC: giamteckchoon@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx; 
>> konrad.wilk@xxxxxxxxxx
>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61
>> 
>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote:
>> > Hiï
>> >
>> > Could the crash related to this patch ?
>> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3
>> >
>> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before
>> > cpumask_clear_cpu(line 49).
>> > Could it possible that right after execute line 40 of mmu_context.h,
>> > CPU revice IPI from other CPU to
>> > flush the mm, and when in interrupt, find the TLB state happened to be
>> > TLBSTATE_OK. Which conflicts.
>> 
>> Does reverting it help?
>> 
>> J
>Â
>Hi Jeremy:
>Â
>ÂÂÂ The lastest test result shows the reverting didn't help.
>ÂÂÂÂKernel panic exactly at the same place in tlb.c.
>Â
>ÂÂÂ I have question about TLB state, from the stack, 
>ÂÂÂ xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref
>Â
>ÂÂÂÂWhatÂÂcpu_tlbstate.state should be, couldÂÂTLBSTATE_OK or TLBSTATE_LAZY 
>all be possible? 
>ÂÂÂÂThat is after a hypercall from userspace, state will be TLBSTATE_OK, and
>ÂÂÂÂÂ if from kernel space, state will be TLBSTATE_LAZE ? 
>Â
>ÂÂÂÂÂÂÂthanks.

it looks a bug in drop_other_mm_ref implementation, that current TLB state 
should be checked
before invoking leave_mm(). There's a window between below lines of code:

<xen_drop_mm_ref>
       /* Get the "official" set of cpus referring to our pagetable. */
        if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
                for_each_online_cpu(cpu) {
                        if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
                            && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
                                continue;
                        smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
                }
                return;
        }

there's chance that when smp_call_function_single is invoked, actual TLB state 
has been
updated in the other cpu. The upstream kernel patch you referred to earlier 
just makes
this bug exposed more easily. But even without this patch, you may still suffer 
such issue
which is why reverting the patch doesn't help.

Could you try adding a check in drop_other_mm_ref?

        if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
                leave_mm(smp_processor_id());

once the interrupted context has TLBSTATE_OK, it implicates that later it will 
handle 
the TLB flush and thus no need for leave_mm from interrupt handler, and that's 
the
assumption of doing leave_mm.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.