[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors



On 07/05/2010 03:52 PM, Joanna Rutkowska wrote:
> On 07/06/10 00:43, Jeremy Fitzhardinge wrote:
>   
>> On 07/05/2010 03:07 PM, Joanna Rutkowska wrote:
>>     
>>> On 07/05/10 23:28, Joanna Rutkowska wrote:
>>>   
>>>       
>>>> On 07/05/10 12:38, Joanna Rutkowska wrote:
>>>>     
>>>>         
>>>>> I'm experiencing very reproducible DomU lockups that occur after I
>>>>> resume the system from an S3 sleep. Strangely this seem to happen only
>>>>> on my Core i5 systems (tested on two different machines), but not on
>>>>> older Core 2 Duo systems.
>>>>>
>>>>> Usually this causes the apps (e.g. Firefox) running in DomUs to become
>>>>> unresponsive, but sometimes I see that some very limited functionality
>>>>> of the app is still available (e.g. I can open/close Tabs in Firefox,
>>>>> but cannot do much anything more). Also, when I log in to the DomU via
>>>>> xm console, I usually can see the login prompt, can enter the username,
>>>>> but then the console hangs.
>>>>>
>>>>> I tried to attach to such a hanged DomU using gdbserver-xen, but when I
>>>>> subsequently try to attach to the server from gdb (via the target
>>>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>>>>
>>>>> I'm running Xen 3.4.3, and fairly recent pvops0 kernel in DomU. In Dom0
>>>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
>>>>> relevant in any way.
>>>>>
>>>>> This seems like a scheduling problem, and, because it seems to affect
>>>>> Core i5 processors, but not Core 2 Duos, it might have something to do
>>>>> with Hyperthreading perhaps?
>>>>>
>>>>>       
>>>>>           
>>>> Ok, finally got the gdbsever working. This is the backtrace I get when
>>>> attaching to a lockedup DomU after resume:
>>>>
>>>> #0  0xffffffff810093aa in ?? ()
>>>> #1  0xffffffff8168be18 in ?? ()
>>>> #2  0xffff880003a21600 in ?? ()
>>>> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>>>>     at
>>>> /usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
>>>> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
>>>> #5  0xffffffff8100c33e in raw_safe_halt () at
>>>> /usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
>>>> #6  xen_idle () at arch/x86/xen/setup.c:193
>>>> #7  0xffffffff81011cdd in cpu_idle () at arch/x86/kernel/process_64.c:143
>>>> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
>>>> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
>>>> #10 0xffffffff818242c1 in x86_64_start_reservations
>>>> (real_mode_data=<value optimized out>) at arch/x86/kernel/head64.c:123
>>>> #11 0xffffffff81828160 in xen_start_kernel () at
>>>> arch/x86/xen/enlighten.c:1300
>>>> #12 0xffffffff838f3000 in ?? ()
>>>> #13 0xffffffff838f4000 in ?? ()
>>>> #14 0xffffffff838f5000 in ?? ()
>>>>
>>>> Any ideas?
>>>>
>>>>     
>>>>         
>>> ... and when I disabled Hyperthreading in BIOS, the problem seems to
>>> gone. Obviously this is not a desired solution...
>>>   
>>>       
>> HT has historically been very good at flushing out race conditions which
>> would normally be tricky to hit on SMP systems.  I assume your domain is
>> single CPU?
>>     
> Actually no. It used to be indeed, but then I thought it might be the
> issue and assigned 2 vcpus to it, but it still they were locking up.
>   

Does the other cpu have the same backtrace into idle?

>> Do you know what's going on it in that it might be waiting
>> for?
>>     
> No idea. I might be guessing that it would be different kernel
> subsystems each time -- e.g. when I'm lucky and when the apps got only
> "partially" locked up, I can e.g. open new tabs in Google Chrome, I can
> see some thumbnails of my popular websites, but without their contents.
> This would suggest the networking subsystem is dead, but at the same
> time Chrome is apparently communicating fine with the X server in the
> DomU (and which in turn talks fine with Dom0 over Xen shared
> memory/evtchanl).
>
> I experienced the above behavior also when had only one VCPU er DomU.
>   

I've seen similar things with just normal domain save/restore, where the
timer interrupt seems to be failing.  Can you ssh into the domain?  I
found that I couldn't do an interactive ssh (hung at the prompt), but a
non-interactive command would work, so I could cat /proc/interrupts.

This was on my non-HT i7 box, and it affected both pvops domUs, and
CentOS 5 ones.

>>  Is it not longer getting timer events or something?  Does the Xen
>> 'q' debug-key make it do anything?
>>     
> Ah, that's some secret option I've never heard of... Is in the gdb when
> using with gdbserver-xen?
>   

No, on the xen console: type ^A^A^A to switch input to Xen, then press q
(h gets a list of other magic keys).  ^A^A^A switches the console back
to dom0.  You can also trigger it with "xm debug-key q" and look at "xm
dmesg" to see the results if you can't get to the Xen console.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.