[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?


  • To: "Xu, Anthony" <anthony.xu@xxxxxxxxx>, "Magenheimer, Dan \(HP Labs Fort Collins\)" <dan.magenheimer@xxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Yang, Fred" <fred.yang@xxxxxxxxx>
  • Date: Tue, 27 Dec 2005 11:40:52 -0800
  • Delivery-date: Tue, 27 Dec 2005 19:44:40 +0000
  • List-id: Discussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>
  • Thread-index: AcYBELBGMu7ZHSeYRUaEav49mCDvgwAAVFMwAADLrDAAIbQEoAARHDxgAAFwgsAApRrdUAAOEX2AABW4J/AAUJyDYAAareHQABB8dxAAG8N+UAAfdswgAHxIyvAAUT7r4A==
  • Thread-topic: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?

Dan,

Please keep drilling on this isssue to root cause the bug behind it.
Though Linux doesn't use PAL_FLUSH_CACHE, but yet another major OS does use it 
during runtime per recollection.

As Anthony pointed, please make sure PAL ITR is there before calling!

-Fred

Xu, Anthony wrote:
> Dan,
> Could you double check the itr which is mapping PAL code is there
> just before invoking ia64_pal_call_static? 
> 
> Thanks
> -Anthony
> 
>> -----Original Message-----
>> From: Magenheimer, Dan (HP Labs Fort Collins)
>> [mailto:dan.magenheimer@xxxxxx] Sent: 2005年12月24日 2:22 To: Xu,
>> Anthony; Yang, Fred; Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console
>> problem on domU on tip?  
>> 
>> I got up early and spent several hours trying to debug
>> this further.  By adding timing loops and other debug code
>> and moving all the relevant PAL macros around, I proved
>> conclusively that the ia64_pal_call_static assembly routine
>> is not returning.  Next I added an infinite loop to the ivt
>> nested TLB handler (which isn't used by Xen except by some
>> fast paths that are currently off).  With this loop, the
>> error message goes away and Xen "freezes".  I think this
>> proves that the PAL call is inappropriately accessing some
>> (unpinned) data location with psr.ic off.
>> 
>> You should note that this is the only PAL call that requires
>> psr.ic to be off.  I suspect that OS's need to be prepared
>> for the possibility that a fault occurs.  Linux is not
>> so never calls the routine.  Xen is not prepared either.
>> 
>> Happy holidays!
>> 
>> Dan
>> 
>>> -----Original Message-----
>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
>>> Sent: Thursday, December 22, 2005 7:29 PM
>>> To: Magenheimer, Dan (HP Labs Fort Collins); Yang, Fred;
>>> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>> Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel]
>>> Console problem on domU on tip?
>>> 
>>>> With CONFIG_IA64_SPLIT_CACHE on, a new user may encounter
>>>> the problem on a shipping machine and the symptom is that
>>>> the machine immediately crashes when a domU is launched.
>>> 
>>> Dan,
>>> That means dom0 can boot with CONFIG_IA64_SPLIT_CACHE on, and
>>> PAL_CACHE_FLUSH has been invoked successfully in the process
>>> of dom0 boot. So this is not PAL_CACHE_FLUSH issue, there
>>> must be some other issue. Could you provide more information
>>> about the crash, due to we can't reproduce this issue.
>>> 
>>> Thanks.
>>> 
>>> -Anthony
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
>>> [mailto:dan.magenheimer@xxxxxx]
>>>> Sent: 2005年12月22日 21:26
>>>> To: Yang, Fred; Xu, Anthony; Tian, Kevin;
>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>> Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console
>>>> problem on domU on tip? 
>>>> 
>>>> With CONFIG_IA64_SPLIT_CACHE on, a new user may encounter
>>>> the problem on a shipping machine and the symptom is that
>>>> the machine immediately crashes when a domU is launched.
>>>> 
>>>> With CONFIG_IA64_SPLIT_CACHE off, a developer may encounter
>>>> a different problem on an unreleased machine.
>>>> 
>>>> I know that you are focused primarily on the unreleased machine,
>>>> but in this case, I think we should be cautious for the new user
>>>> as the developer knows to change the option when running
>>>> on the unreleased machine.
>>>> 
>>>> I will spend some more time on this when I have a chance.
>>>> I think it is a real bug (probably PAL accessing some address
>>>> which isn't pinned) that occurs only on some boxes due
>>>> to some factor like memory configuration.
>>>> 
>>>> Thanks,
>>>> Dan
>>>> 
>>>> P.S. The debug output just before the crash was:
>>>> ia64_fault: General Exception: IA-64 Reserved Register/Field fault
>>>> (data access): reflecting 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Yang, Fred [mailto:fred.yang@xxxxxxxxx]
>>>>> Sent: Wednesday, December 21, 2005 10:34 PM
>>>>> To: Magenheimer, Dan (HP Labs Fort Collins); Xu, Anthony;
>>>>> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>>> Subject: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel]
>>>>> Console problem on domU on tip?
>>>>> 
>>>>> Dan,
>>>>> 
>>>>> Can we suggest to always turn on #CONFIG_IA64_SPLIT_CACHE as
>>>>> the default build configuration.  People may not be aware of
>>>>> this build flag and miss it one each new build.
>>>>> 
>>>>> All the newer generation ia64 processors will come with
>>>>> splitted I/Dcache as discussed in the previous mail thread
>>>>> and it is documented in the Itanium architectur of possible
>>>>> splitted cache for future implementation.  With default
>>>>> turning off, it is a potential bugs for all Tiger4 systems
>>>>> using for daily development and future platforms to come.
>>>>> 
>>>>> It is also indicated through your mail, it is only HP  rx2620
>>>>> system has issue and not the other HP boxes.  Can you track
>>>>> down this issue?  Rather than put a kludge for rx2620 box?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -Fred
>>>>> 
>>>>> 
>>>>> Magenheimer, Dan (HP Labs Fort Collins) wrote:
>>>>>> Committed (but without removal of ifdefs until we
>>>>>> track down this problem).
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
>>>>>>> Sent: Monday, December 19, 2005 7:15 PM
>>>>>>> To: Magenheimer, Dan (HP Labs Fort Collins); Tian, Kevin;
>>>>>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
>>>>>>> 
>>>>>>> I guest maybe the firmware on your machine doesn't implement
>>>>>>> this pal call due to there is no split I/D cache at that
>>>>>>> time, so when you call this pal call, it will return
>>>>>>> PAL_STATUS_UNIMPLEMENTED, Could you please turn on
>>>>>>> CONFIG_IA64_SPLIT_CACHE  and try this new patch to see
>>>>>>> whether your machine can boot domain0?
>>>>>>> If this patch works, could you please remove all
>>>>>>> CONFIG_IA64_SPLIT_CACHE macro?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> -Anthony
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
>>>>>>> [mailto:dan.magenheimer@xxxxxx]
>>>>>>>> Sent: 2005年12月19日 23:48
>>>>>>>> To: Xu, Anthony; Tian, Kevin;
>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
>>>>>>>> 
>>>>>>>> I have been distracted tracking another bug...
>>>>>>>> 
>>>>>>>> Here's where I got:
>>>>>>>> 
>>>>>>>> The machine is a new (April 2005) HP rx2620 so it is
>>>>>>>> not old firmware.   I can't reproduce it on a machine
>>>>>>>> with an ITP (which does have older firmware).
>>>>>>>> 
>>>>>>>> This PAL call is never used in Linux, though there is a
>>>>>>>> routine coded for it.  It is the only
>>>>>>>> PAL call coded in Linux that occurs with psr.ic off.
>>>>>>>> 
>>>>>>>> The crash I am seeing occurs either during the PAL call or
>>>>>>>> immediately upon return. 
>>>>>>>> 
>>>>>>>> Is it OK to
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
>>>>>>>>> Sent: Monday, December 19, 2005 2:02 AM
>>>>>>>>> To: Tian, Kevin; Magenheimer, Dan (HP Labs Fort Collins);
>>>>>>>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
>>>>>>>>> 
>>>>>>>>> Dan,
>>>>>>>>> Have you got time to verify below discussion?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> -Anthony
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Tian, Kevin
>>>>>>>>>> Sent: 2005年12月16日 10:16
>>>>>>>>>> To: Xu, Anthony; 'Magenheimer, Dan (HP Labs Fort Collins)';
>>>>>>>>>> 'xen-ia64-devel@xxxxxxxxxxxxxxxxxxx'
>>>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
>>>>>>>>>> 
>>>>>>>>>>> From: Xu, Anthony
>>>>>>>>>>> Sent: 2005年12月16日 9:54
>>>>>>>>>>> 
>>>>>>>>>>>> Also, why panic if it fails?
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Panic is not required here, and we could just print out a
>>>>>>>>>> warning message. Previously panic is kept there to help our
>>>>>>>>>> debug in early stage. 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Does the problem happen only on VTI?  Or both VTI and
>>>>>>>>>>>> non-VTI on split-cache machines?
>>>>>>>>>>> 
>>>>>>>>>>> Sometimes, it makes domain0 crash at the very beginning of
>>>>>>>>>>> the domain0 boot process, especially on MP machine.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> -Anthony
>>>>>>>>>> 
>>>>>>>>>> One complement is, that problem definitely exists on new
>>>>>>>>>> split-cache processors, for dom0/domU. For VTI domain, we
>>>>>>>>>> have logic within device model to ensure consistence.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Kevin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
>>>>>>>>>>> [mailto:dan.magenheimer@xxxxxx]
>>>>>>>>>>>> Sent: 2005年12月16日 1:39
>>>>>>>>>>>> To: Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> Cc: Xu, Anthony
>>>>>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on
>>>>>>>>>>>> tip? 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is this code fragment necessary for VTI to boot domU
>>>>>>>>>>>>>> or is it OK to remove?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   The comment is inaccurate and it should be domU. That I/D
>>>>>>>>>>>>> cache sync step is mandatory to boot domU on new IA64
>>>>>>>>>>>>> processor which has split L2 I/D cache. If without such
>>>>>>>>>>>>> I/D cache sync, control panel loads domU's kernel image
>>>>>>>>>>>>> which only affects D side cache. If there're some stale
>>>>>>>>>>>>> entry on I-side cache within same range of dom0 image,
>>>>>>>>>>>>> people will see machine going weird.
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't understand... how can there be stale entries in the
>>>>>>>>>>>> I-cache? The instructions have just been written to memory
>>>>>>>>>>>> (through D-cache) and no instructions in this domain have
>>>>>>>>>>>> yet been executed. I do see that the D-cache needs to be
>>>>>>>>>>>> flushed so that memory is coherent but are there better
>>>>>>>>>>>> ways to do that without a pal call? 
>>>>>>>>>>>> 
>>>>>>>>>>>>>   Normally I/D cache sync shouldn't force any problem.
>>>>>>>>>>>>> Possibly there's some problem with the pal calling code,
>>>>>>>>>>>>> like incorrect ITLB mapping for pal or similar issue...
>>>>>>>>>>>> 
>>>>>>>>>>>> Although the ia64_pal_cache_flush routine is defined in
>>>>>>>>>>>> linux's pal.h, it doesn't appear to be used anywhere in
>>>>>>>>>>>> Linux so there is no use model to copy.  I suspect there
>>>>>>>>>>>> is some use model for the call that we don't understand,
>>>>>>>>>>>> for example maybe it should only be called with physical
>>>>>>>>>>>> &progress?  It definitely fails every time on one of my
>>>>>>>>>>>> (newer) machines and disabling the pal call makes the
>>>>>>>>>>>> problem go away. 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Though it's intermittent, please
>>>>>>>>>>>>> keep this code
>>>>>>>>>>>>> there for correctness.
>>>>>>>>>>>> 
>>>>>>>>>>>> Since the call is definitely failing under some
>>>>>>>>>>>> circumstances that we don't understand, I'm inclined to at
>>>>>>>>>>>> least put the code in an #ifdef CONFIG_SPLIT_CACHE
>>>>>>>>>>>> 
>>>>>>>>>>>> Does the problem happen only on VTI?  Or both VTI and
>>>>>>>>>>>> non-VTI on split-cache machines? 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Dan
>>>>>>>>>>>> 
>>>>>>>>>>>> P.S. I tried Anthony's patch (which moves the PAL call
>>>>>>>>>>>> after new_thread()) but it still crashes.


_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.