[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] dom0 / hypervisor hang on dom0 boot



On Tue, May 21, 2013 at 09:39:14AM +0200, Dietmar Hahn wrote:
> Am Freitag 17 Mai 2013, 18:28:16 schrieb Konrad Rzeszutek Wilk:
> > On Thu, May 16, 2013 at 01:07:05PM +0200, Dietmar Hahn wrote:
> > > Am Mittwoch 15 Mai 2013, 10:42:17 schrieb Jan Beulich:
> > > > >>> On 15.05.13 at 11:12, Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx> 
> > > > >>> wrote:
> > > > > Am Mittwoch 15 Mai 2013, 09:35:46 schrieb Jan Beulich:
> > > > >> >>> On 15.05.13 at 08:53, Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx> 
> > > > >> >>> wrote:
> > > > >> > I tried iommu=debug and I can't see any faulting messages but Iam 
> > > > >> > not
> > > > >> > familiar with this code.
> > > > >> > I attached the logging, maybe anyone can have a look on this.
> > > > 
> > > > Perhaps only (if at all) by instrumenting the hypervisor. The
> > > > question of course is how easily/quickly you can narrow down the
> > > > code region that it might be dying in. And whether it's a hypervisor
> > > > action at all that causes the hang (as opposed to something the
> > > > DRM code in Dom0 does).
> > > 
> > > I added some debug code to the linux kernel and could track down the
> > > point of the hang. I used openSuSE kernel 3.7.10-1.4 but I looked at newer
> > > kernels and found that the code is similar.
> > > 
> > > i915_gem_init_global_gtt(...)
> > >  ...
> > >  intel_gtt_clear_range(start / PAGE_SIZE, (end-start) / PAGE_SIZE);
> > >  ...
> > > 
> > > void intel_gtt_clear_range(unsigned int first_entry, unsigned int 
> > > num_entries)
> > > {
> > >         unsigned int i;
> > > 
> > >     ---> A printk(...) here is seen on serial line!
> > > 
> > >         for (i = first_entry; i < (first_entry + num_entries); i++) {
> > >                 
> > > intel_private.driver->write_entry(intel_private.base.scratch_page_dma,
> > >                                                   i, 0);
> > >         }
> > > 
> > >     ---> A printk(...) here is never seen!
> > > 
> > >         readl(intel_private.gtt+i-1);
> > > }
> > > 
> > > The function behind the pointer intel_private.driver->write_entry is
> > > i965_write_entry(). And the interesting instruction seems to be:
> > >   writel(addr | pte_flags, intel_private.gtt + entry);
> > > 
> > > I added another printk() on start of the function i965_write_entry().
> > > And surprisingly  after printing a lot of messages the kernel came up!!!
> > > But now I had other problems like losing the audio device (maybe 
> > > timeouts).
> > > So maybe the hang is a timing problem?
> > > 
> > > What I wanted to check is, what the hypervisor is doing while the system 
> > > hangs.
> > > Has anybody an idea maybe a timer and after 30s printing a dump of the 
> > > stack of
> > > all cpus?
> > 
> > Yes. Can you try the two attached patches please.
> 
> I tried both but none helped. I think it couldn't be expected as the first
> patch handles an error case and the line with the second patch,
> the call of pci_dma_sync_single_for_device(), gets not reached.

OK, perhaps move the pci_dma_sync_single_for_device in the while loop?

The idea behind that flush code is to kick the GTT to do its job. But
if the SWIOTLB is used and the bounce page is used, then the writes don't
end up in the flush code area at all - until the pci_unmap_page.

Or the pci_dma_sync_single call.

The other option was to use pci_alloc_coherent so that we would not need
to use the PCI API. But I would like to verify that the theory is correct.

> 
> Dietmar.
> 
> -- 
> Company details: http://ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.