[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Radeon DRM dom0 issues



On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
> On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
>> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
>> <konrad.wilk@xxxxxxxxxx> wrote:
>> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
>> >> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/24/2014
>> >> 09:49:38 AM:
>> >>
>> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> >> > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> >> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> >> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, xen-devel-
>> >> > bounces@xxxxxxxxxxxxx
>> >> > Date: 01/24/2014 09:50 AM
>> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> >
>> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
>> >> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 PM:
>> >> > >
>> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> >> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> >> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
>> >> > > > Date: 01/21/2014 04:59 PM
>> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx
>> >> > > >
>> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
>> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/20/2014
>> >>
>> >> > > > > 10:38:27 AM:
>> >> > > > >
>> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> >> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> >> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
>> >> > > > > > Date: 01/20/2014 10:38 AM
>> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > > >
>> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
>> >> wrote:
>> >> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote on 01/20/2014
>> >> > > 10:14:36
>> >> > > > > AM:
>> >> > > > > > >
>> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
>> >> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> >> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, michael.d.labriola@xxxxxxxxx
>> >> > > > > > > > Date: 01/20/2014 10:14 AM
>> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > > > > >
>> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
>> >>
>> >> > > wrote:
>> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
>> >> > > consistent
>> >> > > > > > > crashes
>> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
>> >> > > unusably
>> >> > > > >
>> >> > > > > > > slow
>> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
>> >> > > > > indiviually on
>> >> > > > > > >
>> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
>> >> metal.
>> >> > > > > > > >
>> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
>> >> mean?
>> >> > > > > > >
>> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
>> >> The
>> >> > >
>> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
>> >> for
>> >> > > a
>> >> > > > > plain
>> >> > > > > > > text console login.
>> >> > > > > >
>> >> > > > > > So sluggish is probably due to the PAT not being enabled. This
>> >> patch
>> >> > > > > > should be applied:
>> >> > > > > >
>> >> > > > > > lkml.org/lkml/2011/11/8/406
>> >> > > > > >
>> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
>> >> > > > > >
>> >> > > > > > and these two reverted:
>> >> > > > > >
>> >> > > > > >  "xen/pat: Disable PAT support for now."
>> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
>> >> > > > > >
>> >> > > > > > Which is to say do:
>> >> > > > > >
>> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
>> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>> >> > > > >
>> >> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
>> >> reverted
>> >> > >
>> >> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
>> >> HD
>> >> > > 7000
>> >> > > > > sluggishness and appears to have fixed the R600 crashes (although
>> >> it's
>> >> > >
>> >> > > > > only been running a few hours).
>> >> > > > >
>> >> > > > > How come that patch didn't get into mainline?  It looks pretty
>> >> > > innocuous
>> >> > > > > to me...
>> >> > > >
>> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
>> >> had
>> >> > > > the chance nor time to implement it.
>> >> > >
>> >> > > I see.  Well, I've got a handful of boxes in my lab that need that
>> >> patch
>> >> > > to be usable.  If you do come up with a more mainline-able solution,
>> >> I'd
>> >> > > gladly test it for you.  ;-)
>> >> >
>> >> > Thank you!
>> >>
>> >> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
>> >> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
>> >> again yeserday.  After being solid as a rock for 2 weeks as my primary
>> >> workstation, X has crashed a half dozen or so times so far this week. I've
>> >> been in Xen with 2 paravirtual linux guests running almost constantly for
>> >> this whole period.  I don't understand what's changed, but my system has
>> >> been entirely unstable now.  I did recompile my kernel... but I all did
>> >> was merge the v3.13.1 stable commit into my working tree and turn a few
>> >> things on (netfilter, wifi, a couple drivers turned on here and there).  I
>> >> just went and verified that those patches are still applied in my tree
>> >> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
>> >> staring at a TTY login).
>> >>
>> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
>> >> acceleration no longer functions unless I reboot.  If memory serves, the
>> >> unpatched behavior upon X crash was that the kernel continued to spew
>> >> these errors until the whole box locked up.  At least that's not happening
>> >> any more... ;-)
>> >>
>> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
>> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> >> GEM object (8192, 2, 4096, -12)
>> >>
>> >> and here's a slightly different variant that happened while I was typing
>> >> this email (on a different machine, luckily):
>> >>
>> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
>> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
>> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
>> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64348.297561] [TTM] Buffer eviction failed
>> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> >> GEM object (16384, 2, 4096, -12)
>> >>
>> >> Any ideas?
>> >
>> > yes. I believe you have a memory leak. As in, some driver (or X) is
>> > eating up the memory and not giving up enough. That means the TTM
>> > layer is hitting its ceiling of how much memory it can allocate.
>> >
>> > Now finding the culprit is going to be a bit hard.
>> >
>> > You could use:
>> >
>> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
>> >          pool      refills   pages freed    inuse available     name
>> >            wc          259           224      808        4 nouveau 
>> > 0000:05:00.0
>> >        cached      3403058      13561071    51158        3 radeon 
>> > 0000:01:00.0
>> >        cached           25             0       96        4 nouveau 
>> > 0000:05:00.0
>> >
>> > to figure out if my thinking is really true. You should have a huge
>> > 'inuse' count and almost no 'available'.
>>
>> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
>> always have the same contents.  Is that normal?
>
> Yes.
>>
>> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
>> metal... only in Xen.  Is that normal?
>
> It would show up on baremetal if you boot with 'iommu=soft'
>
>>
>>          pool      refills   pages freed    inuse available     name
>>        cached        15190         59551     1205        4 radeon 
>> 0000:01:00.0
>>
>> If I watch that file while creating xterms, moving them around, etc, I can
>> see the number available fluctuate between 3 and 6.  This is true, even on
>> my box w/ the newer R7 card in it, which hasn't gotten that GEM error
>> message (yet?).
>
> OK, so lets see what happens when the error shows. Incidentally - what amount 
> of
> memory does your initial domain have? And is it different then when you
> boot it as a baremetal?

I've got the problem very reproducible on 3 boxes.  All three are
booting the dom0 with as much RAM as Xen will give them, then giving
up some of their RAM as needed when I create domUs.  The 3 boxes have
4G, 8G, and 16G if memory serves.

Does the amount of RAM on the actual video cards matter?  All the
older cards (that crash all the time) have 2G, whereas the R7 that
hasn't crashed yet only has 1G.

I've been reproducing the crash by just logging in and out of fluxbox
via XDM over and over again right after booting my dom0 in Xen w/ no
guests running.  That makes it happen within a few minutes.  Otherwise
it randomly crashes while I'm in the middle of trying to work... ;-)

>
> Thank you.
>
>>
>>
>> >
>> > But that will get us just to confirm that yes - you have a big usage
>> > of memory and it is hitting the ceiling.
>> >
>> > Now to actually figure out which application is hanging on these - that
>> > I am not sure about. I think there is some drm info tool to investigate
>> > how many pages each application is using. You can leave it running and
>> > see which app is gulping up the memory. But I am not sure which
>> > tool that is (if there was one).
>> >
>> > Well, lets do one step at a time - see if my theory is correct first.

-- 
Michael D Labriola
21 Rip Van Winkle Cir
Warwick, RI 02886
401-316-9844 (cell)
401-848-8871 (work)
401-234-1306 (home)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.