[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] can't create any more pv-on-hvm domains after~38under 3.3-testing

Keir Fraser wrote:
> On 03/12/2008 11:27, "James Harper" <james.harper@xxxxxxxxxxxxxxxx> wrote:
>> Alternatively it could be a combination of the gplpv drivers and netback
>> or blkback. I'm pretty sure that I had the problem before I started
>> testing pvscsi...
>> The machine I am testing on will be busy for the rest of the night, but
>> tomorrow I'll do some testing and see what happens, unless you can
>> suggest a way I could discover what those pages belong to in the
>> meantime?
> Unfortunately it's a bit of a pain in the butt since we don't have full page
> tracking in Xen -- we only know that *someone* *somewhere* has that page
> mapped for *some* purpose. Indeed even with more tracking Xen can only
> really tell you which domain holds the reference, and that's bound to be
> dom0 (unless this is a bogus refcounting bug in Xen itself).

We have been investigating a similar sounding bug (hung pages with elevated 
reference counts) that occur when blkback requests are issued over an iSCSI 
backend device.  The block requests appear to be running afoul of the lazy copy 
optimization added for netback.  In this path, foreign pages (assumed to be 
netback pages?) are manipulated specially by the dma layer of the dom0 network 
stack.  On return to netback, the page refs are cleaned up.

In our case, the foreign pages actually originate from blkback, are passed to 
iSCSI for processing, and are abused by the ref manipulation in the dom0 
network stack.  On return to blkback, the page refs are off.  What we haven't 
been able to do yet, is identify the exact circumstances that trigger the 
issue.  We have a fairly elaborate reproducer involving running a pool of 
domains and continuously rebooting them.  Eventually, one domain will hang on 
exit with a stuck page with elevated ref counts.

In our case, the stuck page is always a blkback I/O page.

Running the same test on a FC SAN or local SCSI backend device doesn't hang.

- Steve

> I would suggest dumping addresses of interesting control pages in your
> backend drivers (some can log that already if built with debugging I think),
> then match up the address of the remaining page in the zombie domain.
>  -- Keir
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.