[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: [PATCH] Blktap: Userspace file-based image support.(RFC)

On Tue, 20 Jun 2006 14:10:30 -0700, Dan Smith wrote:

> IP> It doesn't bypass the buffer cache (so all bets are off for data
> IP> integrity) and can end up consuming all of dom0 memory with dirty
> IP> buffers -- just create a few loop devices and do a few parallel
> IP> dd's to them and watch the oomkiller go on the rampage. It's even
> IP> worse if the filesystem the file lives on is slow e.g. NFS.
> Ok, it seems like this should be addressed in the upstream loop
> driver.  I imagine quite a few people are depending on the loop driver
> right now, expecting it to maintain data integrity.

It's probably worth spending some cycles trying to improve the loop driver

> Could the loop driver make use of the routines that do direct IO
> instead of the normal routines to solve this when it's an issue?

It appears that the loop driver is split between two threads using a
producer/consumer queue.  The main thread gets the bio requests and queues
them for the consumer thread.

The consumer thread can do a number of things depending on properties of
the fd.  It may use address ops, use fops->write, or do a transform of the
data.  It should be possible to, if the fd is opened with O_DIRECT and
fops has a valid aio_{read,write}, use proper aio calls to queue the
requests.  You'll probably have to get clever about how the thread blocks
(has to wake up either on the queue mutex or when an aio request completes).

I suspect that this will have a pretty noticable performance improvement
in the loop driver (especially on SCSI/SATA storage).

The loop driver still has issues though.  It cannot grow and it has a
pretty odd hardcoded limit (256 devices) which quickly becomes a
scalability issue.

The former problem could possibly be address by having a parameter for
SET_STATUS that let's you set the size of the device to be greater than
the size of the underlying file.  If a bio comes for an offset greater
than the underlying file, it would have to be smart enough to ftruncate
the file.  The error handling is a bit tough (you'll have to make sure
that if ftruncate fails, you fail the read/write--extra points if the
failure is temporary such that later on if space is freed up you succeed).

The hardcoded limit is a bit larger of a problem.  The driver would likely
need a bit of reworking.  Since 256 is the limit based on minor number
allocation, you would have to either get some more device number space for
it or just have the ability to allocate dynamic numbers and rely on
udev/hotplug for folks that want more than 256.

> This brings me to another question: Will people really be using
> file-based images for their VMs?  It seems to me that the performance
> of using a block device overshadows the convenience of a file image.

If the performance of the loop driver could be better (and fundamentally,
there's no reason it can't be pretty good), then I see no reason why using
file images wouldn't be the most common approach.

Files are quite a lot easier to manage than partitions.  Of course, I see
no reason why someone couldn't write a FUSE front-end to LVM :-)


Anthony Liguori

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.