[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Storage Systems for Virtual Disk Images

On 29/10/14 09:15, Eric wrote:
Thanks, again, Adam!

I'm sure that your input is gonna help me out as we begin tuning our SAN!  :P

I'm curious: IIUC, you're using LV's as DRBD backing devices. Is that correct? Wouldn't it be more versatile to use empty partitions as DRBD backing devices; import them to the initiators as PV's [using iSCSI/AoE/GNBD/etc] and then allocate LV's on the initiators (after coordinating the clients using cLVM)?

Well, I'm not sure what you mean by versatile. I allocate a single LV on each SAN per VM, these LV's are used by DRBD (one LV on each SAN node), and exported with iSCSI. Therefore, any xen node can boot any VM (as long as all the xen cfg files are replicated). The danger is a lack of some oversight tool ensuring you don't start the same VM on multiple nodes at the same time.

However, I imagine to use cLVM would require the PV to sit on top of the DRBD (which is what I originally had), and this causes performance issues. I forget the details, but along the lines of DRBD not using it's cache properly because it is spread over such a large area.

Ideally, I would like to upgrade from heartbeat to whatever is the current tool, and use it to do more than just keep one (and only one) of the two san's running as primary. ie, It would be nice if it would also ensure that every vm was running all the time, restricting to only one instance of the VM, etc.


Eric Pretorious

From: Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx>
To: xen-users@xxxxxxxxxxxxx
Sent: Sunday, October 19, 2014 7:13 PM
Subject: Re: [Xen-users] Storage Systems for Virtual Disk Images

On 19/10/14 10:08, Eric wrote:
Thanks, Adam:

That's all tremendously helpful information!

I do have two questions:

Layering - Are you using logical volumes (LV's) as DRBD backing devices and then using the DRBD resources as iSCSI LUN's? This seems like a fairly labor-intensive approach that might not work at all with automated provisioning.
Yes, using:
LVM2 (one LV per dom0)

Previously we were doing:
However was advised by the DRBD authors/support to split to using multiple DRBD's to reduce the IO load.

I don't use any sort of automated provisioning, as the config here is very static. However, it should be relatively easy to automate, simply add a LV to both primary/secondary, create the DRBD config file on both primary/secondary, connect/initial sync, and then create the iscsi export on both primary/secondary. Probably will also want to remember to adjust your failover system to add the extra DRBD (change to primary) and iscsi export.

Networking - What mode are you using to bond the 2 x 1G dom0 interfaces? e.g., balance-rr, active-backup, balance-xor, broadcast, balance-tlb, or balance-alb?

Not using bonding at all, went through all sort of configs and variations there. Started with 8 x 1G on the SAN and 1 x 1G on the dom0. Eventually I've ended up with 1 x 10G on each SAN, plus 1 x 10G for DRBD (crossover). Each dom0 has 3 x 1G ethernet, 2 used for iSCSI and one used for the "user" LAN. The iSCSI is configured as two discrete ethernet devices on the same LAN subnet (eg, and, the primary SAN server is and secondary
iSCSI uses multipath to make one connection over each interface to the same destination (floating IP configured on the SAN servers).

I had considered other options such as creating 4 connections from each dom0, two to (primary) and two to (secondary), this would remove the need for a floating IP, etc, but in practice, I've not had any issue with the floating IP.

I can forcibly shutdown the primary, and all VM's proceed without interruption (few seconds of stalled IO), or else I can nicely shutdown the primary, and there is no noticeable downtime/delay.

I would be interested to hear more about your configuration/setup/etc. For me, the system is working well, the IO layer is still underperforming, but I no longer get IO stalls, and get "good" performance (ie, the users are happy). I suspect some parts could be tweaked further but haven't had the time to work on that.

If you want a lot more information on the problems I had, and the various configurations (both hardware + software/etc) please search on the linux-raid mailing list, and on this list (archives of both). Most of my efforts were over a period of 15+ months starting around January 2013.


Thanks, again!

Eric Pretorious

From: Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx>
To: xen-users@xxxxxxxxxxxxx
Sent: Thursday, October 16, 2014 6:36 PM
Subject: Re: [Xen-users] Storage Systems for Virtual Disk Images

Apologies for my blank response...

I personally use a Linux HA + DRBD + LVM + Linux iSCSI solution, and it works very well. Some things I took a lot of time to solve include:
1) A whole bunch of network cabling/config issues, now using 10G between DRBD, 10G iscsi server, and 2 x 1G for dom0's with multipath.
2) Unexpected poor performance with HDD's, concurrent random access from multiple domU's does not work well with HDD. My solution was to upgrade to SSD.
3) Unexpected poor performance with SSD. This came down to testing the wrong thing when calculating expected performance level. Test with small (eg 4k) random read/write and use those results, unless your VM's are only doing large read/write, and these really do get merged, then you will find performance limited to the 4k request size.
4) Still poor performance from SSD (DRBD). Change LVM so that it is below DRBD. ie, one LV for each domU, then on top is DRBD for each domU, then finally iscsi exports the DRBD devices.
5) Still poor performance from SSD (DRBD). DRBD needs to do it's own write for every domU write, plus lvm does it's own, etc. Each layer adds overhead. Solution for me was to disable DRBD disk-barrier, disk-flushes, md-flushes

Other things that helped along the way include:
echo noop > /sys/block/${disk}/queue/scheduler
echo 128 > /sys/block/${disk}/queue/nr_requests
echo 4096 > /sys/block/md1/md/stripe_cache_size # RAID5, test the correct value for your array

Currently, I have 8 dom0's with about 25 domU's and it is working well, including transparently failing over on iscsi server failure. If anyone wants more details, I'm happy to share.

Most of the above is perhaps not specific to xen, but storage in general, but I hope it will be relevant here.

I'd also ask that if you get any direct response, that you please summarise and send back to the list, and/or update the wiki so others can more easily find the information.


On 17/10/14 11:34, Eric wrote:
Hello, All:

I'd built a highly-available, redundant iSCSI SAN in our lab a while back as a proof of concept (using Linux-HA, DRBD, and the Linux iSCSI Target Framework) and it worked pretty well but, as I'm getting ready to build the infrastructure for our [production] cloud, I'm wanting to re-examine the topic again but I just haven't got enough time to sift through all of the outdated or speculative information on the Internet so I'm reaching out to the list for some guidance on hosting virtual disk images.

e.g., I'm curious about other distributed, clustered storage systems (e.g., Gluster, Ceph, Sheepdog, etc); other SAN technologies besides iSCSI (e.g., AoE), and; various targets. e.g., There are at least four different iSCSI targets available for Linux:
And, there are currently five different AoE targets available for Linux:
  • vblade, a userspace daemon that is part of the aoetools package.
  • kvblade, a Linux kernel module.
  • ggaoed, a userspace daemon that takes advantage of Linux-specific performance features.
  • qaoed, a multithreaded userspace daemon.
  • aoede, a userspace daemon with experimental protocol extensions.
I know that it's a lot to ask, but I really need help with this enormous topic and I'd be thankful for any experience, knowledge, or guidance here.

Eric Pretorious

Adam Goryachev
Website Managers
P: +61 2 8304 0000                    adam@xxxxxxxxxxxxxxxxxxxxxx
F: +61 2 8304 0001                     www.websitemanagers.com.au
Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.