Xen project Mailing List

TIA,

Eric Pretorious

From: Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx>
To: xen-users@xxxxxxxxxxxxx
Sent: Sunday, October 19, 2014 7:13 PM
Subject: Re: [Xen-users] Storage Systems for Virtual Disk Images

On 19/10/14 10:08, Eric wrote:

Thanks, Adam:

That's all tremendously helpful information!

I do have two questions:

Layering - Are you using logical volumes (LV's) as DRBD backing devices and then using the DRBD resources as iSCSI LUN's? This seems like a fairly labor-intensive approach that might not work at all with automated provisioning.

Yes, using:
RAID5
LVM2 (one LV per dom0)
DRBD
iSCSI

Previously we were doing:
RAID5
DRBD
LVM2
iSCSI
However was advised by the DRBD authors/support to split to using multiple DRBD's to reduce the IO load.

I don't use any sort of automated provisioning, as the config here is very static. However, it should be relatively easy to automate, simply add a LV to both primary/secondary, create the DRBD config file on both primary/secondary, connect/initial sync, and then create the iscsi export on both primary/secondary. Probably will also want to remember to adjust your failover system to add the extra DRBD (change to primary) and iscsi export.

Networking - What mode are you using to bond the 2 x 1G dom0 interfaces? e.g., balance-rr, active-backup, balance-xor, broadcast, balance-tlb, or balance-alb?

Not using bonding at all, went through all sort of configs and variations there. Started with 8 x 1G on the SAN and 1 x 1G on the dom0. Eventually I've ended up with 1 x 10G on each SAN, plus 1 x 10G for DRBD (crossover). Each dom0 has 3 x 1G ethernet, 2 used for iSCSI and one used for the "user" LAN. The iSCSI is configured as two discrete ethernet devices on the same LAN subnet (eg, 10.1.1.21/24 and 10.1.1.31/24), the primary SAN server is 10.1.1.11 and secondary 10.1.1.12.
iSCSI uses multipath to make one connection over each interface to the same destination (floating IP configured on the SAN servers).

I had considered other options such as creating 4 connections from each dom0, two to 10.1.1.11 (primary) and two to 10.1.1.12 (secondary), this would remove the need for a floating IP, etc, but in practice, I've not had any issue with the floating IP.

I can forcibly shutdown the primary, and all VM's proceed without interruption (few seconds of stalled IO), or else I can nicely shutdown the primary, and there is no noticeable downtime/delay.

I would be interested to hear more about your configuration/setup/etc. For me, the system is working well, the IO layer is still underperforming, but I no longer get IO stalls, and get "good" performance (ie, the users are happy). I suspect some parts could be tweaked further but haven't had the time to work on that.

If you want a lot more information on the problems I had, and the various configurations (both hardware + software/etc) please search on the linux-raid mailing list, and on this list (archives of both). Most of my efforts were over a period of 15+ months starting around January 2013.

Regards,
Adam

Thanks, again!

Eric Pretorious

From: Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx>
To: xen-users@xxxxxxxxxxxxx
Sent: Thursday, October 16, 2014 6:36 PM
Subject: Re: [Xen-users] Storage Systems for Virtual Disk Images

Apologies for my blank response...

I personally use a Linux HA + DRBD + LVM + Linux iSCSI solution, and it works very well. Some things I took a lot of time to solve include:
1) A whole bunch of network cabling/config issues, now using 10G between DRBD, 10G iscsi server, and 2 x 1G for dom0's with multipath.
2) Unexpected poor performance with HDD's, concurrent random access from multiple domU's does not work well with HDD. My solution was to upgrade to SSD.
3) Unexpected poor performance with SSD. This came down to testing the wrong thing when calculating expected performance level. Test with small (eg 4k) random read/write and use those results, unless your VM's are only doing large read/write, and these really do get merged, then you will find performance limited to the 4k request size.
4) Still poor performance from SSD (DRBD). Change LVM so that it is below DRBD. ie, one LV for each domU, then on top is DRBD for each domU, then finally iscsi exports the DRBD devices.
5) Still poor performance from SSD (DRBD). DRBD needs to do it's own write for every domU write, plus lvm does it's own, etc. Each layer adds overhead. Solution for me was to disable DRBD disk-barrier, disk-flushes, md-flushes

Other things that helped along the way include:
echo noop > /sys/block/${disk}/queue/scheduler
echo 128 > /sys/block/${disk}/queue/nr_requests
echo 4096 > /sys/block/md1/md/stripe_cache_size # RAID5, test the correct value for your array

Currently, I have 8 dom0's with about 25 domU's and it is working well, including transparently failing over on iscsi server failure. If anyone wants more details, I'm happy to share.

Most of the above is perhaps not specific to xen, but storage in general, but I hope it will be relevant here.

I'd also ask that if you get any direct response, that you please summarise and send back to the list, and/or update the wiki so others can more easily find the information.

Regards,
Adam

On 17/10/14 11:34, Eric wrote:

Hello, All:

I'd built a highly-available, redundant iSCSI SAN in our lab a while back as a proof of concept (using Linux-HA, DRBD, and the Linux iSCSI Target Framework) and it worked pretty well but, as I'm getting ready to build the infrastructure for our [production] cloud, I'm wanting to re-examine the topic again but I just haven't got enough time to sift through all of the outdated or speculative information on the Internet so I'm reaching out to the list for some guidance on hosting virtual disk images.

e.g., I'm curious about other distributed, clustered storage systems (e.g., Gluster, Ceph, Sheepdog, etc); other SAN technologies besides iSCSI (e.g., AoE), and; various targets. e.g., There are at least four different iSCSI targets available for Linux:

The SCSI Target Framework (STGT/TGT),

The LIO target,

The iSCSI Enterprise Target (IET), and

The SCSI Target Subsystem (SCST).

And, there are currently five different AoE targets available for Linux:

vblade, a userspace daemon that is part of the aoetools package.

kvblade, a Linux kernel module.

ggaoed, a userspace daemon that takes advantage of Linux-specific performance features.

qaoed, a multithreaded userspace daemon.

aoede, a userspace daemon with experimental protocol extensions.

I know that it's a lot to ask, but I really need help with this enormous topic and I'd be thankful for any experience, knowledge, or guidance here.

TIA,

Eric Pretorious