List-id: Xen user discussion <xen-users.lists.xen.org>
Adam:
All I mean is that, in your infrastructure, the LV's must be created by issuing LVM commands and DRBD commands on the *target* instead of being created on the *initiators*. AFAICT, most CMP's (e.g., CloudStack, OpenNebula, etc) allocate storage by issuing commands on the initiator - not on the target.
Thanks, again!
Eric Pretorious
From: Adam Goryachev <adam@xxxxxxxxxxxxxxxxxxxxxx> To: xen-users@xxxxxxxxxxxxx Sent: Tuesday, October 28, 2014 5:52 PM Subject: Re: [Xen-users] Storage Systems for Virtual Disk Images
On 29/10/14 09:15, Eric wrote:
Thanks, again,
Adam!
I'm
sure that your input is gonna help me out as we begin tuning
our SAN! :P
I'm
curious: IIUC, you're using LV's as DRBD backing devices. Is
that correct? Wouldn't it be more versatile to use empty
partitions as DRBD backing devices; import them to the
initiators as PV's [using iSCSI/AoE/GNBD/etc] and
then allocate LV's on the initiators (after coordinating the
clients using cLVM)?
Well, I'm not sure what you mean by versatile. I allocate a single
LV on each SAN per VM, these LV's are used by DRBD (one LV on each
SAN node), and exported with iSCSI. Therefore, any xen node can boot
any VM (as long as all the xen cfg files are replicated). The danger
is a lack of some oversight tool ensuring you don't start the same
VM on multiple nodes at the same time.
However, I imagine to use cLVM would require the PV to sit on top of
the DRBD (which is what I originally had), and this causes
performance issues. I forget the details, but along the lines of
DRBD not using it's cache properly because it is spread over such a
large area.
Ideally, I would like to upgrade from heartbeat to whatever is the
current tool, and use it to do more than just keep one (and only
one) of the two san's running as primary. ie, It would be nice if it
would also ensure that every vm was running all the time,
restricting to only one instance of the VM, etc.
Layering
- Are you using logical volumes (LV's) as
DRBD backing devices and then using the DRBD
resources as iSCSI LUN's? This seems like
a fairly labor-intensive approach that
might not work at all with automated
provisioning.
Yes, using:
RAID5
LVM2 (one LV per dom0)
DRBD
iSCSI
Previously we were doing:
RAID5
DRBD
LVM2
iSCSI
However was advised by the DRBD authors/support to
split to using multiple DRBD's to reduce the IO
load.
I don't use any sort of automated provisioning, as
the config here is very static. However, it should
be relatively easy to automate, simply add a LV to
both primary/secondary, create the DRBD config file
on both primary/secondary, connect/initial sync, and
then create the iscsi export on both
primary/secondary. Probably will also want to
remember to adjust your failover system to add the
extra DRBD (change to primary) and iscsi export.
Networking
- What mode are you using to bond the 2 x 1G
dom0 interfaces? e.g., balance-rr,
active-backup, balance-xor, broadcast,
balance-tlb, or balance-alb?
Not using bonding at all, went through all sort of
configs and variations there. Started with 8 x 1G on
the SAN and 1 x 1G on the dom0. Eventually I've
ended up with 1 x 10G on each SAN, plus 1 x 10G for
DRBD (crossover). Each dom0 has 3 x 1G ethernet, 2
used for iSCSI and one used for the "user" LAN. The
iSCSI is configured as two discrete ethernet devices
on the same LAN subnet (eg, 10.1.1.21/24 and
10.1.1.31/24), the primary SAN server is 10.1.1.11
and secondary 10.1.1.12.
iSCSI uses multipath to make one connection over
each interface to the same destination (floating IP
configured on the SAN servers).
I had considered other options such as creating 4
connections from each dom0, two to 10.1.1.11
(primary) and two to 10.1.1.12 (secondary), this
would remove the need for a floating IP, etc, but in
practice, I've not had any issue with the floating
IP.
I can forcibly shutdown the primary, and all VM's
proceed without interruption (few seconds of stalled
IO), or else I can nicely shutdown the primary, and
there is no noticeable downtime/delay.
I would be interested to hear more about your
configuration/setup/etc. For me, the system is
working well, the IO layer is still underperforming,
but I no longer get IO stalls, and get "good"
performance (ie, the users are happy). I suspect
some parts could be tweaked further but haven't had
the time to work on that.
If you want a lot more information on the problems I
had, and the various configurations (both hardware +
software/etc) please search on the linux-raid
mailing list, and on this list (archives of both).
Most of my efforts were over a period of 15+ months
starting around January 2013.
I personally use a Linux HA + DRBD
+ LVM + Linux iSCSI solution, and
it works very well. Some things I
took a lot of time to solve
include:
1) A whole bunch of network
cabling/config issues, now using
10G between DRBD, 10G iscsi
server, and 2 x 1G for dom0's with
multipath.
2) Unexpected poor performance
with HDD's, concurrent random
access from multiple domU's does
not work well with HDD. My
solution was to upgrade to SSD.
3) Unexpected poor performance
with SSD. This came down to
testing the wrong thing when
calculating expected performance
level. Test with small (eg 4k)
random read/write and use those
results, unless your VM's are only
doing large read/write, and these
really do get merged, then you
will find performance limited to
the 4k request size.
4) Still poor performance from SSD
(DRBD). Change LVM so that it is
below DRBD. ie, one LV for each
domU, then on top is DRBD for each
domU, then finally iscsi exports
the DRBD devices.
5) Still poor performance from SSD
(DRBD). DRBD needs to do it's own
write for every domU write, plus
lvm does it's own, etc. Each layer
adds overhead. Solution for me was
to disable DRBD disk-barrier,
disk-flushes, md-flushes
Other things that helped along the
way include:
echo noop >
/sys/block/${disk}/queue/scheduler
echo 128 >
/sys/block/${disk}/queue/nr_requests
echo 4096 >
/sys/block/md1/md/stripe_cache_size
# RAID5, test the correct value
for your array
Currently, I have 8 dom0's with
about 25 domU's and it is working
well, including transparently
failing over on iscsi server
failure. If anyone wants more
details, I'm happy to share.
Most of the above is perhaps not
specific to xen, but storage in
general, but I hope it will be
relevant here.
I'd also ask that if you get any
direct response, that you please
summarise and send back to the
list, and/or update the wiki so
others can more easily find the
information.
Regards,
Adam
On 17/10/14 11:34, Eric wrote:
Hello,
All:
I'd
built a highly-available,
redundant iSCSI SAN in our
lab a while back as a proof
of concept (using Linux-HA,
DRBD, and the Linux iSCSI
Target Framework) and it
worked pretty well but, as
I'm getting ready to build
the infrastructure for our
[production] cloud, I'm
wanting to re-examine the
topic again but
I just haven't got enough
time to sift through all
of the outdated or
speculative information on
the Internet so I'm
reaching out to the list for
some guidance on hosting
virtual disk images.