[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Solving the DRBD resync issue



Hi!

I am sorry for not answering. My SPAM filter wasn't your friend and I missed to check it in time... :-(

Am 05/23/2011 10:28 AM, schrieb Fajar A. Nugraha:
On Mon, May 23, 2011 at 2:46 PM, Daniel Brockmann wrote:

the more I think about my DRBD issue and the more I research in the net the
more I tend to explain the issue with limited CPU time for dom0.

First thing first.
By "sync problems" in your previous post, did you mean both nodes
experience split brain for the drbd resource?

One machine -luckily that one that didn't have any important machines running- lost sync so the other machine was the "newer" DRBD partner. But when starting the sync again all virtual guests of the "newer" DRBD partner weren't available anymore until I stopped the sync forcefully. Besides this the sync speed was far below that what I had when I set up the DRBD pair on both Xenserver machines.


When setup properly, you should NOT experience it, regardless of how
much CPU resource dom0 has. You should only experience SLOW disk I/O.
split brain usually occur if you don't setup fencing properly.

The funny thing: It worked fine in the beginning. I was playing around with Xenmotion with a productive mail server machine without any problems. It continuously synced with rates between 250 and 350 MByte/s which is more than okay for our needs.


It will be
better resolving _this_ instead of possibly reaching the same stage later on
again but using another replication technique, wouldn't it?

Reasons why I think it is an I/O and/or CPU time issue:

1. It worked properly when I still did not have 8 virtual guest systems
installed.
2. As soon as I start a DRBD resync my virtual guests bring kernel error
messages like "INFO: task exim4:2336 blocked for more than 120 seconds. ".
3. When starting both Xenserver machines and syncing before starting the
virtual guests a startup that's usually done in<5 minutes takes up to 60
minutes.

... which is exactly the SLOW I/O I mentioned above.

What I cannot understand is why this was occuring weeks after the setup was tested successfully. :-/

Okay, one thing changed: A few machines were added to the virtual pool. But even when these machines have been halted I could not sync the DRBD pair anymore.


I checked the XenWiki accordingly and found two promising entries that I'd
like to follow, if it's possible to apply them under a Citrix Xenserver 5.6
system:

http://wiki.xensource.com/xenwiki/XenCommonProblems#head-413e1d74442772fd5a0a94f0655a009744096627

1. How can I limit the number of vcpus my dom0 has?
2. Can I dedicate a cpu core (or cores) only for dom0?

Especially the 2nd one appears to meet what I expect. So I would be going to
check if I can configure that. How do _you_ think about it?

This thread might be able to help you:
http://xen.1045712.n5.nabble.com/XenServer-adding-additional-CPUs-to-the-control-domain-td3325379.html

I tried it differently then by installing the DRBD "loser" with Xenserver 5.6 SP2 that reserves 4 VCPUs to dom0. I additionally cut 4 GB RAM away for dom0. And I compiled DRBD from the latest sources this time (on the old systems I took precompiled binaries). But this time I couldn't even create a Xenserver storage on the DRBD device.


Personally, I suggest you step back and evaluate several things:
- do you REALLY need active-active setup?
Active-active drbd mandates protocol C (sync replication), which can
GREATLY slow down your throughput. If you can afford a small downtime
better stick with async replication.

I could afford it in the worst case. And I will keep your suggestion in mind for the case my alternative tests appears to fail again.

Actually I run Debian Squeeze on the former DRBD "loser" with a still-not-syncing DRBD device. However, now I try KVM ... and it appears to run at least as fine as my Xenserver setup before ... let's see what's gonna happen in the moment when I kill the Xenserver machine in order to be installed with Squeeze, too ... and start syncing both.

(In this moment I wished to have more than just two hosts.)


- do you KNOW how much IOPS you need?
[...]

Hmm, It's not gonna be a lot more than the machines I have. Most performance-consuming would be a virtualized Samba machine, I guess. I'll check that next week when I am at the machines again. Thanks for the suggestion.


All things considered, it might be that your best option would be
something like:
- get a separate server with lots of disks, setup raid10, install a
storage appliance OS on top (e.g.
http://www.napp-it.org/index_en.html) then export it to your xenserver
either as nfs or scsi. While nfs/scsci induce some overhead, it should
be lower compared to using drbd, OR

Hmm, my experience with NFS is that it's a lot slower than DRBD - however, ignoring the syncer rates I had lately. ;-)

iSCSI would be an option if I had the budget to have that separate storage server. But I unfortunately don't have it. :-(


- drop active-active requirement, OR

Dual-primary worked for a while. I want to check at first if I can get that running again the way it was (but using a different platform). But if it was the only way I could live with a "usual" DRBD setup, too. :-)


- beef-up your xenserver (e.g. use fast storage like SSD), upgrade
XenServer/XCP version to get dom0 to use multiple CPU core on dom0,
upgrade DRBD version to the latest, and setup proper fencing.

I am not that convenient anymore if Xenserver and my MagnyCours CPUs work well together at all. That's why I tried Squeeze and KVM now ... shipping around the limitations that Citrix included in their product.

Thank you very much for your suggestions. Regardless if it's KVM or Xen ... it will help me a lot.

CU,
Mészi.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.