[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] XCP Storage resiliency

To: xen-api@xxxxxxxxxxxxx
From: George Shuklin <george.shuklin@xxxxxxxxx>
Date: Sat, 22 Jun 2013 04:01:55 +0400
Delivery-date: Sat, 22 Jun 2013 00:02:10 +0000
List-id: User and development list for XCP and XAPI <xen-api.lists.xen.org>

Well, you make me curious.

Here simple test. I've done it now on XCP 1.6

1. Run fio (any kind of config, just to create some load - in my case itwas 16 concurrent operations).2. Put domain to pause state (xl pause) - to caught it during some IO'on-flight'

3. wait for 150 seconds or more
4. resume domain

amaizing, I've got some nasty lags after resume, but no IO errors. I'verepeat that operation few times:


1. No problems.

2. Stale RCU trace: [5040301.442715] INFO: rcu_sched_state detectedstall on CPU 0 (t=50594 jiffies)

3. No problems.

That's really strange, because I saw few times io errors on virtualmachines without any signs of problems for the dom0.


I'll research tat topic more on next week and will report results here.


22.06.2013 02:55, Nathan March ÐÐÑÐÑ:

On 6/21/2013 1:16 AM, George Shuklin wrote:
I'm talking not about dom0, mostly, but domU kernel. If IO takes morethan 120 seconds, it will processed as 'io timeout'. And this timeoutis hardcoded (no /sys|/proc variables).
If you getting IO timeout in less than 2 minutes - that differentquestion.
Hi George,
Sorry if I'm misunderstanding, but I don't believe it's a domU issue,as I've run identical virtual machines on our existing xen cluster andcan take storage away from the dom0 for over 45 minutes without aproblem. If the domU kernel was responsible for timing out the IOrequests I'd be seeing some sort of kernel error on my domU's in thissituation. Instead they just hang waiting for the IO and gracefullyrecover once it comes back (albeit, with very very high load averagesas requests back up). I've done no patching/changes to our existingsystems to get it to work like this, it just ended up that way. We'rerunning stock 3.2.28 dom0's and 2.6.32.60 domU's, so having to hack adomU kernel on XCP to achieve the same thing seems strange?
That being said, it is a 120s timeout that I'm hitting (NFS is meechoing to kmsg when I pull connectivity for easy timestamp purposes)
dom0:
[ 2594.069594] NFS
[ 2609.574285] nfs: server 10.1.26.1 not responding, timed out
[ 2717.464716] end_request: I/O error, dev tda, sector 18882056

domu:
[82688.790260] NFS
[82812.678888] end_request: I/O error, dev xvda, sector 18882056
So here the dom0 is timing out and the I/O error is returned back tothe domU and then it goes read only.
If I manually unmount + remount the SR on the dom0 with "-o hard", Iwould expect the timeout to go away as nfs is no longer returning thetimeout back to xcp. Instead what I see are the same 120s timeouts,making me think that this timeout is coming from some other layerinstead?
Thanks!

- Nathan



_______________________________________________
Xen-api mailing list
Xen-api@xxxxxxxxxxxxx
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api



_______________________________________________
Xen-api mailing list
Xen-api@xxxxxxxxxxxxx
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

References:
- [Xen-API] XCP Storage resiliency
  - From: Nathan March
- Re: [Xen-API] XCP Storage resiliency
  - From: George Shuklin
- Re: [Xen-API] XCP Storage resiliency
  - From: Nathan March
- Re: [Xen-API] XCP Storage resiliency
  - From: George Shuklin
- Re: [Xen-API] XCP Storage resiliency
  - From: Nathan March

Prev by Date: Re: [Xen-API] Jumbo frames + bonding not being configured properly on bootup
Next by Date: [Xen-API] xcp-xapi on debian 7
Previous by thread: Re: [Xen-API] XCP Storage resiliency
Next by thread: [Xen-API] Xen Document Day is next MONDAY 24 June
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.