[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-API] sudden grow of poll time, mem and CPU for openvswitch (XCP 1.1)

To: discuss@xxxxxxxxxxxxxxx
From: George Shuklin <george.shuklin@xxxxxxxxx>
Date: Mon, 15 Oct 2012 20:01:40 +0400
Cc: xen-api@xxxxxxxxxxxxxxxxxxx
Delivery-date: Mon, 15 Oct 2012 16:01:54 +0000
List-id: User and development list for XCP and XAPI <xen-api.lists.xen.org>

Good day.

Yesterday I got really strange case (with OVS 1.0 from XCP 1.1):

All network in pool got lagged. pings sometimes was nice and fast, sometimes did not arrive on time and sometimes raise over 500ms (in local network). I've done research and found that usual messages in ovs-vswitchd.log has been changed:

normal workflow:
Oct 15 19:51:36|03767|timeval|WARN|6 ms poll interval (0 ms user, 0 ms system) is over 9 times the weighted mean interval 1 ms (3989440 samples)
Oct 15 19:51:36|03768|coverage|INFO|Skipping details of duplicate event coverage for hash=2992b753 in epoch 3989440
Oct 15 19:51:41|03769|timeval|WARN|7 ms poll interval (0 ms user, 0 ms system) is over 11 times the weighted mean interval 1 ms (3989817 samples)
Oct 15 19:51:41|03770|coverage|INFO|Skipping details of duplicate event coverage for hash=c519633d in epoch 3989817

strange workflow:

Oct 14 23:10:07|3470469|timeval|WARN|context switches: 0 voluntary, 401 involuntary
Oct 14 23:10:07|3470470|coverage|INFO|Skipping details of duplicate event coverage for hash=e8f1626f in epoch 261138151
Oct 14 23:10:08|3470471|timeval|WARN|673 ms poll interval (0 ms user, 660 ms system) is over 10 times the weighted mean interval 64 ms (261138152 samples)
Oct 14 23:10:08|3470472|timeval|WARN|context switches: 0 voluntary, 789 involuntary
Oct 14 23:10:08|3470473|coverage|INFO|Skipping details of duplicate event coverage for hash=e9c7938c in epoch 261138152
Oct 14 23:10:09|3470474|timeval|WARN|1343 ms poll interval (0 ms user, 1310 ms system) is over 14 times the weighted mean interval 93 ms (261138153 samples)

I've draw the plot with 'Y' as poll interval and 'X' as timeline (actually, line number):

http://img696.imageshack.us/img696/315/20121015193023.png

Most interesting: I got those behavior on many hosts on almost same time (all them has different uptime and different load).
Second interesting: after migrating vms off the host and hosts reboot some of them got poll interval about 180ms again, and only second reboot helps.

During that time ovs-vswitchd consume some unrealistic amount of CPU (about 107%) and memory - over 780Mb. (after 1st reboot CPU drops to 70%, mem to 230Mb). 2nd reboot return them to normal state.

What can be cause of that issue? Is this an OVS bug, or something wrong happens in network?

Thanks.

_______________________________________________
Xen-api mailing list
Xen-api@xxxxxxxxxxxxx
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

Prev by Date: [Xen-API] how to shutdown hosts in pool
Next by Date: Re: [Xen-API] how to shutdown hosts in pool
Previous by thread: [Xen-API] how to shutdown hosts in pool
Next by thread: Re: [Xen-API] xenbusb_nop_confighook_cb timeout
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.