[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

On Wed, May 17, 2017 at 11:10 AM, George Dunlap
<george.dunlap@xxxxxxxxxx> wrote:
> On 17/05/17 10:45, Roger Pau Monné wrote:
>> On Wed, May 17, 2017 at 10:04:40AM +0100, George Dunlap wrote:
>>> cc'ing xen-devel & some relevant people
>> Please bear with me, my knowledge of iptables is 0.
>>> On Tue, May 16, 2017 at 4:21 PM, Antony Saba <awsaba@xxxxxxxxx> wrote:
>>>> Hello xen-users,
>>>> We are seeing the following errors repeatedly while trying to create
>>>> domains using a script, with the end result that 2 or 3 out of about
>>>> 20 VMs fail to start, and there are stale entries in the iptables for
>>>> domains that have been destroyed.
>>>>    2017-05-10 11:45:40 UTC libxl: error:
>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>> /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
>>>>    2017-05-10 11:50:52 UTC libxl: error:
>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>> /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
>>>> I've been testing the following patch of vif-common.sh over the last
>>>> day and it appears to resolve the issue.  iptables exits with status 4
>>>> when "Another app is currently holding the xtables lock."
>> So, an iptables command can fail randomly because there's someone else 
>> holding
>> an iptables internal lock?
>> Isn't there anyway to tell the iptables command to just block until it can 
>> get
>> the lock? This seems extremely racy, isn't people then forced to use 
>> something
>> like:
>> while true; do
>>       iptables <...>
>>       if [ $? == 0 ]; then
>>               break;
>>       elif [ $? != 4 ]; then
>>               error ...
>>       fi
>> done
>> When dealing with iptables?
> This seems to be a common problem ([1][2][3] come up right away).
> The basic solution seems to be to add the '-w' option to have it wait
> for the lock.  It does seem like that should be the default though.
> Having commands normally run inside of scripts randomly fail unless you
> add the special "don't randomly fail" option seems a bit mad.

Hmm, looking more into it:

* The -w option was introduced at the same time that the locking was
introduced [1].  So any version that has locking will have the -w

* The bare -w option doesn't introduce a timeout, so in the case that
the xtables lock wasn't released, the script will hang indefinitely.
A '-W' option was introduced in 2016 to introduce a timeout, but this
is on even fewer systems than the -w option.  (My desktop, running
Debian Jessie, doesn't seem to have the -W option for instance.)

* The return code, RESOURCE_PROBLEM, is returned for other reasons;
but it looks like for our purposes in most case retrying might not be
a bad strategy in those cases either.

* But that was only in 2013 that the option was introduced, so it's
likely there are still old versions of iptables around that don't have
the -w option.

The good news is that versions without the -w option will *also* not
fail with error code 4 (although they may fail in other ways in the
case of concurrent accesses instead).

So we have three options:

1. Always add -w.  This will effectively drop support for systems
which don't have iptables -w.  It also wouldn't allow us to reliably
set a timeout.

2. Always do a loop.  This should work on all systems, but is
redundant for systems with -w and unnecessary on systems without.  On
the other hand, it would allow us to implement our own timeout even on
systems without the -W option.

3. Try to check to see if the version of iptables we have supports -w,
and use it if available.  This should also work on all systems, but
introduces a bit of complication.  It also doesn't allow us to
reliably use a timeout.

Any thoughts?



Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.