[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Possible locking issue with network-bridge


  • To: xen-users@xxxxxxxxxxxxx
  • From: Andrew Davidoff <davidoff@xxxxxxxxx>
  • Date: Sat, 12 Oct 2013 03:30:09 -0400
  • Delivery-date: Sat, 12 Oct 2013 07:31:36 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

Hi,

I'm running into a problem that I think has uncovered an issue with
how network-bridge is doing its locking. There's possibly a root cause
to my problem that's lower-level than the potential locking issue, but
it seems to me like I may have uncovered a locking issue either way.

I just installed Xen 4.2.3-23.el6 on a Scientific Linux 6.4 server.
Xen was installed from the CentOS xen4 repo installed by
centos-release-xen. The server has two ethernet ports configured in an
LACP bond. I have Xen configured to use network bridging.

During boot, when xend was setting up bridging, the network link was
going down and coming back up as bond0 was renamed pbond0, etc, but
then it was dropping for good and the xend bridging setup was ending
with this error:

RTNETLINK answers: File exists

I narrowed this down to the fact that network-bridge was running
multiple times, and the instances were stomping on each other. It's
possible that the fact that it is running multiple times is the root
cause of my issues here, but even if it should only be getting called
once, it seems that there's an issue with the call to claim_lock.

claim_lock happens after the checks that would make network-bridge exit earlly:

    if [ "${bridge}" = "null" ] ; then
        return
    fi

    if [ `brctl show | wc -l` != 1 ]; then
        return
    fi

    if link_exists "$pdev"; then
        # The device is already up.
        return
    fi

It seems that this is problematic in that if an instance of
network-bridge starts as another is running and has already claimed
the lock, but before the lock holder has created any bridges or turned
up $pdev, the late-comer will wait for the first script to complete,
get the lock for itself, then proceed to break networking.

I moved the call to claim_lock to the beginning of op_start and
dropped in calls to release_lock before the possible early-exit
returns, and this seems to have solved the problem. Does this seem
like the right thing to do? And either way, if network-bridge
shouldn't be running more than once, what do you think might be
causing that? At a glance I think it should just be getting called
once, when XendPIF.py is first loaded, but maybe I'm overlooking
something.

Thanks.
Andy

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.