[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.7 crash



On 6/6/2016 10:19 AM, Wei Liu wrote:
On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:
(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:
On 6/2/2016 5:07 AM, Julien Grall wrote:
Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:
This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.

Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
dom:(0)

Which suggest that some grants are still mapped in DOM0.


The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.

All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.

If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is
going on.

The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.


We've done some more testing regarding this issue.  And further testing
shows that it doesn't matter if we delete the vchans before the domains
are deleted.  Those appear to be cleaned up correctly when the domain is
destroyed.

What does stop this issue from happening (using the same version of Xen
that the issue was detected on) is removing any non-standard xenstore
references before deleting the domain.  In this case our application
allocates permissions for created domains to non-standard xenstore
paths.  Making sure to remove those domain permissions before deleting
the domain prevents this issue from happening.

I am not sure to understand what you mean here. Could you give a quick
example?

So we have a custom xenstore path for our tool (/tool/custom/ for the sake of this example), and we then allow every domain created using this tool to read that path. When the domain is created, the domain is explicitly given read permissions using xs_set_permissions(). More precisely we:
1. retrieve the current list of permissions with xs_get_permissions()
2. realloc the permissions list to increase it by 1
3. update the list of permissions to give the new domain read only access
4. then set the new permissions list with xs_set_permissions()

We saw errors logged because this list of permissions was getting prohibitively large, but this error did not appear to be directly connected to the Xen crash I submitted last week. Or so we thought at the time.

We realized that we had forgotten to remove the domain from the permissions list when the domain is deleted (which would cause the error we saw). The application was updated to remove the domain from the permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that the Xen crash no longer happens. We checked this morning first thing and confirmed that without this change the crash reliably occurs.

It does not appear to matter if we delete the standard domain xenstore
path (/local/domain/<id>) since libxl handles removing this path when
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.


This is a somewhat strange conclusion. I guess the root cause is still
unclear at this point.

We originally tested a fix that explicitly cleaned up the vchans (created to communicate with the domains) before the xen_domain_destroy() function is called and there was no change. We have confirmed that the vchans do not appear to cause issues when they are not deleted prior to the domain being destroyed.

Our application did delete them eventually, but last week they were only deleted _after_ the domain was destroyed. I would guess that if they are not explicitly deleted they could cause this same problem.

Is it possible that something else what rely on those xenstore node to
free up resources?

It was stated earlier in this thread that the VMID is only deleted once all references to it are destroyed. I would speculate that the xenstore permissions list is one of these references that could prevent a domain reference (and VMID) from being completely cleaned up.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.