[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen 4.7 crash
On 6/6/2016 10:19 AM, Wei Liu wrote: On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:(CC Ian, Stefano and Wei) Hello Aaron, On 06/06/16 14:58, Aaron Cornelius wrote:On 6/2/2016 5:07 AM, Julien Grall wrote:Hello Aaron, On 02/06/2016 02:32, Aaron Cornelius wrote:This is with a custom application, we use the libxl APIs to interact with Xen. Domains are created using the libxl_domain_create_new() function, and domains are destroyed using the libxl_domain_destroy() function. The test in this case creates a domain, waits a minute, then deletes/creates the next domain, waits a minute, and so on. So I wouldn't be surprised to see the VMID occasionally indicate there are 2 active domains since there could be one being created and one being destroyed in a very short time. However, I wouldn't expect to ever have 256 domains.Your log has: (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) Which suggest that some grants are still mapped in DOM0.The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which means that only 48 of the the Mirage domains (with 32MB of RAM) would work at the same time anyway. Which doesn't account for the various inter-domain resources or the RAM used by Xen itself.All the pages who belongs to the domain could have been freed except the one referenced by DOM0. So the footprint of this domain will be limited at the time. I would recommend you to check how many domain are running at this time and if DOM0 effectively released all the resources.If the p2m_teardown() function checked for NULL it would prevent the crash, but I suspect Xen would be just as broken since all of my resources have leaked away. More broken in fact, since if the board reboots at least the applications will restart and domains can be recreated. It certainly appears that some resources are leaking when domains are deleted (possibly only on the ARM or ARM32 platforms). We will try to add some debug prints and see if we can discover exactly what is going on.The leakage could also happen from DOM0. FWIW, I have been able to cycle 2000 guests over the night on an ARM platforms.We've done some more testing regarding this issue. And further testing shows that it doesn't matter if we delete the vchans before the domains are deleted. Those appear to be cleaned up correctly when the domain is destroyed. What does stop this issue from happening (using the same version of Xen that the issue was detected on) is removing any non-standard xenstore references before deleting the domain. In this case our application allocates permissions for created domains to non-standard xenstore paths. Making sure to remove those domain permissions before deleting the domain prevents this issue from happening.I am not sure to understand what you mean here. Could you give a quick example? So we have a custom xenstore path for our tool (/tool/custom/ for the sake of this example), and we then allow every domain created using this tool to read that path. When the domain is created, the domain is explicitly given read permissions using xs_set_permissions(). More precisely we: 1. retrieve the current list of permissions with xs_get_permissions() 2. realloc the permissions list to increase it by 1 3. update the list of permissions to give the new domain read only access 4. then set the new permissions list with xs_set_permissions()We saw errors logged because this list of permissions was getting prohibitively large, but this error did not appear to be directly connected to the Xen crash I submitted last week. Or so we thought at the time. We realized that we had forgotten to remove the domain from the permissions list when the domain is deleted (which would cause the error we saw). The application was updated to remove the domain from the permissions list: 1. retrieve the permissions with xs_get_permissions() 2. find the domain ID that is being deleted3. memmove() the remaining domains down by 1 to "delete" the old domain from the permissions list 4. update the permissions with xs_set_permissions()After we made that change, a load test over the weekend confirmed that the Xen crash no longer happens. We checked this morning first thing and confirmed that without this change the crash reliably occurs. It does not appear to matter if we delete the standard domain xenstore path (/local/domain/<id>) since libxl handles removing this path when the domain is destroyed. Based on this I would guess that the xenstore is hanging onto the VMID.This is a somewhat strange conclusion. I guess the root cause is still unclear at this point. We originally tested a fix that explicitly cleaned up the vchans (created to communicate with the domains) before the xen_domain_destroy() function is called and there was no change. We have confirmed that the vchans do not appear to cause issues when they are not deleted prior to the domain being destroyed. Our application did delete them eventually, but last week they were only deleted _after_ the domain was destroyed. I would guess that if they are not explicitly deleted they could cause this same problem. Is it possible that something else what rely on those xenstore node to free up resources? It was stated earlier in this thread that the VMID is only deleted once all references to it are destroyed. I would speculate that the xenstore permissions list is one of these references that could prevent a domain reference (and VMID) from being completely cleaned up. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |