Re: [Xen-devel] Altp2m use with PML can deadlock Xen

On 5/10/19 5:42 PM, Tamas K Lengyel wrote:
On Thu, May 9, 2019 at 10:19 AM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:

On 09/05/2019 14:38, Tamas K Lengyel wrote:
Hi all,
I'm investigating an issue with altp2m that can easily be reproduced
and leads to a hypervisor deadlock when PML is available in hardware.
I haven't been able to trace down where the actual deadlock occurs.

The problem seem to stem from hvm/vmx/vmcs.c:vmx_vcpu_flush_pml_buffer
that calls p2m_change_type_one on all gfns that were recorded the PML
buffer. The problem occurs when the PML buffer full vmexit happens
while the active p2m is an altp2m. Switching  p2m_change_type_one to
work with the altp2m instead of the hostp2m however results in EPT
misconfiguration crashes.

Adding to the issue is that it seem to only occur when the altp2m has
remapped GFNs. Since PML records entries based on GFN leads me to
question whether it is safe at all to use PML when altp2m is used with
GFN remapping. However, AFAICT the GFNs in the PML buffer are not the
remapped GFNs and my understanding is that it should be safe as long
as the GFNs being tracked by PML are never the remapped GFNs.

Booting Xen with ept=pml=0 resolves the issue.

If anyone has any insight into what might be happening, please let me know.

I could have sworn that George spotted a problem here and fixed it.  I
shouldn't be surprised if we have more.

The problem that PML introduced (and this is mostly my fault, as I
suggested the buggy solution) is that the vmexit handler from one vcpu
pauses others to drain the PML queue into the dirty bitmap.  Overall I
wasn't happy with the design and I've got some ideas to improve it, but
within the scope of how altp2m was engineered, I proposed

As it turns out, that is vulnerable to deadlocks when you get two vcpus
trying to pause each other and waiting for each other to become

Makes sense.

I see this has been reused by the altp2m code, but it *should* be safe
to deadlocks now that it takes the hypercall_deadlock_mutext.

Is that already in staging or your x86-next branch? I would like to
verify that the problem is still present or not with that change. I
tested with Xen 4.12 release and that definitely still deadlocks.

I don't know if Andrew is talking about this patch (probably not, but it looks at least related):


Since there's a "Release-acked" tag on it, I think it's in 4.12.


