[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs



Hi Andrew,

> > The bug is still here, so we can exclude a microcode issue.
>
> Good - that is one further angle excluded.  Always make sure you are
> running with up-to-date microcode, but it looks like we back to
> investigating a logical bug in libvmi or Xen.


I reimplemented a small test, without the Drakvuf/Libvmi layers, that will 
inject traps on one API in Windows (NtCreateUserProcess),
in the same way that Drakvuf does.

I did some quick testing yesterday, with a Python script that was repeatedly
starting the binary to monitor the API, and at the same time starting Ansible
to run "c:\Windows\system32\reg.exe /?" via WinRM, to trigger some process 
creation.

The traps are working, I see the software breakpoint hit, switching to the 
default
view for singlestepping, and switching back to the execution view, so that's 
already good.

After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
- frozen: the mouse doesn't move: so I would guess the VCPU are blocked.

I'm calling the xc_(un)pause_domain APIs multiple times when I write to the 
shadow copies,
but It's always synchronous, so I doubt that they interfered and "paused" the 
domain.

Also, the log output I have before I detect that Ansible failed to execute is 
that the resume succeded and
Xen is ready to process VMI events.

- BSOD: that's the second possibility, apparently I'm corrupting critical data 
structure in the operating system,
and the Windbg analysis is inconclusive, so I can't tell much.

Either way, I can't execute this test sequentially 10 000 times without a crash.

-> Could you look at the implementation, and tell me if I misused the APIs 
somewhere ?
https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7

I used the compat APIs, like Drakvuf does.

@Tamas, if you could check the traps implementation.

You also have stress-test.py, which is the small test suite that I used, and
the screenshot showing the stdout preceding a test failure,
when Ansible couldn't contact WinRM service because the domain was frozen.

Note: I stole some code from libvmi, to handle page read/write in Xen.

PS: in the case where the domain is frozen, and I destroy the domain, a (null) 
entry will remain
in xl list, despite that my stress-test.py process is already dead.

I have 4 of these entries in my xl list right now.
Might be worth looking into it also.

Best regards,
Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.