[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix



At 17:34 +0000 on 30 Nov (1354296851), Andrew Cooper wrote:
> Hello,
> 
> Yesterday, Tim and myself spent a very long time in front of a
> whiteboard trying to develop a fix which covered all the problems, and
> sadly it is very hard.  We managed to possibly come up with a long
> solution which we think has no race conditions, but relies on very large
> sections of reentrant code which cant use the stack or trash registers. 
> As such, is it is not practical at all (assuming that any of us could
> actually code it)

For the record, we also came up with a much simpler solution, which I
prefer:
 - The MCE handler should never return to Xen with IRET.
 - The NMI handler should always return with IRET.
 - There should be no faulting code in the NMI or MCE handlers.

That covers all the interesting cases except (3), (4) and (7) below, and
a simple per-cpu {nmi,mce}-in-progress flag will be enough to detect
(and crash) on _almost_ all cases where that bites us (the other cases
will crash less politely from their stacks being smashed).

Even if we go on to build some more bulletproof solution, I think we
should consider implementing that now, as the baseline.

Tim.

> As a result, I thought instead that I would outline all the issues we
> currently face.  We can then:
>  * Decide which issues need fixing
>  * Decide which issues need to at least be detected and crash gracefully
>  * Decide which issues we are happy (or perhaps at least willing, if not
> happy) to ignore
> 
> So, the issues are as follows.  (I have tried to list them in a logical
> order, with 1 individual problem per number, but please do point out if
> I have missed/miss-attributed entries)
> 
> 1) Faults on the NMI path will re-enable NMIs before the handler
> returns, leading to reentrant behaviour.  We should audit the NMI path
> to try and remove any needless cases which might fault, but getting a
> fault-free path will be hard (and is not going so solve the reentrant
> behaviour itself).
> 
> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> MCE itself if an MCE interrupts an NMI.
> 
> 3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
> way to predict if/when it may happen.  The best we can do is accept that
> it might happen, and try to deal with the after effects.
> 
> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin
> (very unlikely in modern systems with the LAPIC supporting virtual wire
> mode), or by software executing an `int $0x2`.  This can cause the NMI
> handler to run on the NMI stack, but without the normal hardware NMI
> cessation logic being triggered.
> 
> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
> could only be caused by a bug in Xen, although it is also possible on a
> system with out interrupt remapping. (Where the host administrator has
> accepted the documented security issue, and decided still to pass-though
> a device to a trusted VM, and the VM in question has a buggy driver for
> the passed-through hardware)
> 
> 6) Because of interrupt stack tables, real NMIs/MCEs can race with their
> fake alternatives, where the real interrupt interrupts the fake one and
> corrupts the exception frame of the fake one, loosing the original
> context to return to.  (This is one of the two core problem of
> reentrancy with NMIs and MCEs)
> 
> 7) Real MCEs can race with each other.  If two real MCEs occur too close
> together, the processor shuts down (We cant avoid this).  However, there
> is large race condition between the MCE handler clearing the MCIP bit of
> IA32_MCG_STATUS and the handler returning during which a new MCE can
> occur and the exception frame will be corrupted.
> 
> 
> In addition to the above issues, we have two NMI related bugs in Xen
> which need fixing (which shall be part of the series which fixes the above)
> 
> 8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
> causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
> has expired, at which point the return-to-guest path decides to schedule
> instead of resuming the guest.
> 
> 9) The NMI handler when returning to ring3 will leave NMIs latched, as
> it uses the sysret path.
> 
> 
> As for 1 possible solution which we cant use:
> 
> If it were not for the sysret stupidness[1] of requiring the hypervisor
> to move to the guest stack before executing the `sysret` instruction, we
> could do away with the stack tables for NMIs and MCEs alltogether, and
> the above crazyness would be easy to fix.  However, the overhead of
> always using iret to return to ring3 is not likely to be acceptable,
> meaning that we cannot "fix" the problem by discarding interrupt stacks
> and doing everything properly on the main hypervisor stack.
> 
> 
> Looking at the above problems, I believe there is a solution if we are
> willing to ignore the problem to do with SMM re-enabling NMIs, and if we
> are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
> other and trash their exception frames (in situations were we could
> technically fix up correctly), which is based on the Linux NMI solution.
> 
> As questions to the community - have I missed, or misrepresented any
> points above which might perhaps influence the design of the solution? 
> I think the list is complete, but would not be supprised if there is a
> case still not considered yet.
> 
> ~Andrew
> 
> 
> [1] In an effort to prevent a flamewar with my comment, the situation we
> find outself in now is almost certainly the result of unforseen
> interactions of individual features, but we are left to pick up the many
> pieces in way which cant completely be solved.
> 
> -- 
> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> T: +44 (0)1223 225 900, http://www.citrix.com
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.