[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration



On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
> Ping?

Sorry I've been meaning to reply but didn't manage to yet. Also you
replied to V4 saying to ignore it, so I was half waiting for V5 but I
see this should actually be labelled V5 anyway.

I'm afraid I still don't fully grok the reason for the loop that goes
with:
+            /*
+             * At the last iter, count the number of broken pages after 
sending,
+             * and if there are more than before sending, do one or more iter
+             * to make sure the pages are marked broken on the receiving side.
+             */

Can we go through it one more time? Sorry.

Let me outline the sequence of events and you can point out where I'm
going wrong. I'm afraid this has turned out to be rather long, again I'm
sorry for that.

First we do some number of iterations with the guest live. If an MCE
occurs during this phase then the page will be marked dirty and we will
pick this up on the next iteration and resend the page with the dirty
type etc and all is fine. This all looks good to me, so we don't need to
worry about anything at this stage.

Eventually we get to the last iteration, at which point we pause the
guest. From here on in the guest itself is not going to cause an MCE
(e.g. by touching its RAM) because it is not running but we must still
account for the possibility of an asynchronous MCE of some sort e.g.
triggered by the error detection mechanisms in the hardware, cosmic rays
and such like.

The final iteration proceeds roughly as follows.

     1. The domain is paused
     2. We scan the dirty bitmap and add dirty pages to the batch of
        pages to process (there may be several batches in the last
        iteration, we only need to concern ourselves with any one batch
        here).
     3. We map all of the pages in the resulting batch with
        xc_map_foreign_bulk
     4. We query the types of all the pages in the batch with
        xc_get_pfn_type_batch
     5. We iterate over the batch, checking the type of each page, in
        some cases we do some incidental processing.
     6. We send the types of the pages in the batch over the wire.
     7. We iterate over the batch again, and send any normal page (not
        broken, xtab etc) over the wire. Actually we do this as runs of
        normal pages, but the key point is we avoid touching any special
        page (including ones marked as broken by #4)

Is this sequence of events accurate?

Now lets consider the consequences of an MCE occurring at various stages
here.

Any MCE which happens before #4 is fine, we will pick that up in #4 and
the following steps will do the right thing.

Note that I am assuming that the mapping step in #3 is safe even for a
broken page, so long as we don't actually try and use the mapping (more
on that later), is this true?

If an MCE occurs after #4 then the page will be marked as dirty in the
bitmap and Xen will internally mark it as broken, but we won't see
either of those with the current algorithm. There are two cases to think
about here AFAICT,
     A. The page was not already dirty at #2. In this case we know that
        the guest hasn't dirtied the page since the previous iteration
        and therefore the target has a good copy of this page from that
        time. The page isn't even in the batch we are processing So we
        don't particularly care about the MCE here and can, from the PoV
        of migrating this guest, ignore it.
     B. The page was already dirty (but not broken, we handled that case
        above in "Any MCE which happens before #4...") at #2 which means
        we have do not have an up to date copy on the target. This has
        two subcases:
             I. The MCE occurs before (or during) #6 (sending the page)
                and therefore we do not have a good up to date copy of
                that data at either end.
            II. The MCE occurs after #6, in which case we already have a
                good copy at the target end.

To fix B you have added an 8th step to the above:

        8. Query the types of the pages again, using
        xc_get_pfn_type_batch, and if there are more pages dirty now
        than we say at #4 (actually #5 when we scanned the array, but
        that distinction doesn't matter) then a new MCE must have
        occurred. Go back to #2 and try again.

This won't do anything for A since the page wasn't in the batch to start
with and so neither #4 or #8 will look at its type, this is good and
proper.

So now we consider the two subcases of B. Lets consider B.II first since
it seems to be the more obvious case.

In case B.II the target end already has a good copy of the data page,
there is no need to mark the page as broken on the far end, nor to
arrange for a vMCE to be injected. I don't know if/how we arrange for
vMCEs to be injected under these circumstances, however even if a vMCE
does get injected into the guest when it eventually gets unpaused on the
target then all that will happen is that it will needlessly throw away a
good page. However this is a rare corner case which is not worth
concerning ourselves with (it's largely indistinguishable from case A).
If the MCE had happened even a single cycle earlier then this would have
been a B.I event instead of a B.II one. In any case there is no need to
return to #2 and try again, everything will be just fine if we complete
the migration at this point.

In case B.I the MCE occurs before (or while) we send the page onto the
wire. We will therefore try to read from this page because we haven't
looked at the type since #4 and have no idea that it is now broken.
Reading from the broken page will cause a fault, perhaps causing a vMCE
to be delivered to dom0, which causes the kernel to kill the process
doing the migration. Or maybe it kills dom0 or the host entirely. Either
way the idea of looping again is rather moot.

Have I missed a case which needs thinking about?

I suspect B.I is the case where you are most likely to find a flaw in my
argument. Is there something else which is done in this case which would
allow us to continue?

Ian.





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.