Xen project Mailing List

Re: [Xen-devel] [PATCH V11 4/5] p2m: Always use hostp2m when clipping rangesets

To: "George Dunlap" <George.Dunlap@xxxxxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 20 Dec 2018 06:52:04 -0700

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, george.dunlap@xxxxxxxxxx, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>

Delivery-date: Thu, 20 Dec 2018 13:52:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 20.12.18 at 13:59, <george.dunlap@xxxxxxxxxx> wrote: > On 12/20/18 8:20 AM, Jan Beulich wrote: >>>>> On 19.12.18 at 18:26, <george.dunlap@xxxxxxxxxx> wrote: >>> On 12/13/18 10:43 AM, Jan Beulich wrote: >>>>>>> On 13.12.18 at 11:22, <rcojocaru@xxxxxxxxxxxxxxx> wrote: >>>>> For my own part, I see no reason why not clipping end should not work >>>>> when updating the ranges only (as long as start continues to be <= >>>>> unclipped_end). >>>>> >>>>> Would that modification + testing of it help this series continue? >>>> >>>> I think so, at least as far as I'm concerned. But I think we really need >>>> George's opinion as well. >>> >>> We are going off into the weeds a little bit here I think. >>> >>> If I understand Jan's concern properly, he's concerned about a situation >>> like this: >>> >>> [start] p2m->max_mapped_pfn == 0xfff >>> 1. change_type_range ram => logdirty, [0x900, 0x1200) >>> >>> Obviously the actual p2m entries can only be changed from 0x900 to >>> 0xfff; but what about the logdirty ranges? At the moment, the result >>> will be a rangeset with [0x900, 0xfff]. >>> >>> Jan is asking whether the rangeset should instead be [0x900, 0x11ff]. >>> >>> So the time when it would matter would be a situation like the following: >>> >>> 2. p2m_set_entry(0x1100, M) >>> >>> 3. change_entry_type_global(ram => logdirty) >>> >>> 4. change_entry_type_global(logdirty => ram) >>> >>> Under the current regime gfn 0x1100 would be have type ram_rw both after >>> step 2, and after step 4. >>> >>> If we used Jan's suggestion, then it would be marked as ram_rw after >>> step 2, and logdirty after step 4. >> >> Afaict it would be marked logdirty also after step 2, at least >> effectively (to the outside world), due to ept_get_entry()'s call >> to p2m_recalc_type(). > > That's not what I'm seeing. Let's consider the ept entry for gfn 0x1100 > at/after the various stages: > > [start]: empty (valid bit clear) > 1. change_type_range doesn't touch this, so still empty. > 2. ept_set_entry(M) > - Calls recalc_type(). This will walk the ept table down to the > particular ept entry, resolving the `recalc` bit at each level. > - Finally it will set the entry to point to M, with the recalc bit > clear, and the entry *not* misconfigured. Well of course - if the caller specified p2m_ram_rw. But this is wrong for the caller to do for any page inside the logdirty range. ept_set_entry() is a function which is not itself implementing policy; it depends on higher levels getting things right together with the page tables being in proper state. > Guest writes will not trigger an EPT fault at this point, so the most > important part of the "outside world" will not effectively see a logdirty. > > What about ept_get_entry() after point 2? Well, it calls > p2m_recalc_type() with "recalc || ept->recalc". The first is > accumulated by walking down the ept tables; but in this case those bits > will already have been cleared by the recalc_type() at the top of > set_entry. And of course, the gfn's own ept entry will have the recalc > bit clear. > > So p2m_recalc_type() will be called with `recalc` set to zero. When > that's the case, it always returns the type passed to it, without > checking logdirty. > > Did I miss anything? I don't think so, except perhaps me having said ... >> It may well be that there are more bugs >> here (like ept_set_entry() not honoring this, but then again ... this. IOW what you describe might match the current situation, but I'm unconvinced it's intended/wanted behavior. >> this is perhaps something the callers should already take care >> of), but that's the behavior I'd expect, and why I think the >> range should not be clipped for the purpose of insertion into >> the rangeset. >> >>> But of course that's no different than what would happen if >>> max_mapped_pfn were 0x2000, but gfns 0x1000-11ff just happened to be empty. >>> >>> Under normal circumstances, neither of these situations should happen; >>> and in neither case will catastrophic consequences happen (unless you >>> were relying on hap_track_dirty_vram for something other than tracking >>> dirty vram). >>> >>> I'm inclined to say that ideally, change_type_range should pass an error >>> up if end > max_mapped_pfn. >>> >>> But of course, it doesn't return an error at the moment, so that's out >>> of scope for this series. >>> >>> I take it, Jan, that in the absence of changing the behavior, you'd like >>> the comment to look something like this? >>> >>> "Always clip the rangeset down to the host p2m. NB that this means the >>> logdirty_range will also be clipped, so in the future gfns in >>> (host_max_pfn, end) range won't be affected by change_entry_type_global. >>> We should probably return an error in this case instead, as it's almost >>> certainly a mistake; but that's left as a clean-up for another time." >> >> Well, not exactly. IMO at least p2m_change_type_range(..., >> 0, ULONG_MAX) should match p2m_change_entry_type_global(), >> with the exception of the rangeset modification (which in the >> p2m_change_entry_type_global() global case is replaced by >> modifying p2m->global_logdirty). > > That's one sensible interface I considered; but I don't think it's the > best one. It has the advantage that from an interface perspective it's > clean and satisfying. But I'm having difficulty imagining a situation > where that behavior would lead to better outcomes. On the contrary, the > only time I can imagine this situation happening at the moment is if > there were a bug in the device model -- in which case, returning an > error would be a much more helpful thing to do. I would agree if ->max_mapped_gfn was under qemu's direct control. But the value depends on guest behavior. It just so happens that for dirty vram tracking it's perhaps quite unlikely for the range to live above ->max_mapped_gfn, but I wouldn't be surprised if things broke when moving the frame buffer pretty high up in (guest) physical memory, far beyond RAM (and above 4Gb). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.