[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] <summary-1> (v2) Design proposal for RMRR fix



> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Friday, January 09, 2015 5:46 PM
> 
> >>> On 09.01.15 at 07:57, <kevin.tian@xxxxxxxxx> wrote:
> > 1) 'fail' vs. 'warn' upon gfn confliction
> >
> > Assigning device which fails RMRR confliction check (i.e. intended gfns
> > already allocated for other resources) actually brings unknown stability
> > problem (device may clobber those valid resources) and potentially security
> > issue within VM (but not worse than what a malicious driver can do w/o
> > virtual IOMMU).
> >
> > So by default we should not move forward if gfn confliction is detected
> > when setting up RMRR identity mapping, so-called a 'fail' policy.
> >
> > One open though, is whether we want to allow admin to override default
> > 'fail' policy with a 'warn' policy, i.e. throwing out confliction detail but
> > succeeding device assignment. USB is discussed as one example before
> > (hack how it works today upon <1MB confliction), so it might be good to
> > allow enthusiast trying device assignment, or provide flexibility to users
> > who already verified predicted potential problem not a real issue to their
> > specific deployment.
> >
> > I'd like to hear your votes on whether to provide such 'warn' option.
> 
> Yes, I certainly see value in such an option, to have a means to
> circumvent the other possible case of perceived regressions in not
> being able to pass through certain devices anymore.
> 
> > 1.1) per-device 'warn' vs. global 'warn'
> >
> > Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead
> > of a global option.
> >
> > In a glimpse a per-device 'warn' option provides more fine-grained control
> > than a global option, however if thinking it carefully allowing one device
> > w/
> > potential problem isn't more correct or secure than allowing multiple
> > devices w/ potential problem. Even in practice a device like USB can
> > work bearing <1MB confliction, like Jan pointed out there's always corner
> > cases which we might not know so as long as we open door for one device,
> > it implies a problematic environment to users and user's judge on whether
> > he can live up to this problem is not impacted by how many devices the door
> > is opened for (he anyway needs to study warning message and do
> verification
> > if choosing to live up)
> >
> > Regarding to that, imo if we agree to provide 'warn' option, just providing
> > a global overriding option (definitely per-vm) is acceptable and simpler.
> 
> If the admin determined that ignoring the RMRR requirements for one
> devices is safe, that doesn't (and shouldn't) mean this is the case for
> all other devices too.

I don't think admin can determine whether it's 100% safe. What admin can 
decide is whether he lives up to the potential problem based on his purpose
or based on some experiments. only device vendor knows when and how
RMRR is used. So as long as warn is opened for one device, I think it
already means a problem environment and then adding more device is
just same situation.

> 
> > 1.2) when to 'fail'
> >
> > There is one open whether we should fail immediately in domain builder
> > if a confliction is detected.
> >
> > Jan's comment is yes, we should 'fail' the VM creation as it's an error.
> >
> > My previous point is more mimicking native behavior, where a device
> > failure (in our case it's actually potential device failure since VM is not
> > powered yet) doesn't impact user until its function is actually touched.
> > In our case, even domain builder fails to re-arrange guest RAM to skip
> > reserved regions, we have centralized policy (either 'fail' or 'warn' per
> > above conclusion) in Xen hypervisor when the device is actually assigned.
> > so a 'warn' should be fine, but my insist on this is not strong.
> 
> See my earlier reply: Failure to add a device to me is more like a
> device preventing a bare metal system from coming up altogether.

not all devices are required for bare metal to boot. it causes problem
only when it's being used in the boot process. say at powering up the
disk (insert in the PCI slot) is broken (not sure whether you call such
thing as 'failure to add a device'), it is only error when BIOS tries to
read disk.

note device assignment path is the actual path to decide whether a
device will be present to the guest. not at this domain build time.

> 
> > and another point is about hotplug. 'fail' for future devices is too strict,
> > but to differentiate that from static-assigned devices, domain builder
> > will then need maintain a per-device reserved region structure. just
> > 'warn' makes things simple.
> 
> Whereas here I agree - hotplug should just fail (without otherwise
> impacting the guest).

so 'should' -> 'shoundn't'?

> 
> > 2) RMRR management
> >
> > George raised a good point that RMRR reserved regions can be maintained
> > in toolstack, and it's toolstack to tell Xen which regions to be reserved.
> > When
> > providing more flexibility, another benefit from Jan is to specify reserved
> > regions in another node (might-be-migrated-to) as a preparation for
> migration.
> >
> > When it sounds like a good long term plan, my feeling is that it might be
> > some parallel effort driving from toolstack experts. Xen can't simply rely
> > on user space to setup all necessary reserved regions, since it violates the
> > isolation philosophy in Xen. Whatever a toolstack may tell Xen, Xen still
> > needs to setup identity mapping for all reserved regions reported for the
> > assigned device.
> 
> Of course. If the tool stack failed to reserve a certain page in a guest's
> memory map, failure will result.
> 
> > So I still prefer to current way i.e. having Xen to organize reserve regions
> > according to assigned device, and then having libxc/hvmloader to query
> > to avoid confliction. In the future new interface can be created to allow
> > toolstack specific plain reserved regions for whatever reason to Xen, as
> > a compliment.
> 
> Indeed we should presumably allow for both - the guest config may
> specify regions independent on what the host properties are, yet by
> default reserved regions within the guest layout will depend on host
> properties (and guest config settings - as pointed out before, the
> default ought to be no reserved regions anyway). How much of this

if no device assigned, yes no reserved regions at all. the open on later
report-all or report-sel is only relevant when assignment is concerned.

> gets implemented right away vs deferred until found necessary is a
> different question.

agree.

> 
> > 3) report-sel vs. report-all
> >
> > report-sel means report reserved regions selectively (all potentially-to-be
> > assigned are listed for hotplug, but doing this is not user friendly)
> >
> > report-all means report reserved regions for all available devices in this
> > platform (cover hotplug w/ enough flexibility)
> >
> > report-sel is opted by Jan from start, as report-all leaves some confusing
> > reserved regions to end user.
> >
> > otoh, our proposal seeks report-all as a simplified option, because we don't
> > think user should set assumption on e820 layout which is the platform
> > attribute and at most it's similar to a physical layout.
> >
> > first, report-all doesn't cause more conflictions than report-sel in a
> > reasonable
> > thinking. virtual platform is simpler than physical platform. Since those
> > regions
> > can be reserved in physical platform, it's reasonable to assume same
> > reservation
> > can succeed in virtual platform (putting 1MB confliction aside .
> 
> As again said in an earlier reply, tying the guest layout to the one of
> the host where it boots is going to lead to inconsistencies when the
> guest later gets migrated to a host with a different memory layout.

my point is that such inconsistency can be caused not by this design, as I
said in other reply if you hot remove a device (which is a boot device
having RMRR reported) there's no way to erase that knowledge.

> 
> > second, I'm not sure to what level users care about those reserved regions.
> > At most it's same layout as physical so even sensitive users won't see it as
> > a UFO. :-) and e820 is platform attributes so user shouldn't set assumption
> > on it.
> 
> Just consider the case where, in order to accommodate the reserved
> regions, low memory needs to be reduced from the default of over
> 3Gb to say 1Gb. If the guest OS then is incapable of using memory
> above 4Gb (say Linux with HIGHMEM=n), there is a significant
> difference to be seen by the user.

that makes some sense... but if yes it's also a limitation to your below
proposal on avoid fiddling lowmem, if there's a region at 1GB. I think
for this we can go with your earlier assumption, that we only support
the case which is reasonable high say 3G. violating that assumption
will be warned (so guest RAM is not moved) and later device assignment
will fail.

> 
> > 4) handle conflictions
> >
> > there are several points discussed.
> >
> > Jan raised a good point that reasonable assumption can be made to avoid
> > split lowmen into scattered structure, i.e. assuming reserved region only
> > <1MB or >host lowmem. Scatter structure has an impact on RAM layout
> > sharing between domain builder and hvmloader, and further per George's
> > comment impacting qemu upstream. By doing that reasonable assumption
> > domain builder can arrange lowmem always under high end reserved
> > regions and thus preserve existing coarse-grained structure w/
> low/highmem
> > and mmio hole. definitely detection will still be done if a reserved region
> > breaks that assumption but no attempt to break guest RAM to avoid
> > confliction.
> >
> > following that hvmloader changes would become simpler too, with focus on
> > BIOS/ACPI and PCI BARs.
> >
> > other two ideas from Jan. One is to move more layout stuff (like PCI BAR,
> > etc.)
> 
> To clarify - I didn't mean libxc to do BAR assignments, all I meant was
> that it would need to size the MMIO hole in a way that hvmloader can
> do the assignments without needing to fiddle with the lowmem/highmem
> split.

sorry for miscatch. 

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.