[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v5 1/4] domctl: introduce a new domain create flag, XEN_DOMCTL_CDF_evtchn_fifo, ...



> -----Original Message-----
> From: Jan Beulich <jbeulich@xxxxxxxx>
> Sent: 04 December 2020 07:53
> To: paul@xxxxxxx
> Cc: 'Paul Durrant' <pdurrant@xxxxxxxxxx>; 'Eslam Elnikety' 
> <elnikety@xxxxxxxxxx>; 'Ian Jackson'
> <iwj@xxxxxxxxxxxxxx>; 'Wei Liu' <wl@xxxxxxx>; 'Anthony PERARD' 
> <anthony.perard@xxxxxxxxxx>; 'Andrew
> Cooper' <andrew.cooper3@xxxxxxxxxx>; 'George Dunlap' 
> <george.dunlap@xxxxxxxxxx>; 'Julien Grall'
> <julien@xxxxxxx>; 'Stefano Stabellini' <sstabellini@xxxxxxxxxx>; 'Christian 
> Lindig'
> <christian.lindig@xxxxxxxxxx>; 'David Scott' <dave@xxxxxxxxxx>; 'Volodymyr 
> Babchuk'
> <Volodymyr_Babchuk@xxxxxxxx>; 'Roger Pau Monné' <roger.pau@xxxxxxxxxx>; 
> xen-devel@xxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH v5 1/4] domctl: introduce a new domain create flag, 
> XEN_DOMCTL_CDF_evtchn_fifo,
> ...
> 
> On 03.12.2020 18:07, Paul Durrant wrote:
> >> From: Jan Beulich <jbeulich@xxxxxxxx>
> >> Sent: 03 December 2020 15:57
> >>
> >> On 03.12.2020 16:45, Paul Durrant wrote:
> >>>> From: Jan Beulich <jbeulich@xxxxxxxx>
> >>>> Sent: 03 December 2020 15:23
> >>>>
> >>>> On 03.12.2020 13:41, Paul Durrant wrote:
> >>>>> From: Paul Durrant <pdurrant@xxxxxxxxxx>
> >>>>>
> >>>>> ...to control the visibility of the FIFO event channel operations
> >>>>> (EVTCHNOP_init_control, EVTCHNOP_expand_array, and 
> >>>>> EVTCHNOP_set_priority) to
> >>>>> the guest.
> >>>>>
> >>>>> These operations were added to the public header in commit d2d50c2f308f
> >>>>> ("evtchn: add FIFO-based event channel ABI") and the first 
> >>>>> implementation
> >>>>> appeared in the two subsequent commits: edc8872aeb4a ("evtchn: implement
> >>>>> EVTCHNOP_set_priority and add the set_priority hook") and 88910061ec61
> >>>>> ("evtchn: add FIFO-based event channel hypercalls and port ops"). Prior 
> >>>>> to
> >>>>> that, a guest issuing those operations would receive a return value of
> >>>>> -ENOSYS (not implemented) from Xen. Guests aware of the FIFO operations 
> >>>>> but
> >>>>> running on an older (pre-4.4) Xen would fall back to using the 2-level 
> >>>>> event
> >>>>> channel interface upon seeing this return value.
> >>>>>
> >>>>> Unfortunately the uncontrolable appearance of these new operations in 
> >>>>> Xen 4.4
> >>>>> onwards has implications for hibernation of some Linux guests. During 
> >>>>> resume
> >>>>> from hibernation, there are two kernels involved: the "boot" kernel and 
> >>>>> the
> >>>>> "resume" kernel. The guest boot kernel may default to use FIFO 
> >>>>> operations and
> >>>>> instruct Xen via EVTCHNOP_init_control to switch from 2-level to FIFO. 
> >>>>> On the
> >>>>> other hand, the resume kernel keeps assuming 2-level, because it was 
> >>>>> hibernated
> >>>>> on a version of Xen that did not support the FIFO operations.
> >>>>>
> >>>>> To maintain compatibility it is necessary to make Xen behave as it did
> >>>>> before the new operations were added and hence the code in this patch 
> >>>>> ensures
> >>>>> that, if XEN_DOMCTL_CDF_evtchn_fifo is not set, the FIFO event channel
> >>>>> operations will again result in -ENOSYS being returned to the guest.
> >>>>
> >>>> I have to admit I'm now even more concerned of the control for such
> >>>> going into Xen, the more with the now 2nd use in the subsequent patch.
> >>>> The implication of this would seem to be that whenever we add new
> >>>> hypercalls or sub-ops, a domain creation control would also need
> >>>> adding determining whether that new sub-op is actually okay to use by
> >>>> a guest. Or else I'd be keen to up front see criteria at least roughly
> >>>> outlined by which it could be established whether such an override
> >>>> control is needed.
> >>>>
> >>>
> >>> Ultimately I think any new hypercall (or related set of hypercalls) added 
> >>> to the ABI needs to be
> >> opt-in on a per-domain basis, so that we know that from when a domain is 
> >> first created it will not
> see
> >> a change in its environment unless the VM administrator wants that to 
> >> happen.
> >>
> >> A new hypercall appearing is a change to the guest's environment, yes,
> >> but a backwards compatible one. I don't see how this would harm a guest.
> >
> > Say we have a guest which is aware of the newer Xen and is coded to use the 
> > new hypercall *but* we
> start it on an older Xen where the new hypercall is not implemented. *Then* 
> we migrate it to the newer
> Xen... the new hypercall suddenly becomes available and changes the guest 
> behaviour. In the worst
> case, the guest is sufficiently confused that it crashes.
> 
> If a guest recognizes a new hypercall is unavailable, why would it ever try
> to make use of it again, unless it knows it may appear after having been
> migrated (which implies it's prepared for the sudden appearance)? This to
> me still looks like an attempt to work around guest bugs. And a halfhearted
> one, as you cherry pick just two out of many additions to the original 3.0
> ABI.
> 

This is exactly an attempt to work around guest bugs, which is something a 
cloud provider needs to do where possible.

> >> This and ...
> >>
> >>>> I'm also not convinced such controls really want to be opt-in rather
> >>>> than opt-out.
> >>>
> >>> They really need to be opt-in I think. From a cloud provider PoV it is 
> >>> important that nothing in a
> >> customer's environment changes unless we want it to. Otherwise we have no 
> >> way to deploy an updated
> >> hypervisor version without risking crashing their VMs.
> >>
> >> ... this sound to me more like workarounds for buggy guests than
> >> functionality the hypervisor _needs_ to have. (I can appreciate
> >> the specific case here for the specific scenario you provide as
> >> an exception.)
> >
> > If we want to have a hypervisor that can be used in a cloud environment
> > then Xen absolutely needs this capability.
> 
> As per above you can conclude that I'm still struggling to see the
> "why" part here.
> 

Imagine you are a customer. You boot your OS and everything is just fine... you 
run your workload and all is good. You then shut down your VM and re-start it. 
Now it starts to crash. Who are you going to blame? You did nothing to your OS 
or application s/w, so you are going to blame the cloud provider of course.

Now imagine you are the cloud provider, running Xen. What you did was start to 
upgrade your hosts from an older version of Xen to a newer version of Xen, to 
pick up various bug fixes and make sure you are running a version that is 
within the security support envelope. You identify that your customer's problem 
is a bug in their OS that was latent on the old version of the hypervisor but 
is now manifesting on the new one because it has buggy support for a hypercall 
that was added between the two versions. How are you going to fix this issue, 
and get your customer up and running again? Of course you'd like your customer 
to upgrade their OS, but they can't even boot it to do that. You really need a 
solution that can restore the old VM environment, at least temporarily, for 
that customer.

> >>>>> --- a/xen/arch/arm/domain.c
> >>>>> +++ b/xen/arch/arm/domain.c
> >>>>> @@ -622,7 +622,8 @@ int arch_sanitise_domain_config(struct 
> >>>>> xen_domctl_createdomain *config)
> >>>>>      unsigned int max_vcpus;
> >>>>>
> >>>>>      /* HVM and HAP must be set. IOMMU may or may not be */
> >>>>> -    if ( (config->flags & ~XEN_DOMCTL_CDF_iommu) !=
> >>>>> +    if ( (config->flags &
> >>>>> +          ~(XEN_DOMCTL_CDF_iommu | XEN_DOMCTL_CDF_evtchn_fifo) !=
> >>>>>           (XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap) )
> >>>>>      {
> >>>>>          dprintk(XENLOG_INFO, "Unsupported configuration %#x\n",
> >>>>> --- a/xen/arch/arm/domain_build.c
> >>>>> +++ b/xen/arch/arm/domain_build.c
> >>>>> @@ -2478,7 +2478,8 @@ void __init create_domUs(void)
> >>>>>          struct domain *d;
> >>>>>          struct xen_domctl_createdomain d_cfg = {
> >>>>>              .arch.gic_version = XEN_DOMCTL_CONFIG_GIC_NATIVE,
> >>>>> -            .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap,
> >>>>> +            .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap |
> >>>>> +                     XEN_DOMCTL_CDF_evtchn_fifo,
> >>>>>              .max_evtchn_port = -1,
> >>>>>              .max_grant_frames = 64,
> >>>>>              .max_maptrack_frames = 1024,
> >>>>> --- a/xen/arch/arm/setup.c
> >>>>> +++ b/xen/arch/arm/setup.c
> >>>>> @@ -805,7 +805,8 @@ void __init start_xen(unsigned long 
> >>>>> boot_phys_offset,
> >>>>>      struct bootmodule *xen_bootmodule;
> >>>>>      struct domain *dom0;
> >>>>>      struct xen_domctl_createdomain dom0_cfg = {
> >>>>> -        .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap,
> >>>>> +        .flags = XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap |
> >>>>> +                 XEN_DOMCTL_CDF_evtchn_fifo,
> >>>>>          .max_evtchn_port = -1,
> >>>>>          .max_grant_frames = gnttab_dom0_frames(),
> >>>>>          .max_maptrack_frames = -1,
> >>>>> --- a/xen/arch/x86/setup.c
> >>>>> +++ b/xen/arch/x86/setup.c
> >>>>> @@ -738,7 +738,8 @@ static struct domain *__init create_dom0(const 
> >>>>> module_t *image,
> >>>>>                                           const char *loader)
> >>>>>  {
> >>>>>      struct xen_domctl_createdomain dom0_cfg = {
> >>>>> -        .flags = IS_ENABLED(CONFIG_TBOOT) ? 
> >>>>> XEN_DOMCTL_CDF_s3_integrity : 0,
> >>>>> +        .flags = XEN_DOMCTL_CDF_evtchn_fifo |
> >>>>> +                 (IS_ENABLED(CONFIG_TBOOT) ? 
> >>>>> XEN_DOMCTL_CDF_s3_integrity : 0),
> >>>>>          .max_evtchn_port = -1,
> >>>>>          .max_grant_frames = -1,
> >>>>>          .max_maptrack_frames = -1,
> >>>>> --- a/xen/common/domain.c
> >>>>> +++ b/xen/common/domain.c
> >>>>> @@ -307,7 +307,7 @@ static int sanitise_domain_config(struct 
> >>>>> xen_domctl_createdomain *config)
> >>>>>           ~(XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap |
> >>>>>             XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off |
> >>>>>             XEN_DOMCTL_CDF_xs_domain | XEN_DOMCTL_CDF_iommu |
> >>>>> -           XEN_DOMCTL_CDF_nested_virt) )
> >>>>> +           XEN_DOMCTL_CDF_nested_virt | XEN_DOMCTL_CDF_evtchn_fifo) )
> >>>>>      {
> >>>>>          dprintk(XENLOG_INFO, "Unknown CDF flags %#x\n", config->flags);
> >>>>>          return -EINVAL;
> >>>>
> >>>> All of the hunks above point out a scalability issue if we were to
> >>>> follow this route for even just a fair part of new sub-ops, and I
> >>>> suppose you've noticed this with the next patch presumably touching
> >>>> all the same places again.
> >>>
> >>> Indeed. This solution works for now but is probably not what we want in 
> >>> the long run. Same goes
> for
> >> the current way we control viridian features via an HVM param. It is good 
> >> enough for now IMO since
> >> domctl is not a stable interface. Any ideas about how we might implement a 
> >> better interface in the
> >> longer term?
> >>
> >> While it has other downsides, Jürgen's proposal doesn't have any
> >> similar scalability issue afaics. Another possible model would
> >> seem to be to key new hypercalls to hypervisor CPUID leaf bits,
> >> and derive their availability from a guest's CPUID policy. Of
> >> course this won't work when needing to retrofit guarding like
> >> you want to do here.
> >
> > Ok, I'll take a look hypfs as an immediate solution, if that's preferred.
> 
> Well, as said - there are also downsides with that approach. I'm
> not convinced it should be just the three of us to determine which
> one is better overall, the more that you don't seem to be convinced
> anyway (and I'm not going to claim I am, in either direction). It
> is my understanding that based on the claimed need for this (see
> above), this may become very widely used functionality, and if so
> we want to make sure we won't regret the route we went.
> 

Agreed, but we don't need the final top-to-bottom solution now. The only part 
of this that is introducing something that is stable is the libxl part, and I 
think the 'feature' bitmap is reasonable future-proof (being modelled on the 
viridian feature bitmap, which has been extended over several releases since it 
was introduced).

> For starters, just to get a better overall picture, besides the two
> overrides you add here, do you have any plans to retroactively add
> further controls for past ABI additions?

I don't have any specific plans. The two I deal with here are the causes of 
observed differences in guest behaviour, one being an actual crash and the 
other affecting PV driver behaviour which may or may not be the cause of other 
crashes... but still something we need to have control over.

> I don't suppose you have
> plans to consistently do this for all of them? I ask because I
> could see us going the route you've chosen for very few retroactive
> additions of overrides, but use the CPUID approach (provided
> there's a suitable Arm counterpart, and ideally some understanding
> whether something similar could also be found for at least the two
> planned new ports) for ABI additions going forward.
> 

That sounds reasonable and I'd be perfectly happy to rip and replace the use of 
domain creation flags when we have something 'better'. We don't need to wait 
for that though... this series is perfectly functional and doesn't take us down 
any one-way streets (below libxl) AFAICT.

  Paul

> Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.