[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ANNOUNCE] Xen 4.15 release schedule and feature tracking

On Thu, 2021-01-14 at 19:02 +0000, Andrew Cooper wrote:
> On 14/01/2021 16:06, Ian Jackson wrote:
> > The last posting date for new feature patches for Xen 4.15 is
> > tomorrow. [1]  We seem to be getting a reasonable good flood of
> > stuff
> > trying to meet this deadline :-).
> > 
> > Patches for new fetures posted after tomorrow will be deferred to
> > the
> > next Xen release after 4.15.  NB the primary responsibility for
> > driving a feature's progress to meet the release schedule, lies
> > with
> > the feature's proponent(s).
> > 
> > 
> >   As a reminder, here is the release schedule:
> > + (unchanged information indented with spaces):
> > 
> >    Friday 15th January    Last posting date
> > 
> >        Patches adding new features should be posted to the mailing
> > list
> >        by this cate, although perhaps not in their final version.
> > 
> >    Friday 29th January    Feature freeze
> > 
> >        Patches adding new features should be committed by this
> > date.
> >        Straightforward bugfixes may continue to be accepted by
> >        maintainers.
> > 
> >    Friday 12th February **tentatve**   Code freeze
> > 
> >        Bugfixes only, all changes to be approved by the Release
> > Manager.
> > 
> >    Week of 12th March **tentative**    Release
> >        (probably Tuesday or Wednesday)
> > 
> >   Any patches containing substantial refactoring are to treated as
> >   new features, even if they intent is to fix bugs.
> > 
> >   Freeze exceptions will not be routine, but may be granted in
> >   exceptional cases for small changes on the basis of risk
> > assessment.
> >   Large series will not get exceptions.  Contributors *must not*
> > rely on
> >   getting, or expect, a freeze exception.
> > 
> > + New or improved tests (supposing they do not involve refactoring,
> > + even build system reorganisation), and documentation
> > improvements,
> > + will generally be treated as bugfixes.
> > 
> >   The codefreeze and release dates are provisional and will be
> > adjusted
> >   in the light of apparent code quality etc.
> > 
> >   If as a feature proponent you feel your feature is at risk and
> > there
> >   is something the Xen Project could do to help, please consult me
> > or
> >   the Community Manager.  In such situations please reach out
> > earlier
> >   rather than later.
> > 
> > 
> > In my last update I asked this:
> > 
> > > If you are working on a feature you want in 4.15 please let me
> > > know
> > > about it.  Ideally I'd like a little stanza like this:
> > > 
> > > S: feature name
> > > O: feature owner (proponent) name
> > > E: feature owner (proponent) email address
> > > P: your current estimate of the probability it making 4.15, as a
> > > %age
> > > 
> > > But free-form text is OK too.  Please reply to this mail.
> > I received one mail.  Thanks to Oleksandr Andrushchenko for his
> > update
> > on the following feeature:
> > 
> >   IOREQ feature (+ virtio-mmio) on Arm
> >   
> > https://www.mail-archive.com/xen-devel@xxxxxxxxxxxxxxxxxxxx/msg87002.html
> > 
> >   Julien Grall <julien@xxxxxxx>
> >   Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>
> > 
> > I see that V4 of this series was just posted.  Thanks, Oleksandr.
> > I'll make a separate enquiry about your series.
> > 
> > I think if people don't find the traditional feature tracking
> > useful,
> > I will try to assemble Release Notes information later, during the
> > freeze, when fewer people are rushing to try to meet the deadlines.
> (Now I have working email).
> Features:
> 1) acquire_resource fixes.
> Not really a new feature - entirely bugfixing a preexisting one.
> Developed by me to help 2).  Reasonably well acked, but awaiting
> feedback on v3.
> 2) External Processor Trace support.
> Development by Michał.  Depends on 1), and awaiting a new version
> being
> posted.
> As far as I'm aware, both Intel and CERT have production systems
> deployed using this functionality, so it is very highly desirable to
> get
> into 4.15.
> 3) Initial Trenchboot+SKINIT support.
> I've got two patches I need to clean up and submit which is the first
> part of the Trenchboot + Dynamic Root of Trust on AMD support.  This
> will get Xen into a position where it can be started via the new grub
> "secure_launch" protocol.
> Later patches (i.e. post 4.15) will do support for Intel TXT (i.e.
> without tboot), as well as the common infrastructure for the TPM
> event
> log and further measurements during the boot process.
> 4) "simple" autotest support.
> Bugs:
> 1) HPET/PIT issue on newer Intel systems.  This has had literally
> tens
> of reports across the devel and users mailing lists, and prevents Xen
> from booting at all on the past two generations of Intel laptop. 
> I've
> finally got a repro and posted a fix to the list, but still in
> progress.
> 2) "scheduler broken" bugs.  We've had 4 or 5 reports of Xen not
> working, and very little investigation on whats going on.  Suspicion
> is
> that there might be two bugs, one with smt=0 on recent AMD hardware,
> and
> one more general "some workloads cause negative credit" and might or
> might not be specific to credit2 (debugging feedback differs - also
> might be 3 underlying issue).
Yep, so, let's try to summarize/collect the ones I think you may be
referring to:

1) There is one report about Credit2 not working, while Credit1 was
fine. It's this one:


It's the one where it somehow happens that one or more vCPUs manage to
run for a really really long timeslice, much more than the scheduler
would have allowed them to, and this cause problems. _If_ that's it, my
investigation so far seems to show that this happens despite scheduler
code tries to enforce (via timers) the proper timeslice limits. when it
happens, makes the scheduler very unhappy. I've see reports of it
occurring both on Credit and Credit2, but definitely Credit2 seems to
be more sensitive to it.

I've actually been trying to track it down for a while now, but I can't
easily reproduce it, so it's proving to be challenging.

2) Then there has been his one:


Here, the where reporter said that "[credit1] results is an observable
delay, unusable performance; credit2 seems to be the only usable
scheduler". This is the one that also Andrew mention, happening on
Ryzen and with SMT disabled (as this is on QubesOS, IIRC).

Here, doing "dom0_max_vcpus=1 dom0_vcpus_pin" seemed to mitigate the
problem but, of course, with obvious limitations. I don't have a Ryzen
handy, but I have a Zen and a Zen2. I checked there and again could not
reproduce (although, what I tried was upstream Xen, not QubesOS).

3) Then I recall this one:


This also started as a "scheduler, probably Credit2" bug. But it then
turned out manifests on both Credit1 and Credit2 and it started to
happen on 4.14, while it was not there in 4.13... And nothing major
changed in scheduling between these two releases, I think.

During the analysis, we thought we identified a livelock, but then
could not pinpoint what was exactly going on. Oh, and then it was also
discovered that Credit2 + PVH dom0 seemed to be a working
configuration, and it's weird for a scheduling issue to have a (dom0)
domain type dependency, I think. But that could be anything really...
and I'm sure happy to keep digging.

4) There's the NULL scheduler + ARM + vwfi=native issue:


This looks like something that we saw before, but remained unfixed,
although not exactly like that. If it's that one, analysis is done, and
we're working on a patch. If it's something else or even something
similar but slightly different... Well, we'll have to see when we have
the patch.

5) We're also dealing with this bugreport, although this is being
reported against Xen 4.13 (openSUSE 's packaged version of it):


This is again on recent AMD hardware and here, "dom0_max_vcpus=4
dom0_vcpus_pin" works ok, but only until a (Windows) HVM guest is
started. When that happens, though, we have crashes/hangs.

If guests are PV, things are apparently fine. If the HVM guests use a
different set of CPUs than dom0 (e.g., vm.cpumask="4-63" in xl.conf),
thinks are fine as well.

Again a scheduler issue and a scheduling algorithm dependency was
theorized and will be investigated (if the user can come back with
answers, which may take some time, as explained in the report). The
different behavior with different kind of guests is a little weird for
an issue of this kind, IME, but let's see.

6) If we want, we can include this too (hopefully just for reference):


As indeed the symptoms were similar, such as hanging during boot, but
all fine with dom0_max_vcpus=1. However, Jan is currently investigating
this one, and they're heading toward problems with TSC reliability
reporting and rendezvous, but let's see.

Did I forget any?

As for "the plan", I am currently working on 4 (trying to come up with
a patch that fixes it) and on 1 (trying to come up with a way to track
down and uncover what I believe is the real issue).

Dario Faggioli, Ph.D
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Attachment: signature.asc
Description: This is a digitally signed message part



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.