[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slow (higher scheduling latency?) system after S3 - regression 4.17 -> 4.19 ?


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • Date: Thu, 26 Mar 2026 16:53:03 +0100
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=fm1 header.d=invisiblethingslab.com header.i="@invisiblethingslab.com" header.h="Cc:Content-Type:Date:From:In-Reply-To:Message-ID:MIME-Version:References:Subject:To"; dkim=pass header.s=fm1 header.d=messagingengine.com header.i="@messagingengine.com" header.h="Cc:Content-Type:Date:Feedback-ID:From:In-Reply-To:Message-ID:MIME-Version:References:Subject:To:X-ME-Proxy:X-ME-Sender"
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 26 Mar 2026 15:53:15 +0000
  • Feedback-id: i1568416f:Fastmail
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Mar 26, 2026 at 02:20:04PM +0100, Jan Beulich wrote:
> On 26.03.2026 13:44, Marek Marczykowski-Górecki wrote:
> > On Thu, Mar 26, 2026 at 09:27:56AM +0100, Jan Beulich wrote:
> >> On 25.03.2026 20:13, Marek Marczykowski-Górecki wrote:
> >>> On Sat, Oct 25, 2025 at 02:23:23PM +0200, Marek Marczykowski-Górecki 
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>> After updating from Xen 4.17 to 4.19 it appears that I have some
> >>>> performance issues if system was suspended before. For now this is just
> >>>> vague observation, as I don't have much data yet.
> >>>>
> >>>> Generally, the user observable effect is that audio playback (over
> >>>> Qubes's PV audio thing) becomes very choppy, beyond usefulness (cannot
> >>>> make audio calls anymore, music/video playback also becomes useless with
> >>>> some players. For me this suggests some increased latency in scheduling
> >>>> - maybe some timer fires too late?
> >>>>
> >>>> This doesn't happen always, but I'd say quite often after S3, sometimes
> >>>> may need two or three suspend cycles to trigger the issue. But it's
> >>>> possible it is also some other factor, like which pCPU the VM gets
> >>>> scheduled on, maybe? But for now I didn't hit this issue before
> >>>> performing any S3. I don't have a reliable reproducer yet.
> >>>>
> >>>> While I write it's related to 4.17 -> 4.19 update, there is also slight
> >>>> possibility it's related to hardware change (CFL -> MTL). But, it's
> >>>> unlikely, since I'd probably get much more bug reports about that (the
> >>>> stable Qubes version uses 4.17, and we have many users on MTL).
> >>>>
> >>>> What would be the method to collect some useful debugging info? Some
> >>>> debug-keys (r? a?), some cmdline options?
> >>>
> >>> In the meantime (sharing some of it on Matrix already), I managed to
> >>> build a much more isolated test. Specifically, a single PVH domU with
> >>> VLC, preloading 1s of audio (and not playing it really, the output is
> >>> set to a file). The specific command is:
> >>>
> >>>     VLC_VERBOSE=3 vlc -I cli -A afile --no-loop --play-and-exit 
> >>> --run-time 1   ./melodigne-quiz-blind-test-187140.mp3
> >>>
> >>> The domU rootfs (or rather: initramfs) is based on AppImage of VLC,
> >>> there is not much more inside, and the above command is called as part
> >>> of init script (and domU is terminated shortly after).
> >>>
> >>> There is no persistent state in the test, the whole thing is network
> >>> booted from the same base. The only things changing between runs is xen
> >>> binary + toolstack.
> >>>
> >>> In the output I'm looking for this line: main input debug: Stream 
> >>> buffering done (1018 ms in 2 ms)
> >>>
> >>> Before S3 it takes 0-2ms, sometimes (rarely) 10ms (not sure why). Just
> >>> after S3 it takes very similar amount of time.
> >>> Then, I wait 30min, and run that command (or really the whole domU)
> >>> again a few times. I consider it "good" if I get at least one result
> >>> below 10ms. With this test in hand, I did run bisect between
> >>> staging-4.17 and staging-4.19. And here stuff gets interesting:
> >>>
> >>> At 498a4ded74b3f062c52e42568223dc5858d27731 I get the following times
> >>> (in ms):
> >>>
> >>>     30m after S3: 2, 2, 10, 10, 
> >>>     1h after S3: 2, 16, 16, 2, 17
> >>>
> >>> At 5effd810c629d9a836e46ee321ca7409dad27212:
> >>>
> >>>     30m after S3: 26, 27, 26, 15
> >>>     1h after S3: 24, 25, 37, 37, 25
> >>>
> >>> This doesn't make any sense to me. There is a clear difference, yet
> >>> 5effd810c629d9a836e46ee321ca7409dad27212 looks like a no-op change (at
> >>> least for this Intel system).
> >>
> >> Pretty odd indeed. I assume you double checked the effect by going back
> >> and forth between the two commits.
> > 
> > Yes...
> > 
> >> If the effect persists, could you make
> >> both binaries (xen-syms or xen.efi respectively, depending on which one
> >> you use of the system) available somewhere for inspection? There has to
> >> be some relevant difference, e.g. by addresses shifting slightly.
> > 
> > Sure, here: https://salmon.qubes-os.org/~marmarek/xen-perf-weird.tar.gz
> 
> While said commit changes the size of init_nonfatal_mce_checker() by 4
> bytes, that doesn't lead to any other changes. I.e. all addresses remain
> the same. I.e. I'm completely lost as to explaining the observed behavior
> (or even just making a wild guess).

With staging-4.19 from a week ago +
5effd810c629d9a836e46ee321ca7409dad27212 reverted, I got the following
results:

30m after S3: 23, 22, 22, 23
1h after S3: 7, 25, 22, 8, 25

So, there are some fast runs, but most are in line with the "bad" case
above...

But, I have another idea why that seemingly no-op commit may change
something: maybe it's about stack content during/after the function
call, influencing some uninitialized variable down the road?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.