[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xenstore - Suggestion of batching watch events



On Tue, Jun 24, 2025 at 4:01 PM Andriy Sultanov
<sultanovandriy@xxxxxxxxx> wrote:
>
> On 6/24/25 3:51 PM, Andriy Sultanov wrote:
>
> > Suggestion:
> > WATCH_EVENT's req_id and tx_id are currently 0. Could it be possible, for
> > example, to modify this such that watch events coming as a result of a
> > transaction commit (a "batch") have tx_id of the corresponding
> > transaction
> > and req_id of, say, 2 if it's the last such watch event of a batch and 1
> > otherwise? Old clients would still ignore these values, but it would
> > allow
> > some others to detect if an update is part of a logical batch that
> > doesn't end
> > until its last event.
>
> Come to think of it, since clients could watch arbitrary parts of
> what the transaction touched, this wouldn't be as simple, xenstored
> would have to issue the "batch ended" packet per token, tracking
> that somehow internally... Perhaps transaction_start and transaction_end
> could produce WATCH_EVENT (or some other similar packet) as well so
> that this tracking could be done client-side? (standard WATCH_EVENT
> would still need their tx_id to be modified)
>

The Windows PV drivers also write a huge amount of data to xenstore.
E.g. I see 74 entries under /data, grouped by 1st level:
/data/volumes: 39
/dta/scsi: 7
/data/vif: 3
/data/xd: 2
/data/cpus: 2
/data/...: 19

This is probably why the /data/updated optimization used to be beneficial.

/attr/os/hotfixes is also large (28 on this VM, but used to be in the
hundreds for older versions of Windows).

In some situations xenopsd could set up more granular watches, but
then you run into scalability issues due to having thousands of
watches per domain (although you've fixed the largest problem there:
connecting/disconnecting xenstore clients causing slowdowns due to
watch trie walks).

There is also the problem that watch events don't contain enough
information, so watch events only acts as signals to xenopsd, which
then goes on and fetches the entire xenstore subtree to figure out
what actually changed. Which is the result of some of the O(N^2)
performance issues we still have.
We used to have a prototype xenstore cache which avoided actually
making those fetches from oxenstored, and once something got into its
cache, it kept track of updates by setting up a watch on the key.
Although a cold start then took ~30m (worse than not having a cache at
all). Although a compromise could be to cache on-the-fly (instead of
precaching everything you see), e.g. I don't think we actually care
about the values under /data/volumes and attr/os/hotfixes, other than
for debugging purposes, so if xenopsd never fetches them, the cache
shouldn't either.

To avoid a lot of round-trips a new kind of watch event that tells you
the value(s) in addition to the keys might be useful. And then this
new kind of watch event could also be emitted once per transaction (I
think the events are already emitted at transaction commit time, and
not sooner).
If filtering watch events based on tree depths would be useful in some
situations then the new watch event could also try to do that.
But then one such "batched" watch event could become too big, larger
than what would fit into the xenstore ring (and for historic reasons
we don't support sending >4k packets through xenstore).

Best regards,
--Edwin



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.