[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of newer locking primitives...

To: Martin Harvey <martin.harvey@xxxxxxxxxx>, "win-pv-devel@xxxxxxxxxxxxxxxxxxxx" <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>
From: Paul Durrant <xadimgnik@xxxxxxxxx>
Date: Tue, 30 Aug 2022 13:59:54 +0100
Delivery-date: Tue, 30 Aug 2022 13:00:01 +0000
List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

On 30/08/2022 13:40, Martin Harvey wrote:



-----Original Message-----
From: Paul Durrant <xadimgnik@xxxxxxxxx>

I wouldn't necessarily be convinced that moving away from the atomic lock/queue 
would be any faster.


Humm. Maybe depends on H/W cache implementation. If you're doing interlocked 
ops on the lock, and non-interlocked ops on a queue, and they are on separate 
cache lines... maybe.

Limiting the queue drain may help from a fairness PoV, which I guess could be 
achieved by actually blocking on the ring lock if the queue gets to a certain 
length.

- We do a XenBus GntTabRevoke for every single damn fragment individually on 
the Tx fast path. There are probably some ways of batching / deferring these so 
as to make everything run a bit more smoothly.

A revoke ought to be fairly cheap; it's essentially some bit flips. It does, of 
course, involve some memory barriering... but is that really slowing things 
down?


Well, it does do a SCHEDOP_Yield, where I think perhaps 
KeStallExecutionProcessor might be better for a shorter (microsecond) wait. If 
the yield is a hypercall, then I think assorted context switching etc is wasted 
work, and poss longer wait granularity.

Surely KeStallExecutionProcessor() is likely to HLT while waiting, sowhy is that going to be better than a yield? Also the yield is only doneif the interlocked op fails, which it shouldn't do in the normal case.


It would be *really nice* to get vTune / uProf working on guests with all the 
uArchitecture counters working so we could actually find out exactly how many 
cycles get spent where.

We see 99% of TxPollDpc's completing rapidly (microseconds), and then very 
occasionally we get a 4ms PollDpc. This is timing via ETW. If I get time to 
repro it (cpl of weeks), then I might try changing the stall mechanisms, and 
possibly the locking, separately.

Right, that could be because something has just dumped (or is stilldumping) a load of packets on the queue. Whoever has the lock has todrain the entire queue... that's the lack of fairness I was referringto. If there's a need to let other DPCs run then perhaps thatrequirement could be relaxed... but we'd have to be careful that packetsdon't get stuck in the queue.


  Paul

MH.

Follow-Ups:
- RE: Use of newer locking primitives...
  - From: Martin Harvey

References:
- Use of newer locking primitives...
  - From: Martin Harvey
- Re: Use of newer locking primitives...
  - From: Paul Durrant
- RE: Use of newer locking primitives...
  - From: Martin Harvey
- Re: Use of newer locking primitives...
  - From: Paul Durrant
- RE: Use of newer locking primitives...
  - From: Martin Harvey

Prev by Date: RE: Use of newer locking primitives...
Next by Date: RE: Use of newer locking primitives...
Previous by thread: RE: Use of newer locking primitives...
Next by thread: RE: Use of newer locking primitives...
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.