[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions



On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
>
> > So, may I ask what piece of (Linux) code are we actually talking
> > about?
> > Because I had a quick look, and could not find where what you
> > describe
> > happens....
> 
> udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> UDP by having a diffrent timeout.
> 
Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
there and seeing nothing! But yes, it indeed makes sense to consider
the receiving side... let me have a look...

So, it looks to me that this is what happens:

Âudp_recvmsg(noblock=0)
 Â|
 Â---> __skb_recv_datagram(flags=0) {
        timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeoÂ*/
                do {...} wait_for_more_packets(timeo);
                           |
                           ---> schedule_timeor(timeo)

So, at least in Linux 4.4, the timeout used is the one defined in
sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
some link wrong, which can well be the case):

http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
#define SO_RCVTIMEOÂÂÂÂÂ20

So there looks to be a timeout. But anyways, let's check
schedule_timeout().

> And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> 
So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:

schedule_timeout(SCHEDULE_TIMEOUT) {
  schedule();
  return;
}

If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
valid value), the function does:

schedule_timeout(timeout) {
  struct timer_list timer;

  setup_timer_on_stack(&timer);
  __mod_timer(&timer);
  schedule();
  del_singleshot_timer_sync(&timer);
  destroy_timer_on_stack(&timer);
  return;
}

So, in both cases, it pretty much calls schedule() just about
immediately. And when schedule() it's called, the calling process --
which would be out UDP receiver-- goes to sleep.

The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
arrange for anyone to wakeup the thread that is going to sleep. In
theory, it could even be stuck forever... Of course, this depends on
whether the receiver thread is on a runqueue or not, if (in case it's
not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
etc., and, in prractice, it never happens! :-D

In this case, I think we take the other branch (the one 'with
timeout'). But even if we would take this one, I would expect the
receiver thread to not be on any runqueue, but yet to be (either in
interruptible or not state) in a blocking list from where it is taken
out when a packet arrives.

In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
is still true, but a timer is set before calling schedule() and putting
the thread to sleep. This means that, in case nothing that would wakeup
such thread happens, or in case it hasn't happened yet when the timeout
expires, the thread is woken up by the timer.

And in fact, schedule_timeout() is not a different way, with respect to
just calling schedule(), to going to sleep. It is the way you go to
sleep for at most some amount of time... But in all cases, you just and
immediately go to sleep!

And I also am not sure I see where all that discussion you've had with
George about IPIs fit into this all... The IPI that will trigger the
call to schedule() that will actually put back to execution the thread
that we're putting to sleep in here (i.e., the receiver), happens when
the sender manages to send a packet (actually, when the packet arrives,
I think) _or_ when the timer expires.

The two possible calls to schedule() in schedule_timeout() behave
exactly in the same way, and I don't think having a timeout or not is
responsible for any particular behavior.

What I think it's happening is this: when such a call to schedule()
(from inside schedule_timeout(), I mean) is made what happens is that
the receiver task just goes to sleep, and another one, perhaps the
sender, is executed. The sender sends the packet, which arrives before
the timeout, and the receiver is woken up.

*Here* is where an IPI should or should not happen, depending on where
our receiver task is going to be executed! And where would that be?
Well, that depends on the Linux scheduler's load balancer, the behavior
of which is controlled by scheduling domains flags like BALANCE_FORK,
BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
others, but I think these are the most likely ones to be involved
here).

So, in summary, where the receiver executes when it wakes up on what is
the configuration of such flags in the (various) scheduling domain(s).
Check, for instance, this path:

 try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()

The reason why the tests 'reacts' to topology changes is that which set
of flags is used for the various scheduling domains is, during the time
the scheduling domains themselves are created and configured-- depends
on topology... So it's quite possible that exposing the SMT topology,
wrt to not doing so, makes one of the flag flip in a way which makes
the benchmark work better.

If you play with the flags above (or whatever they equivalents were in
2.6.39) directly, even without exposing the SMT-topology, I'm quite
sure you would be able to trigger the same behavior.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.