[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] "cpus" config parameter broken?

Thanks for the reply and sorry for the delay in mine... I've been
having email problems.

Please note proposal and request for comments below
(marked with >>>>>> Comments? <<<<<)

> > 1) Is the "cpus" parameter expected to work in a config 
> file or is it
> > somehow deprecated?
> >    (I see there is an "xm vcpu-pin" command so perhaps this is the
> > accepted way to
> >    pin cpu's?)
> It's expected to work.

Yes indeed it does work.  There were some syntax variations in the cpus
param that I didn't quite understand.  However, my misunderstanding
uncovered another interesting problem.  See below.

> > 3) Does "cpus" really have any real-world usage anyhow?  
> E.g. are most
> > uses probably just
> >     user misunderstanding where "vcpu_avail" should be used instead?
> I'm sure some admins use it to good effect in hand placing 
> domains on CPUs, especially in a NUMA context. In most cases 
> its typically best to be fully work conserving and give Xen's 
> scheduler full flexibility.

Yeah, I guess if you think of it as "poor man's hard partitioning"
it makes a lot of sense.  But if you think of it in a utility data
center context, true affinity rather than restriction may make more

And vcpu_avail should cover most app licensing/pricing concerns.

> >    what happens if the vcpu is ready to schedule but none of the
> > restricted set of pcpu's is available?
> It's a restriction. Each of the values in the mask is 
> processed modulo the number of physical CPUs.

The output from "xm vcpu-list" observes the "modulo" but apparently
the scheduler does not.  For example on a 2 pcpu system launching
a 2 vcpu guest with cpus=0,3 (noting that 3 mod 2 = 1), "xm vcpu-list"
shows that each of the 2 vcpu's of the guest have "any cpu" in the
"CPU Affinity" column, reflecting the fact that 0,3 is modulo the
same as 0,1 which is the same as 0-1 which is the same as all.

However, the cpu_mask is saved as 0,3 and the scheduler ignores
any pcpu's other than 0 and 1.  This can be observed in "xm vcpu-list"
in the above example by seeing that both guest vcpus are sharing
processor 0.

So the results displayed by "xm vcpu-list" and the actual scheduler
placement are different, but which one is the bug?  Consider:

If a 2 vcpu guest is running on an 8 pcpu machine and has been
restricted to cpus="2,3,4,5" and this 2 vcpu guest gets migrated
to a 4 pcpu system, to which pcpus should the migrated guest be
restricted?  Using the xm_vcpu-list logic it gets all 4 pcpus,
but (if cpu_mask were preserved which it currently isn't) the
scheduler logic would give it just two (2 and 3).  And suppose
this 2 vcpu guest on the 8 pcpu system were restricted to "5-8"
and migrated to a 4 pcpu system.  It wouldn't get any processor
time at all (though xm_vcpu-list would say each vcpu's CPU Affinity
is "any").

Because affinity/cpu_restriction is not currently preserved across
save/restore or migration, this is a moot discussion.  But if I
were to "fix" it so it were preserved, the decision is important.

My opinion: CPU affinity/restriction should NOT be preserved
across migration.  Or if it is, it should only be preserved
when the source and target have the same number of pcpus
(thus allowing save/restore to work OK).  Or maybe it should
only be preserved for save/restore and not for migration.
>>>>>>>>>>>>>>>>> Comments? <<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Note that vcpu_avail would still work across migration.
(Hmmm... have to look to see if vcpu_avail is currently
preserved across save/restore/migration. If not, I will
definitely need to find and fix that one.)

> There was an extension to the cpus= syntax proposed at one 
> point that I'm not sure whether it ever got checked in. The 
> idea was to allow the cpus= parameter to be a list of 
> strings, enabling a different mask to specified for each 
> VCPU. This would enable an admin to pin individual VCPUs to 
> CPUs rather than just at a domain level.

It looks like the internal vcpu data structure supports this
and xm_vcpu-pin supports it, but afaict there's no way to
specify per-vcpu-affinity at xm_create.

> I'm not a huge fan of the cpus= mechanism. It would likely be 
> more user friendly to allow physical CPUs to be put into 
> groups then allow domains to be assigned to CPU groups. It 
> would also be better if you could specify physical CPUs by a 
> node.socket.core.thread hierarchy rather than the enumerated 
> CPU number.

Agreed, though I'll bet that would take major scheduler surgery.
And this would also further increase the confusion for migration!

I'd also like to see affinity and restriction teased apart
because they are separate concepts with different uses.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.