[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v10 09/25] x86: refactor psr: L3 CAT: set value: implement framework.



On 17-04-19 03:00:29, Jan Beulich wrote:
> >>> On 19.04.17 at 10:22, <yi.y.sun@xxxxxxxxxxxxxxx> wrote:
> > On 17-04-18 05:46:43, Jan Beulich wrote:
> >> >>> On 18.04.17 at 12:55, <yi.y.sun@xxxxxxxxxxxxxxx> wrote:
> >> > I made a test patch based on v10 and attached it in mail. Could you 
> >> > please
> >> > help to check it? Thanks!
> >> 
> >> This looks reasonable at the first glance, albeit I continue to be
> >> unconvinced that this is the only (reasonable) way of solving the

Do you have any other suggestion on this? Thanks!

> >> problem. After all we don't have to go through similar hoops for
> >> any other of the register state associated with a vCPU. There
> > 
> > Sorry, I do not understand your meaning clearly.
> > 1. DYM the ASSOC register which is set in 'psr_ctxt_switch_to'? In this 
> > patch,
> >    we check 'dom_ids' array to know if the domain's cos id has not been set 
> > but
> >    its 'psr_cos_ids[socket]' already has a non zero value. This case means 
> > the
> >    socket offline has happened so that we have to restore the domain's
> >    'psr_cos_ids[socket]' to default value 0 which points to default COS MSR.
> >    I think we have discussed this in previous mails and achieved agreement 
> > on
> >    such logic. 
> > 2. DYM the COS MSRs? We have discussed this before. Per your comments, when
> >    socket is online, the registers values may be modified by FW and others.
> >    So, we have to restore them to default value which is being done in
> >    'cat_init_feature'.
> > 
> > So, what is your exactly meaning here? Thanks!
> 
> Neither of the two. I'm comparing with COS/PSR-_unrelated_
> handling of register state. After all there are other MSRs which
> we need to put into the right state after a core comes online.
> 
For PSR feature, the 'cos_reg_val[]' of the feature is destroied when socket
is offline. The values in it are all 0 when socket is online again. This causes
error when user shows the CBMs. So, we have two options when socket is online:
1. Read COS MSRs values and save them into 'cos_reg_val[]'.
2. Restore COS MSRs values and 'cos_reg_val[]' values to default CBM.

Per our discussion on v9 patch 05, we decided to adopt option 2. Below are
what we discussed. FYR.

[v9 patch 05]
>> >> > +    /* cos=0 is reserved as default cbm(all bits within cbm_len are 
>> >> > 1). */
>> >> > +    feat->cos_reg_val[0] = cat_default_val(cat.cbm_len);
>> >> > +    /*
>> >> > +     * To handle cpu offline and then online case, we need read MSRs 
>> >> > back to
>> >> > +     * save values into cos_reg_val array.
>> >> > +     */
>> >> > +    for ( i = 1; i <= cat.cos_max; i++ )
>> >> > +    {
>> >> > +        rdmsrl(MSR_IA32_PSR_L3_MASK(i), val);
>> >> > +        feat->cos_reg_val[i] = (uint32_t)val;
>> >> > +    }
>> >> 
[Jan]
>> >> You mention this in the changes done, but I don't understand why 
>> >> you do this. What meaning to these values have to you? If you want 
>> >> hardware and cached values to match up, the much more conventional 
>> >> way of enforcing this would be to write the values you actually 
>> >> want (normally all zero).
>> >>
[Sun Yi] 
>> > When all cpus on a socket are offline, the free_feature() is called 
>> > to free features resources so that the values saved in 
>> > cos_reg_val[] are lost. When the socket is online again, features 
>> > are allocated again so that cos_reg_val[] members are all 
>> > initialized to 0. Only is cos_reg_val[0] initialized to default value in 
>> > this function in old codes.
>> > 
>> > But domain is still alive so that its cos id on the socket is kept. 
>> > The corresponding MSR value is kept too per test. To make 
>> > cos_reg_val[] values be same as HW to not to mislead user, we 
>> > should read back the valid values on HW into cos_reg_val[].
>> 
[Jan]
>> Okay, I understand the background, but I don't view this solution as 
>> viable: Once the last core on a socket goes offline, all references 
>> to it should be cleaned up. After all what will be brought back 
>> online may be a different physical CPU altogether; you can't assume 
>> MSR values to have survived even if it is the same CPU which comes 
>> back online, as it may have undergone a reset cycle, or BIOS/SMM may 
>> have played with the MSRs.
>> That's even a possibility for a single core coming back online, so 
>> you have to reload MSRs explicitly anyway if implicit reloading (i.e. 
>> once vCPU-s get scheduled onto it) doesn't suffice.
>> 
[Sun Yi]
> So, you think the MSRs values may not be valid after such process and 
> reloading (write MSRs to default value) is needed. If so, I would like 
> to do more operations in 'free_feature()':
> 1. Iterate all domains working on the offline socket to change
>    'd->arch.psr_cos_ids[socket]' to COS 0, i.e restore it back to init
>    status.
> 2. Restore 'socket_info[socket].cos_ref[]' to all 0.
> 
> These can make the socket's info be totally restored back to init status.
[Jan]
Yes, that's what I think is needed.

> >> are a number of cosmetic issues, but commenting on an attached
> >> (rather than inlined) patch is a little difficult.
> >> 
> > Sorry for that, I will directly send patch out next time.
> > 
> >> One remark regarding the locking though: Acquiring a lock in the
> >> context switch path should be made as low risk of long stalls as
> >> possible. Therefore you will want to consider using r/w locks
> >> instead of spin locks here, which would allow parallelism on all
> >> cores of a socket as long as COS IDs aren't being updated.
> >> 
> > In 'psr_ctxt_switch_to', I use the lock only to protect 'write' actions.
> > So, I do not understand why read-write lock is better? Anything I don't
> > consider? Please help to point out. Thanks!
> 
> Hmm, looking again I can see that r/w locks indeed may not help
> here. However, what you say still doesn't look correct to me: You
> acquire the lock depending on _only_ psra->cos_mask being non-
> zero. This means that all cores on one socket are being serialized
> here. Quite possibly all you need is for some of the checks done
> inside the locked region to be replicated (but _not_ moved) to the
> outer if(), to limit the number of times where the lock is to be
> acquired.
> 
I think your suggestions is to check old_cos outer the lock region.
If it is not 0 which means the domain's cos id does not need restore
to 0, we can directly set it into ASSOC register. That can avoid
unnecessary lock. I will send out the test patch again to ask your
help to provide review comments (you said there are also 'a number
of cosmetics issues'). Thanks!

> Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.