Xen project Mailing List

Re: [Xen-devel] [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

To: "Aravindh Puthiyaparambil (aravindp)" <aravindp@xxxxxxxxx>

From: Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>

Date: Wed, 30 Apr 2014 21:11:04 -0400

Cc: Keir Fraser <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Thu, 01 May 2014 01:11:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Apr 30, 2014, at 6:20 PM, "Aravindh Puthiyaparambil (aravindp)" <aravindp@xxxxxxxxx> wrote: >>>>> + default: >>>>> + *flags &= ~(_PAGE_NX_BIT); >>>>> + *flags |= (_PAGE_RW|_PAGE_PRESENT); >>>>> + break; >>>>> + } >>>> >>>> Tim will have the final say here, but I'd suggest dropping all these >>>> needless parentheses (I see only one case where they're really >>>> needed). >>> >>> OK, I will drop all the needless ones. Could you please point out the >>> case they are needed? I couldn't find anything in the CODING_STYLE >>> that dictates what should be done. >> >> Convention is to use them when operator precedence isn't considered >> obvious. What "obvious" here is differs between people, but at least all >> precedence rules defined by school mathematics should generally not require >> parenthesizing. Which in the case above means no parentheses around the >> right side expression of assignment operators. > > OK, I understand and will fix them appropriately. > >>>>> + >>>>> + if ( page_get_owner(mfn_to_page(mfn)) != d ) >>>>> + return -ENOENT; >>>>> + >>>>> + gfn = get_gpfn_from_mfn(mfn_x(mfn)); >>>> >>>> You get "gfn" passed in and blindly overwrite it here? _If_ you need >>>> to do this lookup, shouldn't you instead check it matches the passed in >> one? >>> >>> The "gfn" and "mfn" passed in to the function should be the same as it >>> is a PV guest. That is why I am overwriting "gfn" to get the "gpfn" >>> from the machine_to_phys_mapping. >> >> Since this overwrites/discards a function parameter, this surely is worth a >> comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))). > > I will add the ASSERT and a comment for it. > >>>>> + >>>>> + /* >>>>> + * Values with the MSB set denote MFNs that aren't really part of the >>>>> + * domain's pseudo-physical memory map (e.g., the shared info >> frame). >>>>> + * Nothing to do here. >>>>> + */ >>>>> + if ( unlikely(!VALID_M2P(gfn)) ) >>>>> + return 0; >>>>> + >>>>> + if ( gfn > (d->tot_pages - 1) ) >>>>> + return -EINVAL; >>>> >>>> Hardly - a PV domain can easily make its address space sparse, and >>>> iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions >>>> on MMIO space. (And as a side note, this would better be "gfn >= >>>> d->tot_pages".) >>> >>> I was using this as a guard against going out of bounds in the access >>> lookup table which is created based on the domains tot_pages. Please >>> let me know what I should be using instead of tot_pages when doing this. >> >> Afaict For PV guests there's absolutely nothing you can use to bound the >> range. >>>>> + paging_lock(d); >>>>> + >>>>> + table_idx = MEM_ACCESS_TABLE_IDX(gfn); >>>>> + page_idx = MEM_ACCESS_PAGE_IDX(gfn); >>>>> + access_table_page = >>>> map_domain_page(mfn_x(access_lookup_table[table_idx])); >>>>> + access_table_page[page_idx] = p2ma; >>>>> + unmap_domain_page(access_table_page); >>>>> + >>>>> + if ( sh_remove_all_mappings(d->vcpu[0], mfn) ) >>>> >>>> Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- NULL? >>> >>> I am not sure. I see shadow_track_dirty_vram() calling this without a check. >>> Maybe I should check and call it only if d->vcpu is not null. >> >> You ought to check both, at least via ASSERT(). > > OK, I will add the ASSERTs. > >>>>> + table_idx = MEM_ACCESS_TABLE_IDX(gfn); >>>>> + page_idx = MEM_ACCESS_PAGE_IDX(gfn); >>>>> + >>>>> + access_table_page = >>>> map_domain_page(mfn_x(access_lookup_table[table_idx])); >>>>> + >>>>> + /* This is a hint to take the default permissions */ >>>>> + if ( access_table_page[page_idx] == p2m_access_n ) >>>>> + access_table_page[page_idx] = p2m->default_access; >>>> >>>> We're in "get" here - why does that modify any global state? >>> >>> I do not initialize the access lookup table to the default value. But >>> the table is zeroed as part of initialization when each page is >>> allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n" >>> / 0 value as hint that the default value should be returned as >>> p2m_access_n is not valid for PV domains. If you prefer that I >>> initialize the table to the default value, I will gladly do so. >> >> I think implying _n to mean "default" is at least a latent bug anyway (since >> that >> precludes setting that type). And _if_ "get" really does any modification, >> that >> minimally would require a big comment. > > For PV domains access_n is precluded. > >> Furthermore, with it being possible for the default type to change, the net >> effect might easily be different between setting the default type at >> initialization time or when the first lookup happens. > > I am not sure that I follow your concern here. Setting default access does > not mean that the access table gets cleared. So if a particular MFN entry's > access value changed and the default value was changed, then the MFN should > retain the access value it was changed to and not the default value. The new > default value will only be returned for MFNs for which set_entry was not > called on and I think that is correct. > >>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) { >>>>> + struct domain *d = p2m->domain; >>>>> + mfn_t *access_lookup_table; >>>>> + uint32_t nr_access_table_pages; >>>>> + uint32_t ctr; >>>>> + >>>>> + nr_access_table_pages = get_domain_nr_access_table_pages(d); >>>>> + access_lookup_table = xzalloc_array(mfn_t, >>>>> + nr_access_table_pages); >>>> >>>> This surely is going to be an order > 0 allocation, and you even risk >>>> it being an order > MAX_ORDER one. The former is disallowed at runtime >>>> by convention/agreement, the latter is an outright bug. >>> >>> Sorry, I was not aware of the convention. What should I be doing here >>> instead? >> >> Especially with the above note about GFN space being effectively unbounded >> for PV guests, you need to find a different data structure to represent the >> information you need. Perhaps something similar to what log-dirty mode >> uses, except that you would need more levels in the extreme case, and hence >> it might be worthwhile making the actually used number of levels dynamic? > > I am a little confused here. In my initial discussions about this I was told > that PV guests have bounded and contiguous memory. This is why I took the > approach of an indexable array. > > " OTOH, all you need is a byte per pfn, and the great thing is that in PV > domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, > etc, which demand the sparse tree structure. So you can allocate an easily > indexable array, notwithstanding super page concerns (I think/hope)." > > http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html > > Did I misunderstand something here? > PS: I have CCed Andres too so that he can clarify. I can muddy things up further. In the good olden days that was the case. It seems that since linux 3.x, pv dom0 is defined with a physmap as large as the host, and plenty of holes in it which are plugged with the m2p override. I am unfortunately unhelpfully hazy on details. A situation like this will lead to max_pfn being very different from tot_pages. This is common in hvm. Apologies for lack of precision Andres > >>>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d, >>>> unsigned long pfn, uint32_t nr, >>>>> if ( pfn == ~0ul ) >>>>> { >>>>> p2m->default_access = a; >>>>> + if ( is_pv_domain(d) ) >>>>> + return p2m_mem_access_set_default(p2m); >>>> >>>> Is it really correct to set p2m->default_access _before_ calling that >>>> function? Perhaps it wouldn't be correct doing it the other way around >>>> either - I suppose you need to hold the necessary lock across both >>>> operations. >>> >>> I do not see the problem here. p2m_mem_access_set_default() just blows >>> the shadows away after taking the paging_lock. There is no actual >>> setting of default permission for all pages in the PV case. >> >> But the function having a return value means it might fail, in which case you >> shouldn't have set the new default. > > Good point, I will fix that. > >>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v, >>>> mfn_t gmfn) >>>>> if ( !(shadow_mode_external(v->domain) >>>>> && (page->count_info & PGC_count_mask) <= 3 >>>>> && ((page->u.inuse.type_info & PGT_count_mask) >>>>> - == !!is_xen_heap_page(page))) ) >>>>> + == !!is_xen_heap_page(page))) >>>>> + && !(shadow_mode_mem_access(v->domain)) ) >>>> >>>> You're breaking indentation, there are pointless parentheses again, >>>> but most importantly - why? >>> >>> Sorry, I meant to ask a question about this in my patch message and >>> mark this as a workaround. I am seeing the message on all >>> sh_remove_all_mappings() and I was not able to figure out why this was >>> happening. I just added this as a work around. I was hoping you or Tim >> would shed more light on this. >> >> I'm afraid that without detail on which pages the triggers on, and you at >> least >> having spent some time finding out where the stray/extra references may be >> coming from it's going to be hard to help. > > Here are some of the prints I saw. I typically saw it for every set_entry() > call. > > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33960: c=8000000000000002 t=7400000000000000 <most common> > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 134b8: c=8000000000000038 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 13344: c=8000000000000003 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 13449: c=8000000000000028 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 1394b: c=8000000000000013 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 137b5: c=8000000000000003 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33fa6: c=8000000000000002 t=7400000000000000 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 1345d: c=8000000000000028 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33410: c=8000000000000034 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33412: c=8000000000000039 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33953: c=8000000000000011 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 13a63: c=8000000000000028 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 393f2: c=800000000000003a t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 1335b: c=8000000000000028 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 134bc: c=8000000000000036 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 393f0: c=8000000000000014 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 134bd: c=8000000000000008 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 1354e: c=8000000000000010 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 33443: c=8000000000000013 t=7400000000000001 > (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn > 39f74: c=8000000000000008 t=7400000000000001 > >>>>> + /* >>>>> + * Do not police writes to guest memory emanating from the >>>>> Xen >>>>> + * kernel. Trying to do so will cause the same pagefault to >>>>> occur >>>>> + * over and over again with an event being sent to the access >>>>> + * listener for each fault. If the access listener's vcpu is >>>>> not >>>>> + * scheduled during this time, the violation is never >>>>> resolved and >>>>> + * will eventually end with the host crashing. >>>>> + */ >>>>> + if ( (violation && access_w) && >>>>> + (regs->eip >= XEN_VIRT_START && regs->eip <= >> XEN_VIRT_END) ) >>>>> + { >>>>> + violation = 0; >>>>> + rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K, >>>>> + p2m_ram_rw, p2m_access_rw); >>>>> + } >>>> >>>> This looks more like a hack than a proper solution - shouldn't the >>>> listener be allowed to know of hypervisor side accesses? >>> >>> Ideally the listener should know of the hypervisor side accesses but I >>> ran in to cases where the listener would not get to run and the >>> pagefault would be tried over and over, eventually causing the host to >>> crash. I guess more detail about the problem is needed. I will send >>> out a separate email regarding this, CCing you and Tim. >> >> And if it's intended as a workaround until proper resolution only, you >> probably >> should say so explicitly in the comment, avoiding the need for reviewers to >> comment on this being a problem. > > Sorry, I will be more explicit about such things in the future. > >>>>> +/* Number of access table pages for a PV domain */ #define >>>>> +get_domain_nr_access_table_pages(d) \ >>>>> + DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), >>>>> +PAGE_SIZE) >>>> >>>> And once again. I wonder whether this code was tested on a suitably >>>> big pv-ops PV guest. >>> >>> The largest PV domain I tried this with was 28GB. >> >> Which ought to have had GFNs larger than ->tot_pages, so I wonder how this >> worked. > > It would have had MFNs larger than tot_pages. I don't understand how it would > have had GFNs larger than tot_pages. > > Thanks, > Aravindh > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.