Xen project Mailing List

Re: [Xen-devel] [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map

From: Roger Pau Monne <roger.pau@xxxxxxxxxx>

Date: Tue, 4 Oct 2016 11:12:48 +0200

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, boris.ostrovsky@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Tue, 04 Oct 2016 09:13:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote: > >>> On 27.09.16 at 17:57, <roger.pau@xxxxxxxxxx> wrote: > > @@ -43,6 +44,11 @@ static long __initdata dom0_nrpages; > > static long __initdata dom0_min_nrpages; > > static long __initdata dom0_max_nrpages = LONG_MAX; > > > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > > +#define HVM_VM86_TSS_SIZE 128 > > + > > +static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1]; > > This is for your debugging only I suppose? Probably, I wasn't sure if it was relevant so I left it here. Would it make sense to only print this for debug builds of the hypervisor? Or better to just remove it? > > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages( > > avail -= dom0_paging_pages(d, nr_pages); > > } > > > > - if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) && > > + if ( is_pv_domain(d) && > > + (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) && > > Perhaps better to simply force parms->p2m_base to UNSET_ADDR > earlier on? p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, hence the added is_pv_domain check in order to make sure that PVHv2 guests don't get into that branch, which AFAICT is only relevant to PV guests. > > @@ -579,8 +588,19 @@ static __init void pvh_setup_e820(struct domain *d, > > unsigned long nr_pages) > > continue; > > } > > > > - *entry_guest = *entry; > > - pages = PFN_UP(entry_guest->size); > > + /* > > + * Make sure the start and length are aligned to PAGE_SIZE, because > > + * that's the minimum granularity of the 2nd stage translation. > > + */ > > + start = ROUNDUP(entry->addr, PAGE_SIZE); > > + end = (entry->addr + entry->size) & PAGE_MASK; > > + if ( start >= end ) > > + continue; > > + > > + entry_guest->type = E820_RAM; > > + entry_guest->addr = start; > > + entry_guest->size = end - start; > > + pages = PFN_DOWN(entry_guest->size); > > if ( (cur_pages + pages) > nr_pages ) > > { > > /* Truncate region */ > > @@ -591,6 +611,8 @@ static __init void pvh_setup_e820(struct domain *d, > > unsigned long nr_pages) > > { > > cur_pages += pages; > > } > > + ASSERT(IS_ALIGNED(entry_guest->addr, PAGE_SIZE) && > > + IS_ALIGNED(entry_guest->size, PAGE_SIZE)); > > What does this guard against? Your addition arranges for things to > be page aligned, and the only adjustment done until we get here is > one that obviously also doesn't violate that requirement. I'm all for > assertions when they check state which is not obviously right, but > here I don't see the need. Right, I'm going to remove it. I've added more seat belts than needed when testing this and forgot to remove them. > > @@ -1657,15 +1679,238 @@ out: > > return rc; > > } > > > > +/* Populate an HVM memory range using the biggest possible order. */ > > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t > > start, > > + uint64_t size) > > +{ > > + static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node; > > + unsigned int order; > > + struct page_info *page; > > + int rc; > > + > > + ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE)); > > + > > + order = MAX_ORDER; > > + while ( size != 0 ) > > + { > > + order = min(get_order_from_bytes_floor(size), order); > > + page = alloc_domheap_pages(d, order, memflags); > > + if ( page == NULL ) > > + { > > + if ( order == 0 && memflags ) > > + { > > + /* Try again without any memflags. */ > > + memflags = 0; > > + order = MAX_ORDER; > > + continue; > > + } > > + if ( order == 0 ) > > + panic("Unable to allocate memory with order 0!\n"); > > + order--; > > + continue; > > + } > > Is it not possible to utilize alloc_chunk() here? Not really, alloc_chunk will do a ceil calculation of the number of needed pages, which means we end up allocating bigger chunks than needed and then free them. I prefer this approach, since we always request the exact memory that's required, so there's no need to free leftover pages. > > + hvm_mem_stats[order]++; > > + rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)), > > + _mfn(page_to_mfn(page)), order); > > + if ( rc != 0 ) > > + panic("Failed to populate memory: [%" PRIx64 " - %" PRIx64 "] > > %d\n", > > [<start>,<end>) please. > > > + start, start + (((uint64_t)1) << (order + PAGE_SHIFT)), > > rc); > > + start += ((uint64_t)1) << (order + PAGE_SHIFT); > > + size -= ((uint64_t)1) << (order + PAGE_SHIFT); > > Please prefer 1ULL over (uint64_t)1. > > > + if ( (size & 0xffffffff) == 0 ) > > + process_pending_softirqs(); > > That's 4Gb at a time - isn't that a little too much? Hm, it's the same that's used in pvh_add_mem_mapping AFAICT. I could reduce it to 0xfffffff, but I'm also wondering if it makes sense to just call it on each iteration, since we are possibly mapping regions with big orders here. > > + } > > + > > +} > > Stray blank line. > > > +static int __init hvm_setup_vmx_unrestricted_guest(struct domain *d) > > +{ > > + struct e820entry *entry; > > + p2m_type_t p2mt; > > + uint32_t rc, *ident_pt; > > + uint8_t *tss; > > + mfn_t mfn; > > + paddr_t gaddr = 0; > > + int i; > > unsigned int > > > + /* > > + * Stole some space from the last found RAM region. One page will be > > Steal > > > + * used for the identify page tables, and the remaining space for the > > identity > > > + * VM86 TSS. Note that after this not all e820 regions will be aligned > > + * to PAGE_SIZE. > > + */ > > + for ( i = 1; i <= d->arch.nr_e820; i++ ) > > + { > > + entry = &d->arch.e820[d->arch.nr_e820 - i]; > > + if ( entry->type != E820_RAM || > > + entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE ) > > + continue; > > + > > + entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE; > > + gaddr = entry->addr + entry->size; > > + break; > > + } > > + > > + if ( gaddr == 0 || gaddr < MB(1) ) > > + { > > + printk("Unable to find memory to stash the identity map and > > TSS\n"); > > + return -ENOMEM; > > One function up you panic() on error - please be consistent. Also for > one of the other patches I think we figured that the TSS isn't really > required, so please only warn in that case. > > > + } > > + > > + /* > > + * Identity-map page table is required for running with CR0.PG=0 > > + * when using Intel EPT. Create a 32-bit non-PAE page directory of > > + * superpages. > > + */ > > + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > > + &mfn, &p2mt, 0, &rc); > > Comment and operation don't really fit together. > > > +static int __init hvm_setup_p2m(struct domain *d) > > +{ > > + struct vcpu *saved_current, *v = d->vcpu[0]; > > + unsigned long nr_pages; > > + int i, rc, preempted; > > + > > + printk("** Preparing memory map **\n"); > > Debugging leftover again? Should this be merged with the next label, so that it reads as "Preparing and populating the memory map"? > > + /* > > + * Subtract one page for the EPT identity page table and two pages > > + * for the MADT replacement. > > + */ > > + nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3; > > How do you know the MADT replacement requires two pages? Isn't > that CPU-count dependent? And doesn't the partial page used for > the TSS also need accounting for here? Yes, it's CPU-count dependent. This is just an approximation, since we only support up to 256 CPUs on HVM guests, and each Processor Local APIC entry is 8 bytes, this means that the CPU-related data is going to use up to 2048 bytes of data, which still leaves plenty of space for the IO APIC and the Interrupt Source Override entries. We requests two pages in case the original MADT crosses a page boundary. FWIW, I could also fetch the original MADT size earlier and use that as the upper bound here for memory reservation. In the RFC series we also spoke about placing the MADT in a different position that the native one, which would mean that we would end up stealing some space from a RAM region in order to place it, so that we wouldn't have to do this accounting. In fact this is slightly wrong, and it doesn't need to account for the identity page table or the VM86 TSS, since we end up stealing this space from populated RAM regions. At the moment we only need to account for the 2 pages that could be used by the MADT. > > + hvm_setup_e820(d, nr_pages); > > + do { > > + preempted = 0; > > + paging_set_allocation(d, dom0_paging_pages(d, nr_pages), > > + &preempted); > > + process_pending_softirqs(); > > + } while ( preempted ); > > + > > + /* > > + * Special treatment for memory < 1MB: > > + * - Copy the data in e820 regions marked as RAM (BDA, EBDA...). > > + * - Map everything else as 1:1. > > + * NB: all this only makes sense if booted from legacy BIOSes. > > + */ > > + rc = modify_mmio_11(d, 0, PFN_DOWN(MB(1)), true); > > + if ( rc ) > > + { > > + printk("Failed to map low 1MB 1:1: %d\n", rc); > > + return rc; > > + } > > + > > + printk("** Populating memory map **\n"); > > + /* Populate memory map. */ > > + for ( i = 0; i < d->arch.nr_e820; i++ ) > > + { > > + if ( d->arch.e820[i].type != E820_RAM ) > > + continue; > > + > > + hvm_populate_memory_range(d, d->arch.e820[i].addr, > > + d->arch.e820[i].size); > > + if ( d->arch.e820[i].addr < MB(1) ) > > + { > > + unsigned long end = min_t(unsigned long, > > + d->arch.e820[i].addr + d->arch.e820[i].size, > > MB(1)); > > + > > + saved_current = current; > > + set_current(v); > > + rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr, > > + > > maddr_to_virt(d->arch.e820[i].addr), > > + end - d->arch.e820[i].addr); > > + set_current(saved_current); > > + if ( rc != HVMCOPY_okay ) > > + { > > + printk("Unable to copy RAM region %#lx - %#lx\n", > > + d->arch.e820[i].addr, end); > > + return -EFAULT; > > + } > > + } > > + } > > + > > + printk("Memory allocation stats:\n"); > > + for ( i = 0; i <= MAX_ORDER; i++ ) > > + { > > + if ( hvm_mem_stats[MAX_ORDER - i] != 0 ) > > + printk("Order %2u: %pZ\n", MAX_ORDER - i, > > + _p(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) * > > + hvm_mem_stats[MAX_ORDER - i])); > > + } > > + > > + if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) ) > > + { > > + /* > > + * Since Dom0 cannot be migrated, we will only setup the > > + * unrestricted guest helpers if they are needed by the current > > + * hardware we are running on. > > + */ > > + rc = hvm_setup_vmx_unrestricted_guest(d); > > Calling a function of this name inside an if() checking for > !vmx_unrestricted_guest() is, well, odd. Yes, that's right. What about calling it hvm_setup_vmx_realmode_helpers? > > static int __init construct_dom0_hvm(struct domain *d, const module_t > > *image, > > unsigned long image_headroom, > > module_t *initrd, > > void *(*bootstrap_map)(const module_t > > *), > > char *cmdline) > > { > > + int rc; > > > > printk("** Building a PVH Dom0 **\n"); > > > > + /* Sanity! */ > > + BUG_ON(d->domain_id != 0); > > + BUG_ON(d->vcpu[0] == NULL); > > May I suggest > > BUG_ON(d->domain_id); > BUG_ON(!d->vcpu[0]); > > in cases like this? Yes, I have the tendency to not use '!' or perform direct checks unless it's a boolean type. > > + process_pending_softirqs(); > > Why, outside of any loop? It's the same that's done in construct_dom0_pv, so I though that it was a good idea to drain any pending softirqs before starting domain build. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.