Xen project Mailing List

Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall

From: Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>

Date: Thu, 15 Nov 2012 13:51:39 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir $Xen.org$" <keir@xxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Zhigang Wang <zhigang.x.wang@xxxxxxxxxx>

Delivery-date: Thu, 15 Nov 2012 18:52:05 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> Hi Jan -- > > Thanks for the quick review! > > Mostly fixed for v3 but a couple things: > >>> + * Claimed memory is considered unavailable unless the request >>> + * is made by a domain with sufficient unclaimed pages. >>> + */ >>> + if ( (total_unclaimed_pages + request > >>> + total_avail_pages + tmem_freeable_pages()) && >>> + (d == NULL || d->unclaimed_pages < request) ) >>> + goto not_found; >> >> The treatment of d being NULL certainly needs further thought: >> Is it really better to fulfill the claim and fail some (perhaps >> important) _xmalloc()? > > Ideally, allocation in the presence of existing claims should > behave as if the claiming domains had actually already allocated > the unclaimed-amount-of-memory. So I'd argue that enforcing > the claim should be sacrosanct here. Well, are we sure that failing an "anonymous" allocations is not going to trigger a BUG_ON? That's a lot of code review. If you get this wrong, now Xen suddenly crashes if allocating domains close to the max. It doesn't, today, afaict. > >> Also, I'm missing a mechanism by which the tools could find out >> how much unclaimed memory is available, in order to determine >> (if in use) how much memory needs to be ballooned out of Dom0. > > OK. I'm not certain if this will be useful on a per-domain > basis as well but, for completeness, I will also add > unclaimed_pages into xc_dominfo etc (which causes a bump > in XEN_DOMCTL_INTERFACE_VERSION). > >> Similarly, but perhaps of lower priority, there is no integration >> with the low-mem handling. > > I'd also consider this lower priority as Olaf and Andre > have argued that the claim mechanism is not needed for > sharing/paging so the two mechanisms may not > be used together, at least for the foreseeable future. > So I plan to skip this, unless you change your mind and > consider it a showstopper for acceptance. This is a slippery slope. Let's not work out the interactions with existing subsystems before adding code to the tree. What could go wrong? As a data point for everyone, I've found the low men virq extremely useful as a sync interrupt signaling to our toolstack that it needs to get its act together and start rebalancing memory, if it hasn't yet. I don't see how that cannot be useful to any other toolstack. Andres > >> Finally, there still are a number of formatting issues. > > Hmmm... I found one I think. Is there an equivalent to > checkpatch for hypervisor code? If you see any formatting > issues in v3, please call them out explicitly as I am > sincerely trying to avoid them. > > Thanks, > Dan > > > > ------------------------------ > > Message: 4 > Date: Thu, 15 Nov 2012 18:23:09 +0000 > From: Mats Petersson <mats.petersson@xxxxxxxxxx> > To: Tim Deegan <tim@xxxxxxx> > Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx> > Subject: Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to > process real NMI's instead of self_nmi() in VMEXIT handler > Message-ID: <50A5330D.9020204@xxxxxxxxxx> > Content-Type: text/plain; charset="ISO-8859-1"; format=flowed > > On 15/11/12 17:44, Tim Deegan wrote: >> At 17:33 +0000 on 15 Nov (1353000782), Mats Petersson wrote: >>> On 15/11/12 17:15, Tim Deegan wrote: >>>> At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote: >>>>>> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're >>>>>> running a HVM guest, so the stack issue doesn't apply (but nested NMIs >>>>>> are still bad). >>>>>> >>>>>> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM >>>>>> guests. We don't really have to but it saves time in the context switch >>>>>> not to update the IDT. Using do_nmi() here means that the first NMI is >>>>>> handled on the normal stack instead. It's also consistent with the way >>>>>> we call do_machine_check() for the MCE case. But it needs an explicit >>>>>> IRET after the call to do_nmi() to make sure that NMIs get re-enabled. >>>>> Both AMD and Intel has an identical setup with regard to stacks and >>>>> general "what happens when we taken one of these interrupts". >>>> My reading of svm_ctxt_switch_{to,from} makes me disagree with this. >>>> AFAICT, on SVM we're not using dedicated stacks at all. >>> In SVM, the VMRUN returns to whatever stack you had before the VMRUN. >>> This is not what I'm talking about, however. The stack used for the NMI >>> and MCE comes from the interrupt descriptor entry for those respective >>> vectors. >> This is the code I was referring to: >> >> /* >> * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest >> TR. >> * But this doesn't matter: the IST is only req'd to handle >> SYSCALL/SYSRET. >> */ >> idt_tables[cpu][TRAP_double_fault].a &= ~(7UL << 32); >> idt_tables[cpu][TRAP_nmi].a &= ~(7UL << 32); >> idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32); >> >> Am I misreading it? > > No, you are reading it perfectly right, I'm wrong... > > -- > Mats >> >>> So in conclusion, the do_mce_exception() call probably should be a >>> __asm__ __volatile__("int $X"), where X is the relevant vector. >> This handles MCEs that were raised in guest context. If we've managed >> to get this far into the exit handler, the hypervisor stack is probably >> OK. :) >> >> I'd be happy to invoke the MCE handler though the IDT here, just for >> symmetry with the other cases, but I don't think it makes much >> difference. >> >> Tim. >> >> > > > > > ------------------------------ > > Message: 5 > Date: Thu, 15 Nov 2012 13:29:28 -0500 > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > To: Ian Campbell <Ian.Campbell@xxxxxxxxxx> > Cc: "netdev@xxxxxxxxxxxxxxx" <netdev@xxxxxxxxxxxxxxx>, ANNIE LI > <annie.li@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" > <xen-devel@xxxxxxxxxxxxxxxxxxx>, Roger Pau Monne > <roger.pau@xxxxxxxxxx> > Subject: Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in > xen-netfront/netback > Message-ID: <20121115182928.GB22320@xxxxxxxxxxxxxxxxxxx> > Content-Type: text/plain; charset=iso-8859-1 > > On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote: >> On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote: >>> On 15/11/12 09:38, ANNIE LI wrote: >>>> >>>> >>>> On 2012-11-15 15:40, Pasi K?rkk?inen wrote: >>>>> Hello, >>>>> >>>>> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote: >>>>>> This patch implements persistent grants for xen-netfront/netback. This >>>>>> mechanism maintains page pools in netback/netfront, these page pools is >>>>>> used to >>>>>> save grant pages which are mapped. This way improve performance which is >>>>>> wasted >>>>>> when doing grant operations. >>>>>> >>>>>> Current netback/netfront does map/unmap grant operations frequently when >>>>>> transmitting/receiving packets, and grant operations costs much cpu >>>>>> clock. In >>>>>> this patch, netfront/netback maps grant pages when needed and then saves >>>>>> them >>>>>> into a page pool for future use. All these pages will be unmapped when >>>>>> removing/releasing the net device. >>>>>> >>>>> Do you have performance numbers available already? with/without >>>>> persistent grants? >>>> I have some simple netperf/netserver test result with/without persistent >>>> grants, >>>> >>>> Following is result of with persistent grant patch, >>>> >>>> Guests, Sum, Avg, Min, Max >>>> 1, 15106.4, 15106.4, 15106.36, 15106.36 >>>> 2, 13052.7, 6526.34, 6261.81, 6790.86 >>>> 3, 12675.1, 6337.53, 6220.24, 6454.83 >>>> 4, 13194, 6596.98, 6274.70, 6919.25 >>>> >>>> >>>> Following are result of without persistent patch >>>> >>>> Guests, Sum, Avg, Min, Max >>>> 1, 10864.1, 10864.1, 10864.10, 10864.10 >>>> 2, 10898.5, 5449.24, 4862.08, 6036.40 >>>> 3, 10734.5, 5367.26, 5261.43, 5473.08 >>>> 4, 10924, 5461.99, 5314.84, 5609.14 >>> >>> In the block case, performance improvement is seen when using a large >>> number of guests, could you perform the same benchmark increasing the >>> number of guests to 15? >> >> It would also be nice to see some analysis of the numbers which justify >> why this change is a good one without every reviewer having to evaluate >> the raw data themselves. In fact this should really be part of the >> commit message. > > You mean like a nice graph, eh? > > I will run these patches on my 32GB box and see if I can give you > a nice PDF/jpg. > >> >> Ian. >> > > > > ------------------------------ > > Message: 6 > Date: Thu, 15 Nov 2012 18:29:13 +0000 > From: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > To: Michael Palmeter <michael.palmeter@xxxxxxxxxx> > Cc: Dario Faggioli <raistlin@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" > <xen-devel@xxxxxxxxxxxxx> > Subject: Re: [Xen-devel] Xen credit scheduler question > Message-ID: <50A53479.5050901@xxxxxxxxxxxxx> > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > On 15/11/12 15:43, Michael Palmeter wrote: >> >> Hi all (and Mr. Dunlap in particular), >> > > Haha -- please don't call me "Mr"; I prefer "George", but if you want a > title, use "Dr" (since I have PhD). :-) > >> Example scenario: >> >> * Server hardware: 2 sockets, 8-cores per socket, 2 hardware threads >> per core (total of 32 hardware threads) >> * Test VM: a single virtual machine with a single vCPU, weight=256 >> and cap=100% >> >> In this scenario, from what I understand, I should be able to load the >> Test VM with traffic to a maximum of approximately 1/32 of the >> aggregate compute capacity of the server. The total CPU utilization >> of the server hardware should be approximately 3.4%, plus the overhead >> of dom0 (say 1-2). The credits available to any vCPU capped at 100% >> should be equal to 1/32 of the aggregate compute available for the >> whole server, correct? >> > > I think to really be precise, you should say, "1/32nd of the logical cpu > time available", where "logical cpu time" simply means, "time processing > on one logical CPU". At the moment, that is all that either the credit1 > or credit2 schedulers look at. > > As I'm sure you're aware, not all "logical cpu time" is equal. If one > thread of a hyperthread pair is running but the other idle, it will get > significantly higher performance than if the other thread is busy. How > much is highly unpredictable, and depends very much on exactly what > units are shared with the other hyperthread, and the workload running on > each unit. But even when both threads are busy, it should (in theory) > be rare for both threads to get a throughput of 50%; the whole idea of > HT is that threads typically get 70-80% of the full performance of the > core (so the overall throughput is increased). > > But if course, while this is particularly extreme in the case of > hyperthreads, it's also true on a smaller scale even without that -- > cores share caches, NUMA nodes share memory bandwidth, and so on. No > attempt is made to compensate VMs for cache misses or extra memory > latency due to sharing either. :-) > >> Put simply, is there a way to constrain a VM with 1 vCPU to consume no >> more than 0.5 of a physical core (hyper-threaded) on the server >> hardware mentioned below? Does the cap help in that respect? >> > > You can use "cap" to make the VM in question get 50% of logical vcpu > time, which on an idle system will give it 0.5 of the capacity of a > physical core (if we don't consider Intel's Turbo Boost technology). > But if the system becomes busy, it will get less than 0.5 of the > processing capacity of a physical core. > >> I have been struggling to understand how the scheduler can deal with >> the uncertainty that hyperthreading introduces, however. I know this >> is an issue that you are tackling in the credit2 scheduler, but I >> would like to know what your thoughts are on this problem (if you are >> able to share). Any insight or assistance you could offer would be >> greatly appreciated. >> > > At the moment it does not attempt to; the only thing it does is try not > to schedule two hyperthreads that share a core if there is an idle > core. But if there are more active vcpus than cores, then some will > share; and the ones that share a core with another vcpu will be charged > the same as the ones that have the core all to themselves. > > Could you explain why you your question is important to you -- i.e,. > what are you trying to accomplish? It sounds a bit like you're more > concerned with accuracy in reporting, and control of resources, rather > than fairness, for instance. > > -George > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.xen.org/archives/html/xen-devel/attachments/20121115/b2ff8583/attachment.html> > > ------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel > > > End of Xen-devel Digest, Vol 93, Issue 157 > ****************************************** _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.