[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
On 02/05/12 17:30, Dario Faggioli wrote: Ah right -- yeah, probably since b_info is a libxl structure, you shouldn't add it in there. But in that case you should probably add another xl-specific structure and pass it through, rather than having global variables, I think. It's only used in the handful of placement functions, right?+ +/* Store the policy for the domain while parsing */ +static int nodes_policy = NODES_POLICY_DEFAULT; + +/* Store the number of nodes to be used while parsing */ +static int num_nodes_policy = 0;Why are "nodes_policy" and "num_nodes_policy" not passed in along with b_info?That was my first implementation. Then I figured out that I want to do the placement in _xl_, not in _libxl_, so I really don't need to muck up build info with placement related stuff. Should I use b_info anyway, even if I don't need these fields while in libxl? Right. This is always a bit tricky, balancing your own taste for how to do things, and following the style of the code that you're modifying.Sounds definitely nicer. I just did it like that because I found a very similar example in xl itself, but I'm open about changing this to whatever you and libxl maintainers reach a consensus on. :-) I had in mind no constraints at all on the ratios -- basically, if you can find N nodes such that the sum of free memory is enough to create the VM, even 99%/1%, then go for that rather than looking for N+1. Obviously finding a more balanced option would be better. One option would be to scan through finding all sets of N nodes that will satisfy the criteria, and then choose the most "balanced" one. That might be more than we need for 4.2, so another option would be to look for evenly balanced nodes first, then if we don't find a set, look for any set. (That certainly fits with the "first fit" description!)Also, is it really necessary for a VM to have an equal amount of memory on every node? It seems like it would be better to have 75% on one node and 25% on a second node than to have 25% on four nodes, for example. Furthermore, insisting on an even amount fragments the node memory further -- i.e., if we chose to have 25% on four nodes instead of 75% on one and 25% on another, that will make it harder for another VM to fit on a single node as well.Ok, that is something quite important to discuss. What you propose makes a lot of sense, although some issues comes to my mind: - which percent should I try, and in what order? I mean, 75%/25% sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your point. Haha -- yeah, for a research paper, you'd probably implement some kind of lottery scheduling algorithm that would schedule it on one node 75% of the time and another node 25% of the time. :-) But I think that just making the node affinity equal to both of them will be good enough for now. There will be some variability in performance, but there will be some of that anyway depending on what node's memory the guest happens to use more.- suppose I go for 75%/25%, what about the scheduling oof the VM? This actually kind of a different issue, but I'll bring it up now because it's related. (Something to toss in for thinking about in 4.3 really.) Suppose there are 4 cores and 16GiB per node, and a VM has 8 vcpus and 8GiB of RAM. The algorithm you have here will attempt to put 4GiB on each of two nodes (since it will take 2 nodes to get 8 cores). However, it's pretty common for larger VMs to have way more vcpus than they actually use at any one time. So it might actually have better performance to put all 8GiB on one node, and set the node affinity accordingly. In the rare event that more than 4 vcpus are active, a handful of vcpus will have all remote accesses, but the majority of the time, all of the cpus will have local accesses. (OTOH, maybe that should be only a policy thing that we should recommend in the documentation...) Please, don't get me wrong, I see your point and really think it makes sense. I've actually thought along the same line for a while, but then I couldn't find an answers to the questions above. That's why, kind of falling back with Xen's default "striped" approach (although on as less nodes as possible, which is _much_ better than the actual Xen's default!). It looked simple enough to write, read and understand, while still providing statistically consistent performances. Dude, this is open source. Be opinionated. ;-) What do you think of my suggestions above? I think if the user specifies a nodemap, and that nodemap doesn't have enough memory, we should throw an error.Hmm -- if I'm reading this right, the only time the nodemap won't be all nodes is if (1) the user specified nodes, or (2) there's a cpumask in effect. If we're going to override that setting, wouldn't it make sense to just expand to all numa nodes?As you wish, the whole "what to do if what I've been provided with doesn't work" is in the *wild guess* status, meaning I tried to figure out what would be best to do, but I might well be far from the actual correct solution, provided there is one. Trying to enlarge the nodemap step by step is potentially yielding better performances, but is probably not so near to the "least surprise" principle one should use when designing UIs. :-(Hmm -- though I suppose what you'd really want to try is adding each node in turn, rather than one at a time (i.e., if the cpus are pinned to nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before trying [1,2,3,4].Yep, that makes a real lot of sense, thanks! I can definitely try doing that, although it will complicate the code a bit...But that's starting to get really complicated -- I wonder if it's better to just fail and let the user change the pinnings / configuration node mapping.Well, that will probably be the least surprising behaviour. Again, just let me know what you think it's best among the various alternatives and I'll go for it. If there's a node_affinity set, no memory on that node, but memory on a *different* node, what will Xen do? It will allocate memory on some other node, right? So ATM even if you specify a cpumask, you'll get memory on the masked nodes first, and then memory elsewhere (probably in a fairly random manner); but as much of the memory as possible will be on the masked nodes. I wonder then if we shouldnt' just keep that behavior -- i.e., if there's a cpumask specified, just return the nodemask from that mask, and let Xen put as much as possible on that node and let the rest fall where it may. What do you think? Sorry, wrong above -- I meant the other comment about __add_nodes_to_nodemap(). :-)+ + if (use_cpus>= b_info->max_vcpus) { + rc = 0; + break; + }Hmm -- there's got to be a better way to find out the minimum number of nodes to house a given number of vcpus than just starting at 1 and re-trying until we have enough.+ /* Add one more node and retry fitting the domain */ + __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);Same comment as above.I'm not sure I'm getting this. The whole point here is let's consider free memory on the various nodes first, and then adjust the result if some other constraints are being violated. No, that's not exactly what I meant. Suppose there are 4 cores per node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other parameters. If I'm reading your code right, what it will do is first try to find a set of 1 node that will satisfy the constraints, then 2 nodes, then 3, nodes, then 4, &c. Since there are at most 4 cores per node, we know that 1, 2, and 3 nodes are going to fail, regardless of how much memory there is or how many cpus are offline. So why not just start with 4, if the user hasn't specified anything? Then if 4 doesn't work (either because there's not enough memory, or some of the cpus are offline), then we can start bumping it up to 5, 6, &c.However, if what you mean is I could check beforehand whether or not the user provided configuration will give us enough CPUs and avoid testing scenarios that are guaranteed to fail, then I agree and I'll reshape the code to look like that. This triggers the heuristics re-designing stuff from above again, as one have to decide what to do if user asks for "nodes=[1,3]" and I discover (earlier) that I need one more node for having enough CPUs (I mean, what node should I try first?). That's what I was getting at -- but again, if it makes it too complicated, trading a bit of extra passes for a significant chunk of your debugging time is OK. :-) Well, if the user didn't specify anything, then we can't contradict anything he specified, right? :-) If the user doesn't specify anything, and the default is "numa=auto", then I think we're free to do whatever we think is best regarding NUMA placement; in fact, I think we should try to avoid failing VM creation if it's at all possible. I just meant what I think we should do if the user asked for specific NUMA nodes or a specific number of nodes. (I think that cpu masks should probably behave as it does now -- set the numa_affinity, but not fail domain creation if there's not enough memory on those nodes.)So, I'm not entirely sure I answered your question but the point is your idea above is the best one: if you ask something and we don't manage in getting it done, just stop and let you figure things out. I've only one question about this approach, what if the automatic placement is/becomes the default? I mean, avoiding any kind of fallback (which again, makes sense to me in case the user is explicitly asking something specific) would mean a completely NUMA-unaware VM creation can be aborted even if the user did not say anything... How do we deal with this? It seems like we have a number of issues here that would be good for more people to come in on -- what if I attempt to summarize the high-level decisions we're talking about so that it's easier for more people to comment on them? -George diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c ...This should be in its own patch.Ok. Thanks lot again for taking a look! Regards, Dario _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |