[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Problems with merlot* AMD Opteron 6376 systems (Was Re: stable trees (was: [xen-4.2-testing test] 58584: regressions))
Adding Boris+Suravee+Aravind (AMD/SVM maintainers), Dario (NUMA) and Jim +Anthony (libvirt) to the CC. TL;DR osstest is exposing issues running on "AMD Opteron(tm) Processor 6376" in at least a couple of test cases. It would be good if someone from AMD could have a look. The systems here == merlot[01], which seem to be having with win7 live migration tests as well as libvirt when starting PV guests. They each contain "AMD Opteron(tm) Processor 6376" processors with 32 threads in 4 nodes and seem to have a strange NUMA layout with no RAM on nodes 1 or 3. The test history on these machines: http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html I just posted some analysis of the windows cases (including experiments on the old Cambridge test infra with "AMD Opteron(tm) Processor 6168" processes) in: http://lists.xen.org/archives/html/xen-devel/2015-06/msg03713.html I've also been investigating the libvirt guest-start failures. The symptom is a 10s timeout starting qemu. Anthony is seeing this with openstack too and did some analysis in http://thread.gmane.org/gmane.comp.emulators.xen.devel/246473/focus=249172 onwards, but it may be that this is unrelated to the osstest failures and that for Anthony's scenario the 10s timeout could be explained by the openstack tempest tests starting lots of VMs in parallel. However for the osstests we are starting a single PV domain on an otherwise idle host. There should be no reason for qemu to take as long as 10s to come up in that case, even with pessimal NUMA layout (IMHO at least). By comparison on other hosts starting qemu seems to take 2-4s, so merlot is at least 2.5-5 times worse. I tried running some adhoc tests on the old infra tied to the *-frog machines (which are the Opteron 6168 ones): http://xenbits.xen.org/people/ianc/tmp/adhoc/37623/ http://xenbits.xen.org/people/ianc/tmp/adhoc/37625/ The -xsm failures are because I botched the flight configuration, the interesting information is that the other ones passed both times (migrate-support is expected to fail at the moment). Supposing that the NUMA oddities might be what is exposing this issue I tried an adhoc run on the merlot machines where I specified "dom0_max_vcpus=8 dom0_nodes=0" on the hypervisor command line: http://logs.test-lab.xenproject.org/osstest/logs/58853/ Again, I messed up the config for the -xsm case, so ignore. The interesting thing is that the extra NUMA settings were apparently_not_ helpful. From http://logs.test-lab.xenproject.org/osstest/logs/58853/test-amd64-amd64-libvirt/serial-merlot0.log I can see they were applied: Jun 23 15:50:34.205057 (XEN) Command line: placeholder conswitch=x watchdog com1=115200,8n1 console=com1,vga gdb=com1 dom0_mem=512M,max:512M ucode=scan dom0_max_vcpus=8 dom0_nodes=0 [...] Jun 23 15:50:38.309057 (XEN) Dom0 has maximum 8 VCPUs The memory info Jun 23 15:56:27.749008 (XEN) Memory location of each domain: Jun 23 15:56:27.756965 (XEN) Domain 0 (total: 131072): Jun 23 15:56:27.756983 (XEN) Node 0: 126905 Jun 23 15:56:27.756998 (XEN) Node 1: 0 Jun 23 15:56:27.764952 (XEN) Node 2: 4167 Jun 23 15:56:27.764969 (XEN) Node 3: 0 suggests at least a small amount of cross-node memory allocation (16M out of dom0s 512M total). That's probably small enough to be OK. And it seems as if the 8 dom0 vcpus are correctly pinned to the first 8 cpus (the ones in node 0): Jun 23 15:56:43.797055 (XEN) VCPU information and callbacks for domain 0: Jun 23 15:56:43.797110 (XEN) VCPU0: CPU4 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={4} Jun 23 15:56:43.805078 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.813121 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.813157 (XEN) No periodic timer Jun 23 15:56:43.821050 (XEN) VCPU1: CPU3 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={3} Jun 23 15:56:43.829044 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.829082 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.837051 (XEN) No periodic timer Jun 23 15:56:43.837084 (XEN) VCPU2: CPU5 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={5} Jun 23 15:56:43.845102 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.853035 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.853071 (XEN) No periodic timer Jun 23 15:56:43.853099 (XEN) VCPU3: CPU7 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={7} Jun 23 15:56:43.861102 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.869110 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.869145 (XEN) No periodic timer Jun 23 15:56:43.877014 (XEN) VCPU4: CPU0 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={} Jun 23 15:56:43.877038 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.885053 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.885088 (XEN) No periodic timer Jun 23 15:56:43.893085 (XEN) VCPU5: CPU0 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={} Jun 23 15:56:43.901075 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.901134 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.909010 (XEN) No periodic timer Jun 23 15:56:43.909048 (XEN) VCPU6: CPU2 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={2} Jun 23 15:56:43.917065 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.925055 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.925074 (XEN) No periodic timer Jun 23 15:56:43.925095 (XEN) VCPU7: CPU6 [has=F] poll=0 upcall_pend=00 upcall_mask=00 dirty_cpus={6} Jun 23 15:56:43.933119 (XEN) cpu_hard_affinity={0-7} cpu_soft_affinity={0-7} Jun 23 15:56:43.941080 (XEN) pause_count=0 pause_flags=1 Jun 23 15:56:43.941129 (XEN) No periodic timer So whatever the issue is it doesn't seem to be particularly related to the strange NUMA layout. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |