[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] Windows PV drivers fail to set up RSS when vCPUs > 8
On 21/03/2022 17:42, Edwin Torok wrote: On 21 Mar 2022, at 16:06, Martin Harvey <martin.harvey@xxxxxxxxxx> wrote: -----Original Message----- From: win-pv-devel <win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx> On Behalf Of Durrant, Paul Sent: 19 March 2022 17:39 To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] Windows PV drivers fail to set up RSS when vCPUs > 8 [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.I think we need a bit more in the commit comment. What is the nature of the failure... and does XENNET advertise more than 8 queues, so will the situation ever arise? Linux certainly tops out at 8 queues.I don't believe that XenNet ever advertises more than 8 queues, but that's not quite the same as supporting more than 8 vCPU's Perhaps something in the comment like: "Mapping between Queues and VPU's fails for more than 8 VCPU's because the base of the indirection is always considered to be zero, and the mapping is always performed on a direct vCPU number basis" Would that adequately summarise the problem? (It's not easy to explain succinctly in English!)Here is another attempt at explaining it, perhaps you could put your one sentence in the commit title, and the one below in the commit body? The driver only supports at most 8 queues, however Windows can decide to assign vCPU numbers starting from a non-zero offset. E.g. vCPU 8,9,10,11 could get assigned to a device if you have more than one NIC. The total number of vCPUs used by a single device is still less than 8, but the vCPU indexes themselves can be greater than 8. The code previously incorrectly assumed that individual vCPU indexes cannot exceed 8, however a 1:1 mapping between vCPU indexes and queues seems to only exist when using a single NIC. [Full example below] Ok, I understand now. So what we want is a more flexible queue -> CPU mapping. I'll apply the patch locally to my dev branch and take a closer look. Thanks, Paul MH. This summary helpfully provided by Edvin Torok. On a VM with >8 vCPUs RSS might not work because the driver fails to set up the indirection table. This causes the VM to only be able to reach 12.4Gbit/s with 'iperf3 -P 8', instead of 16-18Gbit/s with a working RSS setup: This can be easily reproduced if you give a VM 32 vCPUs and create 3 network interfaces. Windows will assign 0-3 to one network interface, 4-7 to next, and will try 8-12 I think for next but the driver rejects that: PS C:\Program Files\CItrix\XenTools\Diagnostics> Get-NetAdapterRSS Name : Ethernet 5 InterfaceDescription : XenServer PV Network Device #2 Enabled : True NumberOfReceiveQueues : 8 Profile : NUMAStatic BaseProcessor: [Group:Number] : 0:0 MaxProcessor: [Group:Number] : 0:31 MaxProcessors : 4 RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:1/0 0:2/0 0:3/0 0:4/0 0:5/0 0:6/0 0:7/0 0:8/0 0:9/0 0:10/0 0:11/0 0:12/0 0:13/0 0:14/0 0:15/0 0:16/0 0:17/0 0:18/0 0:19/0 0:20/0 0:21/0 0:22/0 0:23/0 0:24/0 0:25/0 0:26/0 0:27/0 0:28/0 0:29/0 0:30/0 0:31/0 IndirectionTable: [Group:Number] : Name : Ethernet 4 InterfaceDescription : XenServer PV Network Device #1 Enabled : True NumberOfReceiveQueues : 8 Profile : NUMAStatic BaseProcessor: [Group:Number] : 0:0 MaxProcessor: [Group:Number] : 0:31 MaxProcessors : 4 RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:1/0 0:2/0 0:3/0 0:4/0 0:5/0 0:6/0 0:7/0 0:8/0 0:9/0 0:10/0 0:11/0 0:12/0 0:13/0 0:14/0 0:15/0 0:16/0 0:17/0 0:18/0 0:19/0 0:20/0 0:21/0 0:22/0 0:23/0 0:24/0 0:25/0 0:26/0 0:27/0 0:28/0 0:29/0 0:30/0 0:31/0 IndirectionTable: [Group:Number] : 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 0:4 0:5 0:6 0:7 Name : Ethernet 3 InterfaceDescription : XenServer PV Network Device #0 Enabled : True NumberOfReceiveQueues : 8 Profile : NUMAStatic BaseProcessor: [Group:Number] : 0:0 MaxProcessor: [Group:Number] : 0:31 MaxProcessors : 4 RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:1/0 0:2/0 0:3/0 0:4/0 0:5/0 0:6/0 0:7/0 0:8/0 0:9/0 0:10/0 0:11/0 0:12/0 0:13/0 0:14/0 0:15/0 0:16/0 0:17/0 0:18/0 0:19/0 0:20/0 0:21/0 0:22/0 0:23/0 0:24/0 0:25/0 0:26/0 0:27/0 0:28/0 0:29/0 0:30/0 0:31/0 IndirectionTable: [Group:Number] : 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 0:0 0:1 0:2 0:3 There is a builtin hardcoded limit of 8 queues, which is fine, but that should be completely unrelated to CPU numbers! (the total number of CPUs assigned to a NIC should be <=8, sure). Potential code causing issue in xenvif receiver.c: for (Index = 0; Index < Size; Index++) { QueueMapping[Index] = KeGetProcessorIndexFromNumber(&ProcessorMapping[Index]); if (QueueMapping[Index] >= NumQueues) goto fail2; } (there is also a problem that the code assumes that group number is always 0. For now that is true, but might change if we implement vNUMA in the future). Jun 2 09:54:57 prost qemu-dm-41[30818]: 30818@1622627697.450233:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:54:57 prost qemu-dm-41[30818]: 30818@1622627697.450320:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:54:57 prost qemu-dm-41[30818]: 30818@1622627697.452097:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:54:57 prost qemu-dm-41[30818]: 30818@1622627697.452180:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:56:14 prost qemu-dm-41[30818]: 30818@1622627774.374713:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:56:14 prost qemu-dm-41[30818]: 30818@1622627774.374798:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:56:14 prost qemu-dm-41[30818]: 30818@1622627774.377121:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:56:14 prost qemu-dm-41[30818]: 30818@1622627774.377203:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:59:00 prost qemu-dm-42[1106]: 1106@1622627940.672941:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:59:00 prost qemu-dm-42[1106]: 1106@1622627940.673058:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:59:00 prost qemu-dm-42[1106]: 1106@1622627940.675891:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:59:00 prost qemu-dm-42[1106]: 1106@1622627940.675993:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:59:39 prost qemu-dm-43[4074]: 4074@1622627979.363892:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:59:39 prost qemu-dm-43[4074]: 4074@1622627979.364008:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 09:59:39 prost qemu-dm-43[4074]: 4074@1622627979.365861:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 09:59:39 prost qemu-dm-43[4074]: 4074@1622627979.365949:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:04:41 prost qemu-dm-45[9705]: 9705@1622628281.935871:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:04:41 prost qemu-dm-45[9705]: 9705@1622628281.935965:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:04:41 prost qemu-dm-45[9705]: 9705@1622628281.937849:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:04:41 prost qemu-dm-45[9705]: 9705@1622628281.937918:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:05:00 prost qemu-dm-46[11484]: 11484@1622628300.973487:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:05:00 prost qemu-dm-46[11484]: 11484@1622628300.973588:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:05:00 prost qemu-dm-46[11484]: 11484@1622628300.976554:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:05:00 prost qemu-dm-46[11484]: 11484@1622628300.976650:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:22:54 prost qemu-dm-49[21901]: 21901@1622629374.720769:xen_platform_log xen platform: xenvif|PdoGetInterfaceGuid: fail1 (c0000034) Jun 2 10:22:55 prost qemu-dm-49[21901]: 21901@1622629375.194122:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:22:55 prost qemu-dm-49[21901]: 21901@1622629375.194231:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:22:55 prost qemu-dm-49[21901]: 21901@1622629375.196726:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:22:55 prost qemu-dm-49[21901]: 21901@1622629375.196825:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:23:38 prost qemu-dm-50[24509]: 24509@1622629418.530046:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:23:38 prost qemu-dm-50[24509]: 24509@1622629418.530115:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:23:38 prost qemu-dm-50[24509]: 24509@1622629418.531811:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:23:38 prost qemu-dm-50[24509]: 24509@1622629418.531888:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:30:28 prost qemu-dm-51[28530]: 28530@1622629828.510968:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:30:28 prost qemu-dm-51[28530]: 28530@1622629828.511050:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:30:28 prost qemu-dm-51[28530]: 28530@1622629828.513570:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:30:28 prost qemu-dm-51[28530]: 28530@1622629828.513691:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:45:43 prost qemu-dm-52[2889]: 2889@1622630743.573791:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:45:43 prost qemu-dm-52[2889]: 2889@1622630743.573904:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) Jun 2 10:45:43 prost qemu-dm-52[2889]: 2889@1622630743.576188:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail2 Jun 2 10:45:43 prost qemu-dm-52[2889]: 2889@1622630743.576298:xen_platform_log xen platform: xenvif|ReceiverUpdateHashMapping: fail1 (c000000d) I tested with both Win10 and Windows Server 2016, with various CPU topologies (e.g. 12 vCPUs, all on one socket shows same issue once windows starts assigning CPUs>8) A workaround is to set MaxProcessorNumber to 7, though obviously this will limit scalability since vCPUs > 8 won't be used even if you have multiple VIFs:
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |