[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Limitations for Running Xen on KVM Arm64


  • To: Julien Grall <julien@xxxxxxx>, Mohamed Mediouni <mohamed@xxxxxxxxxxxxxxxx>
  • From: "haseeb.ashraf@xxxxxxxxxxx" <haseeb.ashraf@xxxxxxxxxxx>
  • Date: Mon, 3 Nov 2025 13:09:59 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=siemens.com; dmarc=pass action=none header.from=siemens.com; dkim=pass header.d=siemens.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=sg7KT3nkexCZILYzqD2htHGlVY1KzhF9/rVFt844FwE=; b=UiA9smRzyPQtouqPUbH1nU9eXw0/ccBLJ7KKzw8jodE+oXK2DgiZcpyksS9ncKmRs3oR9YxDoUJXIR50fkl81IE5wsC/17Dnp4xuDcOzP9j02gjyIyrdJ07qCeptFDshujf0s20qegunu1/F3J2be8brZDnMDjejqrfZ2v8ey50dOKT533/IuY2yqr60FfxTtaAeZbfbuBWwjDTLJYbDAjwwFpiLL1fiWEEMudqWFvQS4Z4ODumMvVqsxkpNM/eVtG2opYUBrmGdvw9CynV2sxLToA3OR8DU9pjZ6nExvfJ9hYJRleVmd7TJDH4l0auD6Mwt1D2go66134uO+ji01w==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Vg1GBNIOllAjla5Z5QuItxQY3lCdq8KCPQv28II73FlCtEtyJjkzE4CQHv4FVXS0Ol0ZXAA7jcLAtCJerBDmLuOLbjFuuWFSFUdcZyF+a5o4hMJQhBr4hHbfBiVVwwNzJIz9TWM+bD7LTJQM+nv7zLSuOzGX0EqqR4yHviXS/HBwVs0+tE+1/mLuFg6Vmpf/uHXCUw3bvmdKjpDVNVmqBm697UVYZVOZIjAq0372rhLrgNaOc+TlMFbmGQN1+jWRqKzpGfvU3RkbRHkMdZc2R1jfhqu9l94aogFmf6UxANG4G10AaEKNx+qmdNg0hopPGGodG5Dq5gHKCfwXZzO1Mg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=siemens.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "Volodymyr_Babchuk@xxxxxxxx" <Volodymyr_Babchuk@xxxxxxxx>, "Driscoll, Dan" <dan.driscoll@xxxxxxxxxxx>, "Bachtel, Andrew" <andrew.bachtel@xxxxxxxxxxx>, "fahad.arslan@xxxxxxxxxxx" <fahad.arslan@xxxxxxxxxxx>, "noor.ahsan@xxxxxxxxxxx" <noor.ahsan@xxxxxxxxxxx>, "brian.sheppard@xxxxxxxxxxx" <brian.sheppard@xxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>
  • Delivery-date: Mon, 03 Nov 2025 13:10:17 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Msip_labels: MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Enabled=True;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SiteId=38ae3bcd-9579-4fd4-adda-b42e1495d55a;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_SetDate=2025-11-03T13:09:59.089Z;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Name=C1 - Restricted;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_ContentBits=1;MSIP_Label_9d258917-277f-42cd-a3cd-14c4e9ee58bc_Method=Standard;
  • Thread-index: AQHcSVi3paxtnyIgIkOf6jhuJCbvfrTalv6cgABtkgCAAFoJgIAABxcAgACWJ4CAABonw4ACEJkAgAKjdLQ=
  • Thread-topic: Limitations for Running Xen on KVM Arm64

Hi,

> To clarify, Xen is using the local TLB version. So it should be vmalls12e1.
If I understood correctly, won't HCR_EL2.FB makes local TLB, a broadcast one?

Mohamed mentioned this in earlier email:
> If a core-local TLB invalidate was issued, this bit forces it to become a 
> broadcast, so that you don’t have to worry about flushing TLBs when moving a 
> vCPU between different pCPUs. KVM operates with this bit set.

Can you explain in what scenario exactly, can we use vmalle1?

> Before going into batching, do you have any data showing how often 
> XENMEM_remove_from_physmap is called in your setup? Similar, I would be 
> interested to know the number of TLBs flush within one hypercalls and whether 
> the regions unmapped were contiguous.
The number of times XENMEM_remove_from_physmap is invoked depends upon the size 
of each binary. Each hypercall invokes TLB instruction once. If I use 
persistent rootfs, then this hypercall is invoked almost 7458 times (+8 approx) 
which is equal to sum of kernel and DTB image pages:
domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 
0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x48000000 -> 
0x4800188d  (pfn 0x48000 + 0x2 pages)

And if I use ramdisk image, then this hypercall is invoked almost 222815 times 
(+8 approx) which is equal to sum of kernel, ramdisk and DTB image 4k pages.
domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 
0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   module0      : 0x48000000 -> 
0x7c93d000  (pfn 0x48000 + 0x3493d pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x7c93d000 -> 
0x7c93e8d9  (pfn 0x7c93d + 0x2 pages)

You can see the address ranges in above logs, the addresses seem contiguous in 
this address space and at best we can reduce the number of calls to 3, each at 
the end of every image when removed from physmap.

> we may still send a few TLBs because:
> * We need to avoid long-running operations, so the hypercall may restart. So 
> we will have to flush at mininum before every restart
> * The current way we handle batching is we will process one item at the time. 
> As this may free memory (either leaf or intermediate page-tables), we will 
> need to flush the TLBs first to prevent the domain accessing the wrong 
> memory. This could be solved by keeping track of the list of memory to free. 
> But this is going to require some work and I am not entirely sure this is 
> worth it at the moment.
I think now you have the figure that 222815 TLBs are too much and a few TLBs 
would still be a lot better. TLBs less than 10 are not much noticeable.

> We could use a series of TLBI IPAS2E1IS which I think is what TLBI range is 
> meant to replace (so long the addresses are contiguous in the given space).
Isn't IPAS2E1IS a range tlbi instruction? My understanding is that this 
instruction is available on processors with range TLBI support, I could be 
wrong. I saw its KVM emulation which does full invalidation if range TLBI is 
not supported 
(https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/hyp/pgtable.c#L647).

> On the KVM side, it would be worth looking at whether the implementation can 
> be optimized. Is this really walking block by block? Can it skip over large 
> hole (e.g. if we know a level 1 entry doesn't exist, then we can increment by 
> 1GB).
Yes, this should also be looked from KVM side. I think to solve this problem, 
we need this optimized on both places in Xen and in KVM because Xen is invoking 
this instruction too many times and unless KVM can provide performance close to 
bare-metal tlbi, this would still be a problem.

Regards,
Haseeb


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.