[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue with shared information page on Xen/ARM 4.17



Hi Roger,

On 04/10/2023 13:53, Roger Pau Monné wrote:
On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
Hi Roger,

On 04/10/2023 09:13, Roger Pau Monné wrote:
On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
the changes with the handling of the shared information page appear to
have broken things for me.

With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
of the 4.17 release, mapping the shared information page doesn't work.

This is due to 71320946d5edf AFAICT.

Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
This seems a fairly reasonable change, so I had no intention of asking
for a revert (which likely would have been rejected).  There is also a
real possibility the -EBUSY comes from elsewhere.  Could also be
71320946d5edf caused a bug elsewhere to be exposed.

A good way to know would be to attempt to revert 71320946d5edf and see
if that fixes your issue.

Alternatively you can try (or similar):

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 6ccffeaea57d..105ef3faecfd 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
                   page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
           }
           else
+        {
+            printk("%u already mapped\n", space);
               /*
                * Mandate the caller to first unmap the page before mapping it
                * again. This is to prevent Xen creating an unwanted hole in
@@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
                * to unmap it afterwards.
                */
               rc = -EBUSY;
+        }
           p2m_write_unlock(p2m);
       }

I'm using Tianocore as the first stage loader.  This continues to work
fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
While Tianocore does map the shared information page, my reading of their
source is that it properly unmaps the page and therefore shouldn't cause
trouble.

Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
the recommended address range, but my understanding is this is supposed
to be okay.

The return code is -16, which is EBUSY.

Ideas?

I think the issue is that you are mapping the shared info page over a
guest RAM page, and in order to do that you would fist need to create
a hole and then map the shared info page.  IOW: the issue is not with
edk2 not having unmapped the page, but with FreeBSD trying to map the
shared_info over a RAM page instead of a hole in the p2m.  x86
behavior is different here, and does allow mapping the shared_info
page over a RAM gfn (by first removing the backing RAM page on the
gfn).

An interesting thought.  I thought I'd tried this, but since I didn't see
such in my experiments list.  What I had tried was removing all the pages
in the suggested mapping range.  Yet this failed.

Yeah, I went too fast and didn't read the code correctly, it is not
checking that the provided gfn is already populated, but whether the
mfn intended to be mapped is already mapped at a different location.

Since this seemed reasonable, I've now tried and found it fails.  The
XENMEM_remove_from_physmap call returns 0.

XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
like edk2 hasn't unmapped the shared_info page.  The OS has no idea
at which position the shared_info page is currently mapped, and hence
can't do anything to attempt to unmap it in order to cover up for
buggy firmware.

edk2 should be the entity to issue the XENMEM_remove_from_physmap
against the gfn where it has the shared_info page mapped.  Likely
needs to be done as part of ExitBootServices() method.

FWIW, 71320946d5edf is an ABI change, and as desirable as such
behavior might be, a new hypercall should have introduced that had the
behavior that the change intended to retrofit into
XENMEM_add_to_physmap.
I can see how you think this is an ABI change but the previous behavior was
incorrect. Before this patch, on Arm, we would allow the shared page to be
mapped twice. As we don't know where the firmware had mapped it this could
result to random corruption.

Now, we could surely decide to remove the page as x86 did. But this could
leave a hole in the RAM. As the OS would not know where the hole is, this
could lead to page fault randomly during runtime.

I would say it's the job of the firmware to notify the OS where the
hole is, by modifying the memory map handled to the OS.  Or else
mapping the shared_info page in an unpopulated p2m hole.

I agree but I am not convinced that they are all doing it. At least U-boot didn't do it before we fixed it.


When using UEFI there's RAM that will always be in-use by the
firmware, as runtime services cannot be shut down, and hence the
firmware must already have a way to remove/reserve such region(s) on
the memory map.

Can either you or Elliott confirm if EDK2 reserve the region?


Neither of the two behaviors help the users. In fact, I think they only make
the experience worse because you don't know when the issue will happen.

AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
all the cases, I can't think of a way for the OS to workaround properly
buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
users and I don't view it as a ABI change (someone rely on the previous
behavior is bound to failure).

I fully agree the current behavior might not be the best one, but I do
consider this part of the ABI, specially as booting guests using edk2
has now stopped working after this change.

Right. If we remove the check, you may be able to boot a guest. But are we sure that such guest would run safely?

Also, it is not really argument, but this is not the only broken part in EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how some Device-Tree nodes and this will break on newer Xen.

So overall, it feels to me that EDK2 is not entirely ready to be used in production for Xen on Arm guests.

Cheers,

--
Julien Grall



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.