[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue with shared information page on Xen/ARM 4.17


  • To: Julien Grall <julien@xxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 4 Oct 2023 15:39:16 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=0QiJSPwtPgu9Rc9sZEOFYFdzftKCqQS+Q+S3fvi+Wog=; b=GGBLQN26J+QjaWKcQ6pIM2eWoThRJKsU4LpZlih2BLVtt24df4wXb70+Ui5vRY8IikX71ZpVJVGRTUHSZHwTojEJeEeu4N7hnhGyWmVrmBlxhyUB6nr0bwsjXxIzsTanUQwZ50HqjQsxyN2oYlKX6XRTxcFspChJGBL4Qw0Ip91ogiSMXIgrMjbZLRE0WfhrkT0OyJ35+NByWmZIVi/e5YQzz+fcYDy3oW42/5TZyKwxFnB5Uk3eyFAlfZTKmH+eYvcXy+w9ls0DRQRQpWQ9lmhcJSEoDS9lptKYa+yZNiOwLzi/LWQ+lPBYhmWDyJHDugmTxtXvfeCa1sOTDyUinw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dPmgXIRr77/GManq4M/+rEIAejA9MhcIHcJUHF/lTZrrFYvokigBN20HLm8Iv9shUT5E86Az0+nN0iweD8/2376RVqwGbdk3hONxioHpY4Nprm4cEHOd3j3KYpuX8NnomKrUF7JkORMs+vKqYjI7coGSENgIGn6yO9I/a3zm5IY6FxaY2F30znqy8Q/9UUmhMNOmDayg9UkhM0EUuoQGrTkzEta7UZIY6prx1LyXemetEfVv2nkB6o5OOjbVDGWQSdDnsvswYMH+G2JMQAICBvN9LBkO2rsMNS3egMBVVMJ3mNyAPJw+5iTaLFTsFwSxdUX3BiOEbLFJ6k08Y1yuWQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Elliott Mitchell <ehem+xen@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>
  • Delivery-date: Wed, 04 Oct 2023 13:39:44 +0000
  • Ironport-data: A9a23:vHe75Ky6NdsVPQOzmiZ6t+ehxyrEfRIJ4+MujC+fZmUNrF6WrkUDz GEcWmnUa/3fYDfzc98jYd/j/UJU6JOHm99rGQM5/iAxQypGp/SeCIXCJC8cHc8wwu7rFxs7s ppEOrEsCOhuExcwcz/0auCJQUFUjPzOHvykTrecZkidfCc8IA85kxVvhuUltYBhhNm9Emult Mj75sbSIzdJ4RYtWo4vw/zF8EgHUMja4mtC5QVmPawT5zcyqlFOZH4hDfDpR5fHatE88t6SH 47r0Ly/92XFyBYhYvvNfmHTKxBirhb6ZGBiu1IOM0SQqkEqSh8ai87XAME0e0ZP4whlqvgqo Dl7WT5cfi9yVkHEsLx1vxC1iEiSN4UekFPMCSDXXcB+UyQq2pYjqhljJBheAGEWxgp4KTFv9 tg3IhQ/VBnd3OuN6rayctRCpct2eaEHPKtH0p1h5RfwKK56BLX8GeDN79Ie2yosjMdTG/qYf 9AedTdkcBXHZVtIJ0sTD5U92uyvgxETcRUB8A7T+fVxvDOVlVMsuFTuGIO9ltiiX8Jak1zev mvb12/4HgsbJJqUzj/tHneE37WUzX2jCNNKfFG+3u5ajUHQ1DEaMl4xDgKwpMGbkFa0B80Kf iT4/QJr98De7neDTMT5XhC+iG6JuFgbQdU4O+A65QTO2qfSywPEHi4PSTspQMwrsoo6SCIn0 neNnsj1Hnp/vbuNU3Wf+7yI6zSoNkA9NnQebCUJSQ8E5djLo4wpiB/LCNF5H8adhNDvBSv5x TzMqSEknqgSluYCzaD99lfC6xqSoZzOQh8w9x/gdGuv5QNkZ6aof4Wtr1Pc6J5oK4KUTUKIu nQerNSP9+AFDZyLlyulTf0EGfei4PPtDdHHqVtmHp1k8iv3/XemJdxU+GsnexovNdsYczj0Z kOVoRlW+JJYIHqta+lwfp61DMMpi6PnELwJS8zpUzaHWbApHCfvwc2kTRf4M7zF+KT0rZwCB A==
  • Ironport-hdrordr: A9a23:IMYE4aG/QYqSmojbpLqE18eALOsnbusQ8zAXPhZKOH5om62j5r iTdZEgvyMc5wxhPE3I9erwXpVoIkmzyXcW2/h1AV7KZmCP01dASrsD0WLM+UyCJ8SUzJ876U 4PSdkGNDQyNzRHZATBjTVQ3+xO/DBPysGVuds=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 04/10/2023 13:53, Roger Pau Monné wrote:
> > On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 04/10/2023 09:13, Roger Pau Monné wrote:
> > > > On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> > > > > On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> > > > > > On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > > > > > > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current 
> > > > > > > issue is
> > > > > > > the changes with the handling of the shared information page 
> > > > > > > appear to
> > > > > > > have broken things for me.
> > > > > > > 
> > > > > > > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a 
> > > > > > > build
> > > > > > > of the 4.17 release, mapping the shared information page doesn't 
> > > > > > > work.
> > > > > > 
> > > > > > This is due to 71320946d5edf AFAICT.
> > > > > 
> > > > > Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> > > > > This seems a fairly reasonable change, so I had no intention of asking
> > > > > for a revert (which likely would have been rejected).  There is also a
> > > > > real possibility the -EBUSY comes from elsewhere.  Could also be
> > > > > 71320946d5edf caused a bug elsewhere to be exposed.
> > > > 
> > > > A good way to know would be to attempt to revert 71320946d5edf and see
> > > > if that fixes your issue.
> > > > 
> > > > Alternatively you can try (or similar):
> > > > 
> > > > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> > > > index 6ccffeaea57d..105ef3faecfd 100644
> > > > --- a/xen/arch/arm/mm.c
> > > > +++ b/xen/arch/arm/mm.c
> > > > @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
> > > >                    page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
> > > >            }
> > > >            else
> > > > +        {
> > > > +            printk("%u already mapped\n", space);
> > > >                /*
> > > >                 * Mandate the caller to first unmap the page before 
> > > > mapping it
> > > >                 * again. This is to prevent Xen creating an unwanted 
> > > > hole in
> > > > @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
> > > >                 * to unmap it afterwards.
> > > >                 */
> > > >                rc = -EBUSY;
> > > > +        }
> > > >            p2m_write_unlock(p2m);
> > > >        }
> > > > 
> > > > > > > I'm using Tianocore as the first stage loader.  This continues to 
> > > > > > > work
> > > > > > > fine.  The build is using tag "edk2-stable202211", commit 
> > > > > > > fff6d81270.
> > > > > > > While Tianocore does map the shared information page, my reading 
> > > > > > > of their
> > > > > > > source is that it properly unmaps the page and therefore 
> > > > > > > shouldn't cause
> > > > > > > trouble.
> > > > > > > 
> > > > > > > Notes on the actual call is gpfn was 0x0000000000040072.  This is 
> > > > > > > outside
> > > > > > > the recommended address range, but my understanding is this is 
> > > > > > > supposed
> > > > > > > to be okay.
> > > > > > > 
> > > > > > > The return code is -16, which is EBUSY.
> > > > > > > 
> > > > > > > Ideas?
> > > > > > 
> > > > > > I think the issue is that you are mapping the shared info page over 
> > > > > > a
> > > > > > guest RAM page, and in order to do that you would fist need to 
> > > > > > create
> > > > > > a hole and then map the shared info page.  IOW: the issue is not 
> > > > > > with
> > > > > > edk2 not having unmapped the page, but with FreeBSD trying to map 
> > > > > > the
> > > > > > shared_info over a RAM page instead of a hole in the p2m.  x86
> > > > > > behavior is different here, and does allow mapping the shared_info
> > > > > > page over a RAM gfn (by first removing the backing RAM page on the
> > > > > > gfn).
> > > > > 
> > > > > An interesting thought.  I thought I'd tried this, but since I didn't 
> > > > > see
> > > > > such in my experiments list.  What I had tried was removing all the 
> > > > > pages
> > > > > in the suggested mapping range.  Yet this failed.
> > > > 
> > > > Yeah, I went too fast and didn't read the code correctly, it is not
> > > > checking that the provided gfn is already populated, but whether the
> > > > mfn intended to be mapped is already mapped at a different location.
> > > > 
> > > > > Since this seemed reasonable, I've now tried and found it fails.  The
> > > > > XENMEM_remove_from_physmap call returns 0.
> > > > 
> > > > XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> > > > like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> > > > at which position the shared_info page is currently mapped, and hence
> > > > can't do anything to attempt to unmap it in order to cover up for
> > > > buggy firmware.
> > > > 
> > > > edk2 should be the entity to issue the XENMEM_remove_from_physmap
> > > > against the gfn where it has the shared_info page mapped.  Likely
> > > > needs to be done as part of ExitBootServices() method.
> > > > 
> > > > FWIW, 71320946d5edf is an ABI change, and as desirable as such
> > > > behavior might be, a new hypercall should have introduced that had the
> > > > behavior that the change intended to retrofit into
> > > > XENMEM_add_to_physmap.
> > > I can see how you think this is an ABI change but the previous behavior 
> > > was
> > > incorrect. Before this patch, on Arm, we would allow the shared page to be
> > > mapped twice. As we don't know where the firmware had mapped it this could
> > > result to random corruption.
> > > 
> > > Now, we could surely decide to remove the page as x86 did. But this could
> > > leave a hole in the RAM. As the OS would not know where the hole is, this
> > > could lead to page fault randomly during runtime.
> > 
> > I would say it's the job of the firmware to notify the OS where the
> > hole is, by modifying the memory map handled to the OS.  Or else
> > mapping the shared_info page in an unpopulated p2m hole.
> 
> I agree but I am not convinced that they are all doing it. At least U-boot
> didn't do it before we fixed it.
> 
> > 
> > When using UEFI there's RAM that will always be in-use by the
> > firmware, as runtime services cannot be shut down, and hence the
> > firmware must already have a way to remove/reserve such region(s) on
> > the memory map.
> 
> Can either you or Elliott confirm if EDK2 reserve the region?

I will defer to Elliott to check for arm.  I would be quite surprised
if it doesn't on x86, or else we would get a myriad of bug reports
about guests randomly crashing when using edk2.

> > 
> > > Neither of the two behaviors help the users. In fact, I think they only 
> > > make
> > > the experience worse because you don't know when the issue will happen.
> > > 
> > > AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
> > > all the cases, I can't think of a way for the OS to workaround properly
> > > buggy firmware. Therefore, returning -EBUSY is the safest we can do for 
> > > our
> > > users and I don't view it as a ABI change (someone rely on the previous
> > > behavior is bound to failure).
> > 
> > I fully agree the current behavior might not be the best one, but I do
> > consider this part of the ABI, specially as booting guests using edk2
> > has now stopped working after this change.
> 
> Right. If we remove the check, you may be able to boot a guest. But are we
> sure that such guest would run safely?

If the guest wants the hypervisor to enforce such behavior, let it
use the new hypercall to explicitly request the shared_info page to
not be mapped anywhere else.

But if you don't trust the bootloader, how do you know it hasn't poked
holes elsewhere in the RAM regions?

Even if the shared_info page has been unmapped, can you be sure the
bootloader has put a RAM page back in that gfn?

I understand this ABI change is done to avoid bugs, but at the cost of
diverging from x86, and breaking existing firmwares which might not be
buggy.

> Also, it is not really argument, but this is not the only broken part in
> EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
> some Device-Tree nodes and this will break on newer Xen.
> 
> So overall, it feels to me that EDK2 is not entirely ready to be used in
> production for Xen on Arm guests.

I really have no insight on this.  What are the supported way of booting
guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
guests AFAICT).

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.