[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen on AWS EC2 Graviton 2 metal instances (c6g.metal)



On Mon, 2 Oct 2023, Julien Grall wrote:
> On 29/09/2023 21:29, Stefano Stabellini wrote:
> > I am very glad you managed to solve the issue!
> > 
> > It is always difficult to know what is the right thing to do when the
> > firmware provides wrong or noncompliant information.
> 
> I am a bit confused why you think the firmware is wrong here. From ACPI spec
> (ACPI 6.5, section 5.2.25):
> 
> "GSIV for the secure EL1 timer. This value is optional, as an operating system
> executing in the nonsecure world (EL2 or EL1), will ignore the content of
> these fields."

Thanks for looking into it -- I didn't investigate the issue, I was just
assuming that the table was incomplete due to the zero value.


> So the expectation is that Xen should not read the value. In Xen, we decided
> to read it because we want to know which PPIs exists in order to find an
> unallocated one for the event channel.
>
> > One on hand a panic
> > can help debug a potentially broken firmware configuration. On the other
> > hand the panic can cause problems to users that just want to boot Xen.
> > Unfortunately due to the complexity of ACPI, issues like this one are
> > not uncommon.
> > 
> > In this specific case, given that we don't actually use
> > TIMER_PHYS_SECURE_PPI (we do use all the others: TIMER_HYP_PPI,
> > TIMER_VIRT_PPI and TIMER_PHYS_NONSECURE_PPI) then I think we could
> > safely remove the BUG at vtimer.c:75.
> 
> IIRC, MISRA has a rule for checking return value.

Yes it does, rule 17.7, likely to be adopted


> If it doesn't, then I would
> at least query why you are suggesting to remove the BUG() but still keep the
> call. Surely, if a function returns an error, we need to investigate why the
> error is returned? And if the TIME_PHYS_SECURE_PPI is really not used, then
> why should the call be kept?
>
> As I wrote above, the goal of those calls was to ensure that all the PPI
> described in the firmware tables were recorded so we can find a free PPI for
> the event channel.
> 
> vgic_reserve_virq() fails because the PPI is already reserved. I actually
> wonder which other path reserves it? The reason in asking it is because for
> some field, 0 is used to mark the interrupt is not used exists (see below for
> the PPI timer). So maybe we forgot to add some check somewhere else. The other
> possible reason is the PPI might be shared (I couldn't find anything in the
> spec that it cannot be) and vgic_reserve_virq() doesn't deal with it right
> now.

Good point. That would be interesting to know


> Now regarding whether the PPI is used. AFAICT, the secure timer PPI is still
> present in the firmware tables (ACPI and DT) passed to dom0. So strictly
> speaking we want to ensure the PPI value is reserved.
> 
> That said, the ACPI spec suggests that the value will be ignored by the guest.
> The Device-Tree binding doesn't have such statement, but I suspect this may be
> the same. So it should be ok to skip reserving the PPI and therefore allow the
> event channel interrupt to use if it is not reserved by someone else.

You looked into this more deeply then I did. Your suggestion makes sense
to me.


> Cheers,
> 
> > On Fri, 29 Sep 2023, Driscoll, Dan wrote:
> > > All,
> > > 
> > >          Just an FYI - using the debug guidance from Julien on Graviton 2,
> > > we have successfully been able to boot Xen and 3 Linux VMs on a Graviton 2
> > > c6g.metal instance.
> > > 
> > >          The problem turned out to be that the ACPI table containing the
> > > arch timer interrupt vectors had an issue - the result was that the secure
> > > physical timer IRQ was getting set to a value of 0 which resulted in Xen
> > > panicking at vtimer.c:75 and stop booting the system.  The quick
> > > work-around for this was to just hard-code this IRQ to 29 which is the
> > > "typical" PPI assigned for this interrupt (and I suspect it isn't even
> > > used, so kind of a don't care).  This fixed the problem and we encountered
> > > no other issues.
> > > 
> > >          Out of curiosity, is this problem we found here one that has been
> > > seen before?  I guess I could argue that the ACPI tables are incorrect and
> > > should provide a valid PPI number for the secure physical timer, but I
> > > could also argue that Xen shouldn't panic if this value is 0 and should
> > > maybe replace with a "suitable" value and continue booting since it really
> > > is not used?  I can provide more details as well as the patch used to work
> > > around this issue - we are using Xen 4.16.1 BTW.
> > > 
> > >          Much appreciated for the support and help here... as we progress
> > > in our work in this domain, we might have some more questions but, for
> > > right now, it appears that things are working properly with the limited
> > > testing we conducted.
> > > 
> > > Thanks,
> > > Dan
> > > 
> > > > -----Original Message-----
> > > > From: Julien Grall <julien@xxxxxxx>
> > > > Sent: Wednesday, September 27, 2023 7:59 AM
> > > > To: Driscoll, Dan (DI SW CAS ES TO) <dan.driscoll@xxxxxxxxxxx>; xen-
> > > > devel@xxxxxxxxxxxxxxxxxxxx
> > > > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; Raghuraman, Arvind (DI
> > > > SW CAS
> > > > ES) <arvind.raghuraman@xxxxxxxxxxx>; Bertrand Marquis
> > > > <Bertrand.Marquis@xxxxxxx>; rahul.singh@xxxxxxx; Luca Fancellu
> > > > <Luca.Fancellu@xxxxxxx>
> > > > Subject: Re: Xen on AWS EC2 Graviton 2 metal instances (c6g.metal)
> > > > 
> > > > Hi Dan,
> > > > 
> > > > Thanks for the report.
> > > > 
> > > > On 26/09/2023 20:41, Driscoll, Dan wrote:
> > > > >      First off - sorry for the very long email, but there are a lot of
> > > > > details related
> > > > to this topic and I figured more details might be better than less but I
> > > > could be
> > > > wrong here....
> > > > > 
> > > > >      Within Siemens Embedded, we have been doing some prototyping
> > > > > using
> > > > Xen for some upcoming customer related work - this email thread attempts
> > > > to
> > > > explain what has been done here and our analysis of the problems we are
> > > > having.
> > > > > 
> > > > >      We have done some initial prototyping to get Xen running on an
> > > > > AWS
> > > > Graviton 2 instance using an EC2 Arm64 "metal" instance (c6g.metal - no
> > > > AWS
> > > > hypervisor) and ran into some problems during this prototyping.
> > > > > 
> > > > >      Since the Edge Workload Abstraction and Orchestration Layer
> > > > > (EWAOL)
> > > > that is part of SOAFEE already has some enablement of Xen in various
> > > > environments (including an Arm64 server environment), we used this as a
> > > > starting
> > > > point.
> > > > > 
> > > > >      We were able to successfully bring up Xen and a Yocto dom0 and
> > > > > multiple domu Yocto guests on an Arm AVA server (AVA Developer
> > > > > Platform - 32 core Neoverse N1 server) following documented steps with
> > > > > some minimal configuration changes (we simply extended the
> > > > > configuration to include 3 Linux guests):
> > > > > https://ewao/
> > > > > l.docs.arm.com%2Fen%2Fkirkstone-
> > > > dev%2Fmanual%2Fbuild_system.html%23bui
> > > > > ld-
> > > > system&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce
> > > > > 
> > > > 45b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C63
> > > > 831416
> > > > > 
> > > > 3551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV
> > > > 2luMzI
> > > > > 
> > > > iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TfGY5InUn
> > > > BnfUO0z
> > > > > ato7l%2Fa5IIAek%2FOip%2FdNMxuXEgM%3D&reserved=0
> > > > > 
> > > > >      So, this specific EWAOL support has all the proper bitbake layers
> > > > > to
> > > > generate images for both bare-metal (Linux running natively) and a
> > > > virtualization
> > > > build (using Xen) for AVA and also a Neoverse N1 System Development
> > > > Platform
> > > > (N1SDP), but we only verified this on AVA.
> > > > > c6g.medium
> > > > >      AWS also has support for EWAOL on Graviton 2, but the only
> > > > > supported
> > > > > configuration is a bare-metal configuration (Linux running natively)
> > > > > and the virtualization build hasn't been implemented in the bitbake
> > > > > layers in their repo - here is the URL for information / instructions
> > > > > on this support:
> > > > > https://gith/
> > > > > ub.com%2Faws4embeddedlinux%2Fmeta-aws-
> > > > ewaol&data=05%7C01%7Cdan.driscol
> > > > > 
> > > > l%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38ae3bcd95794f
> > > > d4a
> > > > > 
> > > > ddab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnknown%7CTWFp
> > > > bGZsb3d8e
> > > > > 
> > > > yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
> > > > C30
> > > > > 
> > > > 00%7C%7C%7C&sdata=rPcqfn9w9C9cS81Ee5HpyupEBD%2BqDY18dvPm0N6tB
> > > > Mk%3D&res
> > > > > erved=0
> > > > > 
> > > > https://docs.aws.am/
> > > > azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fgrub.html&data=05%7C01%
> > > > 7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38a
> > > > e3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnk
> > > > nown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
> > > > aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZwLDw%2B3vOtaVJwg%
> > > > 2B7kgHXJcE8yuu%2F7TNewmE2Yn4AiQ%3D&reserved=0
> > > > >      As part of our effort to bring this up, we did a VERY minimal
> > > > > patch to the
> > > > repo used for the AWS EWAOL to generate a virtualization build (attached
> > > > meta-
> > > > aws-ewaol.patch).  The resultant build of the AWS EWAOL support with
> > > > this patch
> > > > applied does result in Xen being built as well as a dom0 Yocto kernel,
> > > > but there is
> > > > definitely missing support to properly build everything for this
> > > > virtualization layer.
> > > > Following the instructions for meta-aws-ewaol,  we generated an AMI and
> > > > started
> > > > an EC2 instance with this AMI (c6g.metal type).  The resultant image
> > > > does boot,
> > > > but it boots into the dom0 Linux kernel with problems recorded in the
> > > > boot log
> > > > related to Xen (see dom0-linux-boot.txt).
> > > > > 
> > > > >          Looking more closely at the EFI partition, it was clear that
> > > > > systemd-boot
> > > > was being used and it was set-up to boot the dom0 Linux kernel and not
> > > > boot into
> > > > Xen - the Xen EFI images were not present in the EFI partition and
> > > > obviously no
> > > > launch entries existed for Xen.  To rectify this, the Xen EFI image that
> > > > were built as
> > > > part of the AWS EWAOL build mentioned above where placed in the EFI
> > > > partition,
> > > > along with a Xen config file that provided the dom0 Linux kernel image
> > > > details.  A
> > > > new entry was added into the EFI image for Xen and the launch conf file
> > > > was
> > > > updated to boot Xen instead of dom0 Linux.  This resulted in the EC2
> > > > instance
> > > > becoming "bricked" and no longer accessible.
> > > > > 
> > > > >          Details on the EFI related content and changes we made are
> > > > > captured in
> > > > the meta-aws-ewaol-efi-boot-changes.txt file attached above.
> > > > > 
> > > > >          The next step was comparing the AVA Xen output that was
> > > > > working and we
> > > > noticed a few differences - the AVA build did enable ACPI and
> > > > UNSUPPORTED
> > > > kconfig settings whereas the AWS Xen build did not.  So, we tried again
> > > > to bring up
> > > > another EC2 metal instance using the same AMI as before and utilized the
> > > > AVA
> > > > Xen EFI image instead and same Xen config file.  The result was the same
> > > > - a
> > > > "bricked" instance.
> > > > > 
> > > > >          We will likely try to use the entire AVA flow on AWS Graviton
> > > > > next as it is
> > > > using GRUB 2 instead of systemd-boot and we hope to maybe extend or
> > > > enable
> > > > some of the debug output during boot.  The AWS EC2 instances have a
> > > > "serial
> > > > console", but we have yet to see any output on this console prior to
> > > > Linux boot logs
> > > > - no success in getting EC2 serial output during EFI booting.
> > > > 
> > > > That's interesting. The documentation for AWS [1] suggests that the logs
> > > > from boot
> > > > should be seen. They even have a page for troubleshooting using GRUB
> > > > [2].
> > > > 
> > > > I just launched a c6g.metal and I could access the serial console but
> > > > then it didn't
> > > > work across reboot.
> > > > 
> > > > I have tried a c6g.medium and the serial was working across reboot (I
> > > > could see
> > > > some logs). So I wonder whether the serial console is there is a missing
> > > > configuration for baremetal?
> > > > 
> > > > > 
> > > > >          We have had a call and some email exchanges with AWS on this
> > > > > topic
> > > > (Luke Harvey, Jeremy Dahan, Robert DeOliveira, and Azim Siddique) and
> > > > they said
> > > > there have been multiple virtualization solutions successfully booted on
> > > > Graviton 2
> > > > metal instances, so they felt that Xen should be useable once we figured
> > > > out
> > > > configuration / boot details.  The provided some guidance how we might
> > > > go about
> > > > some more exploration here, but nothing really specific to supporting
> > > > Xen.
> > > > 
> > > > To be honest, without a properly working serial console, it is going to
> > > > be very
> > > > difficult to debug any issue in Xen.
> > > > 
> > > > Right now, it is unclear whether Xen has output anything. If we can
> > > > confirm the
> > > > serial console has intended and then are still no logs, then I would
> > > > suggest to
> > > > enable earlyprintk in Xen. For your Graviton2, I think the following
> > > > lines in
> > > > xen/.config should do the trick:
> > > > 
> > > > CONFIG_DEBUG=y
> > > > CONFIG_EARLY_UART_CHOICE_PL011=y
> > > > CONFIG_EARLY_UART_PL011=y
> > > > CONFIG_EARLY_PRINTK=y
> > > > CONFIG_EARLY_UART_BASE_ADDRESS=0x83e00000
> > > > CONFIG_EARLY_UART_PL011_BAUD_RATE=115200
> > > > 
> > > > > 
> > > > >          I have attached the following files for reference:
> > > > > 
> > > > >      * meta-aws-ewaol.patch - patch to AWS EWAOL repo found at
> > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%25
> > > > 2Faws4embeddedlinux%2Fmeta-aws-
> > > > ewaol&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45
> > > > b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C6383
> > > > 14163551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > > > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata
> > > > =rPcqfn9w9C9cS81Ee5HpyupEBD%2BqDY18dvPm0N6tBMk%3D&reserved=0
> > > > >      * meta-aws-ewaol-efi-boot-changes.txt - Description of EFI
> > > > > related
> > > > changes made to AWS EWAOL EFI partition in attempt to boot Xen
> > > > >      * ava.xen.config - config file for Xen build for AVA using EWAOL
> > > > virtualization build
> > > > >      * aws.xen.config - config file for Xen build for AWS using EWAOL
> > > > virtualization build
> > > > >      * xen-4.16.1.cfg - Xen config file placed in root of EFI boot
> > > > > partition alongside xen-4.16.1.efi image
> > > > 
> > > > May I ask why you are using 4.16.1 rather than 4.17? In general I would
> > > > recommend to use the latest stable version or even a staging (the
> > > > on-going
> > > > development branch) for bring-up because we don't always backport
> > > > everything to
> > > > stable branch. So a bug may have been fixed in newer revision.
> > > > 
> > > > That said, skimming through the logs, I couldn't spot any patches that
> > > > may help on
> > > > Graviton 2.
> > > > 
> > > > Best regards,
> > > > 
> > > > [1]
> > > > https://docs.aws.am/
> > > > azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fec2-serial-
> > > > console.html&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc
> > > > 43ce45b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7
> > > > C638314163551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> > > > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
> > > > sdata=%2BQF9rA7KVEGbGJZIX0lhCv0du4oKR5632Eff2EoC2PY%3D&reserved=0
> > > > [2]
> > > > https://docs.aws.am/
> > > > azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fgrub.html&data=05%7C01%
> > > > 7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38a
> > > > e3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnk
> > > > nown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
> > > > aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZwLDw%2B3vOtaVJwg%
> > > > 2B7kgHXJcE8yuu%2F7TNewmE2Yn4AiQ%3D&reserved=0
> > > > 
> > > > > 
> > > > > Dan Driscoll
> > > > > Distinguished Engineer
> > > > > Siemens DISW - Embedded Platform Solutions
> > > > 
> > > > --
> > > > Julien Grall
> > > 
> 
> -- 
> Julien Grall
> 



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.