[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen on AWS EC2 Graviton 2 metal instances (c6g.metal)



Hi Stefano,

On 29/09/2023 21:29, Stefano Stabellini wrote:
I am very glad you managed to solve the issue!

It is always difficult to know what is the right thing to do when the
firmware provides wrong or noncompliant information.

I am a bit confused why you think the firmware is wrong here. From ACPI spec (ACPI 6.5, section 5.2.25):

"GSIV for the secure EL1 timer. This value is optional, as an operating system executing in the nonsecure world (EL2 or EL1), will ignore the content of these fields."

So the expectation is that Xen should not read the value. In Xen, we decided to read it because we want to know which PPIs exists in order to find an unallocated one for the event channel.

One on hand a panic
can help debug a potentially broken firmware configuration. On the other
hand the panic can cause problems to users that just want to boot Xen.
Unfortunately due to the complexity of ACPI, issues like this one are
not uncommon.

In this specific case, given that we don't actually use
TIMER_PHYS_SECURE_PPI (we do use all the others: TIMER_HYP_PPI,
TIMER_VIRT_PPI and TIMER_PHYS_NONSECURE_PPI) then I think we could
safely remove the BUG at vtimer.c:75.

IIRC, MISRA has a rule for checking return value. If it doesn't, then I would at least query why you are suggesting to remove the BUG() but still keep the call. Surely, if a function returns an error, we need to investigate why the error is returned? And if the TIME_PHYS_SECURE_PPI is really not used, then why should the call be kept?

As I wrote above, the goal of those calls was to ensure that all the PPI described in the firmware tables were recorded so we can find a free PPI for the event channel.

vgic_reserve_virq() fails because the PPI is already reserved. I actually wonder which other path reserves it? The reason in asking it is because for some field, 0 is used to mark the interrupt is not used exists (see below for the PPI timer). So maybe we forgot to add some check somewhere else. The other possible reason is the PPI might be shared (I couldn't find anything in the spec that it cannot be) and vgic_reserve_virq() doesn't deal with it right now.

Now regarding whether the PPI is used. AFAICT, the secure timer PPI is still present in the firmware tables (ACPI and DT) passed to dom0. So strictly speaking we want to ensure the PPI value is reserved.

That said, the ACPI spec suggests that the value will be ignored by the guest. The Device-Tree binding doesn't have such statement, but I suspect this may be the same. So it should be ok to skip reserving the PPI and therefore allow the event channel interrupt to use if it is not reserved by someone else.

Cheers,

On Fri, 29 Sep 2023, Driscoll, Dan wrote:
All,

         Just an FYI - using the debug guidance from Julien on Graviton 2, we 
have successfully been able to boot Xen and 3 Linux VMs on a Graviton 2 
c6g.metal instance.

         The problem turned out to be that the ACPI table containing the arch timer 
interrupt vectors had an issue - the result was that the secure physical timer IRQ was 
getting set to a value of 0 which resulted in Xen panicking at vtimer.c:75 and stop 
booting the system.  The quick work-around for this was to just hard-code this IRQ to 29 
which is the "typical" PPI assigned for this interrupt (and I suspect it isn't 
even used, so kind of a don't care).  This fixed the problem and we encountered no other 
issues.

         Out of curiosity, is this problem we found here one that has been seen before?  
I guess I could argue that the ACPI tables are incorrect and should provide a valid PPI 
number for the secure physical timer, but I could also argue that Xen shouldn't panic if 
this value is 0 and should maybe replace with a "suitable" value and continue 
booting since it really is not used?  I can provide more details as well as the patch 
used to work around this issue - we are using Xen 4.16.1 BTW.

         Much appreciated for the support and help here... as we progress in 
our work in this domain, we might have some more questions but, for right now, 
it appears that things are working properly with the limited testing we 
conducted.

Thanks,
Dan

-----Original Message-----
From: Julien Grall <julien@xxxxxxx>
Sent: Wednesday, September 27, 2023 7:59 AM
To: Driscoll, Dan (DI SW CAS ES TO) <dan.driscoll@xxxxxxxxxxx>; xen-
devel@xxxxxxxxxxxxxxxxxxxx
Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; Raghuraman, Arvind (DI SW CAS
ES) <arvind.raghuraman@xxxxxxxxxxx>; Bertrand Marquis
<Bertrand.Marquis@xxxxxxx>; rahul.singh@xxxxxxx; Luca Fancellu
<Luca.Fancellu@xxxxxxx>
Subject: Re: Xen on AWS EC2 Graviton 2 metal instances (c6g.metal)

Hi Dan,

Thanks for the report.

On 26/09/2023 20:41, Driscoll, Dan wrote:
     First off - sorry for the very long email, but there are a lot of details 
related
to this topic and I figured more details might be better than less but I could 
be
wrong here....

     Within Siemens Embedded, we have been doing some prototyping using
Xen for some upcoming customer related work - this email thread attempts to
explain what has been done here and our analysis of the problems we are having.

     We have done some initial prototyping to get Xen running on an AWS
Graviton 2 instance using an EC2 Arm64 "metal" instance (c6g.metal - no AWS
hypervisor) and ran into some problems during this prototyping.

     Since the Edge Workload Abstraction and Orchestration Layer (EWAOL)
that is part of SOAFEE already has some enablement of Xen in various
environments (including an Arm64 server environment), we used this as a starting
point.

     We were able to successfully bring up Xen and a Yocto dom0 and
multiple domu Yocto guests on an Arm AVA server (AVA Developer
Platform - 32 core Neoverse N1 server) following documented steps with
some minimal configuration changes (we simply extended the
configuration to include 3 Linux guests):
https://ewao/
l.docs.arm.com%2Fen%2Fkirkstone-
dev%2Fmanual%2Fbuild_system.html%23bui
ld-
system&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce

45b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C63
831416

3551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV
2luMzI

iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TfGY5InUn
BnfUO0z
ato7l%2Fa5IIAek%2FOip%2FdNMxuXEgM%3D&reserved=0

     So, this specific EWAOL support has all the proper bitbake layers to
generate images for both bare-metal (Linux running natively) and a 
virtualization
build (using Xen) for AVA and also a Neoverse N1 System Development Platform
(N1SDP), but we only verified this on AVA.
c6g.medium
     AWS also has support for EWAOL on Graviton 2, but the only supported
configuration is a bare-metal configuration (Linux running natively)
and the virtualization build hasn't been implemented in the bitbake
layers in their repo - here is the URL for information / instructions
on this support:
https://gith/
ub.com%2Faws4embeddedlinux%2Fmeta-aws-
ewaol&data=05%7C01%7Cdan.driscol

l%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38ae3bcd95794f
d4a

ddab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnknown%7CTWFp
bGZsb3d8e

yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
C30

00%7C%7C%7C&sdata=rPcqfn9w9C9cS81Ee5HpyupEBD%2BqDY18dvPm0N6tB
Mk%3D&res
erved=0

https://docs.aws.am/
azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fgrub.html&data=05%7C01%
7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38a
e3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnk
nown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZwLDw%2B3vOtaVJwg%
2B7kgHXJcE8yuu%2F7TNewmE2Yn4AiQ%3D&reserved=0
     As part of our effort to bring this up, we did a VERY minimal patch to the
repo used for the AWS EWAOL to generate a virtualization build (attached meta-
aws-ewaol.patch).  The resultant build of the AWS EWAOL support with this patch
applied does result in Xen being built as well as a dom0 Yocto kernel, but 
there is
definitely missing support to properly build everything for this virtualization 
layer.
Following the instructions for meta-aws-ewaol,  we generated an AMI and started
an EC2 instance with this AMI (c6g.metal type).  The resultant image does boot,
but it boots into the dom0 Linux kernel with problems recorded in the boot log
related to Xen (see dom0-linux-boot.txt).

         Looking more closely at the EFI partition, it was clear that 
systemd-boot
was being used and it was set-up to boot the dom0 Linux kernel and not boot into
Xen - the Xen EFI images were not present in the EFI partition and obviously no
launch entries existed for Xen.  To rectify this, the Xen EFI image that were 
built as
part of the AWS EWAOL build mentioned above where placed in the EFI partition,
along with a Xen config file that provided the dom0 Linux kernel image details. 
 A
new entry was added into the EFI image for Xen and the launch conf file was
updated to boot Xen instead of dom0 Linux.  This resulted in the EC2 instance
becoming "bricked" and no longer accessible.

         Details on the EFI related content and changes we made are captured in
the meta-aws-ewaol-efi-boot-changes.txt file attached above.

         The next step was comparing the AVA Xen output that was working and we
noticed a few differences - the AVA build did enable ACPI and UNSUPPORTED
kconfig settings whereas the AWS Xen build did not.  So, we tried again to 
bring up
another EC2 metal instance using the same AMI as before and utilized the AVA
Xen EFI image instead and same Xen config file.  The result was the same - a
"bricked" instance.

         We will likely try to use the entire AVA flow on AWS Graviton next as 
it is
using GRUB 2 instead of systemd-boot and we hope to maybe extend or enable
some of the debug output during boot.  The AWS EC2 instances have a "serial
console", but we have yet to see any output on this console prior to Linux boot 
logs
- no success in getting EC2 serial output during EFI booting.

That's interesting. The documentation for AWS [1] suggests that the logs from 
boot
should be seen. They even have a page for troubleshooting using GRUB [2].

I just launched a c6g.metal and I could access the serial console but then it 
didn't
work across reboot.

I have tried a c6g.medium and the serial was working across reboot (I could see
some logs). So I wonder whether the serial console is there is a missing
configuration for baremetal?


         We have had a call and some email exchanges with AWS on this topic
(Luke Harvey, Jeremy Dahan, Robert DeOliveira, and Azim Siddique) and they said
there have been multiple virtualization solutions successfully booted on 
Graviton 2
metal instances, so they felt that Xen should be useable once we figured out
configuration / boot details.  The provided some guidance how we might go about
some more exploration here, but nothing really specific to supporting Xen.

To be honest, without a properly working serial console, it is going to be very
difficult to debug any issue in Xen.

Right now, it is unclear whether Xen has output anything. If we can confirm the
serial console has intended and then are still no logs, then I would suggest to
enable earlyprintk in Xen. For your Graviton2, I think the following lines in
xen/.config should do the trick:

CONFIG_DEBUG=y
CONFIG_EARLY_UART_CHOICE_PL011=y
CONFIG_EARLY_UART_PL011=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_UART_BASE_ADDRESS=0x83e00000
CONFIG_EARLY_UART_PL011_BAUD_RATE=115200


         I have attached the following files for reference:

     * meta-aws-ewaol.patch - patch to AWS EWAOL repo found at
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%25
2Faws4embeddedlinux%2Fmeta-aws-
ewaol&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45
b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C6383
14163551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata
=rPcqfn9w9C9cS81Ee5HpyupEBD%2BqDY18dvPm0N6tBMk%3D&reserved=0
     * meta-aws-ewaol-efi-boot-changes.txt - Description of EFI related
changes made to AWS EWAOL EFI partition in attempt to boot Xen
     * ava.xen.config - config file for Xen build for AVA using EWAOL
virtualization build
     * aws.xen.config - config file for Xen build for AWS using EWAOL
virtualization build
     * xen-4.16.1.cfg - Xen config file placed in root of EFI boot
partition alongside xen-4.16.1.efi image

May I ask why you are using 4.16.1 rather than 4.17? In general I would
recommend to use the latest stable version or even a staging (the on-going
development branch) for bring-up because we don't always backport everything to
stable branch. So a bug may have been fixed in newer revision.

That said, skimming through the logs, I couldn't spot any patches that may help 
on
Graviton 2.

Best regards,

[1]
https://docs.aws.am/
azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fec2-serial-
console.html&data=05%7C01%7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc
43ce45b908dbbf598a6d%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7
C638314163551872035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
sdata=%2BQF9rA7KVEGbGJZIX0lhCv0du4oKR5632Eff2EoC2PY%3D&reserved=0
[2]
https://docs.aws.am/
azon.com%2FAWSEC2%2Flatest%2FUserGuide%2Fgrub.html&data=05%7C01%
7Cdan.driscoll%40siemens.com%7Cc7c8b262cbfc43ce45b908dbbf598a6d%7C38a
e3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638314163551872035%7CUnk
nown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZwLDw%2B3vOtaVJwg%
2B7kgHXJcE8yuu%2F7TNewmE2Yn4AiQ%3D&reserved=0


Dan Driscoll
Distinguished Engineer
Siemens DISW - Embedded Platform Solutions

--
Julien Grall


--
Julien Grall



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.