[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen 4.18rc/ARM64 on Raspberry Pi 4B: Traffic in DomU crashing Dom0 when using VLANs



On Fri, Jan 19, 2024 at 8:32 PM Elliott Mitchell <ehem+xen@xxxxxxx> wrote:
>
> On Sun, Jan 14, 2024 at 10:54:24PM +0100, Paul Leiber wrote:
> >
> > Am 22.10.2023 um 07:42 schrieb Paul Leiber:
> > > Am 13.10.2023 um 20:56 schrieb Paul Leiber:
> > >> Hi Xen developers list,
> > >>
> > >> TL;DR:
> > >> ------
> > >>
> > >> Causing certain web server traffic on a secondary VLAN on Raspberry Pi
> > >> under vanilla Debian/UEFI in combination with Xen leads to complete
> > >> system reboot (watchdog triggering for Dom0). Other strange things are
> > >> happening.
> > >>
> > >> Description:
> > >> ----------
> > >>
> > >> I recently set up Xen (self compiled, Version 4.18-rc) on a Raspberry
> > >> Pi 4B (on vanilla Debian Bookworm, UEFI boot mode). Until some time
> > >> ago, everything worked well with Dom0, one DomU and one bridge.
> > >>
> > >> Then I wanted to actually make use of the virtualization and started
> > >> to set up a second Debian Bookworm DomU (using xen-create-image) for
> > >> monitoring my systems with zabbix (a webserver based system monitoring
> > >> solution). The bridge used for this setup was the device bridging the
> > >> hardware NIC. I installed zabbix, set it up, and everything went well,
> > >> I could access the web interface without any problem.
> > >>
> > >> Then I set up VLANs (initally using VLAN numbers 1 and 2) to separate
> > >> network traffic between the DomUs. I made the existing device bridge
> > >> VLAN 1 (bridge 1) and created a secondary device for bridging VLAN 2
> > >> (bridge 2). Using only bridge 1 / VLAN 1 everything works well, I can
> > >> access the zabbix web interface without any noticeable issue. After
> > >> switching the zabbix DomU to VLAN 2 / bridge 2, everything seemingly
> > >> keeps on working well, I can ping different devices in my network from
> > >> the zabbix DomU and vice versa, I can ssh into the machine.
> > >>
> > >> However, as soon as I remotely access the zabbix web interface, the
> > >> complete system (DomUs and Dom0) becomes unresponsive and reboots
> > >> after some time (usually seconds, sometimes 1-2 minutes). The reboot
> > >> is reliably reproducable.
> > >>
> > >> I didn't see any error message in any log (zabbix, DomU syslog, Dom0
> > >> syslog) except for the following lines immediately before the system
> > >> reboots on the Xen serial console:
> > >>
> > >> (XEN) Watchdog timer fired for domain 0
> > >> (XEN) Hardware Dom0 shutdown: watchdog rebooting machine
> > >>
> > >> As soon as I change the bridge to bridge 1 (with or without VLAN
> > >> setup), the web interface is accessible again after booting the zabbix
> > >> DomU, no reboots.
> > >>
> > >> So I assume that causing specific traffic on the virtual NIC when
> > >> using a VLAN setup with more than one VLAN under Xen makes the Dom0
> > >> system hard crash. Of course, there might be other causes that I'm not
> > >> aware of, but to me, this seems to be the most likely explanation
> > >> right now.
> > >>
> > >> What I tried:
> > >> -------------
> > >>
> > >> 1. I changed the VLAN numbers. First to 101, 102, 103 etc. This was
> > >> when I noticed another strange thing: VLANs with numbers >99 simply
> > >> don't work on my Raspberry Pi under Debian, with or without Xen. VLAN
> > >> 99 works, VLAN 100 (or everything else >99 that I tried) doesn't work.
> > >> If I choose a number >99, the VLAN is not configured, "ip a" doesn't
> > >> list it. Other Debian systems on x64 architecture don't show this
> > >> behavior, there, it was no problem to set up VLANs > 99. Therefore,
> > >> I've changed the VLANs to 10, 20, 30 etc., which worked. But it didn't
> > >> solve the initial problem of the crashing Dom0 and DomUs.
> > >>
> > >> 2. Different bridge options, without noticable effect:
> > >> bridge_stp off      # dont use STP (spanning tree proto)
> > >> bridge_waitport 0   # dont wait for port to be available
> > >> bridge_fd 0         # no forward delay
> > >>
> > >> 3. Removing IPv6: No noticable effect.
> > >>
> > >> 4. Network traffic analysis: Now, here it becomes _really_ strange. I
> > >> started tcpdumps on Dom0, and depending on on which interface/bridge
> > >> traffic was logged, the problem went away, meaning, the DomU was
> > >> running smoothly for hours, even when accessing the zabbix web
> > >> interface. Stopping the log makes the system crash (as above, after
> > >> seconds up to 1-2 minutes) reproducably if I access the zabbix web
> > >> interface.
> > >>
> > >> Logging enabcm6e4ei0 (NIC): no crashes
> > >> Logging enabcm6e4ei0.10 (VLAN 10): instant crash
> > >> Logging enabcm6e4ei0.20 (VLAN 20): no crashes
> > >> Logging xenbr0 (on VLAN 10): instant crash
> > >> Logging xenbr1 (on VLAN 20): no crashes
> > >>
> > >> I am clinging to the thought that there must be a rational explanation
> > >> for why logging the traffic on certain interfaces/bridges should avoid
> > >> the crash of the complete system, while logging other
> > >> interfaces/bridges doesn't. I myself can't think of one.
> > >>
> > >> I checked the dumps of enabcm6e4ei0.10 and xenbr0 (where the system
> > >> crashes) with wireshark, nothing sticks out to me (but I am really no
> > >> expert in analyzing network traffic). Dumps can be provided.
> > >>
> > >> 5. Watchdog: I tried to dig deeper into the cause for the watchdog
> > >> triggering. However, I didn't find any useful documentation on the web
> > >> on how the watchdog works or how to enable logging.
> > >>
> > >> 6. Eliminating Xen as cause: I booted the Debian system (which in Xen
> > >> setup would be Dom0) without Xen and set it up to use the VLAN 20
> > >> bridge (the same that leads to a reboot when using it in the DomU) as
> > >> primary network interface. Everything seemed to be working, I could
> > >> download large files from the internet without any problem. Setting up
> > >> Zabbix on the base Debian system showed that the same setup (VLANs 10
> > >> and 20, bridges 1 and 2, using bridge 2 as interface for Zabbix)
> > >> without Xen is working reliably, no reboots. This points to some Xen
> > >> related component being the root cause, I think.
> > >>
> > >> 7. Eliminating Apache as root cause: Reloading the Apache starting
> > >> page hosted on DomU several times per second didn't lead to a reboot.
> > >>
> > >> 8. Recompiling Xen: Independent of which Xen master branch version I
> > >> was using (all 4.18), the behavior was the same. I didn't get Xen
> > >> working on ARM64/UEFI in version 4.17.
> > >>
> > >> Current situation:(XEN) d3v0 Unhandled SMC/HVC: 0x84000050
> > (XEN) d3v0 Unhandled SMC/HVC: 0x8600ff01
> > (XEN) d3v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) common/grant_table.c:1909:d3v0 Expanding d3 grant table from 1 to
> > 2 frames
> > (XEN) common/grant_table.c:1909:d3v0 Expanding d3 grant table from 2 to
> > 3 frames
> > (XEN) common/grant_table.c:1909:d3v0 Expanding d3 grant table from 3 to
> > 4 frames
> > >> ------------------
> > >>
> > >> I am out of ideas what to do next. Everything that was recommended to
> > >> me on xen-users didn't lead to significant insight or solve the problem.
> > >>
> > >> I'd appreciate any hints how to troubleshoot this and/or how to
> > >> proceed otherwise.
> > >
> > > O.k., let's try to break that issue down.
> > >
> > > Firstly, how can I get more information on why the Xen watchdog
> > > triggers? Is there documentation? Are there any logs? I couldn't find
> > > anything useful with my search skills.
> > >
> >
> > After some delay, I have picked up the Raspberry Pi again, built Xen
> > 4.19-unstable, with the same result: Reboot of the complete system after
> > the Dom0 watchdog triggering when accessing the Zabbix content on a
> > webserver on DomU.
> >
> > I still would like to find out what's going wrong here, but I have no
> > idea what to do. I'd appreciate any hint.
> >
> > Not knowing if it helps, I added Xen logs from boot until Dom0
> > crash/reboot below.
> >
> > Loading Xen xen ...
> > Loading Linux 6.1.0-17-arm64 ...
> > Loading initial ramdisk ...
> > Using modules provided by bootloader in FDT
> > Xen 4.19-unstable (c/s Fri Jan 12 11:54:31 2024 +0000
> > git:1ec3fe1f66-dirty) EFI
> > loader
> >   Xen 4.19-unstable
> > (XEN) Xen version 4.19-unstable (root@xxxxxxxxxxxxxxxxxxxx) (gcc (Debian
> > 12.2.0-
> >
> >
> >                             14) 12.2.0) debug=y Sun Jan 14 21:46:34 CET 2024
> > (XEN) Latest ChangeSet: Fri Jan 12 11:54:31 2024 +0000 git:1ec3fe1f66-dirty
> > (XEN) build-id: babb03cb6107fc46f7d8969142ccd6772a1133c3
> > (XEN) Console output is synchronous.
> > (XEN) Processor: 00000000410fd083: "ARM Limited", variant: 0x0, part
> > 0xd08,rev 0
> >
> >
> >                                 x3
> > (XEN) 64-bit Execution:
> > (XEN)   Processor Features: 0000000000002222 0000000000000000
> > (XEN)     Exception Levels: EL3:64+32 EL2:64+32 EL1:64+32 EL0:64+32
> > (XEN)     Extensions: FloatingPoint AdvancedSIMD
> > (XEN)   Debug Features: 0000000010305106 0000000000000000
> > (XEN)   Auxiliary Features: 0000000000000000 0000000000000000
> > (XEN)   Memory Model Features: 0000000000001124 0000000000000000
> > (XEN)   ISA Features:  0000000000010000 0000000000000000
> > (XEN) 32-bit Execution:
> > (XEN)   Processor Features: 0000000000000131:0000000000011011
> > (XEN)     Instruction Sets: AArch32 A32 Thumb Thumb-2 Jazelle
> > (XEN)     Extensions: GenericTimer Security
> > (XEN)   Debug Features: 0000000003010066
> > (XEN)   Auxiliary Features: 0000000000000000
> > (XEN)   Memory Model Features: 0000000010201105 0000000040000000
> > (XEN)                          0000000001260000 0000000002102211
> > (XEN)   ISA Features: 0000000002101110 0000000013112111 0000000021232042
> > (XEN)                 0000000001112131 0000000000011142 0000000000010001
> > (XEN) Using SMC Calling Convention v1.2
> > (XEN) Using PSCI v1.1
> > (XEN) ACPI: GICC (acpi_id[0x0000] address[0xff842000] MPIDR[0x0] enabled)
> > (XEN) ACPI: GICC (acpi_id[0x0001] address[0xff842000] MPIDR[0x1] enabled)
> > (XEN) ACPI: GICC (acpi_id[0x0002] address[0xff842000] MPIDR[0x2] enabled)
> > (XEN) ACPI: GICC (acpi_id[0x0003] address[0xff842000] MPIDR[0x3] enabled)
> > (XEN) 4 CPUs enabled, 4 CPUs total
> > (XEN) SMP: Allowing 4 CPUs
> > (XEN) enabled workaround for: ARM erratum 1319537
> > (XEN) Generic Timer IRQ: phys=30 hyp=26 virt=27 Freq: 54000 KHz
> > (XEN) GICv2 initialization:
> > (XEN)         gic_dist_addr=00000000ff841000
> > (XEN)         gic_cpu_addr=00000000ff842000
> > (XEN)         gic_hyp_addr=00000000ff844000
> > (XEN)         gic_vcpu_addr=00000000ff846000
> > (XEN)         gic_maintenance_irq=25
> > (XEN) GICv2: 256 lines, 4 cpus, secure (IID 0200143b).
> > (XEN) XSM Framework v1.0.1 initialized
> > (XEN) Initialising XSM SILO mode
> > (XEN) Using scheduler: SMP Credit Scheduler rev2 (credit2)
> > (XEN) Initializing Credit2 scheduler
> > (XEN)  load_precision_shift: 18
> > (XEN)  load_window_shift: 30
> > (XEN)  underload_balance_tolerance: 0
> > (XEN)  overload_balance_tolerance: -3
> > (XEN)  runqueues arrangement: socket
> > (XEN)  cap enforcement granularity: 10ms
> > (XEN) load tracking window length 1073741824 ns
> > (XEN) Allocated console ring of 32 KiB.
> > (XEN) CPU0: Guest atomics will try 16 times before pausing the domain
> > (XEN) Bringing up CPU1
> > (XEN) CPU1: Guest atomics will try 16 times before pausing the domain
> > (XEN) CPU 1 booted.
> > (XEN) Bringing up CPU2
> > (XEN) CPU2: Guest atomics will try 13 times before pausing the domain
> > (XEN) CPU 2 booted.
> > (XEN) Bringing up CPU3
> > (XEN) CPU3: Guest atomics will try 16 times before pausing the domain
> > (XEN) Brought up 4 CPUs
> > (XEN) CPU 3 booted.
> > (XEN) I/O virtualisation disabled
> > (XEN) P2M: 44-bit IPA with 44-bit PA and 8-bit VMID
> > (XEN) P2M: 4 levels with order-0 root, VTCR 0x0000000080043594
> > (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
> > (XEN) Initializing Credit2 scheduler
> > (XEN)  load_precision_shift: 18
> > (XEN)  load_window_shift: 30
> > (XEN)  underload_balance_tolerance: 0
> > (XEN)  overload_balance_tolerance: -3
> > (XEN)  runqueues arrangement: socket
> > (XEN)  cap enforcement granularity: 10ms
> > (XEN) load tracking window length 1073741824 ns
> > (XEN) Adding cpu 0 to runqueue 0
> > (XEN)  First cpu on runqueue, activating
> > (XEN) Adding cpu 1 to runqueue 0
> > (XEN) Adding cpu 2 to runqueue 0
> > (XEN) Adding cpu 3 to runqueue 0
> > (XEN) alternatives: Patching with alt table 00000a00002ee0b0 ->
> > 00000a00002ef250
> > (XEN) CPU2 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> > (XEN) CPU1 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> > (XEN) CPU3 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> > (XEN) CPU0 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> > (XEN) *** LOADING DOMAIN 0 ***
> > (XEN) Loading d0 kernel from boot module @ 0000000030ef7000
> > (XEN) Loading ramdisk from boot module @ 000000002ee6d000
> > (XEN) Allocating 1:1 mappings totalling 1024MB for dom0:
> > (XEN) BANK[0] 0x00000040000000-0x00000080000000 (1024MB)
> > (XEN) Grant table range: 0x0000002eceb000-0x0000002ed2b000
> > (XEN) Allocating PPI 16 for event channel interrupt
> > (XEN) Loading zImage from 0000000030ef7000 to
> > 0000000040000000-0000000041f1c7c0
> > (XEN) Loading d0 initrd from 000000002ee6d000 to
> > 0x0000000048200000-0x000000004a
> >
> >
> >                                                     288c22
> > (XEN) Loading d0 DTB to 0x0000000048000000-0x00000000480002c3
> > (XEN) Initial low memory virq threshold set at 0x4000 pages.
> > (XEN) Scrubbing Free RAM in background
> > (XEN) Std. Loglevel: All
> > (XEN) Guest Loglevel: All
> > (XEN) ***************************************************
> > (XEN) WARNING: CONSOLE OUTPUT IS SYNCHRONOUS
> > (XEN) This option is intended to aid debugging of Xen by ensuring
> > (XEN) that all output is synchronously delivered on the serial line.
> > (XEN) However it can introduce SIGNIFICANT latencies and affect
> > (XEN) timekeeping. It is NOT recommended for production use!
> > (XEN) ***************************************************
> > (XEN) 3... 2... 1...
> > (XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
> > (XEN) Freed 376kB init memory.
> > (XEN) d0v0 Unhandled SMC/HVC: 0x84000050
> > (XEN) d0v0 Unhandled SMC/HVC: 0x8600ff01
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER4
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER8
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER12
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER16
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER20
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER24
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER28
> > (XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) d0v1: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) d0v2: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) d0v3: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) d1v0 Unhandled SMC/HVC: 0x84000050
> > (XEN) d1v0 Unhandled SMC/HVC: 0x8600ff01
> > (XEN) d1v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) common/grant_table.c:1909:d1v0 Expanding d1 grant table from 1 to
> > 2 frames
> > (XEN) gnttab_mark_dirty not implemented yet
> > (XEN) d2v0 Unhandled SMC/HVC: 0x84000050
> > (XEN) d2v0 Unhandled SMC/HVC: 0x8600ff01
> > (XEN) d2v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER0
> > (XEN) common/grant_table.c:1909:d2v0 Expanding d2 grant table from 1 to
> > 2 frames
> > (XEN) common/grant_table.c:1909:d2v0 Expanding d2 grant table from 2 to
> > 3 frames
> > (XEN) common/grant_table.c:1909:d2v0 Expanding d2 grant table from 3 to
> > 4 frames
> > (XEN) Watchdog timer fired for domain 0
> > (XEN) Hardware Dom0 shutdown: watchdog rebooting machine
>
> I'm unsure whose attention to draw to this report.
>
> This might be a scheduler issue since the watchdog timer is triggering.
>
> This might be an ACPI issue as ACPI is in use here.
>
> This might be an ARM Linux kernel issue.
>
> In the end this is someone running into trouble with Xen on an ARM
> device.  Yet despite bringing up the issue hasn't gotten any help...

Hey Elliot,

Thanks for raising the visibility of this.  I'm not familiar with ARM,
but if I were investigating this I'd try to figure out what the
"unhandled" error messages are.  "gnttab_mark_dirty not implemented
yet" looks pretty sus too, and also sounds like it might be something
ARM-specific.

I don't see anything suspicious WRT the scheduler, but a simple way to
test that would be to set the scheduler to credit1 and see if that
changes things.

 -George



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.