[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Question about running Xen on NVIDIA Jetson-TK1



Hi, Meng:

Julien is correct-- a coworker and I are working on support for Tegra
SoCs, and we've made pretty good progress; there's work yet to be
done, but we have dom0 and guests booting on the Jetson TK1, Jetson
TX1, and the Google Pixel C. We hope to get a patch set out soon--
unfortunately, our employer has to take some time to verify that
everything's okay to be open-sourced, so I can't send out our
work-in-progress just yet. We'll have an RFC patchset out soon, I
hope!

There are two main hardware differences that cause Tegra SoCs to have
trouble with Xen:

- The primary interrupt controller for those systems isn't a single
GIC, as Xen expects. Instead, there's an NVIDIA Legacy Interrupt
Controller (LIC, or ICTLR) that gates all peripheral interrupts before
passing them to a standard GICv2. This interrupt controller has to be
programmed to ensure Xen can receive interrupts from the hardware
(e.g. serial), programmed to ensure that interrupts for pass-through
devices are correctly unmasked, and virtualized so dom0 can program
the "sections" related to interrupts not being routed to Xen or to a
domain for hardware passthrough.

- The serial controller on the Tegra SoCs doesn't behave in the same
was as most NS16550-compatibles; it actually adheres to the NS16550
spec a little more rigidly than most compatible controllers. A
coworker (Chris Patterson, cc'd) figured out what was going on; from
what I understand, most 16550s generate the "transmit ready" interrupt
once, when the device first can accept new FIFO entries. Both the
original 16550 and the Tegra implementation generate the "transmit
ready" interrupt /continuously/ when there's space available in the
FIFO, slewing the CPU with a stream of constant interrupts.

What you're seeing is likely a symptom of the first difference. In
your logs, you see messages that indicate Xen is having trouble
correctly routing IRQ that are parented by the legacy interrupt
controller:

> irq 0 not connected to primary controller.Connected to 
> /interrupt-controller@60004000

The issue here is that Xen is currently explicitly opting not to route
legacy-interrupt-controller interrupts, as they don't belong to the
primary GIC. As a result, these interrupts never make it to dom0. The
logic that needs to be tweaked is here:

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/arm/domain_build.c;h=00dc07af637b67153d33408f34331700dff84f93;hb=HEAD#l1137

We re-write this logic in our forthcoming patch-set to be more
general. As an interim workaround, you might opt to rewrite that logic
so LIC interrupts (which have an interrupt-parent compatible with
"tegra124-ictlr", in your case) can be routed by Xen, as well. Off the
top of my head, a workaround might look like:

/*
* Don't map IRQ that have no physical meaning
* ie: IRQ whose controller is not the GIC
*/
- if ( rirq.controller != dt_interrupt_controller )
+if ( (rirq.controller != dt_interrupt_controller) &&
(!dt_device_is_compatible(rirq.controller, "tegra124-ictlr") )

Of course, that's off-the-cuff code I haven't tried, but hopefully it
should help to get you started.

--Kyle

On Mon, May 16, 2016 at 3:39 PM, Meng Xu <xumengpanda@xxxxxxxxx> wrote:
> Hi Julien,
>
> On Mon, May 16, 2016 at 1:33 PM, Julien Grall <julien.grall@xxxxxxx> wrote:
>> (CC Kyle who is also working on Tegra?)
>>
>> Hi Meng,
>>
>> Many people are working on Nvidia platform with different issues :/. I have
>> CCed another person which IIRC is also working on it.
>
> Sure. It's good to know others are also interested in this platform.
> It will be more useful to fix it... :-)
>
>>
>> On 16/05/16 17:33, Meng Xu wrote:
>>>
>>> On Mon, May 16, 2016 at 7:33 AM, Julien Grall <julien.grall@xxxxxxx>
>>> wrote:
>>>>
>>>>
>>>> On 15/05/16 20:35, Meng Xu wrote:
>>>>>
>>>>>
>>>>> I'm trying to run Xen on NVIDIA Jetson TK1 board. (Right now, Xen does
>>>>> not support the Jetson board officially. But I'm thinking it may be
>>>>> very interesting and useful to see it happens, since it has GPU inside
>>>>> which is quite popular in automotive.)
>>>>>
>>>>> Now I encountered some problem to boot dom0 in Xen environment. I want
>>>>> to debug the issues and maybe fix the issues, but I'm not so sure how
>>>>> I should debug the issue more efficiently. I really appreciate it if
>>>>> you advise me a little bit about the method of how to fix the issue.
>>>>> :-)
>>>>>
>>>>> ---Below is the details----
>>>>>
>>>>> I noticed the Dushyant from IBM also tried to run Xen on the Jetson
>>>>> board. (http://www.gossamer-threads.com/lists/xen/devel/422519). I
>>>>> used the same Linux kernel (Jan Kiszka's development tree -
>>>>> http://git.kiszka.org/linux.git/, branch queues/assorted) and Ian's
>>>>> Xen repo. with the hack for Jetson board. I can see the dom0 kernel
>>>>> can boot to some extend and then "stall/spin" before the dom0 kernel
>>>>> fully boot up.
>>>>>
>>>>> In order to figure out the possible issue, I boot the exact same Linux
>>>>> kernel in native environment on one CPU and collected the boot log
>>>>> information in [1]. I also boot the same Linux kernel as dom0 in Xen
>>>>> environment and collected the boot log information in [2].
>>>>>
>>>>> In Xen environment, dom0 hangs after the following message
>>>>> [   10.541010] NET: Registered protocol family 10
>>>>> 6mip6: Mobile IPv6
>>>>> [   10.542510] mi
>>>>>
>>>>> In native environment, the kernel has the following log after
>>>>> initializing NET.
>>>>> [    2.934693] NET: Registered protocol family 10
>>>>> [    2.940611] mip6: Mobile IPv6
>>>>> [    2.943645] sit: IPv6 over IPv4 tunneling driver
>>>>> [    2.951303] NET: Registered protocol family 17
>>>>> [    2.955800] NET: Registered protocol family 15
>>>>> [    2.960257] can: controller area network core (rev 20120528 abi 9)
>>>>> [    2.966617] NET: Registered protocol family 29
>>>>> [    2.971098] can: raw protocol (rev 20120528)
>>>>> [    2.975384] can: broadcast manager protocol (rev 20120528 t)
>>>>> [    2.981088] can: netlink gateway (rev 20130117) max_hops=1
>>>>> [    2.986734] Bluetooth: RFCOMM socket layer initialized
>>>>> [    2.991979] Bluetooth: RFCOMM ver 1.11
>>>>> [    2.995757] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
>>>>> [    3.001109] Bluetooth: BNEP socket layer initialized
>>>>> [    3.006089] Bluetooth: HIDP (Human Interface Emulation) ver 1.2
>>>>> [    3.012052] Bluetooth: HIDP socket layer initialized
>>>>> [    3.017894] Registering SWP/SWPB emulation handler
>>>>> [    3.029675] tegra-pcie 1003000.pcie-controller: 2x1, 1x1
>>>>> configuration
>>>>> [    3.036586] +3.3V_SYS: supplied by +VDD_MUX
>>>>> [    3.040857] +3.3V_LP0: supplied by +3.3V_SYS
>>>>> [    3.045509] +1.35V_LP0(sd2): supplied by +5V_SYS
>>>>> [    3.050201] +1.05V_RUN_AVDD: supplied by +1.35V_LP0(sd2)
>>>>> [    3.057131] tegra-pcie 1003000.pcie-controller: probing port 0, using
>>>>> 2 lanes
>>>>> [    3.066479] tegra-pcie 1003000.pcie-controller: Slot present pin
>>>>> change, signature: 00000008
>>>>>
>>>>> I'm suspecting that my dom0 kernel hangs when it tries to initialize
>>>>> "can: controller area network core ". However, from Dushyant's post at
>>>>> http://www.gossamer-threads.com/lists/xen/devel/422519,  it seems
>>>>> Dushyant's dom0 kernel hangs when it tries to initialize pci_bus. (The
>>>>> linux config I used may be different form Dushyant's. That could be
>>>>> the reason.)
>>>>>
>>>>> Right now, the system just hangs and has no output indicating what the
>>>>> problem could be. Although there are a lot of error message before the
>>>>> system hangs, I'm not that sure if I should start with solving all of
>>>>> those error messages. Maybe some errors can be ignored?
>>>>>
>>>>> My questions are:
>>>>> 1) Do you have suggestion on how to see more information about the
>>>>> reason why the dom0 hangs?
>>>>
>>>>
>>>>
>>>> Have you tried to dump the registers using Xen console (CTLR-x 3 times
>>>> then 0) and see where it get stucks?
>>>
>>>
>>>
>>> I tried to type CTLR -x 3 times and then 0, nothing happens... :-(
>>> Just to confirm, once the system got stuck, I directly type Ctrl-x for
>>> three times on the host's screen. Am I correct?
>>
>>
>> Sorry, I forgot the default way to switch console is CTLR-a three times.
>>
>> On my configuration I modified the default character to avoid issue with
>> screen.
>>
>
> Ah-ha, typing "Ctrl -a, a" for three times, I can switch to the Xen
> console now. :-)
>
> Thank you for the clarification. :-)
>
>>>
>>> Maybe the serial console is not correctly set up?
>>
>>
>> It's likely to be a problem with the serial drivers in driver. Can you tried
>> CTLR-a before the kernel is booting?
>>
>>> The serial console configuration I used is as follows, could you have
>>> a quick look to see if it's because I configure the serial
>>> incorrectly?
>>
>>
>> I am not familiar with the Nvidia board. If the serial configuration is
>> working for baremetal, then it should work for Xen.
>>
>>>
>>> I used screen program to attach to the serial port.
>>> The command I used is $screen /dev/ttyUSB0 115200n8 on the host machine.
>>>
>>> On the board, I set up the device tree's /chosen node as follows:
>>> #
>>> fdt print /chosen
>>>
>>> chosen {
>>>
>>>          xen,xen-bootargs = "console=dtuart dtuart=serial0
>>> dom0_mem=512M loglvl=all guest_loglvl=all dom0_max_vcpus=1
>>> dom0_vcpus_pin maxcpus=1";
>>>
>>>          bootargs = "console=dtuart dtuart=serial0 dom0_mem=512M
>>> loglvl=all guest_loglvl=all dom0_max_vcpus=1 dom0_vcpus_pin
>>> maxcpus=1";
>>
>>
>> FIY, this property is not necessary.
>>
>>>          module {
>>>
>>>                  bootargs = "console=hvc0 console=tty1 earlyprintk=xen
>>> root=/dev/mmcblk0p1 rw rootwait";
>>
>>
>> Can you try to add "clk_ignore_unused" on the Linux command line?
>
> Yes. After trying it, it gives me some different,but more useful log
> information.
>
> In the kernel booting log, it first shows this message:
>
> ---start of the message----
>
> [   10.607251] Waiting for root device /dev/mmcblk0p1...
>
> ... /* I omit some other unrelated message  */
>
> [    5.347354] sdhci-tegra 700b0400.sdhci: Got WP GPIO
>
> 3mmc0: Unknown controller version (3). You may experience problems.
>
> [    5.347464] mmc0: Unknown controller version (3). You may
> experience problems.
>
> sdhci-tegra 700b0400.sdhci: No vmmc regulator found
>
> [    5.347647] sdhci-tegra 700b0400.sdhci: No vmmc regulator found
>
> 3mmc0: Unknown controller version (3). You may experience problems.
>
> [    5.347933] mmc0: Unknown controller version (3). You may
> experience problems.
>
> sdhci-tegra 700b0600.sdhci: No vmmc regulator found
>
> [    5.348099] sdhci-tegra 700b0600.sdhci: No vmmc regulator found
>
> sdhci-tegra 700b0600.sdhci: No vqmmc regulator found
>
> [    5.348162] sdhci-tegra 700b0600.sdhci: No vqmmc regulator found
>
> 4mmc0: Invalid maximum block size, assuming 512 bytes
>
> [    5.348222] mmc0: Invalid maximum block size, assuming 512 bytes
>
> 6mmc0: SDHCI controller on 700b0600.sdhci [700b0600.sdhci] using ADMA 64-bit
>
> [    5.395208] mmc0: SDHCI controller on 700b0600.sdhci
> [700b0600.sdhci] using ADMA 64-bit
>
> 6usbcore: registered new interface driver usbhid
>
> ---end of the message----
>
> It seems that mmc0 is not correctly recognized by dom0.
>
> Later the dom0 kernel keeps printing this message:
>
> [  426.405136] mmc0: Timeout waiting for hardware interrupt.
>
> Dom0 actually hangs here, because it cannot read the eMMC device. :-(
>
> Is it because the device tree is not properly recreated by Xen?
> Do you happen to know how to fix this issue or have some idea about
> how to fix it?  I can have a look at it.
>
> BTW, dom0 didn't recognize the mmc1 controller either.
>
> Thank you very much for your time and help in this! :-)
>
> Best Regards,
>
> Meng
> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania
> http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.