[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Debugging sudden hangs



Hi list,
    We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).

    Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
symptoms/crashes/panics. I've booted xen and dom0 with:
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.

    When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

    Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.

    Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.

    At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?

    I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.

    OS: Debian Buster
    Kernel: 4.17.0-1-amd64
    Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
    CPU: Xeon E5-2699 v4
    RAM: Samsung 96GB ECC Registered
    MB: Supermicro X10SRi-F

    In case it is relevant, since it might be IO related...
    Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
    RAID: LSI SAS3224 with 10 SAS3 drives

Warm regards,
Liwei

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.