[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] mpt3sas bug with Debian jessie kernel only under Xen - "swiotlb buffer is full"



On 04/12/16 08:32, Andy Smith wrote:
> Hi,
>
> I have a Debian jessie server with an LSI SAS controller using the
> mpt3sas driver.
>
> Under the Debian jessie amd64 kernel (linux-image-3.16.0-4-amd64
> 3.16.36-1+deb8u2) running under Xen, I cannot put the system's
> storage under heavy load without receiving a bunch of "swiotlb
> buffer is full" kernel error messages and severely degraded
> performance. Sometimes the system panics and reboots itself.
>
> These problems do not happen if booting the kernel on bare metal.
>
> With a bit of searching I found someone having a similar issue with
> the Debian jessie kernel (though 686 and several versions back) and
> the tg3 driver:
>
>     https://lists.debian.org/debian-kernel/2015/05/msg00307.html
>
> They mention that suggestions on this list led them to compile a
> kernel with NEED_DMA_MAP_STATE set.
>
> I already seem to have that set:
>
> $ grep NEED_DMA /boot/config-3.16.0-4-amd64 
> CONFIG_NEED_DMA_MAP_STATE=y
>
> Is there something similar that I could try?
>
> The machine has two SSDs in an md RAID-10 and two spinning disks in
> another RAID-10. I can induce the situation within a few seconds by
> telling mdadm to check both of those arrays at the same time. i.e.:
>
> # /usr/share/mdadm/checkarray /dev/md4 # Spinny disks
> # /usr/share/mdadm/checkarray /dev/md5 # SSDs
>
> I expect to see 200,000K/sec (my set maximum) checking rate reported
> in /proc/mdstat for md5, and about 98,000K/sec for md4. This happens
> on bare metal.
>
> Under Xen, it starts off well but then the kernel errors appear
> within a few seconds; md4's speed drops to ~90,000K/sec and md5's
> drops right down to just ~100K/sec. If the machine doesn't do a
> kernel panic and reset itself very soon, it becomes unusably slow
> anyway.
>
> I can also trigger it with fio if I run jobs against filesystems on
> both arrays at once.
>
> Some logs appended at the end of this email.
>
> Would it be useful for me to show you a "dmesg" and "xl dmesg"?
>
> Shall I try a kernel and/or hypervisor from testing?

Can you try these two patches from the XenServer Patch queue?
https://github.com/xenserver/linux-3.x.pg/blob/master/master/series#L613-L614

There are bugs with some device drivers in choosing the correct DMA
mask, which cause them incorrectly to believe that they need
bounce-buffering.  Once you hit bounce buffering, everything grinds to a
halt.

> Dec  4 07:06:00 elephant kernel: [22019.373653] mpt3sas 0000:01:00.0: swiotlb 
> buffer is full (sz: 57344 bytes)
> Dec  4 07:06:00 elephant kernel: [22019.374707] mpt3sas 0000:01:00.0: swiotlb 
> buffer is full
> Dec  4 07:06:00 elephant kernel: [22019.375754] BUG: unable to handle kernel 
> NULL pointer dereference at 0000000000000010
> Dec  4 07:06:00 elephant kernel: [22019.376430] IP: [<ffffffffa004e779>] 
> _base_build_sg_scmd_ieee+0x1f9/0x2d0 [mpt3sas]
> Dec  4 07:06:00 elephant kernel: [22019.377122] PGD 0

This alone is a clear error handling bug in the mpt3sas driver.  It
hasn't checked the DMA mapping call for a successful mapping before
following the NULL pointer it got given back.  It is collateral damage
from the swiotlb buffer being full, but a bug none the less.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.