[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: ocplib+endian improvement



Right.  So I'm going to stop looking at this now.  I'd like to say
that's because I've fixed it and everything's fine, but, actually,
it's because the BMC on my test box has failed and I can't find any
other machines which are in a usable state.  Pretty much the only
concrete thing which I've found is that the ring indices in the shared
page are corrupted, with what looks like ASCII; not sure if that helps
you at all.

> [   39.830497] <0>In xen_netbk_tx_build_gops; done 95000 iterations so far
> [   39.837138] <0>nr_pending_reqs 0
> [   39.840485] <0>Ring status: rsp_prod_pvt 24cda, req_cons 238da
> [   39.846437] <0>Shared: req_prod 75422121, req_event 69646c69, rsp_prod 
> 24cda, rsp_event 6c6d6143

Decode indices as ascii:

75422121 -> uB!!   !!Bu
69646c69 -> idli   ildi
6c6d6143 -> lmaC   Caml

Which could plausibly be fragments of strings which might be present
in an ocaml program, but it's not really terribly convincing.

I do have a patch which might at least stop the bug from taking down
dom0, which I've attached, but I've not been able to test it all.

So that's not desperately helpful, really.  Sorry about that.

Steven.



> http://www.recoil.org/~avsm/www-crashes-pvops.xen is the offending kernel, 
> which always gives me the traceback below when booted using xm (on Xen 4.1, 
> either dom0 kernel of 3.2 or 3.7)... can you repro on your setup?  You have 
> to provide one VIF.
> 
> -anil
> 
> Begin forwarded message:
> 
> > From: Anil Madhavapeddy <anil@xxxxxxxxxx>
> > Subject: Re: ocplib+endian improvement
> > Date: 6 January 2013 21:08:18 GMT
> > To: "<cl-mirage@xxxxxxxxxxxxxxx> List" <cl-mirage@xxxxxxxxxxxxxxx>
> > Cc: Steven Smith <steven.smith@xxxxxxxxxxxx>
> > 
> > Hm, but trying out TCP seems to trigger a softlockup in Xen netback (both 
> > kernels 3.2 and latest 3.7). Will need to do some more debugging tomorrow...
> > 
> > Steven (Smith): any luck with the grant-free netback modification?  I could 
> > try it out at the same time as debugging this particular issue.
> > 
> > -anil
> > 
> > [  277.249069] Code: 00 89 c1 7c c8 41 59 c3 90 90 90 65 c6 04 25 41 b1 00 
> > 00 00 65 f6 04 25 40 b1 00 00 ff 74 05 e8 47 00 00 00 c3 66 0f 1f 44 00 00 
> > <65> c6 04 25 41 b1 00 00 01 c3 66 0f 1f 44 00 00 65 f6 04 25 41 
> > [  305.248782] BUG: soft lockup - CPU#0 stuck for 23s! [netback/0:3772]
> > [  305.248883] Modules linked in: xt_physdev iptable_filter ip_tables 
> > x_tables xen_netback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd 
> > auth_rpcgss nfs_acl nfs lockd dns_resolver fscache sunrpc bridge stp llc 
> > loop crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 ablk_helper 
> > cryptd xts lrw gf128mul snd_pcm sp5100_tco snd_page_alloc snd_timer snd 
> > soundcore tpm_tis tpm amd64_edac_mod edac_mce_amd i2c_piix4 i2c_core dcdbas 
> > evdev pcspkr tpm_bios edac_core microcode psmouse fam15h_power k10temp 
> > serio_raw acpi_power_meter button processor thermal_sys ext4 crc16 jbd2 
> > mbcache dm_mod sg sd_mod crc_t10dif ata_generic ohci_hcd pata_atiixp ixgbe 
> > ahci ptp libahci pps_core libata ehci_hcd dca mdio scsi_mod bnx2 usbcore 
> > usb_common
> > [  305.248932] CPU 0 
> > [  305.248934] Pid: 3772, comm: netback/0 Tainted: G        W    
> > 3.7-trunk-amd64 #1 Debian 3.7.1-1~experimental.2 Dell Inc. PowerEdge 
> > R415/08WNM9
> > [  305.248936] RIP: e030:[<ffffffffa021c153>]  [<ffffffffa021c153>] 
> > xen_netbk_tx_build_gops+0x19d/0x7ad [xen_netback]
> > [  305.248941] RSP: e02b:ffff88020eba3ca8  EFLAGS: 00000217
> > [  305.248943] RAX: 0000000073202626 RBX: ffffc90007ab5000 RCX: 
> > 000000001ce71e08
> > [  305.248945] RDX: 000000001ce71e08 RSI: ffffc90007ab02a8 RDI: 
> > ffff880653403800
> > [  305.248946] RBP: ffffc90007ab70c0 R08: ffff8806534038d8 R09: 
> > ffff88020eba3c74
> > [  305.248947] R10: ffffc90007ab0208 R11: ffffc90007ab0208 R12: 
> > ffff880653403800
> > [  305.248949] R13: 0000000000007320 R14: 0000000000000000 R15: 
> > 000000000104210e
> > [  305.248953] FS:  00007f59123a4700(0000) GS:ffff8807ff800000(0000) 
> > knlGS:0000000000000000
> > [  305.248955] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [  305.248956] CR2: 00007fd7c822b070 CR3: 00000007f2177000 CR4: 
> > 0000000000000660
> > [  305.248958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > 0000000000000000
> > [  305.248960] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> > 0000000000000400
> > [  305.248961] Process netback/0 (pid: 3772, threadinfo ffff88020eba2000, 
> > task ffff8807ee921180)
> > [  305.248963] Stack:
> > [  305.248964]  ffffffff81004067 ffffffff81004202 1ce71e0800000003 
> > ffff8807ee921180
> > [  305.248968]  ffffc90007ab70c0 ffff8807ff811740 0000000000000000 
> > 2074706fffff2e2f
> > [  305.248972]  ffff4f2073202626 2074706f20646c72 ffffffff810037f7 
> > ffff8807ee921180
> > [  305.248975] Call Trace:
> > [  305.248978]  [<ffffffff81004067>] ? arch_local_irq_restore+0x7/0x8
> > [  305.248982]  [<ffffffff81004202>] ? xen_mc_flush+0x11d/0x160
> > [  305.248985]  [<ffffffff810037f7>] ? xen_mc_issue.constprop.22+0x10/0x4d
> > [  305.248988]  [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
> > [  305.248991]  [<ffffffff8100d60c>] ? __switch_to+0x195/0x3f8
> > [  305.248994]  [<ffffffff8105fadb>] ? mmdrop+0xd/0x1c
> > [  305.248996]  [<ffffffff81061390>] ? finish_task_switch+0x83/0xb4
> > [  305.249000]  [<ffffffff813778e9>] ? __schedule+0x4b2/0x4e0
> > [  305.249003]  [<ffffffff8107f0d3>] ? arch_local_irq_disable+0x7/0x8
> > [  305.249006]  [<ffffffff81378465>] ? _raw_spin_lock_irqsave+0x14/0x35
> > [  305.249009]  [<ffffffff8107f0d3>] ? arch_local_irq_disable+0x7/0x8
> > [  305.249012]  [<ffffffff81378465>] ? _raw_spin_lock_irqsave+0x14/0x35
> > [  305.249016]  [<ffffffffa021c897>] ? xen_netbk_kthread+0x134/0x78d 
> > [xen_netback]
> > [  305.249019]  [<ffffffff8105d78f>] ? arch_local_irq_enable+0x7/0x8
> > [  305.249022]  [<ffffffff81061357>] ? finish_task_switch+0x4a/0xb4
> > [  305.249025]  [<ffffffff81057987>] ? abort_exclusive_wait+0x79/0x79
> > [  305.249029]  [<ffffffffa021c763>] ? xen_netbk_tx_build_gops+0x7ad/0x7ad 
> > [xen_netback]
> > [  305.249032]  [<ffffffffa021c763>] ? xen_netbk_tx_build_gops+0x7ad/0x7ad 
> > [xen_netback]
> > [  305.249035]  [<ffffffff810570ac>] ? kthread+0x81/0x89
> > [  305.249038]  [<ffffffff810037f7>] ? xen_mc_issue.constprop.22+0x10/0x4d
> > [  305.249041]  [<ffffffff8105702b>] ? __kthread_parkme+0x5c/0x5c
> > [  305.249043]  [<ffffffff8137d6bc>] ? ret_from_fork+0x7c/0xb0
> > [  305.249046]  [<ffffffff8105702b>] ? __kthread_parkme+0x5c/0x5c
> > [  305.249048] Code: bc 24 80 00 00 00 4d 89 a4 24 a8 00 00 00 49 c7 84 24 
> > a0 00 00 00 58 bf 21 a0 4c 89 f6 e8 86 d9 e2 e0 e9 c3 05 00 00 8b 54 24 14 
> > <66> 8b 74 24 3e 41 8d 4f ff 0f b7 44 24 42 48 c7 44 24 30 00 00 
> > avsm@gabriel:~/src/git/mirage/mirage-www$ 
> > Message from syslogd@gabriel at Jan  6 21:06:38 ...
> > 
> > On 6 Jan 2013, at 18:44, Anil Madhavapeddy <anil@xxxxxxxxxx> wrote:
> > 
> >> I've been porting the network stack to take advantage of the cstruct 
> >> turbo-boost that Pierre and Thomas worked on. This optimisation adds 
> >> compiler built-ins (in 4.01.0+) which let the code generator optimise away 
> >> many of the temporary values required for low-level optimisation.
> >> 
> >> Here's a (very quick) before/after for a ping flood (which is a good 
> >> stress test of the low-level shared ring, network driver and protocol 
> >> stack).
> >> 
> >> For a ping flood With 4.00.1 without the optimisation:
> >> 73755 packets transmitted, 73702 received, +49 duplicates, 0% packet loss, 
> >> time 6283ms
> >> rtt min/avg/max/mdev = 0.031/0.228/1209.178/9.887 ms, pipe 14850, ipg/ewma 
> >> 0.085/0.036 ms                                                             
> >>                              
> >> 
> >> and with the optimisation:
> >> 41791 packets transmitted, 41764 received, +25 duplicates, 0% packet loss, 
> >> time 3539msrtt min/avg/max/mdev = 0.030/0.188/1261.042/8.459 ms, pipe 
> >> 14742, ipg/ewma 0.084/0.039 ms
> >> 
> >> So our average latency drops quite significantly (0.228 -> 0.188), as does 
> >> CPU load (not shown above).
> >> 
> >> I've not committed these changes to the mainstream yet until I test out 
> >> TCP more, but it's getting there!
> >> 
> >> -anil
> > 
> > 
> 

Attachment: netback-deadlock.diff
Description: Text Data


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.