Xen project Mailing List

Re: [MirageOS-devel] wireshark capture of failed download from mirage-www on ARM

From: Thomas Leonard <talex5@xxxxxxxxx>

Date: Mon, 21 Jul 2014 22:30:38 +0100

Cc: "mirageos-devel@xxxxxxxxxxxxxxxxxxxx" <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>

Delivery-date: Mon, 21 Jul 2014 21:30:44 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

On 21 July 2014 21:31, Dave Scott <Dave.Scott@xxxxxxxxxx> wrote: > > On 21 Jul 2014, at 21:14, Thomas Leonard <talex5@xxxxxxxxx> wrote: > >> On 21 July 2014 20:56, Richard Mortier <Richard.Mortier@xxxxxxxxxxxxxxxx> >> wrote: >>> [ context for list: thomas' observation of failed download, and lots of >>> retransmissions generally, while bringing up mirage-www on ARM ] >>> >>> On 21 Jul 2014, at 09:27, Thomas Leonard <talex5@xxxxxxxxx> wrote: >>> >>>> On 21 July 2014 17:08, Richard Mortier <Richard.Mortier@xxxxxxxxxxxxxxxx> >>>> wrote: >>>> >>> >>>>> On 21 Jul 2014, at 09:01, Thomas Leonard <talex5@xxxxxxxxx> wrote: >>>>> >>>>>> Here's the wireshark capture of a failed download. It does indeed say >>>>>> the TCP checksum is wrong. Any idea what's going on? >>>>>> >>>>>> Note that on ARM it uses a different function to calculate this (which >>>>>> I took from mirage-unix). It's in the #else block here: >>>>>> >>>>>> https://github.com/talex5/mirage-tcpip/blob/checksum/lib/checksum_stubs.c >>>>> >>>>> ack; will take a look after breakfast :) >>>>> >>>>> just to be clear -- the ARM version is using the code from L247 marked >>>>> "generic implementation"? >>>> >>>> Yes. The x86 version crashes on ARM because the 64-bit values aren't >>>> aligned. >>>> >>>>> two immediate questions -- is the checksum field definitely treated as >>>>> all zeros in the computation across the header? and is the segment >>>>> padded with zeros to be N*16 bits for the purposes of the computation >>>>> (but the pad not transmitted)? >>>> >>>> No idea. I haven't changed any code around there. >>> >>> this is weird-- wireshark says that the first transmission of that segment >>> (frame#13) has an invalid checksum while the retransmission (#17) has a >>> valid checksum. but the two checksums are the same! however #13 appears to >>> have almost no valid data in it -- after the first 74 bytes (which are the >>> same in both #13 and #17), the payload in #13 is zeroed out. >>> >>> so i guess the cstruct buffer is being recycled too soon (after the >>> checksum calculation but before the data is actually transmitted) or >>> something? >>> >>> anil, balraj (or anyone else!)-- has that part of the stack been changed >>> recently? >> >> I'm seeing strange things using a simpler test case now: >> >> let start c s = >> S.listen_tcpv4 s ~port:8000 (fun flow -> >> let dst, dst_port = S.TCPV4.get_dest flow in >> C.log_s c (green "new tcp connection from %s %d" >> (Ipaddr.V4.to_string dst) dst_port) >>>> = fun () -> >> let data = Cstruct.of_string "Hello" in >> S.TCPV4.write flow data >>>> = fun () -> >> S.TCPV4.close flow >> ); >> S.listen s >> >> This is also failing. I added a hexdump to mirage-net-xen and got this >> in Netif.writev: >> >> f0 1f af 6a 9b 95 c0 ff ee c0 ff ee 08 00 45 00 >> 00 2d 52 95 00 00 26 06 c0 c8 c0 a8 00 12 c0 a8 >> 00 0b 1f 40 b4 ca 1a fe b5 69 5e 8c dd fe 50 18 >> ff ff 29 8a 00 00 >> >> 48 65 6c 6c 6f >> >> That looks correct. The first block is the header, the second is the >> payload. In wireshare, the header is identical but the payload is >> different (20 00 00 00 08), which matches what you're seeing. >> >> So I guess there's some problem sending the second page to the ring. >> Suggestions from people who know this code would be great! Could just >> be a missing barrier or something. > > I think the flow is: > > https://github.com/mirage/mirage-net-xen/blob/master/lib/netif.ml#L408 > https://github.com/mirage/shared-memory-ring/blob/master/lwt/lwt_ring.ml#L75 > https://github.com/mirage/shared-memory-ring/blob/master/lib/ring.ml#L154 > https://github.com/mirage/shared-memory-ring/blob/master/lib/ring.ml#L102 > https://github.com/mirage/shared-memory-ring/blob/master/lib/barrier_stubs.c#L28 > â calling âxen_mbâ > > Perhaps to see whether âxen_mbâ is working you could add a delay (via busy > loop?) in the âmemory_barrierâ function (or thereabouts) in > shared-memory-ring. Assuming the writes are committed eventually (is that a > valid assumption?) then the busy loop would âfix itâ. That would be fairly > good evidence that barriers are broken. This is potentially a problem (in shared-memory-ring/lib/barrier.h): #elif defined(__arm__) # ifndef _M_ARM #define xen_mb() {} #define xen_rmb() {} #define xen_wmb() {} # elif _M_ARM > 6 #define xen_mb() asm volatile ("dmb" : : : "memory") #define xen_rmb() asm volatile ("dmb" : : : "memory") #define xen_wmb() asm volatile ("dmb" : : : "memory") From a quick Google, it looks like _M_ARM is a Microsoft-only thing. However, the barrier code is duplicated in mirage-xen, and I think we're using that version (in any case, changing it didn't help). However, I've now noticed that my network test is failing on x86 too, so there's something very odd going on... -- Dr Thomas Leonard http://0install.net/ GPG: 9242 9807 C985 3C07 44A6 8B9A AE07 8280 59A5 3CC1 GPG: DA98 25AE CAD0 8975 7CDA BD8E 0713 3F96 CA74 D8BA _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.