[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] wireshark capture of failed download from mirage-www on ARM



If it fails on x86 could you try reverting to an older Cstruct version? I worry 
about the recent bounds checks tripping some new behaviour.

Anil

> On 21 Jul 2014, at 14:30, Thomas Leonard <talex5@xxxxxxxxx> wrote:
> 
>> On 21 July 2014 21:31, Dave Scott <Dave.Scott@xxxxxxxxxx> wrote:
>> 
>>> On 21 Jul 2014, at 21:14, Thomas Leonard <talex5@xxxxxxxxx> wrote:
>>> 
>>>> On 21 July 2014 20:56, Richard Mortier <Richard.Mortier@xxxxxxxxxxxxxxxx> 
>>>> wrote:
>>>> [ context for list: thomas' observation of failed download, and lots of 
>>>> retransmissions generally, while bringing up mirage-www on ARM ]
>>>> 
>>>>> On 21 Jul 2014, at 09:27, Thomas Leonard <talex5@xxxxxxxxx> wrote:
>>>>> 
>>>>> On 21 July 2014 17:08, Richard Mortier <Richard.Mortier@xxxxxxxxxxxxxxxx> 
>>>>> wrote:
>>>> 
>>>>>>> On 21 Jul 2014, at 09:01, Thomas Leonard <talex5@xxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Here's the wireshark capture of a failed download. It does indeed say
>>>>>>> the TCP checksum is wrong. Any idea what's going on?
>>>>>>> 
>>>>>>> Note that on ARM it uses a different function to calculate this (which
>>>>>>> I took from mirage-unix). It's in the #else block here:
>>>>>>> 
>>>>>>> https://github.com/talex5/mirage-tcpip/blob/checksum/lib/checksum_stubs.c
>>>>>> 
>>>>>> ack; will take a look after breakfast :)
>>>>>> 
>>>>>> just to be clear -- the ARM version is using the code from L247 marked 
>>>>>> "generic implementation"?
>>>>> 
>>>>> Yes. The x86 version crashes on ARM because the 64-bit values aren't 
>>>>> aligned.
>>>>> 
>>>>>> two immediate questions -- is the checksum field definitely treated as 
>>>>>> all zeros in the computation across the header?  and is the segment 
>>>>>> padded with zeros to be N*16 bits for the purposes of the computation 
>>>>>> (but the pad not transmitted)?
>>>>> 
>>>>> No idea. I haven't changed any code around there.
>>>> 
>>>> this is weird-- wireshark says that the first transmission of that segment 
>>>> (frame#13) has an invalid checksum while the retransmission (#17) has a 
>>>> valid checksum. but the two checksums are the same!  however #13 appears 
>>>> to have almost no valid data in it -- after the first 74 bytes (which are 
>>>> the same in both #13 and #17), the payload in #13 is zeroed out.
>>>> 
>>>> so i guess the cstruct buffer is being recycled too soon (after the 
>>>> checksum calculation but before the data is actually transmitted) or 
>>>> something?
>>>> 
>>>> anil, balraj (or anyone else!)-- has that part of the stack been changed 
>>>> recently?
>>> 
>>> I'm seeing strange things using a simpler test case now:
>>> 
>>> let start c s =
>>>   S.listen_tcpv4 s ~port:8000 (fun flow ->
>>>       let dst, dst_port = S.TCPV4.get_dest flow in
>>>       C.log_s c (green "new tcp connection from %s %d"
>>>                    (Ipaddr.V4.to_string dst) dst_port)
>>>>> = fun () ->
>>>       let data = Cstruct.of_string "Hello" in
>>>       S.TCPV4.write flow data
>>>>> = fun () ->
>>>       S.TCPV4.close flow
>>>     );
>>>   S.listen s
>>> 
>>> This is also failing. I added a hexdump to mirage-net-xen and got this
>>> in Netif.writev:
>>> 
>>> f0 1f af 6a 9b 95 c0 ff ee c0 ff ee 08 00 45 00
>>> 00 2d 52 95 00 00 26 06 c0 c8 c0 a8 00 12 c0 a8
>>> 00 0b 1f 40 b4 ca 1a fe b5 69 5e 8c dd fe 50 18
>>> ff ff 29 8a 00 00
>>> 
>>> 48 65 6c 6c 6f
>>> 
>>> That looks correct. The first block is the header, the second is the
>>> payload. In wireshare, the header is identical but the payload is
>>> different (20 00 00 00 08), which matches what you're seeing.
>>> 
>>> So I guess there's some problem sending the second page to the ring.
>>> Suggestions from people who know this code would be great! Could just
>>> be a missing barrier or something.
>> 
>> I think the flow is:
>> 
>> https://github.com/mirage/mirage-net-xen/blob/master/lib/netif.ml#L408
>> https://github.com/mirage/shared-memory-ring/blob/master/lwt/lwt_ring.ml#L75
>> https://github.com/mirage/shared-memory-ring/blob/master/lib/ring.ml#L154
>> https://github.com/mirage/shared-memory-ring/blob/master/lib/ring.ml#L102
>> https://github.com/mirage/shared-memory-ring/blob/master/lib/barrier_stubs.c#L28
>> â calling âxen_mbâ
>> 
>> Perhaps to see whether âxen_mbâ is working you could add a delay (via busy 
>> loop?) in the âmemory_barrierâ function (or thereabouts) in 
>> shared-memory-ring. Assuming the writes are committed eventually (is that a 
>> valid assumption?) then the busy loop would âfix itâ. That would be fairly 
>> good evidence that barriers are broken.
> 
> This is potentially a problem (in shared-memory-ring/lib/barrier.h):
> 
> #elif defined(__arm__)
> # ifndef _M_ARM
> #define xen_mb()   {}
> #define xen_rmb()  {}
> #define xen_wmb()  {}
> # elif _M_ARM > 6
> #define xen_mb()   asm volatile ("dmb" : : : "memory")
> #define xen_rmb()  asm volatile ("dmb" : : : "memory")
> #define xen_wmb()  asm volatile ("dmb" : : : "memory")
> 
> From a quick Google, it looks like _M_ARM is a Microsoft-only thing.
> 
> However, the barrier code is duplicated in mirage-xen, and I think
> we're using that version (in any case, changing it didn't help).
> 
> However, I've now noticed that my network test is failing on x86 too,
> so there's something very odd going on...
> 
> 
> -- 
> Dr Thomas Leonard        http://0install.net/
> GPG: 9242 9807 C985 3C07 44A6  8B9A AE07 8280 59A5 3CC1
> GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA
> 
> _______________________________________________
> MirageOS-devel mailing list
> MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
> http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.