[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Systematic crash on create_bounce_frame when hitting specific data allocation threshold



On 14 December 2016 at 16:13, Vittorio Cozzolino
<vittorio.cozzolino@xxxxxxxxx> wrote:
> Ok,
>
> I've built a "lightweight" version of my original Unikernel, there is
> basically only the code necessary to trigger the crash.
>
> Do I copy-paste the code here or somewhere else? I have already an issue
> open here https://github.com/mirage/mirage/issues/731, should I updated it
> and copy there the unikernel code?

Great! You can either add it to the issue or, if you want more space /
multiple files, there's a "Gist" link at the top of the GitHub page
that can be a convenient place to paste stuff.

> Best regards,
> Vittorio
>
>
>
> Il 14/12/2016 16:36, Thomas Leonard ha scritto:
>>
>> On 14 December 2016 at 15:12, Vittorio Cozzolino
>> <vittorio.cozzolino@xxxxxxxxx> wrote:
>>>
>>> Hi Thomas,
>>>
>>> I've tried a few things:
>>>
>>> - `Gc.full_major()` unfortunately doesn't help.
>>> - Looking at the address pointed by the RIP at the moment of the
>>> exception,
>>> I can see this instruction:
>>>
>>> 25605f:       e8 7c ad ff ff          callq  250de0 <memcpy>
>>>
>>> I don't know how useful can it be, considering that I can trigger the
>>> same
>>> crash by actually changing the code and, in this case, the references
>>> instruction would be something totally different (like a movel, push).
>>> Maybe
>>> the instruction type is not much related to the crash itself? I feel like
>>> it
>>> doesn't make much sense..
>>
>> It would be more interesting to know the caller of this function, etc.
>> It's possible that it branched to an invalid address and started
>> executing random code at some point, so the actual location of the
>> crash might not help but things further up the stack might be useful.
>>
>>> - Regarding in-lining the raw data in the code, I'm still working on it.
>>> Actually I don't fully understand what you mean, are you suggesting
>>> de-structuring the JSON format and insert into my code directly a
>>> list/array
>>> of values? Or copying the JSON output directly inside my code as a static
>>> variable? I've tried the latter and the error persists. I will build the
>>> list of static values and see what happens.
>>
>> Yes, I mean putting the json in your code, as
>>
>>    let raw_json = "..."
>>
>> If it still crashes with this, you can remove the database call. If it
>> still crashes, you can remove networking completely from your
>> unikernel. You can eliminate a lot of code quickly this way.
>>
>> If you can get a unikernel that just parses a JSON string and crashes,
>> other people can try it too and it should be easy to find the cause.
>>
>>> Anyway, whatever I do with the retrieved JSON (even List.iter with an
>>> empty
>>> function body), the unikernel crashes. I have the impression that as soon
>>> as
>>> I try to access the variable containing the JSON value the system crash
>>> is
>>> triggered.
>>>
>>> Best regards,
>>> Vittorio
>>>
>>>
>>> Il 14/12/2016 13:45, Thomas Leonard ha scritto:
>>>>
>>>> On 14 December 2016 at 11:35, Vittorio Cozzolino
>>>> <vittorio.cozzolino@xxxxxxxxx> wrote:
>>>>>
>>>>> Hi,
>>>>> I'm running a unikernel on XEN that basically accesses a remote DB,
>>>>> fetches
>>>>> and computes some data, sends out the result. Apparently, if I try to
>>>>> fetch
>>>>> and parse a JSON response greater than a empirically found threshold
>>>>> (details at the bottom of the email), the PVM XEN unikernel just
>>>>> crashes
>>>>> and
>>>>> this is wait I see when running sudo xl dmesg:
>>>>>
>>>>> (XEN) Pagetable walk from 00000000002c9ff8:
>>>>> (XEN)  L4[0x000] = 00000010b5f67067 0000000000000567
>>>>> (XEN)  L3[0x000] = 00000010b5f68067 0000000000000568
>>>>> (XEN)  L2[0x001] = 00000010b5f6a067 000000000000056a
>>>>> (XEN)  L1[0x0c9] = 00100010b1ac9025 00000000000002c9
>>>>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d0802261be
>>>>> create_bounce_frame+0x66/0x13a
>>>>> (XEN) Domain 23 (vcpu#0) crashed on cpu#17:
>>>>> (XEN) ----[ Xen-4.6.0  x86_64  debug=n  Not tainted ]----
>>>>> (XEN) CPU:    17
>>>>> (XEN) RIP:    e033:[<0000000000258cf4>]
>>>>> (XEN) RFLAGS: 0000000000010206   EM: 1   CONTEXT: pv guest (d23v0)
>>>>> (XEN) rax: 0000000000258cf0   rbx: 0000000000000000   rcx:
>>>>> 0000000000000073
>>>>> (XEN) rdx: 0000000000442528   rsi: 0000000000000000   rdi:
>>>>> 00000000002ca018
>>>>> (XEN) rbp: 00000000002ca1e8   rsp: 00000000002ca000   r8:
>>>>> 0000000000000002
>>>>> (XEN) r9:  0000000000000007   r10: 0000000000000007   r11:
>>>>> 0000000000000000
>>>>> (XEN) r12: 00000000002ca118   r13: 0000000000000000   r14:
>>>>> 00000011238fa000
>>>>> (XEN) r15: 0000000000000074   cr0: 0000000080050033   cr4:
>>>>> 00000000001526e0
>>>>> (XEN) cr3: 00000010b5f66000   cr2: 00000000002c9ff8
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
>>>>> (XEN) Guest stack trace from rsp=00000000002ca000:
>>>>> (XEN)    00000000002ca118 0000000000000000 000000000025933f
>>>>> 0000000000000074
>>>>> (XEN)    00000011238fa000 0000000000000000 00000000002ca118
>>>>> 00000000002ca1e8
>>>>> (XEN)    0000000000000000 0000000000000000 0000000000000007
>>>>> 0000000000000007
>>>>> (XEN)    0000000000000002 ffff800000000000 0000000000000073
>>>>> 0000000000442528
>>>>> (XEN)    00000000002ca118 0000000000000000 ffffffffffffffff
>>>>> 0000000000256708
>>>>> (XEN)    000000010000e030 0000000000010006 00000000002ca0c8
>>>>> 000000000000e02b
>>>>> (XEN)    0000000000000ffc 3736353433323130 4645444342413938
>>>>> 4e4d4c4b4a494847
>>>>> (XEN)    00000000002ca18b 00000000002ca1e8 00000000002ca18a
>>>>> 0000000000000074
>>>>> (XEN)    00000000002566a0 00000000002ca118 00000000002561bc
>>>>> 7561662065676150
>>>>> (XEN)    696c20746120746c 646461207261656e 3062642073736572
>>>>> 706972202c306433
>>>>> (XEN)    2c38303736353220 3030207367657220 3030303030303030
>>>>> 202c383333616332
>>>>> (XEN)    6533616332207073 735f72756f202c38 3030303030302070
>>>>> 3261633230303030
>>>>> (XEN)    65646f63202c3866 ffffffff0a0d3020 0000000000000bfc
>>>>> 61665f686374614d
>>>>> (XEN)    0200006572756c69 0000000000000073 0000000000000000
>>>>> ffffffffffffffef
>>>>> (XEN)    0000000000000000 00000000002ca2e8 0000000000000000
>>>>> 00000011238fa000
>>>>> (XEN)    0000000000000074 00000000002ca338 000000000025630a
>>>>> 636f6c625f737953
>>>>> (XEN)    0000003000000030 00000000002ca2e0 00000000002ca218
>>>>> ffffffffffffffeb
>>>>> (XEN)    0000000000db03d0 0000000000256708 00000000002ca338
>>>>> 00000000002ca3e8
>>>>> (XEN)    00000000002ca2f8 ffffffffffffffe9 00000000000013fc
>>>>> 656e696665646e55
>>>>> (XEN)    7372756365725f64 75646f6d5f657669 050000000000656c
>>>>> 00000000003df368
>>>>>
>>>>> I've tried to destroy/create multiple times the same unikernel and I
>>>>> always
>>>>> receive the same error. When running on Unix I don't bump into this
>>>>> issue,
>>>>> even when fetching and parsing multiple MB of data.
>>>>>
>>>>> By filling my code with logs, I figured out where exactly the unikernel
>>>>> stops. Specifically during the JSON response parsing (I'm using the
>>>>> YoJson
>>>>> library):
>>>>>
>>>>> let directExtractionn rawJson =
>>>>>              Log.info (fun f -> f "Initializing direct extraction");
>>>>>               let json = Yojson.Basic.from_string rawJson in
>>>>>               let result = [json] |> filter_member "results" |> flatten
>>>>> |>
>>>>> filter_member "series"
>>>>>               |> flatten |> filter_member "values" |> flatten in
>>>>>                   List.map (
>>>>>                                   fun item ->
>>>>>                                   let datapoint = match item |> index 1
>>>>> with
>>>>>                                       | `String a -> a
>>>>>                                       | `Float f -> string_of_float f
>>>>>                                       | `Int i -> string_of_float
>>>>> (float_of_int i)
>>>>>                                       | `Bool b -> string_of_bool b
>>>>>                                   in
>>>>>                                   datapoint
>>>>>               ) result |> computeAverage >>= fun aver ->
>>>>>               log_lwt ~inject:(fun f -> f "Result %f" aver)
>>>>>
>>>>> I know that probably my code is not really optimized and clean but I'm
>>>>> quite
>>>>> shocked to see that my unikernel crashes when it has to extract roughly
>>>>> 3500
>>>>> datapoints (it's more or less the threshold at which it crashes). The
>>>>> function computeAverage is not even called. If I run the same code on
>>>>> Unix I
>>>>> can parse and process up to a 1M datapoints in less than a second. I've
>>>>> also
>>>>> tried to increase the number of vcpus and memory, but nothing changed
>>>>> (16
>>>>> vcpus and 4GB of memory).
>>>>>
>>>>> I would like to add that this threshold changes depending on the host
>>>>> machine:
>>>>>
>>>>> - Machine A (Ubuntu 14.04, Xen 4.6.0, 32 Cores, 128 GB RAM, 10 GB
>>>>> Network
>>>>> Interface) -> Threshold is around 107Kb
>>>>> - Machine B (Debian 8.5, Xen 4.4.1, 4 cores, 8 GB RAM, 1GB Network
>>>>> Interface) -> Threshold is around 33Kb
>>>>
>>>> Can you simplify the case? For example, instead of fetching the JSON,
>>>> what if you in-line the raw data in your code and parse that?
>>>>
>>>> Does adding a `Gc.full_major ()` just before the crash help? That
>>>> might indicate we're running out of memory and failing to run the GC
>>>> for some reason.
>>>>
>>>> You could also use `objdump -d` or similar on the unikernel image and
>>>> see what the addresses in the stack trace correspond to.
>>>>
>>>>
>>> --
>>> Vittorio Cozzolino, M.Eng.
>>> Technische Universität München - Institut für Informatik
>>> Office 01.05.041
>>> Boltzmannstr. 3, 85748 Garching, Germany
>>> Tel: +49 89 289-17356
>>> http://www.cm.in.tum.de/en/research-group/vittorio-cozzolino
>>>
>>>
>>
>>
>
> --
> Vittorio Cozzolino, M.Eng.
> Technische Universität München - Institut für Informatik
> Office 01.05.041
> Boltzmannstr. 3, 85748 Garching, Germany
> Tel: +49 89 289-17356
> http://www.cm.in.tum.de/en/research-group/vittorio-cozzolino
>
>



-- 
talex5 (GitHub/Twitter)        http://roscidus.com/blog/
GPG: 5DD5 8D70 899C 454A 966D  6A51 7513 3C8F 94F6 E0CC
GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA

_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.