[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [MirageOS-devel] Systematic crash on create_bounce_frame when hitting specific data allocation threshold
On 14 December 2016 at 16:13, Vittorio Cozzolino <vittorio.cozzolino@xxxxxxxxx> wrote: > Ok, > > I've built a "lightweight" version of my original Unikernel, there is > basically only the code necessary to trigger the crash. > > Do I copy-paste the code here or somewhere else? I have already an issue > open here https://github.com/mirage/mirage/issues/731, should I updated it > and copy there the unikernel code? Great! You can either add it to the issue or, if you want more space / multiple files, there's a "Gist" link at the top of the GitHub page that can be a convenient place to paste stuff. > Best regards, > Vittorio > > > > Il 14/12/2016 16:36, Thomas Leonard ha scritto: >> >> On 14 December 2016 at 15:12, Vittorio Cozzolino >> <vittorio.cozzolino@xxxxxxxxx> wrote: >>> >>> Hi Thomas, >>> >>> I've tried a few things: >>> >>> - `Gc.full_major()` unfortunately doesn't help. >>> - Looking at the address pointed by the RIP at the moment of the >>> exception, >>> I can see this instruction: >>> >>> 25605f: e8 7c ad ff ff callq 250de0 <memcpy> >>> >>> I don't know how useful can it be, considering that I can trigger the >>> same >>> crash by actually changing the code and, in this case, the references >>> instruction would be something totally different (like a movel, push). >>> Maybe >>> the instruction type is not much related to the crash itself? I feel like >>> it >>> doesn't make much sense.. >> >> It would be more interesting to know the caller of this function, etc. >> It's possible that it branched to an invalid address and started >> executing random code at some point, so the actual location of the >> crash might not help but things further up the stack might be useful. >> >>> - Regarding in-lining the raw data in the code, I'm still working on it. >>> Actually I don't fully understand what you mean, are you suggesting >>> de-structuring the JSON format and insert into my code directly a >>> list/array >>> of values? Or copying the JSON output directly inside my code as a static >>> variable? I've tried the latter and the error persists. I will build the >>> list of static values and see what happens. >> >> Yes, I mean putting the json in your code, as >> >> let raw_json = "..." >> >> If it still crashes with this, you can remove the database call. If it >> still crashes, you can remove networking completely from your >> unikernel. You can eliminate a lot of code quickly this way. >> >> If you can get a unikernel that just parses a JSON string and crashes, >> other people can try it too and it should be easy to find the cause. >> >>> Anyway, whatever I do with the retrieved JSON (even List.iter with an >>> empty >>> function body), the unikernel crashes. I have the impression that as soon >>> as >>> I try to access the variable containing the JSON value the system crash >>> is >>> triggered. >>> >>> Best regards, >>> Vittorio >>> >>> >>> Il 14/12/2016 13:45, Thomas Leonard ha scritto: >>>> >>>> On 14 December 2016 at 11:35, Vittorio Cozzolino >>>> <vittorio.cozzolino@xxxxxxxxx> wrote: >>>>> >>>>> Hi, >>>>> I'm running a unikernel on XEN that basically accesses a remote DB, >>>>> fetches >>>>> and computes some data, sends out the result. Apparently, if I try to >>>>> fetch >>>>> and parse a JSON response greater than a empirically found threshold >>>>> (details at the bottom of the email), the PVM XEN unikernel just >>>>> crashes >>>>> and >>>>> this is wait I see when running sudo xl dmesg: >>>>> >>>>> (XEN) Pagetable walk from 00000000002c9ff8: >>>>> (XEN) L4[0x000] = 00000010b5f67067 0000000000000567 >>>>> (XEN) L3[0x000] = 00000010b5f68067 0000000000000568 >>>>> (XEN) L2[0x001] = 00000010b5f6a067 000000000000056a >>>>> (XEN) L1[0x0c9] = 00100010b1ac9025 00000000000002c9 >>>>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d0802261be >>>>> create_bounce_frame+0x66/0x13a >>>>> (XEN) Domain 23 (vcpu#0) crashed on cpu#17: >>>>> (XEN) ----[ Xen-4.6.0 x86_64 debug=n Not tainted ]---- >>>>> (XEN) CPU: 17 >>>>> (XEN) RIP: e033:[<0000000000258cf4>] >>>>> (XEN) RFLAGS: 0000000000010206 EM: 1 CONTEXT: pv guest (d23v0) >>>>> (XEN) rax: 0000000000258cf0 rbx: 0000000000000000 rcx: >>>>> 0000000000000073 >>>>> (XEN) rdx: 0000000000442528 rsi: 0000000000000000 rdi: >>>>> 00000000002ca018 >>>>> (XEN) rbp: 00000000002ca1e8 rsp: 00000000002ca000 r8: >>>>> 0000000000000002 >>>>> (XEN) r9: 0000000000000007 r10: 0000000000000007 r11: >>>>> 0000000000000000 >>>>> (XEN) r12: 00000000002ca118 r13: 0000000000000000 r14: >>>>> 00000011238fa000 >>>>> (XEN) r15: 0000000000000074 cr0: 0000000080050033 cr4: >>>>> 00000000001526e0 >>>>> (XEN) cr3: 00000010b5f66000 cr2: 00000000002c9ff8 >>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 >>>>> (XEN) Guest stack trace from rsp=00000000002ca000: >>>>> (XEN) 00000000002ca118 0000000000000000 000000000025933f >>>>> 0000000000000074 >>>>> (XEN) 00000011238fa000 0000000000000000 00000000002ca118 >>>>> 00000000002ca1e8 >>>>> (XEN) 0000000000000000 0000000000000000 0000000000000007 >>>>> 0000000000000007 >>>>> (XEN) 0000000000000002 ffff800000000000 0000000000000073 >>>>> 0000000000442528 >>>>> (XEN) 00000000002ca118 0000000000000000 ffffffffffffffff >>>>> 0000000000256708 >>>>> (XEN) 000000010000e030 0000000000010006 00000000002ca0c8 >>>>> 000000000000e02b >>>>> (XEN) 0000000000000ffc 3736353433323130 4645444342413938 >>>>> 4e4d4c4b4a494847 >>>>> (XEN) 00000000002ca18b 00000000002ca1e8 00000000002ca18a >>>>> 0000000000000074 >>>>> (XEN) 00000000002566a0 00000000002ca118 00000000002561bc >>>>> 7561662065676150 >>>>> (XEN) 696c20746120746c 646461207261656e 3062642073736572 >>>>> 706972202c306433 >>>>> (XEN) 2c38303736353220 3030207367657220 3030303030303030 >>>>> 202c383333616332 >>>>> (XEN) 6533616332207073 735f72756f202c38 3030303030302070 >>>>> 3261633230303030 >>>>> (XEN) 65646f63202c3866 ffffffff0a0d3020 0000000000000bfc >>>>> 61665f686374614d >>>>> (XEN) 0200006572756c69 0000000000000073 0000000000000000 >>>>> ffffffffffffffef >>>>> (XEN) 0000000000000000 00000000002ca2e8 0000000000000000 >>>>> 00000011238fa000 >>>>> (XEN) 0000000000000074 00000000002ca338 000000000025630a >>>>> 636f6c625f737953 >>>>> (XEN) 0000003000000030 00000000002ca2e0 00000000002ca218 >>>>> ffffffffffffffeb >>>>> (XEN) 0000000000db03d0 0000000000256708 00000000002ca338 >>>>> 00000000002ca3e8 >>>>> (XEN) 00000000002ca2f8 ffffffffffffffe9 00000000000013fc >>>>> 656e696665646e55 >>>>> (XEN) 7372756365725f64 75646f6d5f657669 050000000000656c >>>>> 00000000003df368 >>>>> >>>>> I've tried to destroy/create multiple times the same unikernel and I >>>>> always >>>>> receive the same error. When running on Unix I don't bump into this >>>>> issue, >>>>> even when fetching and parsing multiple MB of data. >>>>> >>>>> By filling my code with logs, I figured out where exactly the unikernel >>>>> stops. Specifically during the JSON response parsing (I'm using the >>>>> YoJson >>>>> library): >>>>> >>>>> let directExtractionn rawJson = >>>>> Log.info (fun f -> f "Initializing direct extraction"); >>>>> let json = Yojson.Basic.from_string rawJson in >>>>> let result = [json] |> filter_member "results" |> flatten >>>>> |> >>>>> filter_member "series" >>>>> |> flatten |> filter_member "values" |> flatten in >>>>> List.map ( >>>>> fun item -> >>>>> let datapoint = match item |> index 1 >>>>> with >>>>> | `String a -> a >>>>> | `Float f -> string_of_float f >>>>> | `Int i -> string_of_float >>>>> (float_of_int i) >>>>> | `Bool b -> string_of_bool b >>>>> in >>>>> datapoint >>>>> ) result |> computeAverage >>= fun aver -> >>>>> log_lwt ~inject:(fun f -> f "Result %f" aver) >>>>> >>>>> I know that probably my code is not really optimized and clean but I'm >>>>> quite >>>>> shocked to see that my unikernel crashes when it has to extract roughly >>>>> 3500 >>>>> datapoints (it's more or less the threshold at which it crashes). The >>>>> function computeAverage is not even called. If I run the same code on >>>>> Unix I >>>>> can parse and process up to a 1M datapoints in less than a second. I've >>>>> also >>>>> tried to increase the number of vcpus and memory, but nothing changed >>>>> (16 >>>>> vcpus and 4GB of memory). >>>>> >>>>> I would like to add that this threshold changes depending on the host >>>>> machine: >>>>> >>>>> - Machine A (Ubuntu 14.04, Xen 4.6.0, 32 Cores, 128 GB RAM, 10 GB >>>>> Network >>>>> Interface) -> Threshold is around 107Kb >>>>> - Machine B (Debian 8.5, Xen 4.4.1, 4 cores, 8 GB RAM, 1GB Network >>>>> Interface) -> Threshold is around 33Kb >>>> >>>> Can you simplify the case? For example, instead of fetching the JSON, >>>> what if you in-line the raw data in your code and parse that? >>>> >>>> Does adding a `Gc.full_major ()` just before the crash help? That >>>> might indicate we're running out of memory and failing to run the GC >>>> for some reason. >>>> >>>> You could also use `objdump -d` or similar on the unikernel image and >>>> see what the addresses in the stack trace correspond to. >>>> >>>> >>> -- >>> Vittorio Cozzolino, M.Eng. >>> Technische Universität München - Institut für Informatik >>> Office 01.05.041 >>> Boltzmannstr. 3, 85748 Garching, Germany >>> Tel: +49 89 289-17356 >>> http://www.cm.in.tum.de/en/research-group/vittorio-cozzolino >>> >>> >> >> > > -- > Vittorio Cozzolino, M.Eng. > Technische Universität München - Institut für Informatik > Office 01.05.041 > Boltzmannstr. 3, 85748 Garching, Germany > Tel: +49 89 289-17356 > http://www.cm.in.tum.de/en/research-group/vittorio-cozzolino > > -- talex5 (GitHub/Twitter) http://roscidus.com/blog/ GPG: 5DD5 8D70 899C 454A 966D 6A51 7513 3C8F 94F6 E0CC GPG: DA98 25AE CAD0 8975 7CDA BD8E 0713 3F96 CA74 D8BA _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |