[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: oxenstored performance issue when starting VMs in parallel



we tested on the latest 4.14, same issue.

we tried a oxenstored replacement using lixs : https://github.com/cnplab/lixs

This basically solves the problem, no more 100% CPU (or only a few
spikes) , all the VMs are responsive!

One problem though, everything works fine except during the "xl
destroy", xl is complaining it cannot delete the VIF interface, so
there is a VIF leakage which lead, after a few hours / days, to some
issue with the dom0  complaining about network interface and has to be
rebooted....

So lixs is not a solution and it is no longer in active
maintenance/evolution since 4 years.
A supported Xen solution/workaround would be better...

Jerome

Le lun. 21 sept. 2020 à 17:25, Fanny Dwargee <fdwargee6@xxxxxxxxx> a écrit :
>
>
>
> El lun., 21 sept. 2020 a las 15:10, jerome leseinne 
> (<jerome.leseinne@xxxxxxxxx>) escribió:
>>
>> Hello,
>>
>> We are developing a solution based on Xen 4.13 who is constantly
>> creating / destroying VMs.
>>
>> To summarize our lifecycle :
>>
>> - xl restore vmX
>> - xl cd-insert ....
>> - We do our stuff for ~ 2 minutes
>> - xl destroy vmX
>>
>> So our VMs have a life of approximately 2 minutes.
>>
>> The number of VMs we ran in parallel depends on the underlying server.
>>
>> We are seeing the issue with our larger server who is running 30 VMs
>> (HVM) in parallel.
>>
>> On this server oxenstored is constantly running at 100% cpu usage and
>> some VMs are almost stucked or unresponsive.
>>
>> This is not an hardware issue, 72 xeon cores, 160 GB of memory and
>> very fast I/O subsystem.
>> Everything else is running smoothly on the server.
>>
>> what we witness in the xenstore-access.log is that the number of WATCH
>> event is matching the number of currently running VMs
>>
>> so for example for a single WRITE event is followed by around 30 watch 
>> events :
>>
>> [20200918T15:15:18.045Z]  A41354       write
>> /local/domain/0/backend/qdisk/1311/5632
>> [20200918T15:15:18.046Z]  A41248       w event
>> backend/qdisk/1311/5632 38ed11d9-9a38-4022-ad75-7c571d4886ed
>> [20200918T15:15:18.046Z]  A41257       w event
>> backend/qdisk/1311/5632 98fa91b8-e88b-4667-9813-d95196257288
>> [20200918T15:15:18.046Z]  A40648       w event
>> backend/qdisk/1311/5632 e6fd9a35-61ec-4750-93eb-999fb7f662fc
>> [20200918T15:15:18.046Z]  A40542       w event
>> backend/qdisk/1311/5632 6a39c858-2fd4-46e4-a810-485a41328f8c
>> [20200918T15:15:18.046Z]  A41141       w event
>> backend/qdisk/1311/5632 8762d552-b4b4-41ef-a2aa-23700f790ea2
>> [20200918T15:15:18.046Z]  A41310       w event
>> backend/qdisk/1311/5632 4dc2a9ae-6388-4b0c-9c98-df3c897a832f
>> [20200918T15:15:18.046Z]  A40660       w event
>> backend/qdisk/1311/5632 6abf244d-5939-4540-b176-4ec7d14b392c
>> [20200918T15:15:18.046Z]  A41347       w event
>> backend/qdisk/1311/5632 ecb93157-9929-43e2-8ed4-f5e78ab2f37d
>> [20200918T15:15:18.046Z]  A41015       w event
>> backend/qdisk/1311/5632 a1fec49f-e7cc-4059-87d3-ce43f386746e
>> [20200918T15:15:18.046Z]  A41167       w event
>> backend/qdisk/1311/5632 e9419014-9fd2-47c0-b79d-30f99d9530d6
>> [20200918T15:15:18.046Z]  A41100       w event
>> backend/qdisk/1311/5632 a2754a91-ecd6-4b6b-87ea-b68db8b888df
>> [20200918T15:15:18.046Z]  A41147       w event
>> backend/qdisk/1311/5632 176a1c3c-add7-4710-a7ee-3b5548d7a56a
>> [20200918T15:15:18.046Z]  A41305       w event
>> backend/qdisk/1311/5632 afe7933b-c92d-4403-8d6c-2e530558c937
>> [20200918T15:15:18.046Z]  A40616       w event
>> backend/qdisk/1311/5632 35fa45e0-21e8-4666-825b-0c3d629f378d
>> [20200918T15:15:18.046Z]  A40951       w event
>> backend/qdisk/1311/5632 230eb42f-d700-46ce-af61-89242847a978
>> [20200918T15:15:18.046Z]  A40567       w event
>> backend/qdisk/1311/5632 39cc7ffb-5045-4120-beb7-778073927c93
>> [20200918T15:15:18.046Z]  A41363       w event
>> backend/qdisk/1311/5632 9e42e74a-80fb-46e8-81f2-718628bf70f6
>> [20200918T15:15:18.046Z]  A40740       w event
>> backend/qdisk/1311/5632 1a64af31-fee6-45be-b8d8-c98baa5e162f
>> [20200918T15:15:18.046Z]  A40632       w event
>> backend/qdisk/1311/5632 466ef522-cb76-4117-8e93-42471897c353
>> [20200918T15:15:18.046Z]  A41319       w event
>> backend/qdisk/1311/5632 19ea986b-e303-4180-b833-c691b2b32819
>> [20200918T15:15:18.046Z]  A40677       w event
>> backend/qdisk/1311/5632 fb01629a-033b-41d6-8349-cec82e570238
>> [20200918T15:15:18.046Z]  A41152       w event
>> backend/qdisk/1311/5632 84ce9e29-a5cc-42a1-a47b-497b95767885
>> [20200918T15:15:18.047Z]  A41233       w event
>> backend/qdisk/1311/5632 ea944ad3-3af6-4688-8076-db1eac25d8e9
>> [20200918T15:15:18.047Z]  A41069       w event
>> backend/qdisk/1311/5632 ce57e169-e1ea-4fb5-b97f-23e651f49d79
>> [20200918T15:15:18.047Z]  A41287       w event
>> backend/qdisk/1311/5632 d31110c8-ae0b-4b9d-b71f-aa2985addd1a
>> [20200918T15:15:18.047Z]  A40683       w event
>> backend/qdisk/1311/5632 f0e4b0a0-fad0-4bb7-b01e-b8a31107ba3d
>> [20200918T15:15:18.047Z]  A41177       w event
>> backend/qdisk/1311/5632 9ff80e49-4cca-4ec9-901a-d30198104f29
>> [20200918T15:15:18.047Z]  D0           w event
>> backend/qdisk/1311/5632 FFFFFFFF8276B520
>> [20200918T15:15:18.047Z]  A40513       w event
>> backend/qdisk/1311/5632 d35a9a42-c15e-492c-a70d-d8b20bafec8f
>> [20200918T15:15:18.047Z]  A41354       w event
>> backend/qdisk/1311/5632 e4456ca4-70f4-4afc-9ba1-4a1cfd74c8e6
>>
>> We are not sure this is the root cause of the issue but this is the
>> only real difference we can see in the log.
>>
>> We don't understand why the number of WATCH events is related to the
>> number of concurrent running VM.
>> A watch event should be registered and only fired for the current
>> domain ID, so a write for a specific node path should only trigger one
>> watch event and not 30 in our case.
>>
>> Any ideas / comments ?
>>
>> Thanks
>>
>> Jerome Leseinne
>>
>
> Jerome,
> we are experiencing very similar issues in Xen v4.12.3 (Debian 10.4) with a 
> similar setup (128GB RAM, 48 cores), in our case we start and stop dozens of 
> HVM VMs in parallel using restore from a memory saved file and analyzing 
> automatically a software behaviour inside the guest during a few minutes.
>
> Any ideas/comments for improving the oxenstore performance will be very 
> welcome.
>



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.