Xen project Mailing List

Re: [Xen-devel] 4.2.1: Poor write performance for DomU.

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Fri, 06 Sep 2013 19:37:23 -0400

Cc: roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Fri, 06 Sep 2013 23:38:19 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Steven Haigh <netwiz@xxxxxxxxx> wrote: >On 06/09/13 23:33, Konrad Rzeszutek Wilk wrote: >> On Thu, Sep 05, 2013 at 06:28:25PM +1000, Steven Haigh wrote: >>> On 21/08/13 02:48, Konrad Rzeszutek Wilk wrote: >>>> On Mon, Mar 25, 2013 at 01:21:09PM +1100, Steven Haigh wrote: >>>>> So, based on my tests yesterday, I decided to break the RAID6 and >>>>> pull a drive out of it to test directly on the 2Tb drives in >>>>> question. >>>>> >>>>> The array in question: >>>>> # cat /proc/mdstat >>>>> Personalities : [raid1] [raid6] [raid5] [raid4] >>>>> md2 : active raid6 sdd[4] sdc[0] sde[1] sdf[5] >>>>> 3907026688 blocks super 1.2 level 6, 128k chunk, algorithm 2 >>>>> [4/4] [UUUU] >>>>> >>>>> # mdadm /dev/md2 --fail /dev/sdf >>>>> mdadm: set /dev/sdf faulty in /dev/md2 >>>>> # mdadm /dev/md2 --remove /dev/sdf >>>>> mdadm: hot removed /dev/sdf from /dev/md2 >>>>> >>>>> So, all tests are to be done on /dev/sdf. >>>>> Model Family: Seagate SV35 >>>>> Device Model: ST2000VX000-9YW164 >>>>> Serial Number: Z1E17C3X >>>>> LU WWN Device Id: 5 000c50 04e1bc6f0 >>>>> Firmware Version: CV13 >>>>> User Capacity: 2,000,398,934,016 bytes [2.00 TB] >>>>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>>>> >>>>> From the Dom0: >>>>> # dd if=/dev/zero of=/dev/sdf bs=1M count=4096 oflag=direct >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 30.7691 s, 140 MB/s >>>>> >>>>> Create a single partition on the drive, and format it with ext4: >>>>> Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes >>>>> 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 >sectors >>>>> Units = sectors of 1 * 512 = 512 bytes >>>>> Sector size (logical/physical): 512 bytes / 4096 bytes >>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>>>> Disk identifier: 0x98d8baaf >>>>> >>>>> Device Boot Start End Blocks Id System >>>>> /dev/sdf1 2048 3907029167 1953513560 83 Linux >>>>> >>>>> Command (m for help): w >>>>> >>>>> # mkfs.ext4 -j /dev/sdf1 >>>>> ...... >>>>> Writing inode tables: done >>>>> Creating journal (32768 blocks): done >>>>> Writing superblocks and filesystem accounting information: done >>>>> >>>>> Mount it on the Dom0: >>>>> # mount /dev/sdf1 /mnt/esata/ >>>>> # cd /mnt/esata/ >>>>> # bonnie++ -d . -u 0:0 >>>>> .... >>>>> Version 1.96 ------Sequential Output------ --Sequential >>>>> Input- --Random- >>>>> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- >>>>> --Block-- --Seeks-- >>>>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec >>>>> %CP /sec %CP >>>>> xenhost.lan.crc. 2G 425 94 133607 24 60544 12 973 95 >209114 >>>>> 17 296.4 6 >>>>> Latency 70971us 190ms 221ms 40369us >17657us >>>>> 164ms >>>>> >>>>> So from the Dom0: 133Mb/sec write, 209Mb/sec read. >>>>> >>>>> Now, I'll attach the full disk to a DomU: >>>>> # xm block-attach zeus.vm phy:/dev/sdf xvdc w >>>>> >>>>> And we'll test from the DomU. >>>>> >>>>> # dd if=/dev/zero of=/dev/xvdc bs=1M count=4096 oflag=direct >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 32.318 s, 133 MB/s >>>>> >>>>> Partition the same as in the Dom0 and create an ext4 filesystem on >it: >>>>> >>>>> I notice something interesting here. In the Dom0, the device is >seen as: >>>>> Units = sectors of 1 * 512 = 512 bytes >>>>> Sector size (logical/physical): 512 bytes / 4096 bytes >>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>>>> >>>>> In the DomU, it is seen as: >>>>> Units = sectors of 1 * 512 = 512 bytes >>>>> Sector size (logical/physical): 512 bytes / 512 bytes >>>>> I/O size (minimum/optimal): 512 bytes / 512 bytes >>>>> >>>>> Not sure if this could be related - but continuing testing: >>>>> Device Boot Start End Blocks Id System >>>>> /dev/xvdc1 2048 3907029167 1953513560 83 Linux >>>>> >>>>> # mkfs.ext4 -j /dev/xvdc1 >>>>> .... >>>>> Allocating group tables: done >>>>> Writing inode tables: done >>>>> Creating journal (32768 blocks): done >>>>> Writing superblocks and filesystem accounting information: done >>>>> >>>>> # mount /dev/xvdc1 /mnt/esata/ >>>>> # cd /mnt/esata/ >>>>> # bonnie++ -d . -u 0:0 >>>>> .... >>>>> Version 1.96 ------Sequential Output------ --Sequential >>>>> Input- --Random- >>>>> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- >>>>> --Block-- --Seeks-- >>>>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec >>>>> %CP /sec %CP >>>>> zeus.crc.id.au 2G 396 99 116530 23 50451 15 1035 99 >176407 >>>>> 23 313.4 9 >>>>> Latency 34615us 130ms 128ms 33316us >74401us >>>>> 130ms >>>>> >>>>> So still... 116Mb/sec write, 176Mb/sec read to the physical device >>>>> from the DomU. More than acceptable. >>>>> >>>>> It leaves me to wonder.... Could there be something in the Dom0 >>>>> seeing the drives as 4096 byte sectors, but the DomU seeing it as >>>>> 512 byte sectors cause an issue? >>>> >>>> There is certain overhead in it. I still have this in my mailbox >>>> so I am not sure whether this issue got ever resolved? I know that >the >>>> indirect patches in Xen blkback and xen blkfront are meant to >resolve >>>> some of these issues - by being able to carry a bigger payload. >>>> >>>> Did you ever try v3.11 kernel in both dom0 and domU? Thanks. >>> >>> Ok, so I finally got around to building kernel 3.11 RPMs today for >>> testing. I upgraded both the Dom0 and DomU to the same kernel: >> >> Woohoo! >>> >>> DomU: >>> # dmesg | grep blkfront >>> blkfront: xvda: flush diskcache: enabled; persistent grants: >enabled; >>> indirect descriptors: enabled; >>> blkfront: xvdb: flush diskcache: enabled; persistent grants: >enabled; >>> indirect descriptors: enabled; >>> >>> Looks good. >>> >>> Transfer tests using bonnie++ as per before: >>> # bonnie -d . -u 0:0 >>> Version 1.96 ------Sequential Output------ --Sequential >Input- >>> --Random- >>> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- >--Block-- >>> --Seeks-- >>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec >%CP >>> /sec %CP >>> zeus.crc.id.au 2G 603 92 58250 9 62248 14 886 99 295757 >30 >>> 492.3 13 >>> Latency 27305us 124ms 158ms 34222us 16865us >>> 374ms >>> Version 1.96 ------Sequential Create------ --------Random >>> Create-------- >>> zeus.crc.id.au -Create-- --Read--- -Delete-- -Create-- >--Read--- >>> -Delete-- >>> files /sec %CP /sec %CP /sec %CP /sec %CP /sec >%CP >>> /sec %CP >>> 16 10048 22 +++++ +++ 17849 29 11109 25 +++++ >+++ >>> 18389 31 >>> Latency 17775us 154us 180us 16008us 38us >>> 58us >>> >>> Still seems to be a massive discrepancy between Dom0 and DomU write >>> speeds. Interesting is that sequential block reads are nearly >300MB/sec, >>> yet sequential writes were only ~58MB/sec. >> >> OK, so the other thing that people were pointing out that is you >> can use xen-blkfront.max parameter. By default it is 32, but try 8. >> Or 64. Or 256. > >Ahh - interesting. > >I used the following: >Kernel command line: ro root=/dev/xvda rd_NO_LUKS rd_NO_DM >LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us >crashkernel=auto console=hvc0 xen-blkfront.max=X > >8: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 696 92 50906 7 46102 11 1013 97 256784 27 >496.5 10 >Latency 24374us 199ms 117ms 30855us 38008us >85175us > >16: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 675 92 58078 8 57585 13 1005 97 262735 25 >505.6 10 >Latency 24412us 187ms 183ms 23661us 53850us >232ms > >32: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 698 92 57416 8 63328 13 1063 97 267154 24 >498.2 12 >Latency 24264us 199ms 81362us 33144us 22526us >237ms > >64: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 574 86 88447 13 68988 17 897 97 265128 27 >493.7 13 > >128: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 702 97 107638 14 70158 15 1045 97 255596 24 >491.0 12 >Latency 27279us 17553us 134ms 29771us 38392us >65761us > >256: >Version 1.96 ------Sequential Output------ --Sequential Input- >--Random- >Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >--Seeks-- >Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >/sec %CP >zeus.crc.id.au 2G 689 91 102554 14 67337 15 1012 97 262475 24 >484.4 12 >Latency 20642us 104ms 189ms 36624us 45286us >80023us > >So, as a nice summary: >8: 50Mb/sec >16: 58Mb/sec >32: 57Mb/sec >64: 88Mb/sec >128: 107Mb/sec >256: 102Mb/sec > >So, maybe it's coincidence, maybe it isn't - but the best (factoring >margin of error) seems to be 128 - which happens to be the block size >of >the underlying RAID6 array on the Dom0. > ># cat /proc/mdstat >md2 : active raid6 sdd[5] sdc[4] sdf[1] sde[0] > 3906766592 blocks super 1.2 level 6, 128k chunk, algorithm 2 [4/4] >[UUUU] > >> The indirect descriptor allows us to put more I/Os on the ring - and >> I am hoping that will: >> a) solve your problem > >Well, it looks like this solves the issue - at least increasing the max >causes almost double the write speed - and no change to read speeds >(within margin of error). > >> b) not solve your problem, but demonstrate that the issue is not >with >> the ring, but with something else making your writes slower. >> >> Hmm, are you by any chance using O_DIRECT when running bonnie++ in >> dom0? The xen-blkback tacks on O_DIRECT to all write requests. This >is >> done to not use the dom0 page cache - otherwise you end up with >> a double buffer where the writes are insane speed - but with >absolutly >> no safety. >> >> If you want to try disabling that (so no O_DIRECT), I would do this >> little change: >> >> diff --git a/drivers/block/xen-blkback/blkback.c >b/drivers/block/xen-blkback/blkback.c >> index bf4b9d2..823b629 100644 >> --- a/drivers/block/xen-blkback/blkback.c >> +++ b/drivers/block/xen-blkback/blkback.c >> @@ -1139,7 +1139,7 @@ static int dispatch_rw_block_io(struct >xen_blkif *blkif, >> break; >> case BLKIF_OP_WRITE: >> blkif->st_wr_req++; >> - operation = WRITE_ODIRECT; >> + operation = WRITE; >> break; >> case BLKIF_OP_WRITE_BARRIER: >> drain = true; > >With the above results, is this still useful? No. There is no need. Awesome that this fixed it. Roger had mentioned that he had seen similar behavior. We should probably do a patch that interrogates the backend for optimal segment size and informs the frontend - so it can set it not. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.