Xen project Mailing List

RE: [Xen-devel] open/stat64 syscalls run faster on Xen VM than standard Linux

To: "xuehai zhang" <hai@xxxxxxxxxxxxxxx>

From: "Petersson, Mats" <mats.petersson@xxxxxxx>

Date: Mon, 28 Nov 2005 19:17:07 +0100

Cc: Kate Keahey <keahey@xxxxxxxxxxx>, Xen Mailing List <xen-devel@xxxxxxxxxxxxxxxxxxx>, Tim Freeman <tfreeman@xxxxxxxxxxx>

Delivery-date: Mon, 28 Nov 2005 18:18:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcX0QZZqA9r12WuDTK+GWlVfgZpW0QAANVvw

Thread-topic: [Xen-devel] open/stat64 syscalls run faster on Xen VM than standard Linux

> -----Original Message----- > From: xuehai zhang [mailto:hai@xxxxxxxxxxxxxxx] > Sent: 28 November 2005 17:17 > To: Petersson, Mats > Cc: Anthony Liguori; Xen Mailing List; Kate Keahey; Tim Freeman > Subject: Re: [Xen-devel] open/stat64 syscalls run faster on > Xen VM than standard Linux > > Petersson, Mats wrote: > >>-----Original Message----- > >>From: xuehai zhang [mailto:hai@xxxxxxxxxxxxxxx] > >>Sent: 28 November 2005 15:51 > >>To: Petersson, Mats > >>Cc: Anthony Liguori; Xen Mailing List; Kate Keahey; Tim Freeman > >>Subject: Re: [Xen-devel] open/stat64 syscalls run faster on Xen VM > >>than standard Linux > >> > >>Mats, > >> > >>I mounted the loopback file in dom0, chrooted to the mountpoint and > >>redid the experiment. The results is attached below. The > time of open > >>and stat64 calls is similar to the XenLinux case and also > much smaller > >>than the standard Linux case. So, either using loopback file as > >>backend of XenLinux or directly mounting it in local > filesystem will > >>result in some benefit (maybe just caused by the extra > layer of block > >>caching) for the performance of some system calls. > > > > > > Yes, I think the caching of the blocks in two layers will be the > > reason you get this effect. The loopback file is cached > once in the fs > > handling the REAL HARD DISK, and then other blocks would be > cached in > > the fs handling the loopback. > > Is "the fs handling the REAL HD" the dom0's filesystem? Is > the cache used here is the dom0's disk buffer cache or > something else? What is "the fs handling the loopback"? Is > the filesystem seen inside of the XenLinux or still the > filesystem of dom0? What is the cache used in this case? In both native linux and XenLinux, it's the file-system that handles the actual disk where the VBD-file lives. The software that manages the file-system, aka file-system driver, will have a block-cache. Since you're then mounting a VBD through a different (maybe the same type) file-system driver, that handles this "disk", it will have it's additional set of block caching. Of course, if a block is cached in the upper level (VBD) file-system driver, the lower-level file-system driver doesn't get the call, and thus no action needs to be taken there. It also means that any access to a DIFFERENT block in the lower level driver can be cached, because there's now more space in that cache, since the first cached access didn't even touch the second driver. It's similar to doubling the block-cache. [The exact effects of doubling the block-cache would depend on it's internal architecture, and I know NOTHING about this, so I can't say if it would have the same effect or different effect]. > > > In this case the directory of the file(s) involved in your > benchmark > > are probably held entirely in memory, whilst when you use a > real disk > > to do the same thing, you could end up with some "real" accesses to > > the disk device itself. > > To confirm our hypothesis that two layer block caching is the > real cause, what experiments I can do to show exactly a block > is accessed from a cache instead of hard disk on XenLinux but > it has to be read from hard disk on stand Linux? Maybe I can > use "vmstat" in dom0 to track block receive/send during the > execution of the benchmark. I think we've just proven that it's the VBD that causes the speed difference. Within a few percent, your last test showed the same numbers as the Xen-Linux combination - writes being slower on Xen-linux, which is what I'd expect, but otherwise the system calls are pretty much the same. I'd expect execve to be much slower on Xen+Linux, as it's messing about with loading an application and doing page-table manipulations. We get similar times for old_mmap, open is about the same, read is reasonably the same (slower on XenLinux, also expected, I think), similar for brk, munmap, etc, etc. So the main differences between XenLinux and native Linux is that you're running one with two levels of block-caching, and if you don't use the VBD/Loop-back-mounted file as your "disk", you only get one level of caching and thus slower access. > > > Next question will probably be why write is slower in > Xen+Linux than > > native Linux - something I can't say for sure, but I would > expect it > > to be because the write is going through Xen in the > Xen+Linux case and > > straight through Linux when in the native linux case. But > that's just > > a guess. [And since it's slower in Xen, I don't expect you to be > > surprised by this]. And the write call is almost identical to the > > Linux native, as you'd expect. > > I also agree the overhead of write system call in VM is > caused by Xen. I actually run a "dd" > benchmark to create a disk file from /dev/zero on both > machines and the VM is slower than the physical machine as we expect. > > So, the benchmark experiments I've done so far suggests > XenLinux using loopback files as VBD backends shows better > performance (faster execution) on part of the system calls > like open and stat64, but it shows worse performance (slower > execution) on other system calls like write than the standard > Linux. Does this mean different applications may have > different execution behaviors on VM than on the standard > Linux? In other words, some applications run faster on VM and > some slower, comparing with the physical machine? Yes, but for larger (i.e., more realistic) workloads it would probably even out, since the advantage of any cache is only really useful if you get lots of hits in that cache, and the smaller cache isn't big enough already. If you have a cache of, say, 2MB, then the difference between 2MB and 4MB when you access 3.5MB of data would be very noticable [numbers just pulled from thin air, I have no idea of how much data a testcase such as yours will access]. If, on the other hand, you run a full-linux kernel compile that accesses several hundred megabytes of source-code, include files and generating several megabyes of object files, symbol files and end-product binary, it would be much less noticable whether the cache is 2MB or 4MB or even 32MB. And of course, this would perform a lot of read/write operations that would very likely eat up any of the previosly gained performance benefit. However, performing REAL tasks would be a much better benchmark on the Xen performance than to microbenchmark some special system call such as popen, since the REAL task would actually be something that a Xen-Linux user would perhaps do in their daily work. [Although compiling the kernel is fairly unlikely, I can certainly see people doing code-development under Xen-Linux]. Running a large database under Xen-Linux would be another likely task, or a web-server. Some of this would obviously take longer, others may run faster. I have no idea of which it is. My point is rather that you're analyzing ONE C-library function, and drawing big conclusions from it. Running a wider spread of loads on the system would give more accurate average results. Averages are good for describing something where you have many different samples, but as my father once said "If you stand with one foot in ice, and the other in hot water, on average, your feet are comfortable". So don't JUST look at the average, look at the max/min values too to determine how things stand. Yes, I do realize that testing the system (real world) performance in such tests as Compiling the Linux kernel, or running an SQL database or web-server are both more complicated, involve less (or no) code-writing - so less fun, and will not test PARTICULAR points of the kernel, and will spend a lot of time in user-mode running exactly identical code on both systems. However, I think it's a better way to get a BIG picture understanding of differences between Xen and native Linux. Btw, I would also suggest that it would make sense to test on a MODERN system - a Pentium III must be at least 5 years old by now, and if the rest of the components are equal age, then I suspect that the performance on a modern machine is at least 2-3x better. -- Mats > > Thanks. > > Xuehai > > >># strace -c /bin/sh -c /bin/echo foo > >> > >>% time seconds usecs/call calls errors syscall > >>------ ----------- ----------- --------- --------- ---------------- > >> 21.93 0.000490 490 1 write > >> 16.34 0.000365 24 15 old_mmap > >> 15.26 0.000341 38 9 3 open > >> 9.62 0.000215 43 5 read > >> 7.97 0.000178 10 18 brk > >> 7.79 0.000174 87 2 munmap > >> 4.07 0.000091 8 11 rt_sigaction > >> 3.27 0.000073 12 6 close > >> 2.91 0.000065 11 6 fstat64 > >> 2.28 0.000051 9 6 rt_sigprocmask > >> 2.15 0.000048 24 2 access > >> 1.75 0.000039 13 3 uname > >> 1.66 0.000037 19 2 stat64 > >> 0.40 0.000009 9 1 getpgrp > >> 0.40 0.000009 9 1 getuid32 > >> 0.36 0.000008 8 1 time > >> 0.36 0.000008 8 1 getppid > >> 0.36 0.000008 8 1 getgid32 > >> 0.31 0.000007 7 1 getpid > >> 0.27 0.000006 6 1 execve > >> 0.27 0.000006 6 1 geteuid32 > >> 0.27 0.000006 6 1 getegid32 > >>------ ----------- ----------- --------- --------- ---------------- > >>100.00 0.002234 95 3 total > >> > >>Thanks. > >> > >>Xuehai > >> > >> > >>Petersson, Mats wrote: > >> > >>>>-----Original Message----- > >>>>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > >>>>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf > Of Anthony > >>>>Liguori > >>>>Sent: 28 November 2005 14:39 > >>>>To: xuehai zhang > >>>>Cc: Xen Mailing List > >>>>Subject: Re: [Xen-devel] open/stat64 syscalls run faster > on Xen VM > >>>>than standard Linux > >>>> > >>>>This may just be the difference between having the extra level of > >>>>block caching from using a loop back device. > >>>> > >>>>Try running the same benchmark on a domain that uses an actual > >>>>partition. While the syscalls may appear to be faster, I > >> > >>imagine it's > >> > >>>>because the cost of pulling in a block has already been > >> > >>payed so the > >> > >>>>overall workload is unaffected. > >>> > >>> > >>>And this would be the same as running standard linux with > >> > >>the loopback > >> > >>>file-system mounted and chroot to the local file-system, or > >> > >>would that > >> > >>>be different? [I'm asking because I don't actually > >> > >>understand enough > >> > >>>about how it works to know what difference it makes, and I > >> > >>would like > >> > >>>to know, because at some point I'll probably need to know this.] > >>> > >>>-- > >>>Mats > >>> > >>> > >>>>Regards, > >>>> > >>>>Anthony Liguori > >>>> > >>>>xuehai zhang wrote: > >>> > >>>[snip] > >>> > >>> > >> > >> > >> > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.