5 Years of Linux Kernel Releases Benchmarked
An anonymous reader writes "Phoronix has published benchmarks of the past five years worth of Linux kernel releases, from the Linux 2.6.12 through Linux 2.6.37 (dev) releases. The results from these benchmarks of 26 versions show that, for the most part, new features haven't affected performance."
What about running the same study on the Windows kernel from XP to 7?
It seems almost every benchmark that had any difference was slower in more modern kernels. It's not all sunshine and roses.
They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?
If they test in a VM, on only one particular hardware configuration, then the results only apply to that specific test setup. If the fact that the experiments are run inside a VM introduces variability into the results, then this will show up as a large variance. However, having a larger variance does not in itself negate the results - but remember that the results can't be generalised to other configurations - they only apply to this particular setup.
In order to produce experimental results that can be generalised you need to run your experiments on a randomised configuration of hardware and VM host software. Either test every possible combination of factors - hardware, VM host sw, sw under test - (full factorial), or some subset (fractional factorial).
I'm usually one of the first to bash Phoronix for not doing multiple replicates or any statistical analysis of their experiments, but things appear to have changed this time. Some of the big criticisms of Phoronix's benchmarks in the past were that they didn't consider whether or not their results were significant - instead doing only one replicate for each configuration, plotting a barchart, and concluding "X was 5 FPS faster. Therefore it wins!" Apparently they're now doing multiple replicates and some proper statistics to calculate whether or not observed differences are actually statistically significant ("our kernel test results were automated, easily reproducible, and statistically significant"). Also the graphs are showing error bars +/- 1SD. This is good. This means that if you want to reproduce their experiments, it should be easy to do so. You can get an idea from the graphs whether a difference is significant.
(Having said that, I'm not sure why some of the data points don't have error bars - presumably the standard deviation was very low? I also can't see the number of replicates mentioned anywhere - maybe he used his "dynamic number of trials" scheme, but statistically speaking this may well be a bad thing - if he is using only 2 or 3 trials there is some probability of getting the first few samples with similar variance, he should probably stick to doing a fixed 10 to 30 replicates instead).
The "get to statistical variance" has been in Phoronix Test Suite for the better part of a year.
As part of the new work happening with Phoronix Test Suite, and the online aggregation site OpenBenchmarking.org, we'll be looking to expose the raw data and allow people to view a particular set of results in a possible more meaningful way. What is being examined now is raw data (scatter diagram), box plot (percentiles), violin plots (kernel function based), full standard error reporting (error bars, numerical reference to SD and SE.
Of course the general articles just show a simple form.
Obviously, infinite time and infinite runs with a broad variance of hardware would be better. As per usual, contact us at Phoronix with a fully baked suggestion for improvements in Phoronix Test Suite or a benchmark suggestion or article suggestion and we are more than willing to consider it.
In addition, a VM will use available assigned cores on the host, without locking them 1:1. This changes the behavior quite a bit, especially when it comes to CPU cache. The guest thinks it is running on the same core, but in reality it jumps between them, and has to reload from higher level cache or even memory.
Worse, from a benchmarking standpoint, hyperthreading will be exposed to the guest as separate CPUs. An intelligent scheduler would want to run distinct tasks on different cores, but can't do so in the VM.
And, it also depends on what other VMs are running on the hosts. Because virtual machines these days do "intelligent paging" and keep only one copy of identical pages. So if you're running two VMs with the same kernel or OS level, they're likely to run faster than two different OSes.
Anyhow, the test is horribly flawed from another point of view -- they test new hardware with old kernels. That's not fair, because the old kernels don't have optimizations for new hardware that didn't exist.
Anyhow, I'd be much more interested in finding out how old hardware would perform if upgrading to a new kernel. The more than 50% increase in size for a basic kernel over the last few years is probably why my old server with little RAM by today's standards runs faster on 2.6.17 than on 2.6.34. It's quite possible that the kernel itself runs faster, but if it leaves less memory for the system, and with the apps you run it starts swapping, it's overall going to be much slower.