5 Years of Linux Kernel Releases Benchmarked

← Back to Stories (view on slashdot.org)

5 Years of Linux Kernel Releases Benchmarked

Posted by samzenpus on Wednesday November 3, 2010 @06:49PM from the line-them-up dept.

An anonymous reader writes "Phoronix has published benchmarks of the past five years worth of Linux kernel releases, from the Linux 2.6.12 through Linux 2.6.37 (dev) releases. The results from these benchmarks of 26 versions show that, for the most part, new features haven't affected performance."

6 of 52 comments (clear)

Min score:

Reason:

Sort:

Windows Kernels by Anonymous Coward · 2010-11-04 01:32 · Score: 5, Interesting

What about running the same study on the Windows kernel from XP to 7?
Results don't support conclusion by QuantumBeep · 2010-11-04 01:53 · Score: 2, Interesting

It seems almost every benchmark that had any difference was slower in more modern kernels. It's not all sunshine and roses.
1. Re:Results don't support conclusion by CAIMLAS · 2010-11-04 07:10 · Score: 2, Interesting
  
  Not only that, but they only looked at the kernel with a specific version of GCC. Due to this, the performance differences could theoretically be not only accounted for by minute differences in how the compiler handles things.
  The bigger thing with Linux performance isn't just the kernel - it's the entire stack. You've got the kernel, sure - and then you've got the core libraries (glibc, etc.) and the compiler which built them. These all can change performance significantly, and in real-world environments, the two are usually associated.
  I'd be interested in seeing the results if they went back and looked at the kernel readme files and applied "requires version x or newer of y" and built everything that way. I suspect you'd see a performance curve inversely related to the kernel version.
  
  --
  ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Re:Virtual machine, really? by chrb · 2010-11-04 03:18 · Score: 2, Interesting

They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?
If they test in a VM, on only one particular hardware configuration, then the results only apply to that specific test setup. If the fact that the experiments are run inside a VM introduces variability into the results, then this will show up as a large variance. However, having a larger variance does not in itself negate the results - but remember that the results can't be generalised to other configurations - they only apply to this particular setup.
In order to produce experimental results that can be generalised you need to run your experiments on a randomised configuration of hardware and VM host software. Either test every possible combination of factors - hardware, VM host sw, sw under test - (full factorial), or some subset (fractional factorial).
I'm usually one of the first to bash Phoronix for not doing multiple replicates or any statistical analysis of their experiments, but things appear to have changed this time. Some of the big criticisms of Phoronix's benchmarks in the past were that they didn't consider whether or not their results were significant - instead doing only one replicate for each configuration, plotting a barchart, and concluding "X was 5 FPS faster. Therefore it wins!" Apparently they're now doing multiple replicates and some proper statistics to calculate whether or not observed differences are actually statistically significant ("our kernel test results were automated, easily reproducible, and statistically significant"). Also the graphs are showing error bars +/- 1SD. This is good. This means that if you want to reproduce their experiments, it should be easy to do so. You can get an idea from the graphs whether a difference is significant.
(Having said that, I'm not sure why some of the data points don't have error bars - presumably the standard deviation was very low? I also can't see the number of replicates mentioned anywhere - maybe he used his "dynamic number of trials" scheme, but statistically speaking this may well be a bad thing - if he is using only 2 or 3 trials there is some probability of getting the first few samples with similar variance, he should probably stick to doing a fixed 10 to 30 replicates instead).
Re:Virtual machine, really? by mtippett · 2010-11-04 03:45 · Score: 2, Interesting

The "get to statistical variance" has been in Phoronix Test Suite for the better part of a year.
As part of the new work happening with Phoronix Test Suite, and the online aggregation site OpenBenchmarking.org, we'll be looking to expose the raw data and allow people to view a particular set of results in a possible more meaningful way. What is being examined now is raw data (scatter diagram), box plot (percentiles), violin plots (kernel function based), full standard error reporting (error bars, numerical reference to SD and SE.
Of course the general articles just show a simple form.
Obviously, infinite time and infinite runs with a broad variance of hardware would be better. As per usual, contact us at Phoronix with a fully baked suggestion for improvements in Phoronix Test Suite or a benchmark suggestion or article suggestion and we are more than willing to consider it.
Re:Virtual machine, really? by arth1 · 2010-11-04 09:52 · Score: 2, Interesting

In addition, a VM will use available assigned cores on the host, without locking them 1:1. This changes the behavior quite a bit, especially when it comes to CPU cache. The guest thinks it is running on the same core, but in reality it jumps between them, and has to reload from higher level cache or even memory.
Worse, from a benchmarking standpoint, hyperthreading will be exposed to the guest as separate CPUs. An intelligent scheduler would want to run distinct tasks on different cores, but can't do so in the VM.
And, it also depends on what other VMs are running on the hosts. Because virtual machines these days do "intelligent paging" and keep only one copy of identical pages. So if you're running two VMs with the same kernel or OS level, they're likely to run faster than two different OSes.
Anyhow, the test is horribly flawed from another point of view -- they test new hardware with old kernels. That's not fair, because the old kernels don't have optimizations for new hardware that didn't exist.
Anyhow, I'd be much more interested in finding out how old hardware would perform if upgrading to a new kernel. The more than 50% increase in size for a basic kernel over the last few years is probably why my old server with little RAM by today's standards runs faster on 2.6.17 than on 2.6.34. It's quite possible that the kernel itself runs faster, but if it leaves less memory for the system, and with the apps you run it starts swapping, it's overall going to be much slower.