5 Years of Linux Kernel Releases Benchmarked

← Back to Stories (view on slashdot.org)

5 Years of Linux Kernel Releases Benchmarked

Posted by samzenpus on Wednesday November 3, 2010 @06:49PM from the line-them-up dept.

An anonymous reader writes "Phoronix has published benchmarks of the past five years worth of Linux kernel releases, from the Linux 2.6.12 through Linux 2.6.37 (dev) releases. The results from these benchmarks of 26 versions show that, for the most part, new features haven't affected performance."

19 of 52 comments (clear)

Min score:

Reason:

Sort:

Windows Kernels by Anonymous Coward · 2010-11-04 01:32 · Score: 5, Interesting

What about running the same study on the Windows kernel from XP to 7?
1. Re:Windows Kernels by coolsnowmen · 2010-11-04 04:38 · Score: 2, Insightful
  
  While interesting, it isn't exactly the same; in linux, you can actually just change the kernel, without changing all the services and starting software.
Virtual machine, really? by edelholz · 2010-11-04 01:47 · Score: 5, Insightful

They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?
1. Re:Virtual machine, really? by mtippett · 2010-11-04 02:05 · Score: 4, Informative
  
  Considering the efforts going into VM these days and the massive deployments in Fortune 500 companies, the performance of VM based systems is predictable. All the testing with Phoronix Test Suite is repeated until there is less than 3% variance between the results - or the result set is discarded.
  Realistically, looking at older kernels on modern hardware is actually a very critical dimension for corporate server environments. There are applications in that space that are deployed and supported only on some old distribution. Being able to achieve and understanding how Red Hat 7.1 will act vs Red Hat 5 is critical for some environments.
2. Re:Virtual machine, really? by chrb · 2010-11-04 03:18 · Score: 2, Interesting
  
  They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?
  If they test in a VM, on only one particular hardware configuration, then the results only apply to that specific test setup. If the fact that the experiments are run inside a VM introduces variability into the results, then this will show up as a large variance. However, having a larger variance does not in itself negate the results - but remember that the results can't be generalised to other configurations - they only apply to this particular setup.
  In order to produce experimental results that can be generalised you need to run your experiments on a randomised configuration of hardware and VM host software. Either test every possible combination of factors - hardware, VM host sw, sw under test - (full factorial), or some subset (fractional factorial).
  I'm usually one of the first to bash Phoronix for not doing multiple replicates or any statistical analysis of their experiments, but things appear to have changed this time. Some of the big criticisms of Phoronix's benchmarks in the past were that they didn't consider whether or not their results were significant - instead doing only one replicate for each configuration, plotting a barchart, and concluding "X was 5 FPS faster. Therefore it wins!" Apparently they're now doing multiple replicates and some proper statistics to calculate whether or not observed differences are actually statistically significant ("our kernel test results were automated, easily reproducible, and statistically significant"). Also the graphs are showing error bars +/- 1SD. This is good. This means that if you want to reproduce their experiments, it should be easy to do so. You can get an idea from the graphs whether a difference is significant.
  (Having said that, I'm not sure why some of the data points don't have error bars - presumably the standard deviation was very low? I also can't see the number of replicates mentioned anywhere - maybe he used his "dynamic number of trials" scheme, but statistically speaking this may well be a bad thing - if he is using only 2 or 3 trials there is some probability of getting the first few samples with similar variance, he should probably stick to doing a fixed 10 to 30 replicates instead).
3. Re:Virtual machine, really? by Anonymous Coward · 2010-11-04 03:40 · Score: 2, Insightful
  
  They tested in a VM. Now where's the proof that by itself doesn't affect performance in an unpredictable way?
  Does it matter?
  They are after delta's not absolutes.
  *IF* they test each kernel in the same VM on the same metal then any change is valid. The numbers are abstract, the difference between release is what is key
4. Re:Virtual machine, really? by mtippett · 2010-11-04 03:45 · Score: 2, Interesting
  
  The "get to statistical variance" has been in Phoronix Test Suite for the better part of a year.
  As part of the new work happening with Phoronix Test Suite, and the online aggregation site OpenBenchmarking.org, we'll be looking to expose the raw data and allow people to view a particular set of results in a possible more meaningful way. What is being examined now is raw data (scatter diagram), box plot (percentiles), violin plots (kernel function based), full standard error reporting (error bars, numerical reference to SD and SE.
  Of course the general articles just show a simple form.
  Obviously, infinite time and infinite runs with a broad variance of hardware would be better. As per usual, contact us at Phoronix with a fully baked suggestion for improvements in Phoronix Test Suite or a benchmark suggestion or article suggestion and we are more than willing to consider it.
5. Re:Virtual machine, really? by arth1 · 2010-11-04 09:52 · Score: 2, Interesting
  
  In addition, a VM will use available assigned cores on the host, without locking them 1:1. This changes the behavior quite a bit, especially when it comes to CPU cache. The guest thinks it is running on the same core, but in reality it jumps between them, and has to reload from higher level cache or even memory.
  Worse, from a benchmarking standpoint, hyperthreading will be exposed to the guest as separate CPUs. An intelligent scheduler would want to run distinct tasks on different cores, but can't do so in the VM.
  And, it also depends on what other VMs are running on the hosts. Because virtual machines these days do "intelligent paging" and keep only one copy of identical pages. So if you're running two VMs with the same kernel or OS level, they're likely to run faster than two different OSes.
  Anyhow, the test is horribly flawed from another point of view -- they test new hardware with old kernels. That's not fair, because the old kernels don't have optimizations for new hardware that didn't exist.
  Anyhow, I'd be much more interested in finding out how old hardware would perform if upgrading to a new kernel. The more than 50% increase in size for a basic kernel over the last few years is probably why my old server with little RAM by today's standards runs faster on 2.6.17 than on 2.6.34. It's quite possible that the kernel itself runs faster, but if it leaves less memory for the system, and with the apps you run it starts swapping, it's overall going to be much slower.
Results don't support conclusion by QuantumBeep · 2010-11-04 01:53 · Score: 2, Interesting

It seems almost every benchmark that had any difference was slower in more modern kernels. It's not all sunshine and roses.
1. Re:Results don't support conclusion by timeOday · 2010-11-04 04:43 · Score: 4, Informative
  I would agree it's not all sunshine and roses, but let's at least look a little more closely. There are some disturbing regressions in there, although keep in mind other improvements (such as moving to a journalling filesystem) may come at a cost to performance, which may be justified.
  Better
  
  Apache Compilation: 40% less time
  
  Disk Transactions: 50% less time
  
  Worse
  
  GnuPG File Encryption: 60% more time
  
  time to transfer 10GB via the TCP network loop-back: 100% more time
  
  Apache static web page serving: 50% more time
  
  IOZone Writes - 20% more time
  
  Same
  
  CAMELLIA256-ECB cipher
  
  OpenSSL
  
  NASA's NPB
  
  TTSIOD 3D rendere
  
  C-Ray multi-threaded ray-tracing
  
  Crafty, an open-source chess engine
  
  MAFFT multiple-sequence alignment test that deals with a molecular biology
  
  Himeno Poisson Pressure Solver
  
  Blowfish performance with John The Ripper
  
  LAME MP3 encoding
  
  7-Zip compression
  
  Dhrystone 2
  
  FS-Mark
  
  IOZone Reads
  
  Threaded IO tester
  
  Parallel BZip2 compression
2. Re:Results don't support conclusion by CAIMLAS · 2010-11-04 07:10 · Score: 2, Interesting
  
  Not only that, but they only looked at the kernel with a specific version of GCC. Due to this, the performance differences could theoretically be not only accounted for by minute differences in how the compiler handles things.
  The bigger thing with Linux performance isn't just the kernel - it's the entire stack. You've got the kernel, sure - and then you've got the core libraries (glibc, etc.) and the compiler which built them. These all can change performance significantly, and in real-world environments, the two are usually associated.
  I'd be interested in seeing the results if they went back and looked at the kernel readme files and applied "requires version x or newer of y" and built everything that way. I suspect you'd see a performance curve inversely related to the kernel version.
  
  --
  ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Overkill by TrailerTrash · 2010-11-04 02:35 · Score: 3, Funny

What more Linux benchmarking do you need besides bogomips? Jeez.
Y'all musta forgot by mark72005 · 2010-11-04 02:58 · Score: 2, Funny

I find this hard to believe, what with 2010 being the year of Linux on the desktop and all.
You call those kernel benchmarks? by m4c+north · 2010-11-04 03:15 · Score: 5, Insightful

Where are the kernel-level tests that do more than exercise the filesystem and network driver (singular) and the scheduler? More than half of those charts were flat, which could mean they weren't making appropriate measurements.
For example, show how mutexes have improved, or copy-on-write, or interrupt handlers, or timers, or workqueues, or kmalloc, or anything else that a system and kernel programmer would care about. I like the user-centric perspective: it's very good information to have and share, but don't call what you've done a kernel benchmark. Maybe call it a kernel survey of its impact on users.

--
Who's your user, program?
1. Re:You call those kernel benchmarks? by CAIMLAS · 2010-11-04 07:15 · Score: 2, Insightful
  
  IF you were running the tests on real hardware, I'd be more likely to agree.
  They weren't. They were running it on a virtualized host in KVM. This means that not only were their results largely determined by the specific network, etc. drivers they used (which can see significant revision between kernels and not accurately reflect the kernel itself), but any idiosyncratic behavior in KVM in how it treats guest interfaces may account for the discrepancies.
  
  --
  ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Re:wops by ustolemyname · 2010-11-04 03:39 · Score: 2, Insightful

Some off the changes noted in the Linux 2.6.30 kernel change-log that was used throughout the Linux testing process included...
Yeah, that new EXT4 filesystem that they didn't use for obvious reasons. Huge impact on the results.
ugh by buddyglass · 2010-11-04 04:15 · Score: 5, Informative

I love that Phoronix is willing to take the time to run tests like this. I just wish they'd learn how to run meaningful tests. For instance, why are they testing a bunch of CPU-bound things? Kernel won't affect that unless we're talking about SMP performance. If you want to test the kernel, test how well it handles SMP, network I/O and disk I/O. And bear in mind that disk I/O will be hugely affected by which filesystem is used and its configurable settings.
Another problem with their article is that it tests individual kernels. Most folks don't use a vanilla kernel. They use one provided by their distro, which may have distro-specific patches that address some of the performance problems (or add new ones). What I would have preferred to see is a comparison of different distro releases over the last 5 years, focusing on the most popular ones (say Ubuntu, Fedora and SuSE).
The meaningful tests (and their results) were:
1. GnuPG: avoid 2.6.30 and later.
2. Loopback TCP: avoid 2.6.30 and later.
3. Apache Compilation: avoid 2.6.29 and earlier.
4. Apache static content: avoid 2.6.12, 2.6.25, 2.6.26, then 2.6.30 and later.
5. PostMark: avoid 2.6.29 and earlier.
6. FS-Mark: avoid 2.6.17 and earlier, 2.6.29, then 2.6.33 to 2.6.36.
7. ioZone: unless you're willing to run 2.6.21 or earlier, avoid 2.6.29 and you're fine.
8. Threaded I/O: avoid 2.6.20 and earlier, 2.6.29, then 2.6.33 to 2.6.36.
Based on these results, #1 and #2 seem to be testing the same thing, and tests #3 and #5 seem to be testing the inverse of whatever that thing is. 2.6.29 seems to be especially crappy, performing worse than the kernels immediately before and immediately after it on tests #6, #7 and #8. In terms of recent kernels, tests #6 and #8 suggest a regression in 2.6.33 that has been resolved in 2.6.37.
If it were me, I'd look at either running 2.6.37 (when its released) or fall back to 2.6.32 if my hardware was supported.
1. Re:ugh by mtippett · 2010-11-04 05:55 · Score: 3, Insightful
  
  This made me laugh - in a good way, not at you :).
  When Phoronix does a distro-comparison the crowd calls out that the tests are only really testing gcc differences, and should have less variables changing. When Phoronix does a fixed comparison varying only one part of the system, the crowd calls out that it isn't a good basis since people don't run it that way.
  Phoronix runs tests in different ways to explore the performance landscape. For some it precisely gives the information that they need, for other it's completely irrelevant. In this particular case, I'm glad that the data gave you enough to have some open questions about 2.6.32 vs 2.6.37. If people walk away with those sorts of first order interpretation, the article served it's purpose.
  Of course the next step would be how do we take a tighter look at the delta between 2.6.32 and 2.6.37 - any thoughts?
  Regarding meaningful vs meaningless tests. The tests Phoronix runs are a collection of tests to explore. The tests were run, and for some of them, the results yielded nothing interesting but were still reported. You don't know until you run the tests, and if the tests are run, you report on them. Some tests may be stable now, but may have sensitivity to other parts of the systems. Even CPU bound tests will yield different results in different cases (scheduler, etc).
2. Re:ugh by TheLink · 2010-11-04 08:29 · Score: 2, Insightful
  
  I suspect the scheduler would make a bigger difference if you were running multiple processes at the same time.
  
  e.g. multiple processes in various scenarios:
  CPU intensive.
  disk IO intensive.
  network IO intensive, single NIC.
  network IO intensive, two NICs.
  network IO intensive, four NICs.
  And various combinations of CPU, disk, network.
  
  Then latency tests:
  One to X processes with high CPU, while measuring latency experienced by another process.
  One to X processes with high IO, while measuring latency experienced by another process.
  --
  
  Too many replies beneath your current threshold