New Linux 2.5 Benchmarks

I'm really sorry. by FreeLinux · 2002-11-16 09:57 · Score: 5, Informative

Try it again.

In a reply on lkml to Aaron Lehmann's praising of the contest results of the latest 2.5-mm kernel Andrew Morton [interview] explains some of the important performance and design differences between the 2.4 stable series and the 2.5 development series accompanied by illustrating benchmarks.

Most significant gains can be expected at the high end such as large machines, large numbers of threads, large disks, large amounts of memory etc. [...] For the uniprocessors and small servers, there will be significant gains in some corner cases. And some losses. [...] Generally, 2.6 should be "nicer to use" on the desktop. But not appreciably faster.

From: Aaron Lehmann
To: linux-kernel
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest
Date: Mon Nov 11 2002 - 18:04:53 AKST

On Tue, Nov 12, 2002 at 10:31:38AM +1100, Con Kolivas wrote:
> Here are the latest contest (http://contest.kolivas.net) benchmarks up to and
> including 2.5.47.

This is just great to see. Most previous contest runs made me cringe when I saw how -mm and recent 2.5 kernels were faring, but it looks like Andrew has done something right in 2.5.47-mm1. I hope the appropriate get merged so that 2.6.0 has stunning performance across the board.

From: Andrew Morton
To: linux-kernel mailing list
Subject: Re: [BENCHMARK] 2.5.47{-mm1} with contest
Date: Tue Nov 12 2002 - 02:04:23 AKST
Aaron Lehmann wrote:
>
> On Tue, Nov 12, 2002 at 10:31:38AM +1100, Con Kolivas wrote:
> > Here are the latest contest (http://contest.kolivas.net) benchmarks up to and
> > including 2.5.47.
>
> This is just great to see. Most previous contest runs made me cringe
> when I saw how -mm and recent 2.5 kernels were faring, but it looks
> like Andrew has done something right in 2.5.47-mm1. I hope the appropriate get merged so that 2.6.0 has stunning performance across
> the board.

Tuning of 2.5 has really hardly started. In some ways, it should be tested against 2.3.99 (well, not really, but...)

It will never be stunningly better than 2.4 for normal workloads on
normal machines, because 2.4 just ain't that bad.

What is being addressed in 2.5 is the areas where 2.4 fell down: large machines, large numbers of threads, large disks, large amounts
of memory, etc. There have been really big gains in that area.

For the uniprocessors and small servers, there will be significant gains in some corner cases. And some losses. Quite a lot of work has gone into "fairness" issues: allowing tasks to make equal progress when the machine is under load. Not stalling tasks for unreasonable
amounts of time, etc. Simple operations such as copying a forest of files from one part of the disk to another have taken a bit of a hit from this. (But copying them to another disk got better).

Generally, 2.6 should be "nicer to use" on the desktop. But not appreciably faster. Significantly slower when there are several processes causing a lot of swapout. That is one area where fairness really hurts throughput. The old `make -j30 bzImage' with mem=128M takes 1.5x as long with 2.5. Because everyone makes equal progress.

Most of the VM gains involve situations where there are large amounts of dirty data in the machine. This has always been a big problem
for Linux, and I think we've largely got it under control now. There are still a few issues in the page reclaim code wrt this, but they're
fairly obscure (I'm the only person who has noticed them ;))

There are some things which people simply have not yet noticed.

Andrea's kernel is the fastest which 2.4 has to offer; let's tickle its weak spots:

Run mke2fs against six disks at the same time, mem=1G:

2.4.20-rc1aa1:
0.04s user 13.16s system 51% cpu 25.782 total
0.05s user 31.53s system 63% cpu 49.542 total
0.05s user 29.04s system 58% cpu 49.544 total
0.05s user 31.07s system 62% cpu 50.017 total
0.06s user 29.80s system 58% cpu 50.983 total
0.06s user 23.30s system 43% cpu 53.214 total

2.5.47-mm2:
0.04s user 2.94s system 48% cpu 6.168 total
0.04s user 2.89s system 39% cpu 7.473 total
0.05s user 3.00s system 37% cpu 8.152 total
0.06s user 4.33s system 43% cpu 9.992 total
0.06s user 4.35s system 42% cpu 10.484 total
0.04s user 4.32s system 32% cpu 13.415 total

Write six 4G files to six disks in parallel, mem=1G:

2.4.20-rc1aa1:
0.01s user 63.17s system 7% cpu 13:53.26 total
0.05s user 63.43s system 7% cpu 14:07.17 total
0.03s user 65.94s system 7% cpu 14:36.25 total
0.01s user 66.29s system 7% cpu 14:38.01 total
0.08s user 63.79s system 7% cpu 14:45.09 total
0.09s user 65.22s system 7% cpu 14:46.95 total

2.5.47-mm2:
0.03s user 53.95s system 39% cpu 2:18.27 total
0.03s user 58.11s system 30% cpu 3:08.23 total
0.02s user 57.43s system 30% cpu 3:08.47 total
0.03s user 54.73s system 23% cpu 3:52.43 total
0.03s user 54.72s system 23% cpu 3:53.22 total
0.03s user 46.14s system 14% cpu 5:29.71 total

Compile a kernel while running `while true;do;./dbench 32;done' against
the same disk. mem=128m:

2.4.20-rc1aa1:
Throughput 17.7491 MB/sec (NB=22.1863 MB/sec 177.491 MBit/sec)
Throughput 16.6311 MB/sec (NB=20.7888 MB/sec 166.311 MBit/sec)
Throughput 17.0409 MB/sec (NB=21.3012 MB/sec 170.409 MBit/sec)
Throughput 17.4876 MB/sec (NB=21.8595 MB/sec 174.876 MBit/sec)
Throughput 15.3017 MB/sec (NB=19.1271 MB/sec 153.017 MBit/sec)
Throughput 18.0726 MB/sec (NB=22.5907 MB/sec 180.726 MBit/sec)
Throughput 18.2769 MB/sec (NB=22.8461 MB/sec 182.769 MBit/sec)
Throughput 19.152 MB/sec (NB=23.94 MB/sec 191.52 MBit/sec)
Throughput 14.2632 MB/sec (NB=17.8291 MB/sec 142.632 MBit/sec)
Throughput 20.5007 MB/sec (NB=25.6258 MB/sec 205.007 MBit/sec)
Throughput 24.9471 MB/sec (NB=31.1838 MB/sec 249.471 MBit/sec)
Throughput 20.36 MB/sec (NB=25.45 MB/sec 203.6 MBit/sec)
make -j4 bzImage 412.28s user 36.90s system 15% cpu 47:11.14 total

2.5.46:
Throughput 19.3907 MB/sec (NB=24.2383 MB/sec 193.907 MBit/sec)
Throughput 16.6765 MB/sec (NB=20.8456 MB/sec 166.765 MBit/sec)
make -j4 bzImage 412.16s user 36.92s system 83% cpu 8:55.74 total

2.5.47-mm2:
Throughput 15.0539 MB/sec (NB=18.8174 MB/sec 150.539 MBit/sec)
Throughput 21.6388 MB/sec (NB=27.0485 MB/sec 216.388 MBit/sec)
make -j4 bzImage 413.88s user 35.90s system 94% cpu 7:56.68 total - fifo_batch strikes again

It's the "doing multiple things at the same time" which gets better; the
straightline throughput of "one thing at a time" won't change much at all.

Corner cases....

Re:Can't get a speedup of more than 10 by certron · 2002-11-16 10:05 · Score: 3, Informative

"It's impossible to get a speedup of more than 10 with any processor-related activities.

Using Amdahl's Law, one can find that
Speedup = (s + p ) / (s + p / N ) where N is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and p is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel."

While I'm no expert in software engineering (and I haven't really looked over the equation you put too closely) I think it assumes the original was written with some sort of intelligence behind it. I bet I could write some really atrocious code that would be so incredibly inefficient that almost anyone else could get a huge performance gain from it.

I'm not sure if I would have to try hard or not try at all to write really bad code. :-)

--

fair.org counterpunch.com truthout.com indymedia.org salon.com
eff.org guerrilla.net debian.org gentoo.org

Well by kentyman · 2002-11-16 10:15 · Score: 2, Informative

For those of you wondering, this is not a proof that you cannot optimize something to be more than 10 times faster in general.

For example, suppose you have an algorithm A that takes X time. And then suppose you change it to algorithm B that takes 11X time by making it do algorithm A 11 times. Well algorithm B can be optimized to be 11 times faster by making it algorithm A instead, since they give the same result.

Anyway, just wanted to make sure no one was missing the "processor-related activities" clause in your statement.

--
You know where you are? You're in the $PATH, baby. You're gonna get executed!

Re:Make it simple please by jericho4.0 · 2002-11-16 10:15 · Score: 5, Informative

It'll be quite a while before recompiling a kernel gets any simpler. Recompiling assumes that you know (somewhat) what you're doing. Keep at it. It took me at least 10 tries before I compiled a bootable kernel.

quick hint; isnstall the kernel sources that came with your dist. Use the .config file found in this to compile first. These are the settings that your kernel was compiled with. The you can use make xconfig alter a known working config. Good luck.

--
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis

Re:This is This is the exact opposite of my findin by be-fan · 2002-11-16 10:19 · Score: 5, Informative

Um, doing benchmarks between an Athlon XP and a Pentium 4 is folly. The P4 has notoriously slow context switching performance. Also, if you are running a small number of threads, your computer isn't spending a whole lot of time thread switching anyway, so the hit doesn't really affect you. When you have lots of threads, scheduling becomes far more important, and so the increase is much more noticible.

--
A deep unwavering belief is a sure sign you're missing something...

Re:Can't get a speedup of more than 10 by Daniel+Phillips · 2002-11-16 10:25 · Score: 4, Informative

Informative? I don't think so. (Moderators, please check the crack that you are smoking)

Amdahl's law makes a (wrong) statement about the amount of speedup that can be obtained through parallel as opposed to serial execution. (By the way, the number 10 doesn't come into it anywhere. You might as well have mentioned the speed of sound.).

Here, we are talking about the comparative performance of two operating systems running on the same number of processors. Since there is no limit on how stupidly the original could have been implemented, there is correspondingly no limit on the amount of possible speedup due to a better implementation.

Anyway, if you think you know something about Amdahl's law, you need to google for "Gustafsons's law". Executive summary: Amdahl was wrong. Exactly how wrong is still a matter of debate, but it's generally agreed that it lies somewhere between "very" and "completely". Please don't quote this nonsense in support of anything, just don't do it.

--
Have you got your LWN subscription yet?

Re:So what does this mean for the everyday linux u by iabervon · 2002-11-16 10:30 · Score: 5, Informative

You'll get better interactive performance under load. So if you're encoding an mp3 and writing your home directory to a CD, your mouse cursor won't stick and your windows will refresh reasonably well. Unless you're doing something kind of disk/processor intensive, you won't notice the difference, because 2.4 is too good already for there to be much improvement. If you try to encode 32 mp3s at the same time, 2.6 will actually do worse than 2.4, but at least it won't make ls quite so slow.

The main goals are interactivity (input gets handled quickly), low latency (your mp3 player gets a chance to send the next second of audio to the sound card before this second is over), and fairness (every program makes at least a little progress after a short amount of time).

Re:So what does this mean for the everyday linux u by Azar · 2002-11-16 10:31 · Score: 5, Informative

Overall throughput has not increased (actually, it is believed to have decreased). So the overall speed of the system is relatively equal to the 2.4 series of kernels. You probably won't see any major performance speedups in any apps you use.

However, the overall responsiveness of the system is improved. Most people who have used it have claimed that it felt much faster than the 2.4 series. You won't have starved processess.

This means if you're running XMMS and you compile a kernel, XMMS won't just hang until the compilation is done. The kernel developers have done a great job in improving -fairness- between processes.

Mostly, the results will be seen on Big Iron and server applications, but the overall desktop experience is expected to improve.

Re:Andrew Morton?? by Anonymous Coward · 2002-11-16 10:41 · Score: 1, Informative

He used the word correctly. Andrew Morton is infamous, after basing his career on gossipy biographies of minor celebrities.

Re:Triple? by Alan · 2002-11-16 10:47 · Score: 4, Informative

Well, there was an nvidia driver update that advertised an increase in performance of something like 25%, which isn't that bad....

Re:inexperience by Anonymous Coward · 2002-11-16 11:01 · Score: 1, Informative

I believe part of the slow load times is due to the fact that glibc on most modern distros does not include object preloading technology. The latest glibc has this I believe, and the only distro I know that uses the latest glibc are currently beta. I think around the Redhat 9.2 timeframe you will see linux that is more than suitable for the desktop.

Re:inexperience by myz24 · 2002-11-16 11:07 · Score: 2, Informative

The short answer is that KDE is written in C++.

The long answer is that anything written in C++ on Linux will load slow (but should run fairly quick once loaded) because of something to do with loading the C++ libraries and some other compiler gook. I can't remember where I read it, or how I found it on google, but aparently this will be fixed soon in glibc.

Of course, I could be WAY off, so if someone could back me up...

Wow, you can disprove Ahmdahl's law? by Fefe · 2002-11-16 11:15 · Score: 5, Informative

Please write and publish a paper about it!

This is a major breakthrough in computer science.

It also is quite unlikely, since Ahmdahl's law is a trivial observation that is completely independent of parallelization or even software engineering (it also applies to hardware design or even accounting). Basically, it says: if initially only 10% of X (CPU cycles, money, whatever you are trying to save) is spent in the part you are optimizing, there is an upper bound of 10% to the X you can save.

I'm very interested in how you can disprove that.

Re:Wow, you can disprove Ahmdahl's law? by Daniel+Phillips · 2002-11-16 13:33 · Score: 3, Informative

Please write and publish a paper about it!

Such rhetoric, oh my.

This is a major breakthrough in computer science.

It also is quite unlikely, since Ahmdahl's law is a trivial observation that is completely independent of parallelization or even software engineering (it also applies to hardware design or even accounting). Basically, it says: if initially only 10% of X (CPU cycles, money, whatever you are trying to save) is spent in the part you are optimizing, there is an upper bound of 10% to the X you can save.

Sorry, wrong law. You seem to be thinking "90% of the time in 10% of the code", a rule of thumb that nobody to my knowledge has ventured to dignify with the term "law". Amdahl's Law (which IMHO doesn't deserve the dignity either) was an attempt to make a statement about the limitations of parallel computing. Relying on wrong assumptions, he drew wrong conclusions, and in the event, parallel clusters have gone on to scale nearly linearly into the tens of thousands of processors, a result he would have liked to have proved impossible.

Read more here.

--
Have you got your LWN subscription yet?
Re:Wow, you can disprove Ahmdahl's law? by Anonymous Coward · 2002-11-16 17:44 · Score: 1, Informative

No, the original poster got it right. Amdahl's law is simply a formalism of the stement "make the common case fast". Simply put, if you are speeding something up on the chip, speed up something that gets used a whole lot. If your feature is only used 1% of the time, and you double the speed, you get at .5% speedup. If, on the other hand, your feature is used 50% of the time and you double the speed, you'll get a 25% speedup. Amdahl's Law helps us make formal arguments about where we should optimize.

Re:Make it simple please by Anna+Merikin · 2002-11-16 12:17 · Score: 4, Informative

I grew up with DOS, too. If you installed Borland's Sidekick (many did) successfully, you can compile. That's the stuff that went on in Sidekick's install process: it used Borland's compiler -- and that's why it ran so well.

I just finished *this morning* compiling a 2.2.22 (yes, RH-6.2) for my box. Use the .config file from the stock kernel sources for your distro, usually in /usr/src/linux* (you may have to install them) open a root terminal window in /usr/src, issue `make xconfig' choose the .config from the load configuration file box and start disabling everything you KNOW you do not need. The help buttons are mostly very helpful. If your box is used for web surfing, compile in ppp, same with lpd if you need to print. Unless you have a SCSI drive, disable all SCSI boxes. Load as much of your equipment into the kernel as you can, and disable the modules that enable hardware and features you don't have or use, like firewire or USB. Make sure equipment you DO HAVE are supported either in the kernel or as a module. Keep doing `next' until the end, when there is no `next.' Choose Main Menu,

Then save the new configuration. Do a 'make dep bzImage modules modules_install' and copy the ~/System.map file as System.map-new.kernel.number and drill down to /usr/linux/arch/i86/boot and copy bzImage as vmlinuz-number.of.kernel to /boot.

from /usr/src/linux , do make modules_install.
Modify /etc.lilo.conf to include the new kernel and System.map. Activate lilo (/sbin/lilo -v -v).

Reboot into new kernel. If you get lots of error messages about modules not loading, reboot at the command prompt, and everything will have been rewritten magically. Use your new kernel for testing. You may find you want to try another configuration. Do it all again, changing the Makefile each time under line 3 EXTRAVERSION with another digit or letter to keep it from overwriting a working kernel when you copy in to /boot and to keep the modules straight (though they appear not to care....)

Frankly, I've tried nine builds and although my kernels are smaller than stock, use about 5Kb less RAM and benchmarks seem to indicate about 5-6 per cent increase in speed, I feel no difference in use.

I do feel better knowing I am using the latest (and perhaps the last) kernel in the 2.2.x series, though. FWIW.

Re:This is This is the exact opposite of my findin by be-fan · 2002-11-16 13:47 · Score: 4, Informative

The instructions involved in the context switch are slow on the Pentium 4. The P4 has a long internal pipeline to flush, and a huge amount of internal state to synchronize, which makes context switches slow. For example, an interrupt/return pair take 2000 clock cycles on the P4!

--
A deep unwavering belief is a sure sign you're missing something...

Re:Performance gains mostly for high-end by blakestah · 2002-11-16 16:30 · Score: 4, Informative

Yes.

The fine-grained locking improvements on SMP will make it noticeably better for SMP boxes.

A very big improvement is that IDE has been parallelized, meaning that if you use multiple IDE devices at once you will see a "night and day" difference in performance.

If you are uniprocessor and all SCSI and already use low-latency patches, well, as you were.

Re:Disk buffers & memory subsystem updated?? by charnerd · 2002-11-16 23:56 · Score: 1, Informative

Turn on the "Sticky" bit. That tells the OS that the process is priority and keep it in memory if at all possible.

Disable your swap. by Effugas · 2002-11-17 01:00 · Score: 3, Informative

Buy more RAM and disable swap. Or just disable it -- at 1Gb, you're close to what you need anyway.

I'm serious. With another gig costing a hundred dollars -- maybe less -- the overhead of disk-based VM is just no longer justified.

WinXP benefits from this optimization even more than Linux.

Yours Truly,

Dan Kaminsky
DoxPara Research
http://www.doxpara.com

Re:Disk buffers & memory subsystem updated?? by sheepman · 2002-11-17 02:41 · Score: 4, Informative

Configure kswapd.
For example, add the following to /etc/sysctl.conf

vm.kswapd = 12800 512 8

When no free memory, kswapd will free more
memory than that in default.

Re:Disk buffers & memory subsystem updated?? by muixA · 2002-11-17 04:03 · Score: 3, Informative

Linux does not honor the sticky bit.

man chmod: ...and the Linux kernel ignores the sticky bit on files. Other ker-
nels may use the sticky bit on files for system-defined purposes. On
some systems, only the superuser can set the sticky bit on files.
--
Matt

22 of 244 comments (clear)