Slashdot Mirror


Dual Athlon Preview: Linux Kernel Compile Smokes

Mr. Flibble writes: "The fellows over at NewsForge have an article describing how they were able to test the 'World's First Dual DDR Athlon' running Mandrake 7.2 on a prerelease motherboard and chipset. The surprising thing is that the dual system was 142% faster in a kernel compile than a single processor system!" Jeff (of NewsForge) says this is the genuine truth. Now if only the right motherboards would start showing up in quantity on pricewatch ...

17 of 177 comments (clear)

  1. 142% Faster? Ok, but did you............. by Anonymous Coward · · Score: 4



    Ah, yes... I can see it now.

    Marketing Bozo: "Ok folks, here's the "before" version. Wow, thats mighty slow! Lets Ctrl-C out of that kernel build, drop in that second Athlon, and build that kernel again!"

    A few minutes of fiddling pass

    Crowd: (ooooh---ahhh!)

    Marketing Bozo: "Ok! Off we go! Wow, look at that sucker haul! Its nearly 150% faster than the single processor version!"

    Crowd: (ooooooh!---ahhh!) (clap clap clap)

    A voice from the back of the crowd speaks...

    Bowie: Hey dickweed! You forgot to MAKE CLEAN!

    A fight breaks out..

    Crowd: "Kill him!! KILL HIM! He runs that PROPAGANDA page! His words and ideas bring fear, destruction and DEATH to all who listen!! KILL HIM!!"

    The sounds of the beating continue as Marketing Bozo takes pre-orders for his motherboard..

    Just another day in at the convention..

  2. Re:Ace's Hardware by maraist · · Score: 5

    I did a similar test using a 466 Dual Celeron system with 128Meg of memory on Red Hat 6.x (With that special Abit board).

    In order to be scientific, you need a control.. I was sorry to say that this reviewer did no such thing. You point out that the -j helps even for single-CPU's, and this definately was the case with my test results (I can go dig them up if anybody is interested). BUT, there is a limit to the performance enhancement of -jxxx, since a single task running at full throttle is much faster than 2 or 10 tasks switching back and forth. So what I did was for both single and dual CPU modes, I ran with the bare make, then -j 2, then -j 3, -j 4, and finally -j 5 (where performance was being hurt).

    I don't recall, but I believe beyond -j 4 I was swapping to disk (though I know I achieved that phenonmena at a sufficiently large number).

    Another problem with the experiment was that the slower method was run first.. There is the issue of disk-cashing - namely that the second test stood the chance of having key libraries and possibly most source code still in cache during launch which would dramatically reduce the IO latency. An ideal test of CPU performance would be to put half a gig of memory in there, run it through once, "reboot", then run it for the other.. This is precisely what I did, and I do believe there were several seconds shaved off for cached recompiles.

    Personally, I like dual proc's just so I can watch xosview's dual-CPU meters flop back and forth. :) Additionally it's great for compiling / MP3ing in the background.. Almost zero lag is noticed (not to mention the almost 100% increase in MP3 encoding performance. I believe that was mostly CPU bound.

    -Michael

    --
    -Michael
  3. What's that noise?! by dstone · · Score: 5

    The make -j3 lets make run three processes at once, which would lead to a speedup even on a single processor system, because disk I/O and CPU-bound compilation can overlap.

    The noise you hear is the sound of thousands of single-CPU /. readers typing "make -j3 bzImage".

  4. Ace's Hardware by NovaX · · Score: 5

    There's a better news bite at Ace's about this. Basically, the second compilation used 3 threads, so the CPU may have had less idle time and i/o bottle neck then the single.

    "Unfortunately, the benchmarks vary significantly between the two tests in that the first is completely serialized while the second (dual-processor) test is run with three parallel make processes (notice the -j flag). Because the first system is running with only a single build instance, the processor is spending a great deal of time simply waiting on IO. Meanwhile, the dual-processor test was performed with not just two, but, in fact, three make processes. The difference here is that a processor will not be completely idle while waiting on IO in the second test, as there are two additional build processes running concurrently. This is why the use of the -j parameter is often recommended even for uniprocessor systems, as a parallel make will often yield much higher CPU utilization and thus faster compiles.

    "Until then, it is very difficult to make a representative statement about the performance of a dual-processor Athlon system from this benchmark."



    -----------------------------------------

    --

    "Open Source?" - Press any key to continue
  5. Re:142%? by Mr+Z · · Score: 4

    No, it was a bogus test. The author did "make bzImage" for the uniprocessor test and "make -j3 bzImage" on the multiprocessor test. If he'd done "make -j3 bzImage" for both, he would've discovered that the machine sped up by less than 100% most likely.

    The thing is, "make -jX " for about 1 < X <= 4 still gives a speedup on uniprocessor systems because some compile tasks can be in disk-wait while others sit on the CPU. (The optimal number for X depends on how fast your disks are and how much RAM you have. If X is too big, you start swapping, and end up losing performance.)

    --Joe
    --
  6. lamer benchmark by valentyn · · Score: 3
    Gee - that could not have been done worse. They actually make the 1 processor kernel with ``make bzImage'', and the (so called) dual processor with ``make clean; make -j3 bzImage''.

    This suggests that they made a kernel on the same system before, and try to ``undo'' the make.

    This is stupid. Why? Because:

    • they did not run a ``make dep''. This means the so-called ``single processor compile'' (which it is not!) is set back several seconds (make dep takes 40 seconds on my SMP Celeron 466).
    • The SMP version can take advantage of this, as ``make clean'' does not need ``make dep'' anymore. (AFAIK).
    • ``make -j3'' is *not* the same as ``testing an SMP compile''.
    If they really wanted a single vs. dual processor kernel compile test, they should have started with two real kernels, one for uniprocessor, one for SMP.

    Then make a ``test config'' .config-file, for example with ``cp arch/i386/defconfig .config; make oldconfig'' (and press a couple of enters). Copy this file to ``Testconfig'' or something.

    Now start the system with the single processor kernel and run the following:

    make mrproper; cp Testconfig .config; make oldconfig; make dep; time make -j$N bzImage

    ... for $N being 1-3. Write down results. This is the ``single processor'' kernel compile time. The ``make mrproper'' makes sure there is no garbage left (another, even better, way of testing would be unpacking a new kernel source tree for every test).

    Now reboot the system and run the dual processor kernel. Recompile, with -j$N maybe going up to 4 or 5 or so.

    Now *that* is something that comes close to a benchmark.

    --
    my other sig is a 500 page novel
  7. Linux SMP kernel "does the right thing." by CaptainAbstraction · · Score: 3

    Linux SMP kernel does the right thing as far as cache synchronization, according to the text Understanding the Linux Kernel published by O'Reilly, in reference to kernel 2.2:

    The section "Hardware Cache" in Chapter 2, Memory Addressing, explained that the contents of the hardware cache and the RAM maintain their consistency at the hardware level. The same approach holds in the case of a dual processor. ... But now updating becomes more time-consuming: whenever a CPU modifies its hardware cache it must check whether the same data is contained in the other hardware cache and, if so, notify the other CPU to update it with the proper value. This activity is often called cache snooping. Luckily, all this is done at the hardware level and is of no concern to the kernel.

    Hope this helps

    Cheers,
    Andrew

  8. Some comparable benchmarks... by Wolfstar · · Score: 5
    Being bored and with a comparable machine, I decided to do some tests of my own.

    System: SuSE 7.0, kernel 2.4.1 compiled with Uniprocessor and APIC/IO_APIC.

    Athlon 1.1GHz, Asus A7V motherboard. FSB is 100MHz DDR. Memory is 256 megs at PC133, ATA66 5400RPM drive with ReiserFS.

    I performed three series of tests. All tests were performed in single/double/triple thread orders, and each thread compile had it's own directory.

    First test, all three had been make config'd per the original article, followed by make dep. After that, I rebooted and did all three compiles without rebooting. Second series started the process over again by make mrproper/make oldconfig/make dep/time make -jN bzImage, with N being the corresponding thread. Finally, I did a make mrproper/make oldconfig/make dep and rebooted each time before the compile.

    I should note that on several occasions, I got Odd results; whether this was caching of some sort or not I don't know, but I would get 3m35s on a single thread and 1m9sec on a -j2 with a removed and recreated directory, as well as one or two other occasions - unfortunately, all the other occasions were when I was accidentally failing to use "time make -j2 bzImage" and instead was only doing "make -j2 bzImage", so I have no empirical proof. At any rate, here's the recorded ones.

    Round 1

    Straight
    real 3m17.571s
    user 2m54.660s
    sys 0m13.120s

    -j2
    real 3m13.772s
    user 2m58.390s
    sys 0m13.390s

    -j3
    real 3m13.470s
    user 2m59.390s
    sys 0m13.180s

    Round 2

    Straight
    real 3m8.048s
    user 2m54.780s
    sys 0m13.140s

    -j2
    real 3m11.912s
    user 2m58.050s
    sys 0m13.590s

    -j3
    real 3m12.532s
    user 2m58.370s
    sys 0m13.900s

    Fresh-boot compile

    Single thread was not redone; it was the Round 1.

    -j2
    real 3m15.634s
    user 2m58.030s
    sys 0m13.700s

    -j3
    real 3m16.433s
    user 2m59.310s
    sys 0m13.290s

    As you can see, not much of a variation on here. The times are also a hell of a lot better than a 1.2GHz system single-threaded with DDR SDRAM, which makes me wonder what precisely is slowing down the 1.2GHz...

    Food for thought.

    --
    You thought that this sig was what you think that I thought you wanted me to think. I think.
  9. Re:And here's the pic by dongkiru · · Score: 3
    Heh... Yeah, the first thing I noticed when I saw this board(in person, not the picture. I work for ASL, and saw this board last week when my boss was testing it.) was the power connector. Basically, this is a very rough draft of the motherboard from Tyan, and our company works closely with Tyan on their board development. So hopefully they'll change the locations of the connectors for the final release.

    The reclining dimm slots are there because the DDR memory for this motherboard is fairly tall, and Tyan would like to be able to use this motherboard on 1U rack systems. The reclining dimm slots does waste a lot of real estate, which could've been used to place the power connector closer to the edge. But the market for 1U rack mount systems appears to be growing rapidly, so I think the reclining dimm slot is very important.

    For those of you that are complaining that they just bought a system or is looking to buy a system, this board isn't even supposed to be announced until March, so don't hold your breath.

  10. The lie of -j3 and no "make dep" by joto · · Score: 4

    I tried the same test on my uniprocessor system, running first "time make bzImage" then "make clean", and last "time make -j 2 bzImage":

    Single thread:

    597.00user 46.40system 12:11.08elapsed 88%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (789303major+881687minor)pagefaults 0swaps

    Two threads on one processor:

    511.41user 31.30system 9:21.66elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (489357major+669019minor)pagefaults 0swaps

    By the same logic as they used in this benchmark, my uniprocessor system is thus 31 percent faster than the same old uniprocessor system. Bah! I just wish people weren't posting nonsensible benchmarks like this. At least, they should _try_ to make it somewhat representable...

  11. Re:Not quite a perfect comparison by mian · · Score: 3
    Here's my tests, didn't remove CPUs or anything since these servers are in production and at the co-location ISP.

    Linux 2.2.18 (Dual p3-550, 1Gb ram, all SCSI compiling Apache 1.3.17)

    make
    real 0m37.244s
    user 0m23.900s
    sys 0m6.000s

    make -j3
    real 0m26.915s
    user 0m24.360s
    sys 0m6.020s

    make -j4
    real 0m23.724s
    user 0m24.130s
    sys 0m5.880s

    make -j5
    real 0m20.154s
    user 0m22.940s
    sys 0m5.000s

    make -j6
    real 0m21.326s
    user 0m24.120s
    sys 0m5.830s

    FreeBSD (Dual p3-550, 512Mb ram, all SCSI compiling Apache 1.3.17)

    make
    39.458u 5.635s 0:48.99 92.0% 1686+1874k 0+1249io 0pf+0w

    make -j3
    40.007u 5.725s 0:32.53 140.5% 1696+1884k 0+1645io 1pf+0w

    make -j4
    40.027u 5.817s 0:32.73 140.0% 1691+1877k 0+1631io 0pf+0w

    make -j5
    40.154u 5.832s 0:31.74 144.8% 1701+1884k 1+1628io 0pf+0w

  12. Not quite a perfect comparison by jerky · · Score: 5

    The article states:

    The kernel was then compiled using " time make bzImage." The dual processor results were then done by first doing "make clean" then "time make -j3 bzImage".

    This isn't really a good way to compare single processor results to dual processor results. The make -j3 lets make run three processes at once, which would lead to a speedup even on a single processor system, because disk I/O and CPU-bound compilation can overlap. The only totally fair way to compare is to boot a non-SMP kernel, run the benchmark, then boot an SMP kernel and run exactly the same benchmark.

    Even though the 142% speedup is bogus, the two minute kernel compile is pretty damn fast.

  13. The good and bad of AMD's MP by DeafDumbBlind · · Score: 5

    About AMD's upcoming dual systems is that each processor has a seperate bus to the memory, unlike intel systems where all the chips share the same bus.

    The bad thing is that so far only Tyan has announced a MB based on the 760MP chipset and that MB is definitely suited for servers, won't fit in a standard ATX case.

    --


    Jesus used to be my co-pilot, but we crashed in the mountains and I had to eat him.
  14. How fast does it by Anagon · · Score: 5
    boot Windows 2000? Now thats a test~

  15. Glad to hear it... by rich22 · · Score: 3

    but wouldn't a better test involve removing one of the processors, compiling a kernel (while running a non-smp kernel), and then lather-rinse-repeat with both processors in under an SMP enabled kernel? According to the make man page, the -j option contols the number of jobs allowed to run simultaneously. Depending on what you are making, the -jN option can even speed up compile times on single processor machines. The 142 percent "performance" increase may be partially explained by this.

    As one of those dual-celeron guys (bang for the buck!), I love to see AMD finally show off dual processor machines. But the next time we get a chance to play with one, lets try to make a more realistic comparison.

  16. Switches invalidate the results (also: 4-way SMP) by NortonDC · · Score: 3

    The "-j3" switch with the make is why it got a greater-than-linear improvement.

    See Ace's Hardware for a discussion of exactly this:

    "[T]he dual-processor test was performed with not just two, but, in fact, three make processes. The difference here is that a processor will not be completely idle while waiting on IO in the second test, as there are two additional build processes running concurrently. This is why the use of the -j parameter is often recommended even for uniprocessor systems, as a parallel make will often yield much higher CPU utilization and thus faster compiles."

    Also, see reader comments saying that AMD demonstrated a 4-way SMP Athlon system at LinuxWorld.

  17. Re:Lies, statistics and benchmarks by martyb · · Score: 3
    This is how a serious benchmark should be done, with the machine state as similar as possible before each tests.

    I agree. I think this post was on the right track in performing a number of tests to find out where the sweet spot is for the -j argument. There have been hypotheses posted here that caching effects may have interfered with the results. (I wonder if interim/final files' locations on the disk could vary the results, too -- longer seek/write times... maybe need to defrag the disk between iterations, too?)

    BUT, it strikes me that EACH test should be repeated a sufficient number of times so that the durations measured vary within a desired confidence level (statistics term -- standard deviation and variance and other stuff whose name and vague conepts I recall but I learned too long ago to recall, now). At an absolute minimum, doing each test twice and having results that vary within, say, a couple seconds would counter the concerns that there was some unknown but suspected optimization happening (e.g. disk cache, left over interim files, etc.).

    Personally, I'd still prefer to see each test performed at least 3 times. In my experience, I've seen very close 2-try results where the results on the 3rd time sometimes confirmed them, but other times refuted them. (Yes, I know it's not "scientific", but I'd rather repeat an unnecessary test than omit a necessary one!)

    Then, to make sure there were no accumulated small effects from running all those tests, repeat the very first test one more time to confirm that its results fell in line with the orginal results.