Slashdot Mirror


23 Second Kernel Compiles

b-side.org writes "As a fine testament to how quickly linux is absorbing technology formerly available only to the computing elite, an LKML member posted a 23 second kernel compile time to the list this morning as a result of building a 16-way NUMA cluster. The NUMA technology comes gifted from IBM and SGI. Just one year ago, a Sequent NUMA-Q would have cost you about USD $100,000. These days, you can probably build a 16-way Xeon (4X 4-way SMP) system off of ebay for two grand, and the NUMA comes free of charge!"

28 of 222 comments (clear)

  1. Tempting... by JoeLinux · · Score: 4, Insightful

    ok..I'm NOT about to start the perverbial deluge of people wanting to know about a beowulf cluster of these things. But what I will ask is this: if it can do that for a kernel, I wonder how long it will take to do Mozilla, or XFree? It'd be interesting to see those stats.

    JoeLinux

    1. Re:Tempting... by castlan · · Score: 4, Interesting

      A Beowolf cluster of these? That's so 2 years ago... I'd love to see a NUMA-linked cluster of these! And I wonder how long it would take that cluster running GNOME under XFree86 to have Mozilla render this page nested at -1!

      Seriously, I wonder how long it takes to boot. Every NUMA machine I've ever used took more than its fair share of time to boot... much more than a standard Unix server. It would be pretty funny if compiling the kernel turned out to be trivial compared to booting!

    2. Re:Tempting... by hansendc · · Score: 4, Informative

      Seriously, I wonder how long it takes to boot.

      They do take a good bit of time to boot. In fact, it makes me much more careful when booting new kernels on them because if I screw up, I've got to wait 5 minutes, or so, for it to boot again! But, they do boot a lot faster when you run them as a single quad and turn off the detection of other quads.

  2. $500 for a quad xeon? by Anonymous Coward · · Score: 3, Informative

    No way. Just a no-CPU, no-memory case and
    motherboard costs $500. More like $2000
    to $3000 for an old quad.

    1. Re:$500 for a quad xeon? by Wells2k · · Score: 5, Informative

      No way. Just a no-CPU, no-memory case and
      motherboard costs $500. More like $2000
      to $3000 for an old quad.


      I am actually in the process of building a quad xeon right now with bits and pieces I bought off of E-Bay, and this is certainly doable. Not sure about the $500, but $2000-$3000 is high. I have the motherboard and memory riser now for $150, I am pretty sure that I can get a used rackmount case for $100 or so, the CPU's are going to cost around $60-70 each (P-III Xeon 500's), and memory is cheap as well.

      I figure I will be in it for around $1000 in the end. Yes, $500 is a low number, but I also know that your estimates of $2000-3000 is high.

  3. 42 seconds by decep · · Score: 4, Informative

    23 seconds is impressive. I, personally, have seen a 42 second compile time of a 2.2 series kernel on a Intel 8-way system (8GB ram, 8 550Mhz PIII Xeons w/ 1mb L2). It was in the 1 minute range with a 2.4 kernel.

    Definately the most impressive x86 system I have ever seen.

    1. Re:42 seconds by kigrwik · · Score: 5, Funny

      Arthur Dent: Ford, I've got it ! "What's the kernel compile time in seconds on an Intel 8-way Xeon ?"
      Ford: 42 ! We're made !

      --
      -- don't discount flying pigs until you have good air defense
  4. Was I the only one... by leviramsey · · Score: 4, Funny

    ...who wondered, "I didn't know that Clive Cussler had gotten into cluster design?

  5. ok this is NOT a troll by autopr0n · · Score: 4, Interesting

    But, does anyone know how NUMA compares with, say, a beowulf cluster? Does NUMA allow you to 'bind' multiple systems into one, so that I wouldn't need to rewrite my software? Did these guys use a stock GCC or something special? I know you would need to use MPI or similar for beowulf. Is NUMA as scalable as Beowulf in terms of building huge-ass machines (of course if I was going to expend the effort to do that, I might as well want to write custom software).

    If this type of system would allow 'supercomputer' performance on regular programs... well... that would be really nice. How much work is it to setup?

    --
    autopr0n is like, down and stuff.
    1. Re:ok this is NOT a troll by macinslak · · Score: 5, Informative

      NUMA is rather different than Beowulf.

      NUMA is just a strategy used for making computers that are too large for normal SMP techniques. I read a few good papers on sgi.com a couple of years ago that explained it in detail, and the NUMA link in the article had a quick definition. NUMA systems run one incarnation of one OS throughout the whole cluster, and usually imply some kind of crazy-ass bandwidth running between different machines. I don't think you could actually create a NUMA cluster of seperate quad Xeons boxes, and it would probably be ungodly slow if you tried.

      There probably isn't any difference for kernel compiles between the two, but NUMA clusters don't require any reworking of normal multithreaded programs to utilize the cluster and can be commanded as one coherent entity (make -j 32, wheee).

    2. Re:ok this is NOT a troll by jelson · · Score: 4, Informative

      NUMA is somewhere in between clustering (e.g. Beowulf) and SMP.

      On a normal desktop machine, you typically have one CPU and one set of main memory. The CPU is basically the only user of the memory (other than DMA from peripherals, etc.) so there's no problem.

      SMP machines have multiple CPUs, but each process running on each CPU can still see every byte of the same main memory. This can be a bottleneck as you scale up, since larger and larger numbers of processors that can theoretically run in parallel are being serviced by the same, serial memory.

      NUMA means that there are multiple sets of main memory -- typically one chunk of main memory for every processor. Despite the fact that memory is physically distributed, it still looks the same as one big set of centralized main memory -- that is, every processor sees the same (large) address space. Every processor can access every byte of memory. Of course, there is a performance penalty for accessing nonlocal memory, but NUMA machines typically have extremely fast interconnects to minimize this cost.

      Multi-computers, or clustering, etc. such as Beowulf completely disconnects memory spaces from each other. That is, each processor has its own independent view of its own independent memory. The only way to share data across processors is by explicit message-passing.

      I think the advantage of NUMA over beowulf from the point of view of compiling a kernel is just that you can launch 32 parallel copies of gcc, and the the cost of migrating those processes to processors is nearly 0. With beowulf, you'd have to write a special version of 'make' that understood MPI or some other way of manually distributing processes to processors. Even with something like MOSIX, an OS that automatically migrates processes to remote nodes in a multicomputer for you, the cost of process migration is very high compared to the typically short lifetime of a typical instantiation of 'gcc', so it's not a big win. (MOSIX is basically control software on top of a beowulf style cluster, and the kernel mods needed to do transparent process migration)

      I hope this clarified the situation rather than further confusing you. :-)

  6. Alternatively... by Ed+Avis · · Score: 4, Funny

    You can also get 23-second kernel compiles in software using Compilercache :-).

    --
    -- Ed Avis ed@membled.com
  7. Re:Who would have guessed... by BitwizeGHC · · Score: 4, Interesting

    No, this is a case of free software and cheap hardware making technologies available now to many people for whom it wasn't available (i.e., outside the realm of affordability because it was only sold by expensive proprietary vendors) just a short time ago. That is a more significant change than the endless treadmill of Moore's Law to which we had become accustomed.

    --
    N4st0r, trixx0r h0bb1tz0rz! Th3y st0l3 0ur pr3c10uzz!
  8. this may be good but... by m0RpHeus · · Score: 3, Insightful

    This may be good news, but what the heck! They should have at least included the .config that they used so that we can know what drivers/modules that are compiled with it, or maybe this is just bare-bones kernel enough to run the basic. We need to know the complexity of the configuration before we could really say it's fast.

    --
    Take-off every .sig! For Great Justice!
  9. Yeah, that's great, and all... by Wakko+Warner · · Score: 4, Funny

    But where can I get a NUMA cluster for $80? Should I Ask Slashdot?

    - A.P.

    --
    "Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
  10. Re:Why? by quintessent · · Score: 4, Informative

    is there some hidden application of this that I'm not seeing?

    How about doing other stuff really fast?

    3D modeling. 3D simulations. Even extensive photoshop editing with complex filters can benefit from this kind of raw speed.

    It wouldn't be a catchy headline, though, if it said "render a scene of a house in 40 seconds--oh, and here are the details of the scene so you can be impressed if you understand 3D rendering..."

    There are hundreds of applications for this, many of which we don't do every day on our desktop simply because they take too much juice to be useful. With ever-faster computers, we will continue to envision and benefit from these new possibilities.

  11. HELLLOOOOOO??? by Anonymous Coward · · Score: 4, Insightful

    You can't build a NUMA cluster worth a crap without a fast, low-latency interconnect.

    Sequent's NUMA Boxen use a flavor of SCI (Scalable Coherent Interface) which is integrated into the memory controller.

    While you can use some sort of PCI-based interconnect, the results are just plain not worth it.

    Infiniband should be better, though I've heared the latency is too high to make this a marketable solution.

    Keep your eyes on IBM's Summit chipset based systems. These are quads tied together with a "scalability port" and go up to 16-way. They should go to 32 or higher by 2003. That's when NUMA will -finally- be inevitable...

  12. Great news for mozilla and nautilus... by boris_the_hacker · · Score: 5, Funny

    ... with the advent of this new technology and raw speed, you should actually be able to use them!

    [this is actually a joke]

    --
    chris at darkrock dot co dot uk
    http colon slash slash www dot darkrock dot co dot uk
  13. IBM and Sequent being good citizens by swirlyhead · · Score: 5, Informative

    I went and looked at the email and noticed that the very first patch he mentions was from the woman who came and gave a talk to EUGLUG last spring. For one of our Demo Days we emailed IBM and asked them if they would send down someone to talk about IBM's Linux effort. We were kind of worried that they would send a marketing type in a suit who would tell us all about how much money they were going to spend, etc., etc. But we were very pleasantly surprised when they sent down a hardcore engineer who had been with Sequent until they were swallowed by IBM.

    She did a pretty broadranging overview of the linux projects currently in place at IBM, and then dived into the NUMA/Q stuff that she had been working on. The main gist of which is that Sequent had these 16-way fault-tolerant redundant servers that needed linux because the number of applications that ran on the native OS was small and getting smaller. Turned out that even the SMP code that was in the current tree at the time did not quite do it. She had some fairly hairy debugging stories, apparently sprinkling print statements through the code doesn't work too well when you're dealing with boot time on a multiprocessor system because it causes the kernel to serialize when in normal circumstances it wouldn't...

    I think the end result of all this progress with multiprocessor systems is that we'll be able to go down to the hardware store and buy more nodes plug 'em into the bus; and compute away.

  14. Re:Why? by JabberWokky · · Score: 3, Informative
    Maybe this is a silly question.. but why would you want to compile a kernel in 23 seconds?

    That's not the point - kernel compilation (or the compilation of any large project like KDE or XFree[1]) is a fairly common benchmark for general performance. It chews up disk access and memory and works the CPU quite nicely.

    [1] Large is, of course, a relative thing. Also, some compilers (notably Borland) are incredibly efficent at compiling (sometimes through manipulating the language specs so the programmer lines things up so the compiler can just go through the source once and compiles as it goes).

    Still, benchmarks are suspect to begin with, and kernel compile time is a decent loose benchmark. What was that quote from Linus about the latest release being so good he benchmarked an infinate loop at just under 6 seconds? :)

    --
    Evan

    --
    "$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
  15. Re:Who would have guessed... by sydb · · Score: 3, Funny

    Woah, for a moment I could have sworn that was a Jon Katz article...

    --
    Yours Sincerely, Michael.
  16. Nice try... by castlan · · Score: 4, Informative

    But the reserve for this machine is $3850. The article says 16 way, which would be four of these four-way SMP systems. That also doesn't take into account the need for a high-bandwidth, low latency interconnect (like SGI's NumaLink.) If you aren't expecting more than 16-way SMP, then you can probably get away with switched Gigabit Ethernet, as long as it is kept distinct from the nornal network connectivity. If the Gigabit upgrade is still dual portm then you are set. If not, you'll neet another NIC - though you will only really need one for the whole cluster.

    Maybe instead of two grand, the poster meant twenty-grand. Either way, $20 grand is better than $100K!

  17. Re:Why? by LinuxHam · · Score: 5, Insightful

    but why would you want to compile a kernel in 23 seconds?

    I think this benchmark is used time and time again because its really the only one that nearly any Linux user would be able to compare their own experiences to. If they said 1.2 GFLOPS, I (and I suspect most others) could only say "Wow, that sounds like a lot. I wonder what that looks like." OTOH, I have seen how long it takes to download 33 Slackware diskettes in parallel on a v.34 modem, and I still run 3 P75's today.

    I've been told that I will soon be deploying Beowulf HPC clusters to many clients, including universities and biomedical firms. If they were to tell me that the clusters will be able to do protein folds (or whatever they call it -- referring back to the nuclear simulation discussion) in "only 4 weeks", I won't have a clue as to how to scale that relative to customary performance of the day.

    Sure, there are many other applications that are run on clusters, but kernel compiles are the ones that all of us do. It can give us an idea of what kind of performance you'd get out of other processor-intensive operations. And many people will tell you there are so many variables with kernel compiles that its ridiculous to compare the results.

    Check out beowulf.org and see what people are doing with cluster computing. I've always wanted to open a site that compiles kernels for you. Just select the patches you want applied and paste the .config file. I'll compile it, and send back to you by email a clickable link to download your custom tarball. Of course no one here would trust a remotely compiled kernel :)

    --
    Intelligent Life on Earth
  18. Re:Why? by sohp · · Score: 5, Funny

    Never ask a geek, "why?". Just nod your head and back away slowly.

  19. make bzImage is not a very good benchmark by wowbagger · · Score: 3, Informative
    I would assert that a simple "time make -j32 bzImage" (which is what is being quoted) is not a very good benchmark as it is.

    Reason? Not enough information as to the options.
    • What version kernel was he building (actually, the LKML post did give this, but as a general statement this objection still stands)
    • What were his compile options? Building a kernel with everything possible built as modules will take a great deal less time to build bzImage (the non-module part of the kernel) than would a kernel with everything built in.
    • Then there's the issue of buffercache - to be consistent you would have to do a "make -j32 bzImage && make -j32 clean && time make -j32 bzImage" in order to have a consistent set of files in the VFS buffercache.

    Never the less:

    I WANT ONE
  20. Sorry, Anton Blanchard Wins by nbvb · · Score: 3, Informative

    http://samba.org/~anton/e10000/maketime_24

    Wheeeeeee!

    And seriously, I saw some comments about needed a really fast interconnect... check out Sun's Wildcat.

    --NBVB

  21. Re:hmph by Paul+Jakma · · Score: 3, Insightful

    what about the interconnect? the machine in question is /not/ a simple beowulf cluster, it's NUMA. Non Uniform Memory Architecture, which implies there is some form of memory architecture, and that the main difference between that architecture and that of a normal computer is that it is non-uniform.

    Ie, the CPUs in this computer share a common address space and can reference any memory, just that some memory (eg located at another node) has a higher cost of access than other memory. (as opposed to a typical SMP system where all memory has an equal 'cost of access').

    at the moment, under linux, this implies that there is special hardware in between those CPUs to provide the memory coherency - ie lots of bucks - cause there is no software means of providing that coherency (least not in linux today).

    NB: normal linux SMP could run fine on a NUMA machine (from the memory management POV), but it would be slower because it would not take the non-uniform bit into account.

    anyway... despite what the post says, this machine is /not/ a collection of cheap PCs connected via 100/1G ethernet or other high-speed packet interconnect.

    --
    I use Friend/Foe + mod-point modifiers as a karma/reputation system.
  22. Re:Why is this alternative funny? by Webmonger · · Score: 3, Informative

    Probably because compilercache is a way to AVOID compiling. . .