23 Second Kernel Compiles
b-side.org writes "As a fine testament to how quickly linux is absorbing technology formerly available only to the computing elite, an LKML member posted a
23 second kernel compile time to the list this morning as a result of building a 16-way NUMA cluster. The NUMA technology comes gifted from IBM and SGI. Just one year ago, a
Sequent NUMA-Q would have cost you about USD $100,000. These days, you can probably build a 16-way Xeon (4X 4-way SMP) system off of ebay for two grand, and the NUMA comes free of charge!"
ok..I'm NOT about to start the perverbial deluge of people wanting to know about a beowulf cluster of these things. But what I will ask is this: if it can do that for a kernel, I wonder how long it will take to do Mozilla, or XFree? It'd be interesting to see those stats.
JoeLinux
No way. Just a no-CPU, no-memory case and
motherboard costs $500. More like $2000
to $3000 for an old quad.
23 seconds is impressive. I, personally, have seen a 42 second compile time of a 2.2 series kernel on a Intel 8-way system (8GB ram, 8 550Mhz PIII Xeons w/ 1mb L2). It was in the 1 minute range with a 2.4 kernel.
Definately the most impressive x86 system I have ever seen.
...who wondered, "I didn't know that Clive Cussler had gotten into cluster design?
Remember, this was in 1996. Now, how much did we progress in the last five-six years? :)
"Ten years from now, they could do it in a few seconds." -- The Racketeer of the Hellfire Club, 1993, Phrack 42
Maybe this is a silly question.. but why would you want to compile a kernel in 23 seconds? I mean, ok, it's cool and everything, but is there some hidden application of this that I'm not seeing? Or are people really devoting hardcore time to this just because they can?
6 years ago, a kernel compile for me took about 3 hours. These days, it takes less than 3 minutes, which is more than fast enough for my needs. So, you can push it down to 23 seconds with a few thousand $ - what's the point? Someone help me out here!
But, does anyone know how NUMA compares with, say, a beowulf cluster? Does NUMA allow you to 'bind' multiple systems into one, so that I wouldn't need to rewrite my software? Did these guys use a stock GCC or something special? I know you would need to use MPI or similar for beowulf. Is NUMA as scalable as Beowulf in terms of building huge-ass machines (of course if I was going to expend the effort to do that, I might as well want to write custom software).
If this type of system would allow 'supercomputer' performance on regular programs... well... that would be really nice. How much work is it to setup?
autopr0n is like, down and stuff.
You can also get 23-second kernel compiles in software using Compilercache :-).
-- Ed Avis ed@membled.com
No, this is a case of free software and cheap hardware making technologies available now to many people for whom it wasn't available (i.e., outside the realm of affordability because it was only sold by expensive proprietary vendors) just a short time ago. That is a more significant change than the endless treadmill of Moore's Law to which we had become accustomed.
N4st0r, trixx0r h0bb1tz0rz! Th3y st0l3 0ur pr3c10uzz!
This may be good news, but what the heck! They should have at least included the .config that they used so that we can know what drivers/modules that are compiled with it, or maybe this is just bare-bones kernel enough to run the basic. We need to know the complexity of the configuration before we could really say it's fast.
Take-off every
But where can I get a NUMA cluster for $80? Should I Ask Slashdot?
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
That's on a "16 way NUMA-Q, 700MHz P3's, 4Gb RAM".
I've been following that thread wondering if anybody would post better results with a dual Athlon or similar. Any lucky soals with really cool hardware who want to post benchmarks? In fact, it would be interesting to know how quickly the kernel compiles on single P3/700, just to get an idea of how it scales.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
... just so long as I never have to program on Sequent iron, and that it's insidious operating system ever again. Of course, that was 10 years ago when Dynix, trying to be the best of both worlds, was really neither ATT nor Berkly !
... hence, making it a REAL-world bargain.
... of coure the other problem was indeed the expense, leaving us in situations where we had to program at odd hours and off-days because the client couldn't affort a "development" machine.
... two issues which I would hope are sovled a 16-way Xeon for $2K
healyourchurchwebsite.com - WWJB?
You can't build a NUMA cluster worth a crap without a fast, low-latency interconnect.
Sequent's NUMA Boxen use a flavor of SCI (Scalable Coherent Interface) which is integrated into the memory controller.
While you can use some sort of PCI-based interconnect, the results are just plain not worth it.
Infiniband should be better, though I've heared the latency is too high to make this a marketable solution.
Keep your eyes on IBM's Summit chipset based systems. These are quads tied together with a "scalability port" and go up to 16-way. They should go to 32 or higher by 2003. That's when NUMA will -finally- be inevitable...
... with the advent of this new technology and raw speed, you should actually be able to use them!
[this is actually a joke]
chris at darkrock dot co dot uk
http colon slash slash www dot darkrock dot co dot uk
I went and looked at the email and noticed that the very first patch he mentions was from the woman who came and gave a talk to EUGLUG last spring. For one of our Demo Days we emailed IBM and asked them if they would send down someone to talk about IBM's Linux effort. We were kind of worried that they would send a marketing type in a suit who would tell us all about how much money they were going to spend, etc., etc. But we were very pleasantly surprised when they sent down a hardcore engineer who had been with Sequent until they were swallowed by IBM.
She did a pretty broadranging overview of the linux projects currently in place at IBM, and then dived into the NUMA/Q stuff that she had been working on. The main gist of which is that Sequent had these 16-way fault-tolerant redundant servers that needed linux because the number of applications that ran on the native OS was small and getting smaller. Turned out that even the SMP code that was in the current tree at the time did not quite do it. She had some fairly hairy debugging stories, apparently sprinkling print statements through the code doesn't work too well when you're dealing with boot time on a multiprocessor system because it causes the kernel to serialize when in normal circumstances it wouldn't...
I think the end result of all this progress with multiprocessor systems is that we'll be able to go down to the hardware store and buy more nodes plug 'em into the bus; and compute away.
Yes, it can be applied to other stuff...and there are always CPU-bound problems that can use the speed. (I hope someone knowledgeable about current computer graphics technology can comment on what could be done with the machine under discussion.)
Woah, for a moment I could have sworn that was a Jon Katz article...
Yours Sincerely, Michael.
I'm afraid I have to disagree entirely, mate. I'm no neo-luddite by
any stretch of the imagination... I too spend a good proportion
[English is hilarious] of my time on the internet. I could, indeed, be
said to be leading a 'double life' by the unobservant. Notwithstanding
Mr Postman, Still and Talbot whom I cannot speak for; your assessment
of the intrinsically 'good' or 'evil' nature of technology is far from
clearly correct. You're allusion to the internet as a block of marble,
awaiting us to sculpt meaning into its form by using it is desperately
far from the truth. For example, books are not tabula rasa objects,
waiting for readers to impress upon them meaning and effect. When you
read the bible, the koran, Herman Hess or whoever, is it not the
author that steers you're experience of reading?
There are many forms of media in our lives, and the internet is just
one of them
and accessible to many folks does not detract from its power to
affect, to sometimes enourmous proportions, our culture, purpose and
ultimate ly 'mystical' existence.
Some instances of a particlular media may merely 'incline' us to
consider something... bland books, poor television programs or vacuous
theatre productions, but there are some instances that inspire us and
drive us to better our existance, or in some cases to cause blight,
cruelty and ruinous events.... Language, for example, has allowed our
brains to extend far beyond the confines of our boney skulls and
enables us to communicate and share ideas. If you've ever been in the
presence of a great speaker, you will know instantly that words are
not merely emtpy sounds awaiting our interpretations, but are weighted
vehicles for the influencial dissemination of ideas, and are very
seldom 'neutral'.
So my point is this: the internet is not a neutral object awaiting our
interpretation, but is a rich and varied media that can influence you
.. it can shock, scare, amuse, frighten.... and more things than you
can find in a thesaurus or dictionary, to boot; and it is NOT guided
or limited by your own mind...
But the reserve for this machine is $3850. The article says 16 way, which would be four of these four-way SMP systems. That also doesn't take into account the need for a high-bandwidth, low latency interconnect (like SGI's NumaLink.) If you aren't expecting more than 16-way SMP, then you can probably get away with switched Gigabit Ethernet, as long as it is kept distinct from the nornal network connectivity. If the Gigabit upgrade is still dual portm then you are set. If not, you'll neet another NIC - though you will only really need one for the whole cluster.
Maybe instead of two grand, the poster meant twenty-grand. Either way, $20 grand is better than $100K!
NUMA means Non-Uniform Memory Access. It is a kind of computer where you have shared memory but you dont have the same access time for every processor to every memory position. Therefore, every processor will have access to all memory but sometimes it will take longer or shorter (if the memory belongs to another processor).
In a Beowulf cluster, you dont have shared memory (unless inside a node, if you have a SMP machine) and you must use message passing to communicate (unless you are using DSM--Distributed Shared Memory--, maybe with SCI).
Reason? Not enough information as to the options.
Never the less:
I WANT ONE
www.eFax.com are spammers
http://samba.org/~anton/e10000/maketime_24
Wheeeeeee!
And seriously, I saw some comments about needed a really fast interconnect... check out Sun's Wildcat.
--NBVB
Very nice. :)
my 386-dx40 with weitek coprocessor and 8M ram,
at 1.36 bogomips, will compile a 2.2 kernel in only 27 hours 13 minutes.
what about the interconnect? the machine in question is /not/ a simple beowulf cluster, it's NUMA. Non Uniform Memory Architecture, which implies there is some form of memory architecture, and that the main difference between that architecture and that of a normal computer is that it is non-uniform.
/not/ a collection of cheap PCs connected via 100/1G ethernet or other high-speed packet interconnect.
Ie, the CPUs in this computer share a common address space and can reference any memory, just that some memory (eg located at another node) has a higher cost of access than other memory. (as opposed to a typical SMP system where all memory has an equal 'cost of access').
at the moment, under linux, this implies that there is special hardware in between those CPUs to provide the memory coherency - ie lots of bucks - cause there is no software means of providing that coherency (least not in linux today).
NB: normal linux SMP could run fine on a NUMA machine (from the memory management POV), but it would be slower because it would not take the non-uniform bit into account.
anyway... despite what the post says, this machine is
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
How well would Firewire, Fibre Channel, or SCA work as NUMA interconnects? How would these guys compare, pricewise and in effectiveness, to 1000baseT?
By computer graphics technology, do you mean a render-farm? That would be much better suited to a standard beowolf cluster, because the interprocess communication is minimal. That is an example of an "embarrasingly parallel" compumpting problem. As for live graphics, an Onyx workstation doesn't benefit from CPU power so much as its Reality Engine/Infinite Reality graphics pipeline. When you need better graphics performance, you can utilize multiple graphics pipelines. Some of the Onyx 3000s can use (I think) as many as 16 different IR3s for improved graphics output, like in RealityCenters.
The point of this article isn't that kernel compilation is fast because it is usually CPU bound, and 16 CPUs alleviate that problem. If fact kernel compiliation isn't strictly CPU bound... there are other performance limits too, especially disk performance. The significance of this article is that multithreaded kernel compiles benefit from the increased interprocess communication potential in NUMA architectures... performance would be much worse trying to spread that across a beowolf cluster.
While rendering (not displaying) graphics or running basic number crunching does not benefit much from a NUMA setup as compared to a beowolf style setup, some complex equation do benefit... computing the first million digits of Pi would use interprocess communication, as would large scale data minig application. It's been a few years since I've been there, I saw a huge cluster of Origin 2000s CC-NUMAed together with one Onyx 2, which handled displaying the results of the data mining. (An Onyx2 is basically an Origin 2000 with a graphics pipeline. An Onyx 3000 without any graphics bricks is an Origin 3000.)
That's good, but compilation is awfully parallelizable: You could (almost) just assign a computer to compile each individual source file; the total time would be the time to compile the slowest file plus link time. You could accomplish this with a shell script and a network file system -- what's the benefit of doing it with a shared-memory system like NUMA?
When it's upgrade time, I can start a compile, go to the pub, have a few beers, go back, see that the compile failed (because of , sparc32 and linux 2.4 don't seem to mix very well without some heavy tweaking), fix mistake, start again, and go back to the pub :)
Thanks to my slow sparcstations, I have a life! :)
Beowolf clusters are considered horizontal scaling, while NUMA clusters are considered vertical scaling. From my experience (SGI CC-NUMA) a NUMA cluster looks like a single computer system, with a single shared memory. (SGI systems are even Cache-Coherent, so that there is minimal performance loss if your data is in the RAM of the most distant CPU.. a significant issue with 256 nodes). This means that you don't have to deal with MPI or other systems to deal with disparate memory of seperate machines, so you can mostly code as if it were a single supercomputer. In fact, that is how SGI actually makes their supercomputers.
NUMA clusters tend to have scalability problems related to the cache coherence issue, so for a vertically scalable CC-NUMA box, you have to pay SGI the big bux. I haven't looked at IBMs NUMA technology, but if they own sequent, then they probably have similar capability.
As for the work to set one up, SGI's 3000 line is fairly trivial, the hardware is designed to handle it, and I think you only need NUMA link cables to scale beyond what fits inside a deskside case, if not a full height case. Now if you have a wall of these systems, you will need the NUMAlink (nee CrayLink) lovin'. As for an Intel based system, I suspect it wouldn't be nearly as easy... unless your vendor provides the setup for you. On your own, you would need to futz with cabling the systems together, just like in a beowolf. Except that your performance depends on finding a reasonably priced, high bandwidth, low-latency interconnect. Gigabit Ethernet wont scale very far, so going past 16 CPUs would be very unpleasant. If you expend the effort, you will end up with a cluster of machines that behave very much like a "supercomputer" though. Good luck!
Probably because compilercache is a way to AVOID compiling. . .
These machines were designed to run huge databases. The IO scalability isn't there in Linux yet as it was in Dynix/PTX, and there hasn't been so much work on the scalability of Linux as there has on PTX, but it'll get there pretty soon ;-)
So yes, it will apply to other stuff, though maybe not as well as it could, quite yet.
Don't forget to add about $10,000 per quad for the custom interconnect, which is what really makes this machine work
Let me get this right... you buy a used computer, and then go straight to the manufacturer for replacement parts??? (Surely you know 'accessories' are one of their higher-margin profit centers!)
Still... if you're in the Seattle, WA area, stop in the Boeing Surplus Retail Store. I was there last week, and they had a bucket of what looked like 80-pin 2.1GB Compaq hot-plug drives. They were just sitting there next to a cash register like candy would be at a supermarket. I don't remember the price, but I want to say they wanted ~$5 each for 'em.
They were also selling an Indigo ($50), lots of PCs (mostly old Dell OptiPlex models, $20 - $300, Pentium MMXs to P-IIs), and even a Barco data-grade projector ($2500). Fun place to go and blow half a day poking around.
"...America's great minds of today, teaching America's great minds of tomorrow. Poor bastards." -- A Beautiful Min
It would be quite easy to configure the kernel build process for several machines to each make a .o file, and them send them to a master machine for a final link.
If the end goal of this was to just compile kernels fast, you would be right. These numbers were posted because everybody knows how fast their kernel compiles. If someone posts TPC-H or SpecWeb99 numbers, no one notices. Normal people can say, "Wow, that is fast!"
Well, a 23-second kernel compile is impressive and all, but the most important question I would have of such a machine is: How fast can it run Quake-3?
If it can do 1280 * 1024 * 32bpp at 300 frames/second, then I'm getting one.
Schwab
Editor, A1-AAA AmeriCaptions
You can buy the bits needed to build your own NUMA hardware system out of separate boxes relatively[1] cheaply. The speed depends on how you manage the memory and I/O. You'd need Linux to support it as a coherent whole though and I'm not sure that it does.
[1] For large values of relatively.
Government of the people, by corporate executives, for corporate profits.
Ooh, for 5 years or so.
They've since been bought by EMC and closed down but they had it working *and* scaling to 32 CPUs and on the market. 64 CPU systems were well on the way but I don't recall if they finished them.
Government of the people, by corporate executives, for corporate profits.
Female Prison Rape in NY
1. 2.4.18, and I also told you what patches I was using (though some of them won't be published until next week).
2. OK, I just posted the config file. http://lse.sourceforge.net/numa/config.mem
3. I did five kernel compiles in a row (though I omitted to mention that).
Hi Martin!
--
Daniel
Life's a bitch but somebody's gotta do it.
In fact, somebody please go and mod up all oxfletch's posts on this article, he's Martin Bligh, the guy who did this.
Life's a bitch but somebody's gotta do it.
Actually, it still takes quite some time. From the FAQ:
kernel--- compilercache time
default-- no-------------- 5m28.860s
default-- yes, but empty 6m56.490s
default-- yes, filled------ 2m51.900s
modified yes, filled------ 3m58.730s
(ugly formatting to avoid lameness filter)
It looks pretty safe, especially if you've been burned by a badly written Makefile. The FAQ explains the difference between compilercache and makefiles pretty well.
As a bonus, compilercache ignores changes made to the comments (since it uses the preprocessed source (with the comments stripped) to calculate an MD5 checksum), so you can fix/add comments without worrying about an extra long compile.
I probably won't use it, though -- my projects tend to require only one file to be recompiled per build.
HIV Crosses Species Barrier... into Muppets