23 Second Kernel Compiles
b-side.org writes "As a fine testament to how quickly linux is absorbing technology formerly available only to the computing elite, an LKML member posted a
23 second kernel compile time to the list this morning as a result of building a 16-way NUMA cluster. The NUMA technology comes gifted from IBM and SGI. Just one year ago, a
Sequent NUMA-Q would have cost you about USD $100,000. These days, you can probably build a 16-way Xeon (4X 4-way SMP) system off of ebay for two grand, and the NUMA comes free of charge!"
No way. Just a no-CPU, no-memory case and
motherboard costs $500. More like $2000
to $3000 for an old quad.
23 seconds is impressive. I, personally, have seen a 42 second compile time of a 2.2 series kernel on a Intel 8-way system (8GB ram, 8 550Mhz PIII Xeons w/ 1mb L2). It was in the 1 minute range with a 2.4 kernel.
Definately the most impressive x86 system I have ever seen.
is there some hidden application of this that I'm not seeing?
How about doing other stuff really fast?
3D modeling. 3D simulations. Even extensive photoshop editing with complex filters can benefit from this kind of raw speed.
It wouldn't be a catchy headline, though, if it said "render a scene of a house in 40 seconds--oh, and here are the details of the scene so you can be impressed if you understand 3D rendering..."
There are hundreds of applications for this, many of which we don't do every day on our desktop simply because they take too much juice to be useful. With ever-faster computers, we will continue to envision and benefit from these new possibilities.
Donate background CPU time to fight cancer.
NUMA is rather different than Beowulf.
NUMA is just a strategy used for making computers that are too large for normal SMP techniques. I read a few good papers on sgi.com a couple of years ago that explained it in detail, and the NUMA link in the article had a quick definition. NUMA systems run one incarnation of one OS throughout the whole cluster, and usually imply some kind of crazy-ass bandwidth running between different machines. I don't think you could actually create a NUMA cluster of seperate quad Xeons boxes, and it would probably be ungodly slow if you tried.
There probably isn't any difference for kernel compiles between the two, but NUMA clusters don't require any reworking of normal multithreaded programs to utilize the cluster and can be commanded as one coherent entity (make -j 32, wheee).
I went and looked at the email and noticed that the very first patch he mentions was from the woman who came and gave a talk to EUGLUG last spring. For one of our Demo Days we emailed IBM and asked them if they would send down someone to talk about IBM's Linux effort. We were kind of worried that they would send a marketing type in a suit who would tell us all about how much money they were going to spend, etc., etc. But we were very pleasantly surprised when they sent down a hardcore engineer who had been with Sequent until they were swallowed by IBM.
She did a pretty broadranging overview of the linux projects currently in place at IBM, and then dived into the NUMA/Q stuff that she had been working on. The main gist of which is that Sequent had these 16-way fault-tolerant redundant servers that needed linux because the number of applications that ran on the native OS was small and getting smaller. Turned out that even the SMP code that was in the current tree at the time did not quite do it. She had some fairly hairy debugging stories, apparently sprinkling print statements through the code doesn't work too well when you're dealing with boot time on a multiprocessor system because it causes the kernel to serialize when in normal circumstances it wouldn't...
I think the end result of all this progress with multiprocessor systems is that we'll be able to go down to the hardware store and buy more nodes plug 'em into the bus; and compute away.
That's not the point - kernel compilation (or the compilation of any large project like KDE or XFree[1]) is a fairly common benchmark for general performance. It chews up disk access and memory and works the CPU quite nicely.
[1] Large is, of course, a relative thing. Also, some compilers (notably Borland) are incredibly efficent at compiling (sometimes through manipulating the language specs so the programmer lines things up so the compiler can just go through the source once and compiles as it goes).
Still, benchmarks are suspect to begin with, and kernel compile time is a decent loose benchmark. What was that quote from Linus about the latest release being so good he benchmarked an infinate loop at just under 6 seconds? :)
--
Evan
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
But the reserve for this machine is $3850. The article says 16 way, which would be four of these four-way SMP systems. That also doesn't take into account the need for a high-bandwidth, low latency interconnect (like SGI's NumaLink.) If you aren't expecting more than 16-way SMP, then you can probably get away with switched Gigabit Ethernet, as long as it is kept distinct from the nornal network connectivity. If the Gigabit upgrade is still dual portm then you are set. If not, you'll neet another NIC - though you will only really need one for the whole cluster.
Maybe instead of two grand, the poster meant twenty-grand. Either way, $20 grand is better than $100K!
NUMA means Non-Uniform Memory Access. It is a kind of computer where you have shared memory but you dont have the same access time for every processor to every memory position. Therefore, every processor will have access to all memory but sometimes it will take longer or shorter (if the memory belongs to another processor).
In a Beowulf cluster, you dont have shared memory (unless inside a node, if you have a SMP machine) and you must use message passing to communicate (unless you are using DSM--Distributed Shared Memory--, maybe with SCI).
Reason? Not enough information as to the options.
Never the less:
I WANT ONE
www.eFax.com are spammers
http://samba.org/~anton/e10000/maketime_24
Wheeeeeee!
And seriously, I saw some comments about needed a really fast interconnect... check out Sun's Wildcat.
--NBVB
I'll give you the benefit of the doubt and assume you're not just a trolling karma whore here. The answer is as obvious as always: faster is always better. If there's nothing which needed that speed, it's because it wasn't previously viable and nothing got written with it in mind. If every computer were this fast, it makes compiling huge projects viable on small workstations.
And here's a great example - where I work there are three things that reduce productivity because of technical bottlenecks:
Of these, the major bottleneck is compiling. If it takes 30 seconds just to recompile a single source file and link everything, you end up writing and debugging code in "batch" fashion rather than in tiny increments. And it's 30 seconds where you're not doing anything except waiting for the results.
If I had a machine like this on my desk, I'd probably get twice as much work done.
brilliant that you're the only person who caught that, and no one has modded you up. ;) just further proof that the /. editors (not to mention the average reader/moderator) don't actually know anything at all about technology, and aren't in any way interested in fact checking submissions.
Beowolf clusters are considered horizontal scaling, while NUMA clusters are considered vertical scaling. From my experience (SGI CC-NUMA) a NUMA cluster looks like a single computer system, with a single shared memory. (SGI systems are even Cache-Coherent, so that there is minimal performance loss if your data is in the RAM of the most distant CPU.. a significant issue with 256 nodes). This means that you don't have to deal with MPI or other systems to deal with disparate memory of seperate machines, so you can mostly code as if it were a single supercomputer. In fact, that is how SGI actually makes their supercomputers.
NUMA clusters tend to have scalability problems related to the cache coherence issue, so for a vertically scalable CC-NUMA box, you have to pay SGI the big bux. I haven't looked at IBMs NUMA technology, but if they own sequent, then they probably have similar capability.
As for the work to set one up, SGI's 3000 line is fairly trivial, the hardware is designed to handle it, and I think you only need NUMA link cables to scale beyond what fits inside a deskside case, if not a full height case. Now if you have a wall of these systems, you will need the NUMAlink (nee CrayLink) lovin'. As for an Intel based system, I suspect it wouldn't be nearly as easy... unless your vendor provides the setup for you. On your own, you would need to futz with cabling the systems together, just like in a beowolf. Except that your performance depends on finding a reasonably priced, high bandwidth, low-latency interconnect. Gigabit Ethernet wont scale very far, so going past 16 CPUs would be very unpleasant. If you expend the effort, you will end up with a cluster of machines that behave very much like a "supercomputer" though. Good luck!
Probably because compilercache is a way to AVOID compiling. . .
These machines were designed to run huge databases. The IO scalability isn't there in Linux yet as it was in Dynix/PTX, and there hasn't been so much work on the scalability of Linux as there has on PTX, but it'll get there pretty soon ;-)
So yes, it will apply to other stuff, though maybe not as well as it could, quite yet.
NUMA is somewhere in between clustering (e.g. Beowulf) and SMP.
:-)
On a normal desktop machine, you typically have one CPU and one set of main memory. The CPU is basically the only user of the memory (other than DMA from peripherals, etc.) so there's no problem.
SMP machines have multiple CPUs, but each process running on each CPU can still see every byte of the same main memory. This can be a bottleneck as you scale up, since larger and larger numbers of processors that can theoretically run in parallel are being serviced by the same, serial memory.
NUMA means that there are multiple sets of main memory -- typically one chunk of main memory for every processor. Despite the fact that memory is physically distributed, it still looks the same as one big set of centralized main memory -- that is, every processor sees the same (large) address space. Every processor can access every byte of memory. Of course, there is a performance penalty for accessing nonlocal memory, but NUMA machines typically have extremely fast interconnects to minimize this cost.
Multi-computers, or clustering, etc. such as Beowulf completely disconnects memory spaces from each other. That is, each processor has its own independent view of its own independent memory. The only way to share data across processors is by explicit message-passing.
I think the advantage of NUMA over beowulf from the point of view of compiling a kernel is just that you can launch 32 parallel copies of gcc, and the the cost of migrating those processes to processors is nearly 0. With beowulf, you'd have to write a special version of 'make' that understood MPI or some other way of manually distributing processes to processors. Even with something like MOSIX, an OS that automatically migrates processes to remote nodes in a multicomputer for you, the cost of process migration is very high compared to the typically short lifetime of a typical instantiation of 'gcc', so it's not a big win. (MOSIX is basically control software on top of a beowulf style cluster, and the kernel mods needed to do transparent process migration)
I hope this clarified the situation rather than further confusing you.
NUMA provides you with a single system image, so there's no need to rewrite your software. At the moment, we're working on default behaviours so that normal software works reasonably well. For something like a large database, we're providing APIs that will allow you to specify things about how processes interact with their memory and each other, allowing you to increase performance further.
;-)
The hardware looks a little like 4 x 4way SMP boxes, with a huge fat interconnect pipe slung down the back (10 - 20 Gbit/s IIRC). But there's all sorts of smart cache coherency / mem transparency hardware in there too, to make the whole machine look like a single normal machine.
Yes, I used stock GCC (redhat 6.2).
re Scalability, the largest machine you can build out of this stuff would be a 64 proc P3/900 with 64Gb of RAM. SGI can build larger machines, but I think they're ia64 based, which has it's own problems.
It's not that hard to set up, but not something you would build in your bedroom
It doesn't scale too well yet. A single quad (fairly standard SMP 4 way) will do the dirty deed in about 47s.
Don't forget to add about $10,000 per quad for the custom interconnect, which is what really makes this machine work
Seriously, I wonder how long it takes to boot.
They do take a good bit of time to boot. In fact, it makes me much more careful when booting new kernels on them because if I screw up, I've got to wait 5 minutes, or so, for it to boot again! But, they do boot a lot faster when you run them as a single quad and turn off the detection of other quads.
Stop hiding behind the AC and people might pay you attention.
You appear to be equating clocking and processor speed like apples and oranges. They aren't. If we consider all of the technological advances in the modern ia32 processors vs. it's earilier brethern, then the comparision is even less favorable... Modern processors should be exceptionally faster. But they aren't. There are two primary reasons for this: increasing inefficiency, and increasing complexity. Present day programmers are far less motivated to write "good code" because they live in the falacy that the processor is fast enough to run anything. ("No one will notice the difference.") In fact, they are generally incapable of generating efficient code as they've never been taught to think that way. These people will surely spend an eternity in computing hell writing programs in BASIC on 1MHz machines that have 32x16 character console displayed on a 12" BW TV. (Any resemblance to the movie Brazil is unintentional.)
Complexity breeds more inefficiency. As the saying goes, "Make it work. Then make it fast."
As for my comments about Sparc... Unless Sun is deploying reverse engineered alien technologies, the core of the processor (ie. how it adds and subtracts) hasn't changed much. It's the clock speed (how fast it runs through the "add" proceedure) that makes it faster. The efficient adaption of code to the native 64bit environment also helps alot. (Better code + better compiler yeilds faster execution.)