If you add support in the hardware for a single processor to handle the unrolled loop at once, you end up with a vector processor. This is how the compiler for Crays works.
That is a *very* far cry from firing up a thread to do the work. I would say that there are very few kinds of loops that a compiler could prove were thread safe that would be worth firing off a thread, and all the overhead that takes, to perform the task. And before y'all jump all over that, I can think of a few degenerate cases that would work also and even a few that aren't so degenerate. At this point though, I think multi-threading is best left up to the programmer, not the compiler, if you want maximum performance.
How hard this is depends on the architecture of the OS. In Irix it is hard. I don't know how hard it was for Sun to get to where they are. I have heard, however, that Sun's dynamic subtract ability doesn't work under certain conditions. The trick is, with NUMA, you are migrating both the memory and the processes. We can migrate processes very easily. Getting all the memory off the node is a lot harder. Sun doesn't have to deal with this until they get NUMA.
As for the E10k followon being NUMA, I'll believe it when I see it. Sun has previously said that they don't think NUMA is a good thing. Also, we heard they were working on an architecture called COMA (bad name, but it stands for Cache Only Memory Architecture) where you treat all of memory as a cache and let cache lines move wherever. If they *are* really doing NUMA on 1000 processors, they are going to find that the jump from 64 to 1024 is more like scaling a cliff than a gentle slope... Besides, Sun's NUMA stuff is vapor - ours runs *now*:)
Err, hmm. Someone care to point me at something that'll explain segment registers? I am apparently a lot more clueless on this subject than I thought:)
That won't help on Intel. There's no way to make it so you can't jump to the stack unless the stack is not executable. The way a lot of these buffer overflows work is that they trick the program into corrupting its own stack and rewriting the return address so that instead of returning to where it should (some address in the code segment) it "returns" to some arbitrary address on the stack that contains the code to execute. Just because the segments aren't run right together in memory (actually, they aren't this way now) doesn't prevent a jump from one to the other as long as they share the same address space. Unless you are suggesting that they *not* share the same address space. That would require (if you could even make it work) a *very* large number of context switches which would hurt performance a *lot* more than just doing the right thing which is to not read data from an untrusted source without checking it to make sure it's sane.
You *can* flip off the execute bit on processors that have one, however, Intel x86 doesn't. On Intel, read==execute.
...was the Thinking Machines cube. Ok, granted it's a *big* cube, but when you have 65,536 processors, you suck a lot of space. So Cobalt was beat by *two* people to the cube idea (the Thinking Machines cube even has lots of nice blinky lights:) - Thinking Machines (now owned by Gores Technology - www.gores.com) in 1985 and NeXT (now owned by Apple) in the late 80's. I think Cobalt isn't exactly going to win on originality here...
For a picture of the Thinking Machines cube, see http://forest.bigw.org/cmu_presentation/sld007.htm (yeah, I know it came from PowerPoint, but...)
Yeah - the Origin 2000 deskside models (and also Onyx II deskside) are cubes, but they are just a *bit* bigger than Apple's:) They are about 2 feet on each side. Maybe even a bit bigger than that.
This, sadly, is not correct. You can hot swap I/O on Origin 3000 but *not* C-bricks without a reboot. Yeah, it'd be really nice if that weren't true, but the software to subtract running processors from the system is *REALLY*REALLY* hard. It may get done at some point (ie, the point where there's enough customer demand and engineer resources to make it feasable), though, since hard==fun for kernel programmers:) Brick addition (ie, adding a C-brick to a running system without rebooting the system) is easier and is will probably happen before brick subtraction. Neither are committed projects (*sniff*:)
However, what you *can* do is shut down a single partition of a multi-partition system without affecting the rest of it. Also, we have some stuff in Irix to throw away pages that have double bit errors in them without panic'ing the system in some cases. More RAS features are planned to be added over time.
There is a very important difference between ccNUMA machines like the O2000/O3000/Sequent and something like Ethernet (of any flavor). That is that the communication doesn't have to go through the I/O channel, which means *zero* syscalls to do communication between threads - it's all memorymemory. That means much lower latency on things like MPI jobs. Cards like Myrinet are trying to accomplish the same thing (direct user->user transfers) but as far as I know, you can only push data down them, not pull data from the other side.
Actually, you can get better than just TCP/IP communication - the partitions can share memory (actually, that's how the IP communication works - one partition pulls data over via this thing called a "block transfer engine" - think hardware bcopy). Starting around the November timeframe, we'll have an MPI implementation that works over this interconnect without the additional latency of IP.
As someone else pointed out, this system will eventually be released with Merc^H^H^H^HItanium chips and will be running Linux. We are working on getting a SysV shared memory driver to share memory cross partition for it.
(I speak for myself, not for SGI, though I am in the group that does the partitioning software)
Actually, to see how you can slice up one of these machines after it's installed, check out "system partitioning". (http://www.sgi.com/origin/3000/partitioning.html) The software is very cool (then again, I was one of the people who wrote it so I'm a bit biased:) - you can break up a system into multiple smaller systems that can communicate over the memory interconnect. You can even reconfigure without rebooting unaffected hardware. You can power off an entire partition, service it, and bring it back online without taking down the rest of the system. I believe Marketing is calling this "cluster-in-a-box"...
Actually, Linux has some limited support for Origin 2000, which is this thing's predecessor (very similar as far as architecture, but O3000 is a lot faster:) In order to scale Linux/BSD/any other OS to a machine this large, you end up with two things that would be rather unsavory to the Linux philosophy: lots and lots of locks and trading performance on single processor systems for better performance on large systems. You end up with a lot of locks because you have a lot of threads all trying to do the same or very similar things at once. Something like "lock_kernel" just doesn't work. Linux tries to minimize locks to minimize locking confusion. Then, when you've added all these locks, they really don't help at all on small systems so you lose a bit of performance there. Also not something that is good in the Linux world.
That said, if you check out the "partitioning" feature of the machine, you can break it up into multiple smaller machines that can share memory (well, not yet but we're working on it:) and communicate via direct memory->memory copies.
When you get Windoze 2000 to boot on 512 processors all at once in a single image, then we'll talk about multiprocessing support. I do that regularly with Irix. Cray, with Unicos/mk can boot 1800 processors. Windows does.... Hmm... 8? Or do you consider supercomputing a "trivial" task? If not, be prepared to explain why *none* of the machines listed at top500.org are running Windows.
Yeah - that's what I get for posting before I'm fully awake:) There is no Origin 2010. There is no backside cache on Origin. Spinlock support has always been in Origin - that's what the "cc" part of "ccNUMA" gets you.... But the best is where he claims that 1998 was 5 years ago:)
With the way Unix does load averages, a load of 20 means that you are running 20 processors full boar and the other 172 processors are just sitting around running idle. The machine would be really responsive up to a load of 191. Then things would start slowing down, but the same would happen on any single processor system when the load goes over 1.
What you noticed about the machine plodding through code (but a lot of it at once) is because the point of Origin is for parallel programs. The fastest way to run a program with n threads is on an n processor machine (neglecting other activity). So, if you aren't writing parallel code, you get basically no benefit from a parallel system. The way to speed up single thread code in the supercomputing world is to run on a vector based system.
It does support multi-module. I've tried it on a two module system and it boots fine. I'll probably try it on a 4 module system at some point in the near future. It does have some *cough* issues, though. The error handling for errors that come from Hub (the ASIC that runs the ccNUMA protocol between node boards). In fact there are some errors that I can't even figure out where the interrupts are going to... But overall, it seems to be fairly stable and supports at least the Base I/O card that runs most of the ports on O2k. Not sure how well it performs on Onyx2 (I'd imagine it won't support Infinite Reality gfx, but it should support booting since the hardware is the same as Origin 2000).
Uh, ok -I have never even *heard* of the Origin 2010, nor can I find *any* mention of it anywhere in the Irix kernel code, which I look at/modify on a daily basis. Spinlock support has been available, AFAIK, since the first day Origin shipped - it's kinda necessary to support an OS boot. And I've never heard of a backside cache on any of our systems. Origin, BTW, is not SMP - it's NUMA. As for getting one several years before it was publicly released, I wasn't at SGI then, so I don't know what the Origin beta program was like, but the current machines in beta *certainly* weren't ready to ship 2 years ago in beta.
And no, I haven't rendered a 3D game myself (though I have written a few simple Open GL programs). However, I *do* work with Origin hardware and software *every* day since my group is responsible for kernel support on it.
I will ask some people Monday, though, about the 2010 and I'll post more then (if anyone's actually heard of it...)
For an example of a *really* huge cluster of these machines, check out the ASCI project at LANL. They have a cluster of 48 nodes of Origin 2000 where each node has 128 processors. It's not Beowulf, but it's similar in that you can run an MPI job across the cluster.
What a moronic troll. The *processor* is what's slow, not the OS. And BTW - Linux runs on this piece of hardware so you just contradicted yourself. Titanic was generated on Alpha processors that happened to be running Linux. Much as I hate to admit it, they probably could have gone just as fast had they been Alphas running NT (except that they probably would have had to reinstall the OS a bunch of times:) That had way more to do with the speed than the OS since rendering is just a huge number crunching application. The machine in this article doesn't even have a graphics card. My SGI, BTW, *can* get 80+ fps. So there.
Lastly, compare Infinite Reality 3 graphics with *any* PC graphics card. The PC crap won't even come *close* to catching it.
Yes - go to oss.sgi.com and look at "Ports". The mips64 port of Linux runs on the Origin 200/2000. In fact, I was running it on an 8p Origin yesterday.
That would be a neat feat since the Origin 2000 was first released in 1996 which, last time I checked, was less than 5 years ago..... As for woefully archaic, find me another machine that scales to 512 processor in a single system image. Have you ever *really* used an SGI Origin 2000???
I don't know what we sell these things for used, but new, a 16p Origin 2000 lists (though sales generally discounts at various percentages that I don't claim to understand) for around $80,000 the last time I checked. It is fairly powerful - you get a *lot* of memory bandwidth in these things (800 megabytes/second between nodes where a node is 2 processors, an I/O channel and a bunch 'o RAM). I would imagine that it would easily take out a PIII 800, dual or otherwise. Also, remember that Irix scales *very* well (though this is running an *old* version) - 16p is very very close to 16 times as fast as 1p.
1) It's not *that* heavy. Any decent floor should hold it. I've wheeled these things around quite a bit. 2) Yeah - these suckers kick out a lot of heat, especially since htye have almost a gig of RAM in there (that generates a lot of heat - you should see the one with 196 gigs of RAM that I work with - the entire room changes weather when you power up:) 3) You can easily get the kind of power this thing draws from a standard residential panel if you have some spare circuits - it doesn't need tri-phase power. 4) It has a 30 amp twistlock plug on it, though I'm not sure how mcuh it actually draws. I think it uses 240 volts, though. Think stove or dryer power.
It's not your head. The place you really hear the difference between an mp3 and a CD is in the cymbals. Listen to the upper ranges in the high crash cymbals. If you hear a static'y rushing noise, it's an mp3 (or a cheap microphone doing the recording:) If you hear them crisply, it's a CD.
No - it doesn't. The data bus is the width of one processor cache line, not the width of a word. The idea is that you fetch one entire cache line over the bus at a time because if you reference word 1 you are likely to want word 2 sometime in the near future as well. Then, if you walk through an array of 1000 elements, you'll only need 500 memory references, and memory references are *very* expensive operations. A 64 bit rocessor would have a 128bit or 256 bit data bus. The Origin 2000 uses a 128 *byte* data bus between the L2 cache and memory. I don't remember the cache line size of the L1 cache. The PowerPC is, AFAIK, a 32 bit processor at its core. Altivec extensions, however, allow the G4 to do operations on more than one word in the processor at once.
That is a *very* far cry from firing up a thread to do the work. I would say that there are very few kinds of loops that a compiler could prove were thread safe that would be worth firing off a thread, and all the overhead that takes, to perform the task. And before y'all jump all over that, I can think of a few degenerate cases that would work also and even a few that aren't so degenerate. At this point though, I think multi-threading is best left up to the programmer, not the compiler, if you want maximum performance.
As for the E10k followon being NUMA, I'll believe it when I see it. Sun has previously said that they don't think NUMA is a good thing. Also, we heard they were working on an architecture called COMA (bad name, but it stands for Cache Only Memory Architecture) where you treat all of memory as a cache and let cache lines move wherever. If they *are* really doing NUMA on 1000 processors, they are going to find that the jump from 64 to 1024 is more like scaling a cliff than a gentle slope... Besides, Sun's NUMA stuff is vapor - ours runs *now* :)
Err, hmm. Someone care to point me at something that'll explain segment registers? I am apparently a lot more clueless on this subject than I thought :)
You *can* flip off the execute bit on processors that have one, however, Intel x86 doesn't. On Intel, read==execute.
For a picture of the Thinking Machines cube, see http://forest.bigw.org/cmu_presentation/sld007.htm (yeah, I know it came from PowerPoint, but...)
Yeah - the Origin 2000 deskside models (and also Onyx II deskside) are cubes, but they are just a *bit* bigger than Apple's :) They are about 2 feet on each side. Maybe even a bit bigger than that.
However, what you *can* do is shut down a single partition of a multi-partition system without affecting the rest of it. Also, we have some stuff in Irix to throw away pages that have double bit errors in them without panic'ing the system in some cases. More RAS features are planned to be added over time.
And I definetly agree - O3k is "real hardware" :)
There is a very important difference between ccNUMA machines like the O2000/O3000/Sequent and something like Ethernet (of any flavor). That is that the communication doesn't have to go through the I/O channel, which means *zero* syscalls to do communication between threads - it's all memorymemory. That means much lower latency on things like MPI jobs. Cards like Myrinet are trying to accomplish the same thing (direct user->user transfers) but as far as I know, you can only push data down them, not pull data from the other side.
As someone else pointed out, this system will eventually be released with Merc^H^H^H^HItanium chips and will be running Linux. We are working on getting a SysV shared memory driver to share memory cross partition for it.
(I speak for myself, not for SGI, though I am in the group that does the partitioning software)
Actually, to see how you can slice up one of these machines after it's installed, check out "system partitioning". (http://www.sgi.com/origin/3000/partitioning.html) The software is very cool (then again, I was one of the people who wrote it so I'm a bit biased :) - you can break up a system into multiple smaller systems that can communicate over the memory interconnect. You can even reconfigure without rebooting unaffected hardware. You can power off an entire partition, service it, and bring it back online without taking down the rest of the system. I believe Marketing is calling this "cluster-in-a-box"...
That said, if you check out the "partitioning" feature of the machine, you can break it up into multiple smaller machines that can share memory (well, not yet but we're working on it :) and communicate via direct memory->memory copies.
When you get Windoze 2000 to boot on 512 processors all at once in a single image, then we'll talk about multiprocessing support. I do that regularly with Irix. Cray, with Unicos/mk can boot 1800 processors. Windows does.... Hmm... 8? Or do you consider supercomputing a "trivial" task? If not, be prepared to explain why *none* of the machines listed at top500.org are running Windows.
Wait till next year. It's going to cost a lot though :) We're working on the clustering software for it now. CrayLink is now called NUMAlink, BTW.
Yeah - that's what I get for posting before I'm fully awake :) There is no Origin 2010. There is no backside cache on Origin. Spinlock support has always been in Origin - that's what the "cc" part of "ccNUMA" gets you.... But the best is where he claims that 1998 was 5 years ago :)
What you noticed about the machine plodding through code (but a lot of it at once) is because the point of Origin is for parallel programs. The fastest way to run a program with n threads is on an n processor machine (neglecting other activity). So, if you aren't writing parallel code, you get basically no benefit from a parallel system. The way to speed up single thread code in the supercomputing world is to run on a vector based system.
It does support multi-module. I've tried it on a two module system and it boots fine. I'll probably try it on a 4 module system at some point in the near future. It does have some *cough* issues, though. The error handling for errors that come from Hub (the ASIC that runs the ccNUMA protocol between node boards). In fact there are some errors that I can't even figure out where the interrupts are going to... But overall, it seems to be fairly stable and supports at least the Base I/O card that runs most of the ports on O2k. Not sure how well it performs on Onyx2 (I'd imagine it won't support Infinite Reality gfx, but it should support booting since the hardware is the same as Origin 2000).
And no, I haven't rendered a 3D game myself (though I have written a few simple Open GL programs). However, I *do* work with Origin hardware and software *every* day since my group is responsible for kernel support on it.
I will ask some people Monday, though, about the 2010 and I'll post more then (if anyone's actually heard of it...)
For an example of a *really* huge cluster of these machines, check out the ASCI project at LANL. They have a cluster of 48 nodes of Origin 2000 where each node has 128 processors. It's not Beowulf, but it's similar in that you can run an MPI job across the cluster.
Lastly, compare Infinite Reality 3 graphics with *any* PC graphics card. The PC crap won't even come *close* to catching it.
Yes - go to oss.sgi.com and look at "Ports". The mips64 port of Linux runs on the Origin 200/2000. In fact, I was running it on an 8p Origin yesterday.
That would be a neat feat since the Origin 2000 was first released in 1996 which, last time I checked, was less than 5 years ago..... As for woefully archaic, find me another machine that scales to 512 processor in a single system image. Have you ever *really* used an SGI Origin 2000???
I don't know what we sell these things for used, but new, a 16p Origin 2000 lists (though sales generally discounts at various percentages that I don't claim to understand) for around $80,000 the last time I checked. It is fairly powerful - you get a *lot* of memory bandwidth in these things (800 megabytes/second between nodes where a node is 2 processors, an I/O channel and a bunch 'o RAM). I would imagine that it would easily take out a PIII 800, dual or otherwise. Also, remember that Irix scales *very* well (though this is running an *old* version) - 16p is very very close to 16 times as fast as 1p.
1) It's not *that* heavy. Any decent floor should hold it. I've wheeled these things around quite a bit. 2) Yeah - these suckers kick out a lot of heat, especially since htye have almost a gig of RAM in there (that generates a lot of heat - you should see the one with 196 gigs of RAM that I work with - the entire room changes weather when you power up :) 3) You can easily get the kind of power this thing draws from a standard residential panel if you have some spare circuits - it doesn't need tri-phase power. 4) It has a 30 amp twistlock plug on it, though I'm not sure how mcuh it actually draws. I think it uses 240 volts, though. Think stove or dryer power.
It's not your head. The place you really hear the difference between an mp3 and a CD is in the cymbals. Listen to the upper ranges in the high crash cymbals. If you hear a static'y rushing noise, it's an mp3 (or a cheap microphone doing the recording :) If you hear them crisply, it's a CD.
No - it doesn't. The data bus is the width of one processor cache line, not the width of a word. The idea is that you fetch one entire cache line over the bus at a time because if you reference word 1 you are likely to want word 2 sometime in the near future as well. Then, if you walk through an array of 1000 elements, you'll only need 500 memory references, and memory references are *very* expensive operations. A 64 bit rocessor would have a 128bit or 256 bit data bus. The Origin 2000 uses a 128 *byte* data bus between the L2 cache and memory. I don't remember the cache line size of the L1 cache. The PowerPC is, AFAIK, a 32 bit processor at its core. Altivec extensions, however, allow the G4 to do operations on more than one word in the processor at once.