I think you are mistaken. There are no vector processors in blue gene/L. BG/l is composed entirely of IBM ppc/440 cores. Each node (out of 65,000) is composed of 2 ppc scalar cores. In most cases one runs the application, and one handles the message passing. The Blue Gene/P uses 4-core nodes, but is otherwise similar.
The cell processor has many scalar cores, which can be programmed to behave a little-bit like a vector processor, though they really aren't. Cell processors are not currently used in Blue-Gene systems, though I suspect they might be someday. modified cell's are used in some supercomputers, but they are still the third focus in IBM's supercomputing arsenal.
So here's what you're missing: Vector processors aren't about doing a lot of math. True, they do that very well, but that's not where they excell. Where vector processors really shine, is in memory bandwidth. Vector operations let you use that 4Terabyte/second of memory bandwidth, and actually use it, not spend it all flushing out cache lines. On this machine, a single load instruction can fetch 2KB of data.
Cell (and many GPUs or future whatever) have the ability to do a LOT of math, but they do it on a very tiny amount of data. These vector CPUs have dozens or hundreds of memory controllers. That's a lot of RAM chips, and a lot of copper wires between memory and the CPU. I'm sure the motherboard is dozens of layers thick for all the traces. In short, you can't get all that capability on a commodity processor, because the commodity market won't pay for all the memory bandwidth, which is expensive to engineer, and expensive per-unit.
Unless/untill there is a major change in the cost of memory, and memory bandwidth, there will still be a need for special-purpose supercomputing processors. This is not to say that Cray and NEC will continue to be the people to make such a thing. I'm sure IBM could come up with a cell-derived processor with a TON of real memory bandwidth, or maybe Nvidia. The question is: will they want to? I figure there's a lot more money to be made selling videogame consoles than there is at the high-end of the supercomputer market.
Supercomputer OSes, like all unix OSes, have gained functionality over the years. In the supercomputing world, data storage and I/O performance are almost as important as the computational job. Thus a lot of attention is paid to filesystems. super/UX is pretty stripped down, but getting better. Cray Unicos is no longer based on that system-V stuff and is either based on Irix on the X1, or is linux-derived on the XT4. The compute nodes are pretty stripped down, but the loggin nodes are pretty much off the shelf linux. I think it's a logical direction for supercomputer makers.
NEC keeps plugging along at the cpu's. Those things are incredible. They are, however, VERY VERY vector dependant. They do not run scalar code very fast at all. This many pipe-sets and ALUs per pipe requires very long vector lengths to vectorize well. What I'd love to see is one of these vector CPUs tied very closely to a high-speed scalar CPU like an opteron, xeon, or power6. For a while it sounded like IBM was getting back into the vector game with a power6-derived processor, but that seems to have faded away.
The 16-way SMP also helps. On the older NEC machines, you only ever wanted a reservation on 7 out of 8 CPUs, so the last one could run the OS. Now you can probably grab 15 out of 16 without paying the penalty.
The really exciting thing about this system is the new interconnect. NEC had been limping along with the previous generation of IXS interconnect since the sx-5. 8GB/s is fast, but not when each node has so much horsepower.
The one thing missing from this literature, of course, is any talk of price. For the last few generations, NEC has had the highest per-node performance of any supercomputer, but each node cost so much that other solutions were equally attractive, if not more so. Anyone have some clues as to the pricing?
64 cores does not make a supercomputer. There are database servers with more cores than this, and have been for years. Technical computer, sure. Maybe even high performance computer. Definately NOT supercomputer. 8 systems, that's what? 4GB of RAM? There are laptops that can hold that much memory.
If you went to a technical conference like, for example, Supercomputing '07, you would get laughed off the floor calling that a supercomputer. Supercomputer is a changing definition, but I don't think I'd call anything a supercomputer that didn't have at least 1TF of peak double-precission performance, and at least 200GB of RAM.
Who both have windows server boxes, as well as linux and high-end unix and mainframe products. These guys are trying to be an end-to-end provider of servers, software, storage, and middleware. Sun was cutting themselves out of a big chunk of business. Champion whatever cause you like, but there aren't very many data centers without windows servers in them. From the point of view of the sun salesman, you can either have a machine room full of suns and HPs, or you can have a machine room full of suns and suns alone. What they want to avoid, is the machine room full of HP and HP alone.
No that's not true at all. There are thousands of high-performance computing clusters out there. In my current job, I work on the top 100 sort, but in the past, I spent a lot of time working on the sort that don't make the top 500 list. They're generally 1-2 racks of rackmount sleds running some sort of MPI application, or bulk processing of video feeds, render frames, whatever. I remember one oil industry machine room that had over a hundred clusters, each with 40-80 processors. I suppose they could have connected them all together and called it a single, big cluster, but they didn't run any large jobs, so why kill your reliability by making the cluster bigger than it needs to be?
This microwolf thing is neat for a university project. However, clusters are a lot of work to administer, even if you come up with some really good tools. I don't know why everyone would want one next to their desks. That's a lot of work. They should build a smaller number of really good machines, and pay fools like me to administer them. Don't ask scientists to be cluster admins.
Do you really think the blogosphere plays by different rules than old media? What distinguishes bloggers from other journalists? The fundamental differences, so near as I can figure out, are only in degree, and in number. Instead of 5-10 major media conglomerates, spread out over a couple dozen network and cable television outlets plus some radio, newspaper, and magazines, you have a few hundred significant bloggers for any particular topic area. So you have a lot more voices out there, but does that really mean that those voices use radically better, or at least different, means of getting out an idea? I'm not convinced that they do. What makes you believe that blogging can "move beyond" the methods journalists have used in the past?
As for Mr. Asher, I don't mean to attack him, except to say that he uses the same tricks and tools that modern journalism demands. The title of the article, and the subsequent slashdot posting's title, convey the sense that the greater scientific body's empirical data on climate has changed completely. These titles, along with the tone of Mr. Asher's language invokes a conspiracy theory on the part of nasa and the mainsteam media, while the actual text of the blog entry seems to suggest that the impact of the data adjustment really is fairly minor. In short, the title is sensational to lure in readers, while the actual story is much more mundane. This is a regular occurence for much of the television and newspaper journalism, of which I am also not a huge fan. I hold Mr. Asher up to the same standards I hold other journalistic mediums, and feel underserved by the veracity and competence of both.
Michael Asher isn't likely to listen to your suggestions. Past blogs of his tend very heavily towards the sensationalist. Who can blame him. Amongst the cacophony of blogs in the technical readership space, how else do you get your voice heard.
However, if you're very careful about reading his "story", you'll find that he even admits that this has very little effect on actual global-warming theory. (most of which has to do with air temperature over the oceans) He just wants the sensational headline, so that buzzword farms like slashdot will link to his blog, and he'll get a few more regular readers. Journalism101.
Too true, too true; which is why we're seeing more and more memory technologies use packet-serialized protocols over differential-pair traces. (rambus, fb-dimms, etc) It's a good time to be a serdes designer.
Since the article is mostly about what is or isn't a marketable product, rather than what is technologically feasible, I still think that heat and power usage is a more relevant limiting factor. People are more interested in the processors that fit in a quarter-inch thick cell phone, and run off a 3 ounce battery than they are in 200 watt power6 processors.
The author of the cnet article has confused this definition, but you can hardly blame a journalist from making this mistake. It's one of the most common mistakes out there, and it builds on all the other people who have screwed this one up.
Nonetheless, the point of the articly is the same, regardless of your understanding of the moore's law. Do most consumers really need to keep ahead of the technology curve? Most processor reviews on (tomshardware, anandtech, techreport, insert-your-favorite-site-here) run CPUs through a battery of recent video games, synthetic benchmarks, and high-def video encoding jobs. Why? Because ALL of the processors they test, perform very well for the tasks most computer users need them to do: surf the web, and check email, with a little word or excel upon occasion.
I think it's a pretty good point. I already have a MUCH faster processor in the console under my TV than in the computer on my desk. If sony/microsoft/apple win the war for entertainment center convergence, why does the stand-alone-computer really need to get any faster?
These two things are coincident, but not correlated.
multicore is becoming popular because instruction-level-parallelism has approached a practical limit, not capacitance. Basically processor designers are getting all these "free" transistors, and don't know what to do with them except add cores.
Processor speed limits come from heat generated by switching speeds, combined with heat from leakage current. Improved transistor density actually improves the heat generated by switching, but has to be balanced against the increased leakage current from a smaller lithography process.
The result, however, is the same. Instead of "When is a processor "fast enough"?" you get "When does a processor have "enough" transistors?" At what point will consumers stop paying for more and more advanced processor designs? Moores law doesn't come for free. Building current generation chip fabs costs billions of dollars. At some point they'll be able to make transistors small enough that you can get 25trillion on a chip that costs $7 and that will be good enough for everyone. (Or will it?) At what point do the economics of making the next generation of integrated circuit not pay off? It's an interesting question, though somewhat wierdly phrased.
The point of this posting is to ask the question: Are western computer shoppers content with slower technology than what is cutting edge?
The answer to that is resoudly: sometimes. For the last several years, most users haven't had a really good reason to upgrade their computers, because a 1 ghz computer can do everything most people need. Gamers, of course, are an exception, but they make up a very small part of the marketplace. In big-box retailers you'll often find that shoppers are looking for features more than they are concerned about performance. One of those features is often greater and greater portability. People would rather have their computer (or smartphone, more often than not) with them more of the time, rather than have a faster computer. Some day a new killer-app may come along that forces people to upgrade to faster components, and windows vista is doing that in a very artificial way, but the trend is still present. Ubiquity trumps performance, for most buyers.
No, the fact that every modern OS is bloated and inefficient comes from having to support legacy software, on a large range of existing hardware, and for a wide variety of users/customers all with their own specialized demands.
When I think about the tiny and elegant microkernels and nanokernels that run some embedded devices and supercomputer nodes, the reason they can be written as such, is that the developer is able to say: "No compilers or linkers will be run on this OS. VIM is forbidden. There are only eight I/O devices and we aren't adding any more. No you may not connect this to your keyfob or graphics tablet."
AT&T tried to rewrite unix more elegantly. It's called plan9, and noone runs it because you can't install your favorite X-based web browser. Microsoft tried to write a microkernel VMS; It was called WindowsNT 3.1. It sucked so bad they had to rewrite it as a monolithic kernel in the next versions.
Users want clean and fast, but aren't willing to give up very much in exchange for that. They want to install every piece of old hardware they've ever purchased. Users want to run old software. Beos is still out there. You can run it if you want to. Are you willing to give up firefox, and your ogg-player, and openoffice? Are you willing to give up your wireless mouse? Unless you see a large number of slashdot dorks running clean elegant OSes, don't expect normal users to make that sacrifice.
This assumes that the kernel is a single common software project.
It isn't. A few filesystem developers might have to make changes to elevator, or allocator code, but most developers of XXXXfs don't really need to make changes outside of that directory. Developers writing a driver for the XXXX model scsi controller, don't really need to interact with the people mucking with Alsa, or gart, or whatever.
The kernel might be contained in a single source repository, but it's really a few hundred, mostly-independent software projects.
That's not actually true. Everyone in the industry knows who the customers are. Some of the specifics are secret, and what codes are actually run on the systems is not known. Really huge machines do have to be manufactured, and pushed through Q/A before being sent to a customer, even a classified customer. The on-site installation guys have security clearance, but no HPC company has all of the manufacturing and Q/A people cleared. If the secret-sites were buying fastest-in-the-world sized computers, people would know about it.
I'll point out that the blue gene at livermore and the xt3 at sandia are classified computers. Noone is saying what the machines do, but they're happy to admit that they exist, and sometimes even what they paid for them.
That really depends on the application. Top500 uses a benchmark called Linpack to determine how fast a machine is. Linpack is basically designed to show off big systems. It can squeeze almost every bit of performance out of a system. It offers lots of thread level parallelism for superscalar and vliw designs to exploit. It makes great use of FP multiply-accumulate. The program and the data easily fit inside processor caches. The amount of message passing between nodes is tiny. It's the perfect benchmark for cheap systems. More CPU's = better. Even cheaply designed clusters can get 80% of peak running linpack.
Other applications are lucky to get 15% of peak on the most ballanced systems, and often around 1% or peak on cheap clusters.
There is an effort by darpa and NSF to rank machines with much more meaningful benchmarks. This effort, called HPCC, tests a wider range of system performance characteristics. Even these benchmarks are easy to misinterpret, but they're better than simply relying on linpack. (http://icl.cs.utk.edu/hpcc/) The real answer, of course, is that you have to figure out what you're going to use the machine for, and use the actual application as a benchmark, so in some regards it's pretty meaningless to rank machines at all. It does provide bragging rights, and helps with marketing.
Bechtolsheim compares 131,000 cores of Blue Gene/L to 131,000 cores of constellation, with the sun system offering 3 times the performance.
This is hardly a fair comparison. IBM installed a 131,000 core BG/L 2 years ago, and it's been running customer code for more than a year. The sun system won't be built until late this year, and probably won't be running real customer code until this time next year. Furthermore, the BG/L machine is designed with a low-power node, assuming that a larger number of cores would be used. In IBM's older BG/L design, there are 2048 cores in a rack. Sun is packing 768 opteron cores in a rack. So a per square-meter measure gives IBM's 3 year old design only a 20% disadvantage to Sun's not-yet-released machine.
All of that is moot, of course, as theoretical peak performance is a crappy way to measure supercomputer performance anyway. The opteron is a great processor, and infinaband is a decent, though not remarkable interconnect. I'd be a little concerned, were I to buy the sun solution, that the infinaband bandwidth is being shared by 16 processor cores. That's quite a bit less interconnect performance per processor than IBM's Blue Gene, power5, Cray's XT, or SGI's altix. There's certainly plenty of memory on each of these constellation blades. That said, there are a list of applications that perform very well on Blue Gene, and Sun has a lot of ground to make up in terms of OS, software, and establishing a relationship with the HPC customers.
I'll add: This is not an unusual arrangement for existing infinaband networks. The distinction is that they have all of these 864 switch modules in a cabinet, and the wiring is probably traces on a backplane, rather than flexible cables. This improves the reliability, reduces the cost, and makes it a whole lot easier to install. That may sound silly, but you're talking about 10,000 cables, each with endpoint connectors on each end. Even buying in bulk, that's a lot of money in cables.
Actually 3456 is 12 X 12 X 12 X 2. It's not actually a 3456 port router, it's a fat tree of 24-port router modules. Each rank 1 & 2 module has 12 ports down and 12 ports up. The rank 3 modules have 12 ports down, and 12 sidelink ports to one another. Thus you end up with a 3456 port, rank 3.5 fat tree all in one box.
Actually Suns offering is a 4-socket opteron blade system.
Though the T2 does offer dramatically improved floating-point performance, as compared to the T1, I've seen no evidence that it out-performs better than quad-core opterons, on the sort of HPC workloads needed for these sorts of systems. The T2 is designed for transaction processing, which has very different needs.
The main reason I don't think Sun would sell T2 for supercompute, is that they haven't mentioned it at all, and t's the sort of thing they would want to talk up, when marketing the T2.
yeah, there are a lot of systems out there with theoretical limits of several petaflops. Cray, IBM, NEC, even SGI have systems that could theoretically hit several PFlops if you had enough money.
I'm waiting to see a customer actually purchase one, and for it to be installed, and actually running customer code, before I really care.
well, Perhaps it's more accurate to say that IBM is not selling BG to everyone and their mother, because a limited number of applications port well to the machine. If you happen to have a big need to run one of those applications, they'll sell one to you. But, if you don't run one of those apps, they'll probably try to sell you P570's instead. It must be nice to be in those IBM salespeople's shoes, and have so many options to sell you.
I think you are mistaken. There are no vector processors in blue gene/L. BG/l is composed entirely of IBM ppc/440 cores. Each node (out of 65,000) is composed of 2 ppc scalar cores. In most cases one runs the application, and one handles the message passing. The Blue Gene/P uses 4-core nodes, but is otherwise similar.
The cell processor has many scalar cores, which can be programmed to behave a little-bit like a vector processor, though they really aren't. Cell processors are not currently used in Blue-Gene systems, though I suspect they might be someday. modified cell's are used in some supercomputers, but they are still the third focus in IBM's supercomputing arsenal.
So here's what you're missing: Vector processors aren't about doing a lot of math. True, they do that very well, but that's not where they excell. Where vector processors really shine, is in memory bandwidth. Vector operations let you use that 4Terabyte/second of memory bandwidth, and actually use it, not spend it all flushing out cache lines. On this machine, a single load instruction can fetch 2KB of data.
Cell (and many GPUs or future whatever) have the ability to do a LOT of math, but they do it on a very tiny amount of data. These vector CPUs have dozens or hundreds of memory controllers. That's a lot of RAM chips, and a lot of copper wires between memory and the CPU. I'm sure the motherboard is dozens of layers thick for all the traces. In short, you can't get all that capability on a commodity processor, because the commodity market won't pay for all the memory bandwidth, which is expensive to engineer, and expensive per-unit.
Unless/untill there is a major change in the cost of memory, and memory bandwidth, there will still be a need for special-purpose supercomputing processors. This is not to say that Cray and NEC will continue to be the people to make such a thing. I'm sure IBM could come up with a cell-derived processor with a TON of real memory bandwidth, or maybe Nvidia. The question is: will they want to? I figure there's a lot more money to be made selling videogame consoles than there is at the high-end of the supercomputer market.
Supercomputer OSes, like all unix OSes, have gained functionality over the years. In the supercomputing world, data storage and I/O performance are almost as important as the computational job. Thus a lot of attention is paid to filesystems.
super/UX is pretty stripped down, but getting better. Cray Unicos is no longer based on that system-V stuff and is either based on Irix on the X1, or is linux-derived on the XT4. The compute nodes are pretty stripped down, but the loggin nodes are pretty much off the shelf linux. I think it's a logical direction for supercomputer makers.
NEC keeps plugging along at the cpu's. Those things are incredible. They are, however, VERY VERY vector dependant. They do not run scalar code very fast at all. This many pipe-sets and ALUs per pipe requires very long vector lengths to vectorize well. What I'd love to see is one of these vector CPUs tied very closely to a high-speed scalar CPU like an opteron, xeon, or power6. For a while it sounded like IBM was getting back into the vector game with a power6-derived processor, but that seems to have faded away.
The 16-way SMP also helps. On the older NEC machines, you only ever wanted a reservation on 7 out of 8 CPUs, so the last one could run the OS. Now you can probably grab 15 out of 16 without paying the penalty.
The really exciting thing about this system is the new interconnect. NEC had been limping along with the previous generation of IXS interconnect since the sx-5. 8GB/s is fast, but not when each node has so much horsepower.
The one thing missing from this literature, of course, is any talk of price. For the last few generations, NEC has had the highest per-node performance of any supercomputer, but each node cost so much that other solutions were equally attractive, if not more so. Anyone have some clues as to the pricing?
64 cores does not make a supercomputer. There are database servers with more cores than this, and have been for years. Technical computer, sure. Maybe even high performance computer. Definately NOT supercomputer. 8 systems, that's what? 4GB of RAM? There are laptops that can hold that much memory.
If you went to a technical conference like, for example, Supercomputing '07, you would get laughed off the floor calling that a supercomputer. Supercomputer is a changing definition, but I don't think I'd call anything a supercomputer that didn't have at least 1TF of peak double-precission performance, and at least 200GB of RAM.
True, but the capabilities of UFS don't really exceed HFS+. ZFS, on the other hand, is a thoroughly modern filesystem. UFS is just as rusty as HFS+.
Who both have windows server boxes, as well as linux and high-end unix and mainframe products. These guys are trying to be an end-to-end provider of servers, software, storage, and middleware. Sun was cutting themselves out of a big chunk of business. Champion whatever cause you like, but there aren't very many data centers without windows servers in them. From the point of view of the sun salesman, you can either have a machine room full of suns and HPs, or you can have a machine room full of suns and suns alone. What they want to avoid, is the machine room full of HP and HP alone.
No that's not true at all. There are thousands of high-performance computing clusters out there. In my current job, I work on the top 100 sort, but in the past, I spent a lot of time working on the sort that don't make the top 500 list. They're generally 1-2 racks of rackmount sleds running some sort of MPI application, or bulk processing of video feeds, render frames, whatever. I remember one oil industry machine room that had over a hundred clusters, each with 40-80 processors. I suppose they could have connected them all together and called it a single, big cluster, but they didn't run any large jobs, so why kill your reliability by making the cluster bigger than it needs to be?
This microwolf thing is neat for a university project. However, clusters are a lot of work to administer, even if you come up with some really good tools. I don't know why everyone would want one next to their desks. That's a lot of work. They should build a smaller number of really good machines, and pay fools like me to administer them. Don't ask scientists to be cluster admins.
Do you really think the blogosphere plays by different rules than old media? What distinguishes bloggers from other journalists? The fundamental differences, so near as I can figure out, are only in degree, and in number. Instead of 5-10 major media conglomerates, spread out over a couple dozen network and cable television outlets plus some radio, newspaper, and magazines, you have a few hundred significant bloggers for any particular topic area. So you have a lot more voices out there, but does that really mean that those voices use radically better, or at least different, means of getting out an idea? I'm not convinced that they do. What makes you believe that blogging can "move beyond" the methods journalists have used in the past?
As for Mr. Asher, I don't mean to attack him, except to say that he uses the same tricks and tools that modern journalism demands. The title of the article, and the subsequent slashdot posting's title, convey the sense that the greater scientific body's empirical data on climate has changed completely. These titles, along with the tone of Mr. Asher's language invokes a conspiracy theory on the part of nasa and the mainsteam media, while the actual text of the blog entry seems to suggest that the impact of the data adjustment really is fairly minor. In short, the title is sensational to lure in readers, while the actual story is much more mundane. This is a regular occurence for much of the television and newspaper journalism, of which I am also not a huge fan. I hold Mr. Asher up to the same standards I hold other journalistic mediums, and feel underserved by the veracity and competence of both.
Michael Asher isn't likely to listen to your suggestions. Past blogs of his tend very heavily towards the sensationalist. Who can blame him. Amongst the cacophony of blogs in the technical readership space, how else do you get your voice heard.
However, if you're very careful about reading his "story", you'll find that he even admits that this has very little effect on actual global-warming theory. (most of which has to do with air temperature over the oceans) He just wants the sensational headline, so that buzzword farms like slashdot will link to his blog, and he'll get a few more regular readers. Journalism101.
Too true, too true; which is why we're seeing more and more memory technologies use packet-serialized protocols over differential-pair traces. (rambus, fb-dimms, etc) It's a good time to be a serdes designer.
Since the article is mostly about what is or isn't a marketable product, rather than what is technologically feasible, I still think that heat and power usage is a more relevant limiting factor. People are more interested in the processors that fit in a quarter-inch thick cell phone, and run off a 3 ounce battery than they are in 200 watt power6 processors.
The author of the cnet article has confused this definition, but you can hardly blame a journalist from making this mistake. It's one of the most common mistakes out there, and it builds on all the other people who have screwed this one up.
Nonetheless, the point of the articly is the same, regardless of your understanding of the moore's law. Do most consumers really need to keep ahead of the technology curve? Most processor reviews on (tomshardware, anandtech, techreport, insert-your-favorite-site-here) run CPUs through a battery of recent video games, synthetic benchmarks, and high-def video encoding jobs. Why? Because ALL of the processors they test, perform very well for the tasks most computer users need them to do: surf the web, and check email, with a little word or excel upon occasion.
I think it's a pretty good point. I already have a MUCH faster processor in the console under my TV than in the computer on my desk. If sony/microsoft/apple win the war for entertainment center convergence, why does the stand-alone-computer really need to get any faster?
These two things are coincident, but not correlated.
multicore is becoming popular because instruction-level-parallelism has approached a practical limit, not capacitance. Basically processor designers are getting all these "free" transistors, and don't know what to do with them except add cores.
Processor speed limits come from heat generated by switching speeds, combined with heat from leakage current. Improved transistor density actually improves the heat generated by switching, but has to be balanced against the increased leakage current from a smaller lithography process.
The result, however, is the same. Instead of "When is a processor "fast enough"?" you get "When does a processor have "enough" transistors?" At what point will consumers stop paying for more and more advanced processor designs? Moores law doesn't come for free. Building current generation chip fabs costs billions of dollars. At some point they'll be able to make transistors small enough that you can get 25trillion on a chip that costs $7 and that will be good enough for everyone. (Or will it?) At what point do the economics of making the next generation of integrated circuit not pay off? It's an interesting question, though somewhat wierdly phrased.
The point of this posting is to ask the question: Are western computer shoppers content with slower technology than what is cutting edge?
The answer to that is resoudly: sometimes. For the last several years, most users haven't had a really good reason to upgrade their computers, because a 1 ghz computer can do everything most people need. Gamers, of course, are an exception, but they make up a very small part of the marketplace. In big-box retailers you'll often find that shoppers are looking for features more than they are concerned about performance. One of those features is often greater and greater portability. People would rather have their computer (or smartphone, more often than not) with them more of the time, rather than have a faster computer. Some day a new killer-app may come along that forces people to upgrade to faster components, and windows vista is doing that in a very artificial way, but the trend is still present. Ubiquity trumps performance, for most buyers.
No, the fact that every modern OS is bloated and inefficient comes from having to support legacy software, on a large range of existing hardware, and for a wide variety of users/customers all with their own specialized demands.
When I think about the tiny and elegant microkernels and nanokernels that run some embedded devices and supercomputer nodes, the reason they can be written as such, is that the developer is able to say: "No compilers or linkers will be run on this OS. VIM is forbidden. There are only eight I/O devices and we aren't adding any more. No you may not connect this to your keyfob or graphics tablet."
AT&T tried to rewrite unix more elegantly. It's called plan9, and noone runs it because you can't install your favorite X-based web browser. Microsoft tried to write a microkernel VMS; It was called WindowsNT 3.1. It sucked so bad they had to rewrite it as a monolithic kernel in the next versions.
Users want clean and fast, but aren't willing to give up very much in exchange for that. They want to install every piece of old hardware they've ever purchased. Users want to run old software. Beos is still out there. You can run it if you want to. Are you willing to give up firefox, and your ogg-player, and openoffice? Are you willing to give up your wireless mouse? Unless you see a large number of slashdot dorks running clean elegant OSes, don't expect normal users to make that sacrifice.
This assumes that the kernel is a single common software project.
It isn't. A few filesystem developers might have to make changes to elevator, or allocator code, but most developers of XXXXfs don't really need to make changes outside of that directory. Developers writing a driver for the XXXX model scsi controller, don't really need to interact with the people mucking with Alsa, or gart, or whatever.
The kernel might be contained in a single source repository, but it's really a few hundred, mostly-independent software projects.
That's not actually true.
Everyone in the industry knows who the customers are. Some of the specifics are secret, and what codes are actually run on the systems is not known. Really huge machines do have to be manufactured, and pushed through Q/A before being sent to a customer, even a classified customer. The on-site installation guys have security clearance, but no HPC company has all of the manufacturing and Q/A people cleared. If the secret-sites were buying fastest-in-the-world sized computers, people would know about it.
I'll point out that the blue gene at livermore and the xt3 at sandia are classified computers. Noone is saying what the machines do, but they're happy to admit that they exist, and sometimes even what they paid for them.
That really depends on the application.
Top500 uses a benchmark called Linpack to determine how fast a machine is. Linpack is basically designed to show off big systems. It can squeeze almost every bit of performance out of a system. It offers lots of thread level parallelism for superscalar and vliw designs to exploit. It makes great use of FP multiply-accumulate. The program and the data easily fit inside processor caches. The amount of message passing between nodes is tiny. It's the perfect benchmark for cheap systems. More CPU's = better. Even cheaply designed clusters can get 80% of peak running linpack.
Other applications are lucky to get 15% of peak on the most ballanced systems, and often around 1% or peak on cheap clusters.
There is an effort by darpa and NSF to rank machines with much more meaningful benchmarks. This effort, called HPCC, tests a wider range of system performance characteristics. Even these benchmarks are easy to misinterpret, but they're better than simply relying on linpack. (http://icl.cs.utk.edu/hpcc/) The real answer, of course, is that you have to figure out what you're going to use the machine for, and use the actual application as a benchmark, so in some regards it's pretty meaningless to rank machines at all. It does provide bragging rights, and helps with marketing.
Bechtolsheim compares 131,000 cores of Blue Gene/L to 131,000 cores of constellation, with the sun system offering 3 times the performance.
This is hardly a fair comparison. IBM installed a 131,000 core BG/L 2 years ago, and it's been running customer code for more than a year. The sun system won't be built until late this year, and probably won't be running real customer code until this time next year. Furthermore, the BG/L machine is designed with a low-power node, assuming that a larger number of cores would be used. In IBM's older BG/L design, there are 2048 cores in a rack. Sun is packing 768 opteron cores in a rack. So a per square-meter measure gives IBM's 3 year old design only a 20% disadvantage to Sun's not-yet-released machine.
All of that is moot, of course, as theoretical peak performance is a crappy way to measure supercomputer performance anyway. The opteron is a great processor, and infinaband is a decent, though not remarkable interconnect. I'd be a little concerned, were I to buy the sun solution, that the infinaband bandwidth is being shared by 16 processor cores. That's quite a bit less interconnect performance per processor than IBM's Blue Gene, power5, Cray's XT, or SGI's altix. There's certainly plenty of memory on each of these constellation blades. That said, there are a list of applications that perform very well on Blue Gene, and Sun has a lot of ground to make up in terms of OS, software, and establishing a relationship with the HPC customers.
It's nice to have more options, however.
I'll add:
This is not an unusual arrangement for existing infinaband networks. The distinction is that they have all of these 864 switch modules in a cabinet, and the wiring is probably traces on a backplane, rather than flexible cables. This improves the reliability, reduces the cost, and makes it a whole lot easier to install. That may sound silly, but you're talking about 10,000 cables, each with endpoint connectors on each end. Even buying in bulk, that's a lot of money in cables.
Actually 3456 is 12 X 12 X 12 X 2. It's not actually a 3456 port router, it's a fat tree of 24-port router modules. Each rank 1 & 2 module has 12 ports down and 12 ports up. The rank 3 modules have 12 ports down, and 12 sidelink ports to one another. Thus you end up with a 3456 port, rank 3.5 fat tree all in one box.
Actually Suns offering is a 4-socket opteron blade system.
Though the T2 does offer dramatically improved floating-point performance, as compared to the T1, I've seen no evidence that it out-performs better than quad-core opterons, on the sort of HPC workloads needed for these sorts of systems. The T2 is designed for transaction processing, which has very different needs.
The main reason I don't think Sun would sell T2 for supercompute, is that they haven't mentioned it at all, and t's the sort of thing they would want to talk up, when marketing the T2.
yeah, there are a lot of systems out there with theoretical limits of several petaflops. Cray, IBM, NEC, even SGI have systems that could theoretically hit several PFlops if you had enough money.
I'm waiting to see a customer actually purchase one, and for it to be installed, and actually running customer code, before I really care.
well, Perhaps it's more accurate to say that IBM is not selling BG to everyone and their mother, because a limited number of applications port well to the machine. If you happen to have a big need to run one of those applications, they'll sell one to you. But, if you don't run one of those apps, they'll probably try to sell you P570's instead. It must be nice to be in those IBM salespeople's shoes, and have so many options to sell you.