That's true. Just have a look at the queues of some of the petaflop machines. They're usually at least 5x oversubscribed, meaning that more projects apply for compute time than is actually available.
Sure, I was using SLI as an abbreviation for a multi GPU system. And since I was refering to a hypothetical desktop, it might even run AMD GPUs, not just Nvidia chips). But yeah, I know: AMD GPUs generally suck at scientific computing. Sadly.
Me neither. AMD's Interlagos (a.k.a. Bulldozer) chips have proven to absolutely suck at floating point performance. And in supercomputing floating point means everything. As much as I love the eternal underdog AMD, I can only hope Cray will soon start selling Intel systems, too. Sandy Bridge's AVX implementation is much better as the internal datapaths (L2->L1, L1->registers) are more elaborate.
Let's look this up. 7 years ago #1 on the Top500 was an IBM BlueGene/L at 70 TFLOPS. I can't see that performance anywhere close on the desktop or even on the notebook market.
Assuming you're running a good SLI systems and that your GPUs actually deliver the performance the manufacturer is claiming them to have, you'd get in the best case something around 1.5 TFLOPS which corresponds roughly to a 1998 ASCI Red.
Compute resources don't come for free, you pay per use. You'll only be able to harness the Cloud if your business is sustainable. But if it is, then you could afford to buy compute resources anyway -- albeit in a smaller fashion. The only real difference is that with cloud services you can save some money if you don't run jobs 24/7.
Yep, I do absolutely agree. Eat as much pizza as you want, but be sure to burn those calories by powering your laptop afterwards. It's a win-win situation!
Yes, but the existing planes are constantly updated. Military planes are not like commodity cars, which get build once and only receive new wipers every now end then. The airforce plans to use them for decades. Also, the insight gained will influence the next generation of fighters.
Sadly, both stories lack details on how the FPGAs are used in the computing architecture. Instead the spend great lengths on listing telephone number like, meaningless speedup comparisons with conventional hardware. A typical drawback of FPGAs is that they cannot accommodate as many floating point units (FPUs) per chip as current GPUs and that FPGAs run at about 10x lower clock speeds. Their advantage however, is that the internal chip architecture can be reconfigured to match the algorithm, so that all FPUs run at maximum efficiency. At the end of the day, it really depends on the algorithm, whether it's run best on FPGAs, GPUs or standard CPUs. This is also the reason why one cannot say that an FPGA is X times faster than a GPU: it really depends on the algorithm.
Maxeler, the manufacturer of the machine, had a booth at SC11. The basic component is the MAX3 card, a PCIe 2.0 8x card with up to 96 GB of DRAM on board. The boards are optimized for data stream processing. This is not unlike how GPUs are architectured.
Up to 4 of those boards are located in a MaxNode, which can then be networked via 10Gbit Ethernet or InfiniBand. Multiple MaxNodes can be put into a MaxRack, which can also be seen in the WSJ article. The MAX3 boards can be connected via a custom MaxRing network, which provides a bandwidth of 8 GB/s.
The real use-case for the camera is not to watch at coke bottles at super slo-mo, but to investigate how molecules absorb light of different wave-lengths. There is a real scientific need for this camera. And of course, as mentioned earlier, it can't trace individual photons.
ps: needless to say that I did like my own summary much better (for being informative), but that may just be me.
I can only say what I've been told. However, I didn't want to create the impression that I despised of IBM's patent politics. The hardly ever attack (troll) with their patents and act very reasonable regarding prior art.
That said, I'd distinguish between two issues here: #1 trivial patents vs. "real" inventions and #2 patents as a means to drive innovation vs. patents as a war chest to fight off competition. Regarding #1: while I don't have any papers to back up that claim, I've got the impression that IBM's patents are very seldom trivial ones. If a/. article raves about another trivial patent, then it's often from MS, Apple or Amazon, but never from Big Blue. Regarding #2 (patents and licensing them), I'd argue that technology is moving so fast today, that making an invention alone yields so much benefit to the inventor. He can enter the market months or even years earlier. Competition that is merely reverse engineering or imitating the products will always trail behind.
I'm not an expert on racetrack memory, but yeah: I'd love to see so much more patents invalidated because of prior art. Today, patents seldom serve their original purpose. When I was at IBM they admitted that they mostly used the patents to defend against lawsuits from other companies which were claiming infringement with their own patents. Every big player in the business does this (as can be seen in the recent smartphone patent wars), but that's leading off topic...
IBM is the company which gets the most patents awarded. Every single year. Since decades. The don't do research out of goodwill, but for profit. Yes, not just shortsighted, but for the long haul. That's why they still exist. Since 100 years.
It's hard to compare IBM to Apple, since they target completely different customers: Apple is cosumers, IBM is business.
Though, when you buy a system like that, the cost isn't the hardware, it's the field and support engineers available 24/7, customer support, projects and power consumption that are the big costs. There used to be a joke, "Buy a super-computer from us, and we'll throw the building in for free"
Wrong. Actually, current systems (e.g. Blue Waters) easily cost $200 mio. to procure, and that is just the hardware and support for 1 year, excluding staff, power etc.
Modern day supercomputer systems use a standardized rack frame system and intercommunication fabric so that the oldest and slowest nodes can be pulled out, while the newest and fastest ones can be slotted in straight away. That removes the overhead of having to construct a new building, power supply system, air conditioning and network infrastructure just to do a simple upgrade.
Sorry, but wrong again. Modern supercomputers quite often use custom interconnects (e.g. Cray's Seastar or Gemini or Fujitsu's Tofu). Also, as K and Jaguar show, the cooling solutions are commonly custom, too. This is because node density is growing exponentially and off-the-shelf interconnects and cooling can't keep up with this.
I don't have any knowledge of what those change requests were, so I don't know the answer. Everything I have read indicates that IBM wanted too much money.
From what I have read, it seems that they were. They couldn't keep their costs low enough to justify the expense.
True, but only because of the strict requirements of NCSA. If they had been willing to change them, a BlueGene/Q would have been viable.
Ah, I misunderstood. I don't think directives have been around all that long (PGI's earilier directives and CAPS's directives come to mind) and they certainly weren't standardized. OpenACC, like OpenMP, should allow scientists to write more portable accelerator-enabled code. In fact the OpenACC stuff came out of the OpenMP accelerator committee as explained here. I think it's highly likely some version of it will be incorporated into OpenMP.
The reason why I'm so allergic to annotation based parallelization is the experiences folks had with OpenMP. The common fallacy about OpenMP is that it is sufficient to place a "#pragma omp parallel for" in front of your inner loops and *poof* your performance goes up. But in reality your performance may very well go down, unless your code is embarrassingly parallel. In reality especially simulation codes are tightly coupled and memory bound. The parallelization on GPUs is very different from the code on traditional multi-cores. On the latter you'll want to do pipelined cache blocking, while on the former you'll want to do tiling in the GPU DRAM. These are differences in the high level algorithm of a kernel, something which is beyond the compiler to change. Even with annotations.
Instead of a revamped OpenMP, I expect OpenCL to grab a larger share of the market when it comes to writing portable code. Even though OpenCL code by far isn't write once, run everywhere.
This article explains that five years ago when NCSA made the bid, accelerators were very exotic technology. The move toward GPUs was actually at the behest of scientists who now see a way forward to speed up their codes with accelerators. Technology shifts and we adapt.
If they are so willing to adapt, why weren't they willing to accommodate IBM's change requests? It's not like IBM was totally unwilling to build a $200 million machine.
None? I know of several. It's all still in its infancy of course, but I'm convinced it's possible to get good speedup from GPUs on real science codes. It's not applicable to everything, but then that's why they aren't CPUs.
I was referring to annotations for GPU offloading. Codes that run on GPUs are in fact so common nowadays that in fact you'll be asked on conferences why you didn't try CUDA if you present any performance measurement sans GPU benchmarks.:-)
That's similar to what PGI is doing. And you know what? It's not that simple. You seldom achieve competitive e performance with this annotation type parallelization, simply because the codes were written with different architectures in mind.
This is also the reason why the original design did emphasize single thread performance so much. The alternative to having POWER7 cores running at 5 GHz would have been to buy a BlueGene/Q with much more, but slower cores.They didn't go into that avenue because they knew that their codes wouldn't scale to the number of cores well.
None of the supercomputer codes I know uses such a type of parallelization or accelerator offloading. And the reason for that is not that folks enjoy doing work that a tool could handle for them, but because the tools don't work as well as advertised.
Or am I the only one remembering this from the good old BattleTech times? BTW: I want my Warhammer equipped with dual Gauss cannons, please. ^^
Actually most of the cabinets will be XE6, not XK6. Most codes at U of I aren't GPU ready.
That's true. Just have a look at the queues of some of the petaflop machines. They're usually at least 5x oversubscribed, meaning that more projects apply for compute time than is actually available.
Sure, I was using SLI as an abbreviation for a multi GPU system. And since I was refering to a hypothetical desktop, it might even run AMD GPUs, not just Nvidia chips). But yeah, I know: AMD GPUs generally suck at scientific computing. Sadly.
Me neither. AMD's Interlagos (a.k.a. Bulldozer) chips have proven to absolutely suck at floating point performance. And in supercomputing floating point means everything. As much as I love the eternal underdog AMD, I can only hope Cray will soon start selling Intel systems, too. Sandy Bridge's AVX implementation is much better as the internal datapaths (L2->L1, L1->registers) are more elaborate.
Let's look this up. 7 years ago #1 on the Top500 was an IBM BlueGene/L at 70 TFLOPS. I can't see that performance anywhere close on the desktop or even on the notebook market.
Assuming you're running a good SLI systems and that your GPUs actually deliver the performance the manufacturer is claiming them to have, you'd get in the best case something around 1.5 TFLOPS which corresponds roughly to a 1998 ASCI Red.
Compute resources don't come for free, you pay per use. You'll only be able to harness the Cloud if your business is sustainable. But if it is, then you could afford to buy compute resources anyway -- albeit in a smaller fashion. The only real difference is that with cloud services you can save some money if you don't run jobs 24/7.
Yep, I do absolutely agree. Eat as much pizza as you want, but be sure to burn those calories by powering your laptop afterwards. It's a win-win situation!
How long until we can power devices from human body heat or even by the ATP derived from nutrients in our blood?
Oh noes, what a fail day! ^^ But if /. is my current working directory, then /. becomes ./
...didn't know this was a fake. Kind of spoils the fun of having a ./ story posted for the first time. :-/
ps: I'm not affiliated with the creators of the video.
Haha, I almost forgot about that game. Damn, gotta fire up Dosbox to give it a shot ASAP.
Yes, but the existing planes are constantly updated. Military planes are not like commodity cars, which get build once and only receive new wipers every now end then. The airforce plans to use them for decades. Also, the insight gained will influence the next generation of fighters.
Sadly, both stories lack details on how the FPGAs are used in the computing architecture. Instead the spend great lengths on listing telephone number like, meaningless speedup comparisons with conventional hardware. A typical drawback of FPGAs is that they cannot accommodate as many floating point units (FPUs) per chip as current GPUs and that FPGAs run at about 10x lower clock speeds. Their advantage however, is that the internal chip architecture can be reconfigured to match the algorithm, so that all FPUs run at maximum efficiency. At the end of the day, it really depends on the algorithm, whether it's run best on FPGAs, GPUs or standard CPUs. This is also the reason why one cannot say that an FPGA is X times faster than a GPU: it really depends on the algorithm.
Maxeler, the manufacturer of the machine, had a booth at SC11. The basic component is the MAX3 card, a PCIe 2.0 8x card with up to 96 GB of DRAM on board. The boards are optimized for data stream processing. This is not unlike how GPUs are architectured.
Up to 4 of those boards are located in a MaxNode, which can then be networked via 10Gbit Ethernet or InfiniBand. Multiple MaxNodes can be put into a MaxRack, which can also be seen in the WSJ article. The MAX3 boards can be connected via a custom MaxRing network, which provides a bandwidth of 8 GB/s.
The real use-case for the camera is not to watch at coke bottles at super slo-mo, but to investigate how molecules absorb light of different wave-lengths. There is a real scientific need for this camera. And of course, as mentioned earlier, it can't trace individual photons.
ps: needless to say that I did like my own summary much better (for being informative), but that may just be me.
Again what learned, as one would say in Germany -- literally. ;-) I didn't know that one, so thanks!
I can only say what I've been told. However, I didn't want to create the impression that I despised of IBM's patent politics. The hardly ever attack (troll) with their patents and act very reasonable regarding prior art.
That said, I'd distinguish between two issues here: #1 trivial patents vs. "real" inventions and #2 patents as a means to drive innovation vs. patents as a war chest to fight off competition. Regarding #1: while I don't have any papers to back up that claim, I've got the impression that IBM's patents are very seldom trivial ones. If a /. article raves about another trivial patent, then it's often from MS, Apple or Amazon, but never from Big Blue. Regarding #2 (patents and licensing them), I'd argue that technology is moving so fast today, that making an invention alone yields so much benefit to the inventor. He can enter the market months or even years earlier. Competition that is merely reverse engineering or imitating the products will always trail behind.
I'm not an expert on racetrack memory, but yeah: I'd love to see so much more patents invalidated because of prior art. Today, patents seldom serve their original purpose. When I was at IBM they admitted that they mostly used the patents to defend against lawsuits from other companies which were claiming infringement with their own patents. Every big player in the business does this (as can be seen in the recent smartphone patent wars), but that's leading off topic...
IBM is the company which gets the most patents awarded. Every single year. Since decades. The don't do research out of goodwill, but for profit. Yes, not just shortsighted, but for the long haul. That's why they still exist. Since 100 years.
It's hard to compare IBM to Apple, since they target completely different customers: Apple is cosumers, IBM is business.
Where can I click *like* for this post? ;-)
Though, when you buy a system like that, the cost isn't the hardware, it's the field and support engineers available 24/7, customer support, projects and power consumption that are the big costs. There used to be a joke, "Buy a super-computer from us, and we'll throw the building in for free"
Wrong. Actually, current systems (e.g. Blue Waters) easily cost $200 mio. to procure, and that is just the hardware and support for 1 year, excluding staff, power etc.
Modern day supercomputer systems use a standardized rack frame system and intercommunication fabric so that the oldest and slowest nodes can be pulled out, while the newest and fastest ones can be slotted in straight away. That removes the overhead of having to construct a new building, power supply system, air conditioning and network infrastructure just to do a simple upgrade.
Sorry, but wrong again. Modern supercomputers quite often use custom interconnects (e.g. Cray's Seastar or Gemini or Fujitsu's Tofu). Also, as K and Jaguar show, the cooling solutions are commonly custom, too. This is because node density is growing exponentially and off-the-shelf interconnects and cooling can't keep up with this.
I'll just quickly link to the German news site Spiegel Online where they've summarized the clues of a number of experts. Google translation here.
I don't have any knowledge of what those change requests were, so I don't know the answer. Everything I have read indicates that IBM wanted too much money.
From what I have read, it seems that they were. They couldn't keep their costs low enough to justify the expense.
True, but only because of the strict requirements of NCSA. If they had been willing to change them, a BlueGene/Q would have been viable.
Ah, I misunderstood. I don't think directives have been around all that long (PGI's earilier directives and CAPS's directives come to mind) and they certainly weren't standardized. OpenACC, like OpenMP, should allow scientists to write more portable accelerator-enabled code. In fact the OpenACC stuff came out of the OpenMP accelerator committee as explained here. I think it's highly likely some version of it will be incorporated into OpenMP.
The reason why I'm so allergic to annotation based parallelization is the experiences folks had with OpenMP. The common fallacy about OpenMP is that it is sufficient to place a "#pragma omp parallel for" in front of your inner loops and *poof* your performance goes up. But in reality your performance may very well go down, unless your code is embarrassingly parallel. In reality especially simulation codes are tightly coupled and memory bound. The parallelization on GPUs is very different from the code on traditional multi-cores. On the latter you'll want to do pipelined cache blocking, while on the former you'll want to do tiling in the GPU DRAM. These are differences in the high level algorithm of a kernel, something which is beyond the compiler to change. Even with annotations.
Instead of a revamped OpenMP, I expect OpenCL to grab a larger share of the market when it comes to writing portable code. Even though OpenCL code by far isn't write once, run everywhere.
This article explains that five years ago when NCSA made the bid, accelerators were very exotic technology. The move toward GPUs was actually at the behest of scientists who now see a way forward to speed up their codes with accelerators. Technology shifts and we adapt.
If they are so willing to adapt, why weren't they willing to accommodate IBM's change requests? It's not like IBM was totally unwilling to build a $200 million machine.
None? I know of several. It's all still in its infancy of course, but I'm convinced it's possible to get good speedup from GPUs on real science codes. It's not applicable to everything, but then that's why they aren't CPUs.
I was referring to annotations for GPU offloading. Codes that run on GPUs are in fact so common nowadays that in fact you'll be asked on conferences why you didn't try CUDA if you present any performance measurement sans GPU benchmarks. :-)
That's similar to what PGI is doing. And you know what? It's not that simple. You seldom achieve competitive e performance with this annotation type parallelization, simply because the codes were written with different architectures in mind.
This is also the reason why the original design did emphasize single thread performance so much. The alternative to having POWER7 cores running at 5 GHz would have been to buy a BlueGene/Q with much more, but slower cores.They didn't go into that avenue because they knew that their codes wouldn't scale to the number of cores well.
None of the supercomputer codes I know uses such a type of parallelization or accelerator offloading. And the reason for that is not that folks enjoy doing work that a tool could handle for them, but because the tools don't work as well as advertised.