Titan is a capability machine which distinguishes it from capacity machines. As such it designed for large/extreme scale jobs (which includes full system runs). I expect the techs are just now prepping Linpack for the next Top500 at SC12.
The ratio of 2 GB/core isn't going away anytime soon. The reason is: a) the speed per core is stagnating, thus adding more memory per core just means that one would end up with more memory per core that it could process in a timely manner and b) if you need more memory, you'll just allocate more nodes, the additional cores you get by that don't exactly hurt.
I don't see this as a contradiction to my post. Of course you don't work crunchtime for years, you do that for reaching your goal, and then you go back to "normal". That's what I meant with reset.
Do you know the second guy to prove that a^2+b^2=c^2 in a right triangle? Or the second physicist to postulate that E=mc^2? Who was the second to reach the north pole? Right: their names are written in Wikipedia, the first ones are engraved in our brains. It's got nothing to do with modern science: it's always been that way.
Can't find the word "every" in my post. Obviously there is a lot of BS out there. But brilliant stuff, too. Stuff that revolutionizes our lives.
I'm not saying that the system by itself is good. I'm just saying that it drives people into enslaving themselves in pursuit of degrees, career, and yes, their place in history. Even though the latter are few. Approx. as few as there are, say... olympic gold medal winners? Would you call top athletes mental, too?
This is not limited to Astrophysics. I know a lot of students (both, grad and PhD) that work basically 'round the clock, from fields as diverse as bio-chemistry, materials science and computer science. I'm hesitant to call this even call this a problem. What few realize at the beginning of their academic career is that science is actually a lot like sports: it is constant competition. It's all about who can discover/prove/engineer the next milestone first. There is no such thing in science as a runner-up. Those who come in second, are the first to be scooped. Period.
Now, why do we work crunchtime in science? That's the difference to sports: our brain is our muscle, and that doesn't get sore -- provided sufficient sleep (5-6h sleep is sustainable), nutrition (sugar!), and the occasional reset. Working longer hours gets you faster to the results, gives you better chances of publishing. May sound sick to outsiders, but the truth is, that getting a paper out is highly addictive. In a way we're all hooked on achievement.
Hello and welcome to the Internet. You must be new here. We here do sometimes happen to speak in strange tongues, sometimes we might even portrait a thing desirable as being appalling. That is a rhetoric trick called irony, sometimes even sarcasm. HTH.
What the author mean to say was this: the government probably did reveal more than they meant to, as spying on their on citizens on a regular basis is so 1984, and no one from 2012 wants to live in a future like that and we'll have elections here in Germany soon, so please move along, we'll spy on nobody from now on, at least not if you're not one of the baddies, yes, we do decide who is bad, yes, all your data will be kept, oops, did we was this aloud?... get it?
Is there an architecture with a memory bandwidth superior to GPUs? Nope. Commercial FPGA boards get in the range of O(20 GB/s), but defenitely not O(200 GB/s), which is where GPUs stand.
However, (Nviidia) GPUs are not designed for heavy integer arithmetic. FPGAs do this quite well and even though their DRAM controllers are generally worse, they can often avoid much memory traffic at all by keeping intermetiate data in their comparatively large on-chip SRAM. BTW: Xilinx have just recently announced their first prodicts for 28 nm production...
They need to make the chip designs. These are specific to the manufacturing process, and that's the tricky part. Chip simulation and validation are expensive in terms of compute time and labor. That's all I'm saying.
OK, maybe 100mn was a bit too much, yet I don't see the 28nm coming. Just to give you a comparison: Samsung is manufacturing the brand new Galaxy S3 SOC in 40nm. Why don't they use 28nm? Don't they want it? Hell yes, but it's not that easy. Think about that.
The power argument and the architecture's openness are sensible, I don't argue against that. Yet, the performance per Watt seems grossly inflated. If you look at today's most power efficient HPC chip, the CPU of IBM's Blue Gene/Q, then you'll see that they achieve less than 4 GLOPS/Watt. Adapteva claims more than 9. So they're twice as good as IBM? Really?
Yes, I know. But "planning to" is not the same as actually doing so. That process is expensive: mask creation alone costs a fortune. I'll only work if they order millions of chips up front. Not exactly the thing you can do if you're funding is a Kickstarter project.
Adapteva is creating false expectations here. Their chip won't deliver performance on par with GPUs (or CPUs, for that matter) and still be cheap. Why? Because it's not a thing that a startup can to in todays world of computing. For such a chip you need to use the latest CMOS processes and a huge team to design/optimize the ASIC (especially if it's meant to be a low power chip) -- both of which are extremely costly. If it was that easy, then we'd see more competition and not Intel, AMD, Nvidia and IBM as the only global players in the HPC arena.
If you're a small startup, then you'll be bound to 100nm processes (at best), and have to use automated layouts (not the hand-optimized ones e.g. Intel uses). Both reduce performance, increase power intake.
I work at the Chair for Computer Architecture at FAU. We have some of very brightest minds working at custom chips for industry solutions. This 2D CPU matrix that Adapteva proposes is something that my colleagues have played with years ago. It's a good approach and I personally believe that this will be the shape of CPUs to come. It started with the ring bus on the IBM Cell, now Intel's Nehalem has got an partitioned L3 cache connected with a... ring bus and Intel's Xeon Phi (MIC) even got a 2D on-chip grid network. But even my colleagues concede that a) on FPGAs you'll always be trailing GPUs concerning floating point performance (it's something FPGAs are particularly bad at) and b) even when designing an ASIC you'll always be beat by GPUs in terms of performance, assuming similar prices and power consumption. Those are simply beasts, optimized down to the bone. It's the result of a multi-billion mass market. That's also the reason why there is no next IBM Cell chip for a PlayStation 4: Cell was too expensive to develop to keep up with the competition. Its market is too small compared to the ubiquitous GPUs.
For teaching parallel computing I'd always suggest a GPU. The tools are there, the performance is great and you'll be able to use the knowledge gained in real-world projects.
Sorry, but you're milking the cow twice: process shrinks allow us to pack more transistors on a chip. This would amount to a growth of about 2^3 = 8 in a period of 5 years, as you correctly estimate. But this already includes the increase in parallelism. Today's supercomputers apparently don't grow much more racks:
Roadrunner: 296 racks
Jaguar: 150 racks IIRC
K computer: 768 racks (huge exception)
Sequoia (Blue Gene/Q): 96 racks
The reason why we can't just buy an infinite amount of racks and network them is the MTBF. In the K computer the MTBF means that node failures occur every couple of hours, rendering longer full system runs almost impossible.
I'm talking about fixed-point numbers, which are (almost) the same as integers in respect to the logic, but keep in mind the decimal point, which is is fixed (hence the name) to a certain place. Who in his right set of mind would propose to using pure integer arithmetic? You need a way to represent, say, 1.337.
"First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2."
Part of the point I was making.
Sure, but apparently people still want such a metric, however imperfect and misleading it may be.
"it's much easier to prove numerical stability with floating point numbers"
False.
How eloquent. Simple example: I have two numbers a and b, approximately of the same size, with their LSB being tainted by rounding errors. If I add them on a floating point machine the last two bits of the mantissa will be tainted, but because during normalization the mantissa will be cut we end up with again just the LSB beint tainted.
On a fixed point machine however adding both numbers will either result in an overflow or in two bits being tainted. And so on. Care to disprove me?
Numerical stability is not the same as exactness, which you are referring to. Exactness is something we can never achieve (just think of irrational numbers: impossible to store all digits). So we have to clip numbers and resort to rounding, which introduces errors. When caring for numerical stability one usually tries to prove that the errors introduced by the imperfect representation of numbers in computers is less than an acceptable limit "foo". And these proofs are simpler for floating points arithmetics.
...sort of. And whoever rated the parent "insightful" apparently has little insight into HPC and supercomputing. Interesting might have been appropriate.
First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2. Bit since we want such a simple metric, we might just as well settle for the one we already have. Why flops? Because applications use them. I know that the calls on conferences for fixed point logic (more or less integer arithmetic) are getting louder as you can actually prove that you can safe some power (fixed point needs less transistors), but simultaneously users prefer floating point because it's much easier to prove numerical stability with floating point numbers. And correctness always trumps.
If you need more memory, simply allocate more nodes. Problem solved. Hardly anyone needs more than 2 GB/core.
Titan is a capability machine which distinguishes it from capacity machines. As such it designed for large/extreme scale jobs (which includes full system runs). I expect the techs are just now prepping Linpack for the next Top500 at SC12.
The ratio of 2 GB/core isn't going away anytime soon. The reason is: a) the speed per core is stagnating, thus adding more memory per core just means that one would end up with more memory per core that it could process in a timely manner and b) if you need more memory, you'll just allocate more nodes, the additional cores you get by that don't exactly hurt.
My wife will assure you that 1+1=3, or even 4, for sufficiently large values of me.
Epic post!
I don't see this as a contradiction to my post. Of course you don't work crunchtime for years, you do that for reaching your goal, and then you go back to "normal". That's what I meant with reset.
Do you know the second guy to prove that a^2+b^2=c^2 in a right triangle? Or the second physicist to postulate that E=mc^2? Who was the second to reach the north pole? Right: their names are written in Wikipedia, the first ones are engraved in our brains. It's got nothing to do with modern science: it's always been that way.
Can't find the word "every" in my post. Obviously there is a lot of BS out there. But brilliant stuff, too. Stuff that revolutionizes our lives.
I'm not saying that the system by itself is good. I'm just saying that it drives people into enslaving themselves in pursuit of degrees, career, and yes, their place in history. Even though the latter are few. Approx. as few as there are, say... olympic gold medal winners? Would you call top athletes mental, too?
This is not limited to Astrophysics. I know a lot of students (both, grad and PhD) that work basically 'round the clock, from fields as diverse as bio-chemistry, materials science and computer science. I'm hesitant to call this even call this a problem. What few realize at the beginning of their academic career is that science is actually a lot like sports: it is constant competition. It's all about who can discover/prove/engineer the next milestone first. There is no such thing in science as a runner-up. Those who come in second, are the first to be scooped. Period.
Now, why do we work crunchtime in science? That's the difference to sports: our brain is our muscle, and that doesn't get sore -- provided sufficient sleep (5-6h sleep is sustainable), nutrition (sugar!), and the occasional reset. Working longer hours gets you faster to the results, gives you better chances of publishing. May sound sick to outsiders, but the truth is, that getting a paper out is highly addictive. In a way we're all hooked on achievement.
Exactly what I thought. Why reinvent the wheel? Shouldn't be too difficult to make BSD real-time capable.
Hello and welcome to the Internet. You must be new here. We here do sometimes happen to speak in strange tongues, sometimes we might even portrait a thing desirable as being appalling. That is a rhetoric trick called irony, sometimes even sarcasm. HTH.
What the author mean to say was this: the government probably did reveal more than they meant to, as spying on their on citizens on a regular basis is so 1984, and no one from 2012 wants to live in a future like that and we'll have elections here in Germany soon, so please move along, we'll spy on nobody from now on, at least not if you're not one of the baddies, yes, we do decide who is bad, yes, all your data will be kept, oops, did we was this aloud?... get it?
Is there an architecture with a memory bandwidth superior to GPUs? Nope. Commercial FPGA boards get in the range of O(20 GB/s), but defenitely not O(200 GB/s), which is where GPUs stand.
However, (Nviidia) GPUs are not designed for heavy integer arithmetic. FPGAs do this quite well and even though their DRAM controllers are generally worse, they can often avoid much memory traffic at all by keeping intermetiate data in their comparatively large on-chip SRAM. BTW: Xilinx have just recently announced their first prodicts for 28 nm production...
They need to make the chip designs. These are specific to the manufacturing process, and that's the tricky part. Chip simulation and validation are expensive in terms of compute time and labor. That's all I'm saying.
OK, maybe 100mn was a bit too much, yet I don't see the 28nm coming. Just to give you a comparison: Samsung is manufacturing the brand new Galaxy S3 SOC in 40nm. Why don't they use 28nm? Don't they want it? Hell yes, but it's not that easy. Think about that.
The power argument and the architecture's openness are sensible, I don't argue against that. Yet, the performance per Watt seems grossly inflated. If you look at today's most power efficient HPC chip, the CPU of IBM's Blue Gene/Q, then you'll see that they achieve less than 4 GLOPS/Watt. Adapteva claims more than 9. So they're twice as good as IBM? Really?
Yes, I know. But "planning to" is not the same as actually doing so. That process is expensive: mask creation alone costs a fortune. I'll only work if they order millions of chips up front. Not exactly the thing you can do if you're funding is a Kickstarter project.
Adapteva is creating false expectations here. Their chip won't deliver performance on par with GPUs (or CPUs, for that matter) and still be cheap. Why? Because it's not a thing that a startup can to in todays world of computing. For such a chip you need to use the latest CMOS processes and a huge team to design/optimize the ASIC (especially if it's meant to be a low power chip) -- both of which are extremely costly. If it was that easy, then we'd see more competition and not Intel, AMD, Nvidia and IBM as the only global players in the HPC arena.
If you're a small startup, then you'll be bound to 100nm processes (at best), and have to use automated layouts (not the hand-optimized ones e.g. Intel uses). Both reduce performance, increase power intake.
I work at the Chair for Computer Architecture at FAU. We have some of very brightest minds working at custom chips for industry solutions. This 2D CPU matrix that Adapteva proposes is something that my colleagues have played with years ago. It's a good approach and I personally believe that this will be the shape of CPUs to come. It started with the ring bus on the IBM Cell, now Intel's Nehalem has got an partitioned L3 cache connected with a... ring bus and Intel's Xeon Phi (MIC) even got a 2D on-chip grid network. But even my colleagues concede that a) on FPGAs you'll always be trailing GPUs concerning floating point performance (it's something FPGAs are particularly bad at) and b) even when designing an ASIC you'll always be beat by GPUs in terms of performance, assuming similar prices and power consumption. Those are simply beasts, optimized down to the bone. It's the result of a multi-billion mass market. That's also the reason why there is no next IBM Cell chip for a PlayStation 4: Cell was too expensive to develop to keep up with the competition. Its market is too small compared to the ubiquitous GPUs.
For teaching parallel computing I'd always suggest a GPU. The tools are there, the performance is great and you'll be able to use the knowledge gained in real-world projects.
I'd mod the parent "+1 funny" if I had any the powers to do so.
Was he satisfied with his new Linux desktop? And did they install KDE or Gnome?
The reason why we can't just buy an infinite amount of racks and network them is the MTBF. In the K computer the MTBF means that node failures occur every couple of hours, rendering longer full system runs almost impossible.
I'm talking about fixed-point numbers, which are (almost) the same as integers in respect to the logic, but keep in mind the decimal point, which is is fixed (hence the name) to a certain place. Who in his right set of mind would propose to using pure integer arithmetic? You need a way to represent, say, 1.337.
"First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2."
Part of the point I was making.
Sure, but apparently people still want such a metric, however imperfect and misleading it may be.
"it's much easier to prove numerical stability with floating point numbers"
False.
How eloquent. Simple example: I have two numbers a and b, approximately of the same size, with their LSB being tainted by rounding errors. If I add them on a floating point machine the last two bits of the mantissa will be tainted, but because during normalization the mantissa will be cut we end up with again just the LSB beint tainted.
On a fixed point machine however adding both numbers will either result in an overflow or in two bits being tainted. And so on. Care to disprove me?
Numerical stability is not the same as exactness, which you are referring to. Exactness is something we can never achieve (just think of irrational numbers: impossible to store all digits). So we have to clip numbers and resort to rounding, which introduces errors. When caring for numerical stability one usually tries to prove that the errors introduced by the imperfect representation of numbers in computers is less than an acceptable limit "foo". And these proofs are simpler for floating points arithmetics.
...sort of. And whoever rated the parent "insightful" apparently has little insight into HPC and supercomputing. Interesting might have been appropriate.
First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2. Bit since we want such a simple metric, we might just as well settle for the one we already have. Why flops? Because applications use them. I know that the calls on conferences for fixed point logic (more or less integer arithmetic) are getting louder as you can actually prove that you can safe some power (fixed point needs less transistors), but simultaneously users prefer floating point because it's much easier to prove numerical stability with floating point numbers. And correctness always trumps.
Agreed. 2 GB/core seems to be the current agreement on almost all machines except for IBM BlueGene which has just 1 GB per core.
The summary mentions that 2 teraflops are generated by the CPUs while 8 are generated by the Knights Bridge chips. It should say petaflops.
Amen.