Supercruncher Applications
starheight writes "Bill McColl has written an article contrasting traditional massively parallel supercomputing with a whole new generation of compute-intensive apps that require massively scalable architectures and can deliver both incredible throughput and real-time responsivenes when processing millions or billions of tasks."
Just in time for Vista!
Dell's website consumer pricing generator.
Argh.
Imagine a Beowulf cluster of-- oh. Wait.
"No freeman shall ever be debarred the use of arms." -- Thomas Jefferson
How many hours does it take vista to boot on this thing?
Looking at his examples (Search, Ecommerce, Software-as-a-Service, Infrastructure-as-a-Service, Fraud Detection) I have to think "wow, single point of failure". Lots and lots of fault-tolerance needed to put all your eggs in one basket like that.
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Good point. Single point of failur not only causes your entire system to go down, but stops the several billion processes you're running all at once. How long would it take to get things running again if something simple stopped? How long if its a processor that fries out? An hour? A day? Several days? How much money are you losing when that happends?
The first half of his list seems a bit flighty. They lean more towards buzz and less useful applications. But the second half is much more practical and likely. There are many potentially interesting applications coming up, but I don't think we'll directly see most of them publicly on the internet. So I give him a +0.5 Insightful.
Developers: We can use your help.
Yes ... it includes RFID tracking to reduce theft, and ... manage traffic!?!?
We need our next generations of supercomputers to follow you around, knowing where you are at all times ... so umm, we can change the traffic lights when the roads get busy for you ....
~Director of NSA Domestic Spying Program
With his excessive use of "massively" this is obviously a ginormous supercruncher.
-- www.globaltics.net
Political discussion for a new world
There had better be a CPU dedicated to Error detection and correction!
"No freeman shall ever be debarred the use of arms." -- Thomas Jefferson
I've seen the things - trust me, they're massive!
Slow news day, huh?
Can we please have a "no links to random, boring blogs week" on Slashdot?
The term "massively parallel" indicates a system operating without those constraint.
Engineering is the art of compromise.
bah weep grana weep minibom
"Using supercomputers to test the next-generation version of the SMP code, we get good scaling to many more cores than in the Intel prototype, and we expect to do even better in the future."
m l#166684
http://forum.folding-community.org/fpost166684.ht
http://fahwiki.net/index.php/SMP_client
My main side project is real time ray tracing software. It is very nearly not subject to Amdahl's Law. In the terminology of the Wiki article, F is approximately zero for Ray Tracing. It will scale very well past 10 cores and may well be able to make good use of 100 cores. Memory bandwidth seems to be the limiting factor (that determines F) but that may not be a problem with enough cache and good code. It's also the only potential mass-market use for a lot of cores. nVidia your days are numbered.
There is no such thing as "massively parallel!" It makes no sense! Parallel in qualitative, NOT quantitative! Things are either parallel or they're not, there are no degrees of "parallelness!"
Sure there are. Say you want to find the maximum of 4 integers. You can do that in parallel, but you won't gain much if you have more than two processors (or execution units). Contrast this with say rendering an image using a path tracer, where each ray is independent of each other. First problem is hard to scale up, second one isn't. I'd say that means that ray tracing is a "more parallel" task.
Also, writing algorithms that has to run on 10000 processors efficiently is not exactly the same as one that has to run on 4 processors, in the same way that writing a multiplayer game that handles four players isn't the same as writing one that can handle thousands of concurrent players. So they toss on the "massive" part to separate the cases. At least that's my take on it.
Actually, massively parallel has a meaning. For example, the 131,072 CPU beast designed by IBM. This computer is designed to solve problems that have another term attached to them, and that is "embarassingly parallel" problems. Your average task is not embarassingly parallel, and thus, is difficult to scale to a massively parallel system. It would take a lot of effort, see?
But some problems can use massively parallel computers, designed to solve embarassingly parallel problems.
"Give me a SUPER number crunch."
"We have a 32.33, repeating, of course, percent chance of survival."
"That's better than we usually do."
"Never give up, for that is just the time and place when the tide will change." -Harriet Beecher Stowe ^_^
# Dense linear algebra
# Sparse linear algebra
What about Average linear algebra?
# Structured grids
# Unstructured grids
Are there any other types?
(** Warning: Car analogy...)
Isn't that kind of like selling a car and listing on the spec sheet:
# Goes slow
# Goes fast
Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
Additionally, many of these computers don't run just 1 application. IBM's blue gene, and many other Dept. of Energy/Defense/* computers run a large number of research applications, ranging from 10's to 1000's of cores. It is very rare that a single program gets to run on such a large machine for any length of time by itself, so in most cases, programs don't have to scale to 100,000 PE's, but rather they scale to hundreds or a few thousand. Far more applications can scale well to hundreds than thousands, and still have reasonable speedup.
Microsoft Sucks, F/OSS Rocks. I get mod points now right?
...maybe it should be 'degrees of parallel scalability'
i.e. [these algorithim's are] massively parallel scalable.
buzzwords help too.
It's not stupid. It's Advanced.
Bill McColl, for those who aren't familiar with him, was the driving force behind the bulk synchronous parallel (BSP) model of programming. This model, while available in the MPI-2 spec, is not widely used as is. Instead, its major contribution is inspiring remote direct memory access and the partitioned global address space, among others.
Last time we spoke, Bill said that he was interested in the issue of massively scaled computers that can handle fault tolerance pre-emptively. He compared today's supercomputers (Blue Gene, Cray XT4, Altix, etc) to a racing car that was really fast for a few hours a week, but wasn't even reliable enough to get the groceries. He was also interested in computers that can handle a continuous influx of data (as his blog post mentions), similar to managing millions of RSS feeds.
An example application domain for this stuff would be Wall Street firms that have to run time series analysis on streaming data. Prof. McColl is really on the right track here.
Fault-tolerance is either built into the problem or into the application. Take for example search, if one search server on the backend that is handling 0.1% of the web sites goes down, you may not know or even care that those results are missing (assuming the system doesn't have something built in to give that query to another node searching the same dataset).
In fraud detection, thinking of the credit card companies, it's typically looking for patterns after the transaction has already gone through, and if one node of the cluster goes down, maybe you give the same transaction list to another node. You never find every case of fraud this way, but you want something that can search as many (or all) of the transactions as quickly as possible to reduce the time between the first instance and shutting down the account.
For the other examples, you just build it into the system, e.g. one HA broker on the front that can give out a task to another node if the first one goes down. When you build a system like this, single points of failure in the server farm aren't the concern. It's the mean time between failures and the process to replace nodes, the power and cooling requirements, failure points outside of the nodes, etc.
Either that, or your imagination is lacking somewhat. Personally, I've wanted lots of cores sinces I was in kindergarten. I'm quite sure I can find a use for them all.
What? You are on drugs, yes? And not the good kind?
What about video encoding? Besides codec parallelism, you can also parallelize the distance between two keyframes, handing that chunk off to a core (or node) for processing. This is very mass-market - more and more people want to make snazzy home movies.
In fact, far more people would like to do this than render 3d movies.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
As a standard power user running internet apps/office apps/video processing (home/tv)
At one point you have the app running on a core, the OS on one, the graphics on the GPU, the network on a cpu. You get lower latency because your app's cpu doesn't have to time slice with the others.
I can see parallel makes, conversion (wav2mp3, video formats etc), formatting (commercial skipping, panorama stitching). I/O is going to be the ultimate bottleneck.
What kind of consumer applications would benefit?
None of those offer or require real-time guarantees.
Google Alerts is here now.
A better article would have started with the table that defines "supercruncher" and proceeded to describe the architectural issues of building one. Ideally it would have addressed the software challenges.
The top parents point about the term parallel is correct in the literal sense. Parallel is true or false and there is no spectrum of parallel in the mathematical sense. The term 'concurrent processing' might be more correct (degrees of correctness?) but parallel has slipped into common language.
Identifying problems that are well suited for a multi-processor platform can be quantified. It's hard to scale up when you define it as 4 integers. Try finding the maximum of n integers.
aArray[1..n]
int parallelMAx(array, lowIndex, highIndex){
if(highIndex-lowIndex == 1)
return max(aArray[lowIndex, aArray[highindex])
else
open new thread and calculate a=parallelMax(array, bottomOfRange, middle)
open new thread and calculate b=parallelMax(array, bottomOfRange, topOfRange)
return max(a,b)
end
}
With one processor we need to check every integer and compare it to the current max. This takes at least n comparisons. With n processors we can do half of these at the same time giving us the same number comparisons in order log(n) iterations instead of n iterations. So finding a max will see some benefit from having more processors and that benefit is on the order of n-log(n).
This benefit will be different for different problems (given the best known algorithm) and then sorting these benefits you could get a 'spectrum of benefit of concurrency' which would denote the 'degrees of parallelism that the top parent is speaking of.
In most cases, researchers request a specific number of cores, based on experience of how well their code scales. Some codes to auto-scale, depending on available cores, but these are rarer. The way it works is in a batch queue system... Users submit a job required 2000 cores, and wait until that many are available. Then, when the cores become available, their job runs for 6-48hrs or more, depending on the job. In most cases, a large number of researchers are often in contention for computing time, and wait their turn in line. The good ones tend to understand the system better, and will submit workloads that reflect the current available resources, thus limiting the time their work spends sitting in the queue.
Microsoft Sucks, F/OSS Rocks. I get mod points now right?
It's rare for an entire machine like that to fail. More likely is 1 processor board, or similar subsystem, which you can design for (I didn't get a result back, try again) in software, and, like the T3E which shipped with redundant processors, in hardware as well. If you have enough processors, you could stripe your job across several, so if one doesn't return a result, a second one will. Now, locating your only one of these machines in California might not be the best idea (we had an earthquake which started a eucalyptus grove fire, but don't worry, the mudslide put it out), but it's unlikely that you'll lose an entire one.
Just to geek out for a moment, picture a system large enough to finally troll through all of that data NASA brought back from the Mariner missions, and cross-reference it against what they get daily now from the various Mars probes. Finally turn all of that data into information, as the blog says.
the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
100 cores is not massively parallel. The kind of scaling we're talking about is much higher. Think thousands of cores each with hundreds of threads.
This is the kind of scaling that weather centers are just starting to reach today. It's the kind of scaling that will require a radical rethinking of how consumer software is designed and what tools we need to make that design process easier.
In this world, software is king. You won't care who your chip vendor is. You'll care who provides your compiler, debugger, performance analysis tools and other such things.
Fascinating that a story purporting to be about supercomputers is actually a summary of Weightless Economy theory. The theory is that the wealthiest countries can't achieve more wealth by implementing things anymore. They can't increase their net worth by manufacturing or solving math problems. They have to turn instead to philosophical goals like people management, interpreting literature, creating works of art.
The supercomputer function is still the same. It still solves algebra, n-body methods, structured grids, and finite state machines. The user of the supercomputer is different. The user is now living on $1 a day in Mongolia.
For the wealthiest countries to stay wealthy, they have to focus on not the computing part but marketing the computing, creating the interface to the math, managing the business around the computing.
Not to mention that speedup (i.e., best serial running time over best parallel running time) and efficiency (i.e., speedup over number of processors) are well defined ways of quantifying how well a piece of a code scales in parallelism. If, for example, you're still getting 95% efficiency running on 2000 nodes, then I'd call that pretty darn good given Ahmdal's Law and "massively scalable". The way the efficiency curve falls off as you increase the number of processors tells you a lot about how parallel a piece of code is.
Granted massively parallel is a fuzzy, qualitative phrase, but parallel efficiency at high numbers of processors is a pretty good measure.
As used in the field of "real-time computing/systems," satisfying time constraints is a correctness criterion, not simply a performance metric.
Doug Jensen
Some time ago while doing some research into 'massively parallel' applications for a bio-research company I wrote an auto scaling hack on top of the Pov Ray PVM port. It worked fairly well at monitoring cpu loads across a network, dicing up the scenes to be rendered and shipping off chunks of work to various CPUs as they were available.
Overall the research project covered scaling from the CPU/core through cache to DRAM to disk to network even up to the point of when you'd have to actually scale the dispatcher in order to keep all of the processors busy. It was interesting stuff and produced some nice graphs of performance curves clearly indicating what was the bottleneck for each type of computing problem that we evaluated.
Damn it was nice working for that company.
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO